0

I'm trying to convert the output of the cv::cuda::FarnebackOpticalFlow algorithm into 3 YUV planes of unsigned bytes (for subsequent compression via FFMPEG).

I'm getting error code -217 ("unspecified launch failure") upon calling stream.waitForCompletion() (not shown here); what triggers the error in my kernel is trying to assign a value to one of the output GpuMat objects (see the line below dst_y(y, x) = ...).

I'm using OpenCV 3.3, compiled from source, under Windows 10.

I'm using the CMake cuda_add_executable() command to define my project, and I've defined the CUDA flags with set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch compute_50 -code sm_50), though I've tried variations with no more success. My graphics card is a GTX 970.

#include <opencv2/core/cuda_stream_accessor.hpp>

using namespace cv;
using namespace cuda;


namespace 
{
    __global__ void kernelFunc(
        const PtrStepSz<float2>& src,
        float scale_x, float scale_y,
        PtrStepSzb dst_y, PtrStepSzb dst_u, PtrStepSzb dst_v)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;

        if (x < src.cols && y < src.rows && y >= 0 && x >= 0)
        {
            // Get input: optical flow, and scale it
            auto dx = scale_x * src(y, x).x, dy = scale_y * src(y, x).y;

            // Luma: flow vector length, compressed using an exponential function
            auto l = sqrt(dx*dx + dy*dy);
            l = exp(5*l) / (exp(5*l) + 1);
            dst_y(y, x) = 255 * l;

            // Chroma (U and V)
            dst_u(y, x) = 255 * (dx + 0.5);
            dst_v(y, x) = 255 * (dy + 0.5);
        }
    }

} // ns

void compress_optical_flow_mat_to_yuv444(const GpuMat& src, 
    GpuMat& dst_y, GpuMat& dst_u, GpuMat& dst_v, 
    Stream& stream)
{
    using namespace std::string_literals;

    dst_y.create(src.size(), CV_8UC1); 
    dst_u.create(src.size(), CV_8UC1); 
    dst_v.create(src.size(), CV_8UC1); 

    dim3 cthreads(16, 16); //(32, 8);
    dim3 cblocks(
        static_cast<int>(ceil(src.size().width  / static_cast<double>(cthreads.x))),
        static_cast<int>(ceil(src.size().height / static_cast<double>(cthreads.y))));

    // We scale optical flow so that the range [-0.5..0.5] covers half the width and half the height,
    // in pixels, of the image. In other words, a given pixel may not move by more than half the
    // image size per frame.
    float scale_x = 0.5f / src.size().width;
    float scale_y = 0.5f / src.size().height;
    auto cu_str = StreamAccessor::getStream(stream);

    kernelFunc<<<cblocks, cthreads, 0, cu_str>>>(src, scale_x, scale_y, dst_y, dst_u, dst_v);

    auto err = cudaGetLastError();
    if (err != 0) 
        throw std::runtime_error("compress_optical_flow_mat_to_yuv444() kernel call failed with error "s 
            + std::to_string(err) + ": " + cudaGetErrorString(err));
}
0

1 Answer 1

1

Remove the & sign from the following line:

__global__ void kernelFunc(
        const PtrStepSz<float2>& src,
        float scale_x, float scale_y,
        PtrStepSzb dst_y, PtrStepSzb dst_u, PtrStepSzb dst_v)

to

__global__ void kernelFunc(
        const PtrStepSz<float2> src,
        float scale_x, float scale_y,
        PtrStepSzb dst_y, PtrStepSzb dst_u, PtrStepSzb dst_v)
Sign up to request clarification or add additional context in comments.

5 Comments

Bingo! And thanks a ton. So, passing a const reference should have worked in plain C++, but here we're passing stuff from the CPU to the GPU, so it's understandable that that rule went out the window. It would have been nice of nvcc to warn me though!
@JPNotADragon Glad to help. Although to be honest I am not completely sure either why the const reference is not working. I was facing the same issue and I accidentally stumbled upon this solution. If you find out the reason, do let me know.
Now that I think about it, the CPU-GPU interfacing may not be the only factor to consider. I'm using streaming, which means that the input data must still be available after the call to the wrapper function - which would not be the case here since the PtrStep object is a temporary produced by the GpuMat conversion operator.
it's not possible to use reference parameters on CUDA kernels, unless Unified Memory is also being used (and the item has been allocated with a managed allocator). A reference parameter involves pass-by-reference, meaning the item passed is a reference to (i.e. address of or pointer to) the original data item. The original data item here lives in host code, so the address of it is a pointer to host address space. In order to actually access the item value (src), the device code would have to de-reference this reference ie. dereferencing a host address, which is not legal in device code.
@RobertCrovella Thank you so much! You don't know how many times I had to resist the urge to tag you with this question in one of your posts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.