13

I have an application that requires processing multiple images in parallel in order to maintain real-time speed.

It is my understanding that I cannot call OpenCV's GPU functions in a multi-threaded fashion on a single CUDA device. I have tried an OpenMP code construct such as the following:

#pragma omp parallel for
for(int i=0; i<numImages; i++){
    for(int j=0; j<numChannels; j++){
        for(int k=0; k<pyramidDepth; k++){
            cv::gpu::multiply(pyramid[i][j][k], weightmap[i][k], pyramid[i][j][k]);
        }
    }
}

This seems to compile and execute correctly, but unfortunately it appears to execute the numImages threads serially on the same CUDA device.

I should be able to execute multiple threads in parallel if I have multiple CUDA devices, correct? In order to get multiple CUDA devices, do I need multiple video cards?

Does anyone know if the nVidia GTX 690 dual-chip card works as two independent CUDA devices with OpenCV 2.4 or later? I found confirmation it can work as such with OpenCL, but no confirmation with regard to OpenCV.

1
  • Perhaps the answer is in the source code for OpenCV? Commented Jun 21, 2012 at 16:51

4 Answers 4

5

Just do the multiply passing whole images to the cv::gpu::multiply() function.

OpenCV and CUDA will handle splitting it and dividing the task in the best way. Generally each computer unit (i.e. core) in a GPU can run multiple threads (typically >=16 in CUDA). This is in addition to having cards that can appear as multiple GPUs or putting multiple linked cards in one machine.

The whole point of cv::gpu is to save you from having to know anything about how the internals work.

Sign up to request clarification or add additional context in comments.

6 Comments

Yes, true. The multiply() function is written to take advantage of CUDA threading within the function itself. However, what I need is more than one multiply() function operating in parallel threads. That does not seem to be possible without multiple gpus. Then you can execute a multiply() function on each in parallel and for different images simultaneously.
@mmccullo - yes cv::gpu uses low level cuda threading, you can call it in multiple user threads but each will fully utilize the gpu until the other has finished. If you have a card with cuda2 it will use streams to do this async so your threads don't block
I am using CUDA v4.2. I am not sure what your reference to "cuda2" means exactly. It does not appear to necessarily block my OpenMP threads, but it the execution time of my code above is only slightly better than executing in a single thread. It appears the execution of the multiple threads occurs serially on the single CUDA device -- otherwise the execution time should be much less than the single thread on the same device. My test GPU is a Quadro2000M with 2GB and 192 CUDA cores. The images are 1280x960 RGB.
@mmccullo - compute capability >= 2 adds async streams
Ah, in fact my Quadro2000M is compute capability 2.1. I therefore did the following: cv::gpu::Stream stream[3]; for(int i=0; i<numImages; i++){ for(int j=0; j<numChannels; j++){ for(int k=0; k<pyramidDepth; k++){ cv::gpu::multiply(pyramid[i][j][k], weightmap[i][k], pyramid[i][j][k], stream[i]); } } } That appears to execute in parallel on the same CUDA device! Thanks.
|
4

The answer from Martin worked for me. The key is to make use of the gpu::Stream class if your CUDA device is listed as compute capability 2 or higher. I will restate it here because I could not post the code clip correctly in the comment mini editor.

cv::gpu::Stream stream[3];

for(int i=0; i<numImages; i++){
    for(int j=0; j<numChannels; j++){
        for(int k=0; k<pyramidDepth; k++){
            cv::gpu::multiply(pyramid[i][j][k], weightmap[i][k], pyramid[i][j][k], stream[i]);
        }
    }
}

The above code seems to execute the multiply in parallel (numImages = 3 for my app). There are also Stream methods to aid in uploading/downloading images to and from GPU memory as well as methods to check the state of a stream in order to aid in synchronization with other code.

So... it apparently does not require multiple CUDA devices (i.e. GPU cards) in order to execute OpenCV GPU code in parallel!

Comments

0

I don't know anything about OpenCV's GPU functions, but if they are completely self-contained (i.e., create GPU context, transfer data to GPU, compute results, transfer results back to CPU), then it's not surprising that these functions appear serialized when using a single GPU.

If you have multiple GPUs, then there should be some way to tell the OpenCV function to target a specific GPU. If you have multiple GPUs and can target them effectively, I then I see no reason why the GPU function calls wouldn't be parallelized. According to the OpenCV wiki, the GPU functions target only a single GPU, but you can manually split up work yourself: http://opencv.willowgarage.com/wiki/OpenCV%20GPU%20FAQ#Can_I_use_two_or_more_GPUs.3F

Dual GPUs like the GTX 690 will appear as two distinct devices with their own memory as far as your GPU program is concerned. See here: http://forums.nvidia.com/index.php?showtopic=231726

Also, if you are going a dual GPU route for compute applications, I would recommend against the GTX 690 because its compute performance is somewhat crippled compared to the GTX 590.

3 Comments

Interesting comment about the 690 vs. 590 performance. This nVidia page indicates a higher computer capability for the 690. Do you have any specifics on how the 690 is crippled?
"According to the OpenCV wiki, the GPU functions target only a single GPU, but you can manually split up work yourself" sadly the link is no more active. What does it mean manually split it up? You have to set the device Id before every gpu opencv call? Is there any official example supporting the statement.
Also does it mean that in SLI / CrossFire mode one should do the manual switch?
0

The GTX 290 behaves as 2 separate CUDA devices, regardless of which OpenCV version you use. You don't need multiple GPU cards to get multiple GPUs, which you have 2 on one card such as in the GTX 290. But, from the CUDA programming perspective, there is not much difference between using the two GPUs on the 290 and using 2 GPUs on separately connected GPU cards. Many OpenCV users use ArrayFire CUDA library to supplement with more image processing features and the easy multi-GPU scaling. Of course, my disclaimer is that I work on ArrayFire, but I really do think that it will help you in this case.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.