2

Problem Statement: I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.

There are basically two computational times:

  1. Copying the source image from CPU to GPU and vice-versa
  2. Processing of the source image at GPU

when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.

MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.

What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.

8
  • @talonmies: I have asked politely to let me know the reason of down vote so that I can improve the post. So, there is nothing like "sue me" thing. Commented Jul 2, 2015 at 14:27
  • Use C++11 threads, google tutorials for either C++11 threads or boost threads (because boost threads are nearly identical). It isn't too hard to learn. Also, if you need to do any processing on the CPU, look into using "intrinsic" code to take advantage of SSE/AVX vector processing operations. Commented Jul 2, 2015 at 14:30
  • What is the frame rate of the incoming image stream? Is your image processing faster than real-time or do you lag behind? Commented Jul 2, 2015 at 14:31
  • @m.s. : I want to maintain 30 fps. Commented Jul 2, 2015 at 14:32
  • do you want to drop frames if you cannot achieve this frame rate? Commented Jul 2, 2015 at 14:33

2 Answers 2

8

I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:

$ cat t832.cu
#include <stdio.h>

#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64


unsigned char cur_frame = 0;
unsigned char validated_frame = 0;


bool validate_image(unsigned char *img) {
  validated_frame++;
  for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
  return true;
}

void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
    validate_image((unsigned char *)data);
}


bool capture_image(unsigned char *img){

  for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
  if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
  return false;
}

__global__ void img_proc_kernel(unsigned char *img){

  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  while(idx < IMGSZ){
    img[idx]++;
    idx += gridDim.x*blockDim.x;}
}

int main(){

  // setup

  bool done = false;
  unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
  size_t dsize = IMGSZ*sizeof(unsigned char);
  cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
  cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
  cudaMalloc(&d_imgA, dsize);
  cudaMalloc(&d_imgB, dsize);
  cudaStream_t st1, st2;
  cudaStreamCreate(&st1); cudaStreamCreate(&st2);
  unsigned char *cur = h_imgA;
  unsigned char *d_cur = d_imgA;
  unsigned char *nxt = h_imgB;
  unsigned char *d_nxt = d_imgB;
  cudaStream_t *curst = &st1;
  cudaStream_t *nxtst = &st2;


  done = capture_image(cur); // grabs a frame and puts it in cur
  // enter main loop
  while (!done){
    cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
    img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
    cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
  // insert a cuda stream callback here to copy the cur frame to output
    cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
    cudaStreamSynchronize(*nxtst); // prevent overrun
    done = capture_image(nxt); // capture nxt image while GPU is processing cur
    unsigned char *tmp = cur;
    cur = nxt;
    nxt = tmp;   // ping - pong
    tmp = d_cur;
    d_cur = d_nxt;
    d_nxt = tmp;
    cudaStream_t *st_tmp = curst;
    curst = nxtst;
    nxtst = st_tmp;
    }
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks

Sign up to request clarification or add additional context in comments.

10 Comments

"pipelined algorithm"---this is exactly what I was trying to ask through this question but as I new to CUDA and multi-threading so, I did not know the term and I was not able to find something suitable. Thanks.
PLEASE CHECK UPDATE. I have written a program based upon your suggestion. It is not an optimized code but I just wanted to see its working. The program gets crashed without any error code. I think that I am making mistake in extracting the output image. I am a bit confused with the while loop. I don't understand if done = capture_image(nxt); can work when the previous image is still getting processed at GPU.
Program flow: main() -> cudaProcessing() ->captureImage() -> cudaProcessing() -> colorTransformation_kernel()
Post a new question please. And I don't have openCV to work with. That is not a MCVE that I can use. If you want to understand the concepts, it should not be necessary to bring OpenCV into it.
Ok, I will post a new question. But I think that I am doing some conceptual mistake. Could you please have a look at cudaProcessing() because I think that's the place where I am making mistake.
|
5

Since your question is very wide, I can only think of the following advice:

1) Use CUDA streams

When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.

You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.

2) Apply the producer-consumer scheme

The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host. Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.

A simple implementation could look like this:

#include <vector>
#include <thread>
#include <memory>

struct ThreadSafeContainer{ /*...*/ };

struct Producer
{
    Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {

    }

    void run()
    {
        while(true)
        {
            // grab image from camera
            // store image in container
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
};

struct Consumer
{
    Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {
        cudaStreamCreate(&stream);
    }
    ~Consumer()
    {
        cudaStreamDestroy(stream);
    }

    void run()
    {
        while(true)
        {
            // read next image from container

            // upload to GPU
            cudaMemcpyAsync(...,...,...,stream);
            // run kernel
            kernel<<<..., ..., ..., stream>>>(...);
            // copy results back
            cudaMemcpyAsync(...,...,...,stream);

            // wait for results 
            cudaStreamSynchronize(stream);

            // do something with the results
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
    cudaStream_t stream; // or multiple streams per consumer
};


int main()
{
    // create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances 
    auto container = std::make_shared<ThreadSafeContainer>();

    // create one instance of Producer, pass the shared container as an argument to the constructor
    auto p = std::make_shared<Producer>(container);
    // create a separate thread which executes Producer::run  
    std::thread producer_thread(&Producer::run, p);

    const int consumer_count = 2;
    std::vector<std::thread> consumer_threads;
    std::vector<std::shared_ptr<Consumer>> consumers;

    // create as many consumers as specified
    for (int i=0; i<consumer_count;++i)
    {
        // create one instance of Consumer, pass the shared container as an argument to the constructor
        auto c = std::make_shared<Consumer>(container);
        // create a separate thread which executes Consumer::run
        consumer_threads.push_back(std::thread(&Consumer::run, c));
    }

    // wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
    // in this example, the program will never exit since the infinite loop in the run() methods never end
    producer_thread.join();
    for (auto& t : consumer_threads)
    {
        t.join();
    }

    return 0;
}

15 Comments

thanks a lot for your reply. I think that your second method is something which I am looking for. But I did not understand much conceptually. If possible, kindly elaborate it a little more or else let me know some more resources about it.
@skm the two methods are not mutually exclusive, you can combine them; i.e. having multiple streams AND a consumer-producer approach
As, I have mentioned in my post that most the computation time goes for two processes; 1. copying the image to GPU from CPU and 2. Processing at GPU. So, basically, I want to parallelise these two operations. When the image is getting processed at GPU, another image should already be transferred to some sort of buffer memory in GPU.
Can you suggest the basic structure of the project so that I can explore the details myself by researching about that stuff.
@skm if you do not understand that basic C++ syntax, you should try to read up on C++ before approaching multithreading. If after that something is still unclear (which I doubt) post a new SO question
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.