6

How do I initialize device array which is allocated using cudaMalloc()?

I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.

cudaMemset(devPtr,value,number_bytes)
2
  • 1
    Could you provide your code there you call cudaMemset? Commented May 14, 2012 at 20:31
  • 3
    You understand that the value in cudaMemset is a byte value, not a word value, ie. the same as in the C standard library memset? Commented May 15, 2012 at 9:16

2 Answers 2

13

As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:

cudaError_t cudaMemset  (   void *      devPtr,
                            int         value,
                            size_t      count    
                        )           

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

So value is a byte value. If you do something like:

int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);

what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.

Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as

template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < nwords; tidx += stride)
        devPtr[tidx] = val;
}

(standard disclaimer: written in browser, never compiled, never tested, use at own risk).

Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.

Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.

Sign up to request clarification or add additional context in comments.

5 Comments

Currently,in my code i'm doing same thing but I wanted way with only cudaMemset.
@user997704: There isn't a way with cudaMemset. Either run a kernel of your own, or use cuMemsetD32/cuMemsetD32 from the driver API
It would be nice for self containment what would be a sensible grid and block size to invoke such kernel e.g. if my array is N=2333 how would I call this kernel?
I actually don't really understand the stride loop. Why you need a loop at all? Shouldn't each thread set one element and be done with it?
@GiovanniAzua I believe the stride is there to have one thread initialize more than one element in the array. Sometimes is better to give more work to a thread than launching a lot of threads.
1

I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.

Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:

template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value, 
   const size_t n, const size_t incx) {
      int tid = threadIdx.x + blockDim.x * blockIdx.x;
      if (tid*incx < n) {
           a[tid*incx] = value;
       }
}

Then you may invoke the kernel like this:

template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
      int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
      dim3 gridDim(number_of_blocks, 1);
      dim3 blockDim(BLOCK_SIZE, 1);
      kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

1 Comment

Have a look at this answer if you want to understand why the loop is a good idea.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.