Initialize device array in CUDA

Question

How do I initialize device array which is allocated using cudaMalloc()?

I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.

cudaMemset(devPtr,value,number_bytes)

You understand that the value in cudaMemset is a byte value, not a word value, ie. the same as in the C standard library memset? — talonmies
– talonmies, Commented May 15, 2012 at 9:16

talonmies · Accepted Answer · 2012-05-15 14:21:58Z

13

As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:

cudaError_t cudaMemset  (   void *      devPtr,
                            int         value,
                            size_t      count    
                        )

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

So value is a byte value. If you do something like:

int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);

what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.

Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as

template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < nwords; tidx += stride)
        devPtr[tidx] = val;
}

(standard disclaimer: written in browser, never compiled, never tested, use at own risk).

Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.

Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.

edited May 15, 2012 at 14:21

answered May 15, 2012 at 10:51

talonmies

72.7k35 gold badges204 silver badges296 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

username_4567 Over a year ago

Currently,in my code i'm doing same thing but I wanted way with only cudaMemset.

talonmies Over a year ago

@user997704: There isn't a way with cudaMemset. Either run a kernel of your own, or use cuMemsetD32/cuMemsetD32 from the driver API

SkyWalker Over a year ago

It would be nice for self containment what would be a sensible grid and block size to invoke such kernel e.g. if my array is N=2333 how would I call this kernel?

SkyWalker Over a year ago

I actually don't really understand the stride loop. Why you need a loop at all? Shouldn't each thread set one element and be done with it?

BRabbit27 Over a year ago

@GiovanniAzua I believe the stride is there to have one thread initialize more than one element in the array. Sometimes is better to give more work to a thread than launching a lot of threads.

SkyWalker · Accepted Answer · 2014-02-01 19:07:01Z

1

I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.

Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:

template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value, 
   const size_t n, const size_t incx) {
      int tid = threadIdx.x + blockDim.x * blockIdx.x;
      if (tid*incx < n) {
           a[tid*incx] = value;
       }
}

Then you may invoke the kernel like this:

template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
      int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
      dim3 gridDim(number_of_blocks, 1);
      dim3 blockDim(BLOCK_SIZE, 1);
      kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

edited Feb 1, 2014 at 19:07

answered Jan 24, 2014 at 8:19

SkyWalker

14.5k21 gold badges103 silver badges217 bronze badges

1 Comment

talonmies Over a year ago

Have a look at this answer if you want to understand why the loop is a good idea.

Collectives™ on Stack Overflow

Initialize device array in CUDA

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related