0

I have a integer matrix of size 100x200x800 which is stored on the host in a flat 100*200*800 vector, i.e., I have

int* h_data = (int*)malloc(sizeof(int)*100*200*800);

On the device (GPU), I want to pad each dimension with zeros such that I obtain a matrix of size 128x256x1024, allocated as follows:

int *d_data;
cudaMalloc((void**)&d_data, sizeof(int)*128*256*1024);

What is the best approach to obtain the zero-padded matrix? I have two ideas:

  1. Iterate through individual submatrices on the host and copy them directly to the correct location on the device.
    • This approach requires many cudaMemcpy calls and is thus likely to be very slow
  2. On the device, allocate memory for a 100x200x800 matrix and a 128x256x1024 matrix and write a kernel that copies the samples to the correct memory space
    • This approach is probably much faster but requires allocating memory for two matrices on the device

Is there any possibility for three-dimensional matrix indexing similar to MATLAB? In MATLAB, I could simply do the following:

h_data = rand(100, 200, 800);
d_data = zeros(128, 256, 1024);
d_data(1:100, 1:200, 1:800) = h_data;

Alternatively, if I copy the data to the device using cudaMemcpy(d_data, h_data, sizeof(int)*100*200*800, cudaMemcpyHostToDevice);, is it possible to reorder data in place such that I do not have to allocate memory for a second matrix, maybe using cudaMemcpy3D or cudaMemset3D?

3
  • The answer to the first question totally depends on the planned usage on GPU: how will you address the array (sparsely, linearly, etc.) , how many times will you read and write, etc. Commented Oct 12, 2021 at 22:11
  • I want to perform a FFT on the zero-padded matrix using the cuFFT library and copy the data back to the host. This should be performed continuously on a stream, i.e., as soon as the FFT is computed, the next matrix should be processed. I'm new to CUDA and I couldn't find any information about sparse or linear memory. The matrix itself is not sparse. I hope this information helps. Commented Oct 13, 2021 at 12:27
  • The second approach would be preferred for most factors of consideration that I can think of. The extra allocation here is 64MB in size. That's a problem? No, you cannot use matlab style matrix slicing notation to refer to or populate matrices in C++, which is what CUDA is based on. Commented Oct 13, 2021 at 15:14

1 Answer 1

2

As you hypothesize, you can use cudaMemcpy3D for this operation. Basically:

  1. Allocate your device array as normal
  2. Zero it with cudaMemset
  3. Use cudaMemcpy3D to perform a linear memory copy from host to device for the selected subarray from the host source to the device destination array.

The cudaMemcpy3D API is a bit baroque, cryptically documented, and has a few common traps for beginners. Basically, linear memory transfers require a pitched pointer for both the source and destination, and a extent denoting the size of the transfer. The confusing part is that the argument meanings change depending on whether the source and/or destination memory is a CUDA array or pitched linear memory. In code you will want something like this:

int hw = 100, hh = 200, hd = 800; 
size_t hpitch = hw * sizeof(int);
int* h_data = (int*)malloc(hpitch * hh * hd);

int dw = 128, dh = 256, dd = 1024;
size_t dpitch = dw * sizeof(int);
int *d_data; 
cudaMalloc((void**)&d_data, dpitch * dh * dd);
cudaMemset(d_data, 0, dpitch * dh * dd);

cudaPitchedPtr src = make_cudaPitchedPtr(h_data, hpitch, hw, hh);    ​
​cudaPitchedPtr dst = make_cudaPitchedPtr(d_data, dpitch, dw, dh);

cudaExtent copyext = make_cudaExtent(hpitch, hh, hd);

​‎cudaMemcpy3DParms copyparms = {0};
​copyparms.srcPtr = src;
​copyparms.dstPtr = dest;
copyparms.extent = copyext;
copyparms.kind = cudaMemcpyHostToDevice;

cudaMemcpy3D(&copyparms);

[Note: all done in the browser, never compiled or run use at own risk]

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much. This was exactly what I was looking for. You could edit your answer and change copyparms.desPtr to copyparms.dstPtr. Everything else was working immediately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.