How can I copy a part of 4D array from Host memory to Device memory?

Question

I have flattened 4-D array in Host array.
And I want to copy a part(red region) of the 4-D array like below image.

I don't know how to copy the not serialized array.
The reason I copy a part of array is because the original array size is over 10GB and I only need 10% of it.
So at first, I tried it with for loop. But it tooks too much time.
Is there any better idea..?

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray;
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));

    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
            }
        }
    }

    return 0;
}

What you have shown couldn't possibly be correct; you have no pointer arithmetic on your pointers passed to cudaMemcpy. You are always copying the first set of bytes from each pointer. But leaving that aside, copy everything to a contiguous buffer on the host. Use roughly what you have outlined, except memcpy instead of cudaMemcpy (with proper pointer arithmetic/pointer offsets). Then copy that contiguous buffer to the device in a single cudaMemcpy call. — Robert Crovella
– Robert Crovella, Commented Apr 28, 2021 at 16:38
@RobertCrovella Sorry, I copied wrong version of code. Do you mean copy every start pointer which can create by for loop, and copy it using single cudaMemcpy? Is there any example code in CUDA Samples? — powermew
– powermew, Commented Apr 28, 2021 at 17:09

Robert Crovella · Accepted Answer · 2021-04-28 18:56:36Z

For copying a subset of an array, cuda provides cudaMemcpy2D (can copy a single 2D section of a multidimensional array) and cudaMemcpy3D (can copy a single 3D section of a multidimensional array). You can find lots of questions here on the cuda tag to discover how to use those.

There are two problems with those approaches:

They don't necessarily extend to the 4D case. i.e. you might still need a loop
The performance of these operations (host<->device transfer speed) is often significantly lower than a cudaMemcpy operation that is copying the same number of bytes in aggregate

So there is no free lunch here. I believe the best suggestion is to create an extra "contiguous" buffer on the host, do all your slice-copying to that buffer, then copy that buffer from host to device in a single cudaMemcpy call. After that, if you still need the 4D representation on the device, then you will need to write a device kernel that "scatters" the data for you. Conceptually the reverse of the code you have shown.

Sorry, I'm not going to write all that code for you. However, I will rough out the first portion of it (getting everything copied to a single contiguous buffer on the device), using the code you have shown:

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray, *h_temp, *d_temp;
    size_t temp_sz = (int_x_end - int_x_begin)*(idx_ch_end - idx_ch_beg + 1)*(idx_z_end - idx_z_beg + 1)*(idx_y_end - idx_y_beg + 1);
    h_temp = (double *)malloc(temp_sz*sizeof(double));
    cudaMalloc(&d_temp, temp_sz*sizeof(double));
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
    size_t size_tr = 0;
    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                memcpy(h_temp+size_tr
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg));
                size_tr += (int_x_end - int_x_beg);
            }
        }
    }
    cudaMemcpy(d_temp, h_temp, temp_sz*sizeof(double), cudaMemcpyHostToDevice);
    // if necessary, put cuda kernel here to scatter data from d_temp to d_4dArray
    return 0;
}

after that, as indicated, if you have need of the 4D representation on the device, you will need a CUDA kernel to scatter the data for you.

Thanks for the details. I tried your suggestion, making contiguous buffer tooks 0.0021ms and memcpy tooks 310ms. That is impressive!. Actually the original 4d array is really big and the part of the 4d ,which I want to compute in graphic card, is really small. So in case of d_temp is small, it tooks only 0.0035 ms(average) for whole process. I learned it is much efficient when I reallign the part of matrix and passed it to device. So interesting! Thank you again.

Collectives™ on Stack Overflow

How can I copy a part of 4D array from Host memory to Device memory?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related