1

I have flattened 4-D array in Host array.
And I want to copy a part(red region) of the 4-D array like below image.
enter image description here

I don't know how to copy the not serialized array.
The reason I copy a part of array is because the original array size is over 10GB and I only need 10% of it.
So at first, I tried it with for loop. But it tooks too much time.
Is there any better idea..?

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray;
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));

    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
            }
        }
    }

    return 0;
}
2
  • 1
    What you have shown couldn't possibly be correct; you have no pointer arithmetic on your pointers passed to cudaMemcpy. You are always copying the first set of bytes from each pointer. But leaving that aside, copy everything to a contiguous buffer on the host. Use roughly what you have outlined, except memcpy instead of cudaMemcpy (with proper pointer arithmetic/pointer offsets). Then copy that contiguous buffer to the device in a single cudaMemcpy call. Commented Apr 28, 2021 at 16:38
  • @RobertCrovella Sorry, I copied wrong version of code. Do you mean copy every start pointer which can create by for loop, and copy it using single cudaMemcpy? Is there any example code in CUDA Samples? Commented Apr 28, 2021 at 17:09

1 Answer 1

2

For copying a subset of an array, cuda provides cudaMemcpy2D (can copy a single 2D section of a multidimensional array) and cudaMemcpy3D (can copy a single 3D section of a multidimensional array). You can find lots of questions here on the cuda tag to discover how to use those.

There are two problems with those approaches:

  1. They don't necessarily extend to the 4D case. i.e. you might still need a loop
  2. The performance of these operations (host<->device transfer speed) is often significantly lower than a cudaMemcpy operation that is copying the same number of bytes in aggregate

So there is no free lunch here. I believe the best suggestion is to create an extra "contiguous" buffer on the host, do all your slice-copying to that buffer, then copy that buffer from host to device in a single cudaMemcpy call. After that, if you still need the 4D representation on the device, then you will need to write a device kernel that "scatters" the data for you. Conceptually the reverse of the code you have shown.

Sorry, I'm not going to write all that code for you. However, I will rough out the first portion of it (getting everything copied to a single contiguous buffer on the device), using the code you have shown:

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray, *h_temp, *d_temp;
    size_t temp_sz = (int_x_end - int_x_begin)*(idx_ch_end - idx_ch_beg + 1)*(idx_z_end - idx_z_beg + 1)*(idx_y_end - idx_y_beg + 1);
    h_temp = (double *)malloc(temp_sz*sizeof(double));
    cudaMalloc(&d_temp, temp_sz*sizeof(double));
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
    size_t size_tr = 0;
    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                memcpy(h_temp+size_tr
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg));
                size_tr += (int_x_end - int_x_beg);
            }
        }
    }
    cudaMemcpy(d_temp, h_temp, temp_sz*sizeof(double), cudaMemcpyHostToDevice);
    // if necessary, put cuda kernel here to scatter data from d_temp to d_4dArray
    return 0;
}

after that, as indicated, if you have need of the 4D representation on the device, you will need a CUDA kernel to scatter the data for you.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the details. I tried your suggestion, making contiguous buffer tooks 0.0021ms and memcpy tooks 310ms. That is impressive!. Actually the original 4d array is really big and the part of the 4d ,which I want to compute in graphic card, is really small. So in case of d_temp is small, it tooks only 0.0035 ms(average) for whole process. I learned it is much efficient when I reallign the part of matrix and passed it to device. So interesting! Thank you again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.