I have flattened 4-D array in Host array.
And I want to copy a part(red region) of the 4-D array like below image.

I don't know how to copy the not serialized array.
The reason I copy a part of array is because the original array size is over 10GB and I only need 10% of it.
So at first, I tried it with for loop. But it tooks too much time.
Is there any better idea..?
int main(){
int nx = 100; ny = 200; nz = 300; nch = 400;
int idx_x_beg = 50; int_x_end = 100;
int idx_y_beg = 100; int_y_end = 200;
int idx_z_beg = 150; int_z_end = 300;
int idx_ch_beg = 200; int_ch_end = 400;
double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
double *d_4dArray;
cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
, h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
, sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
}
}
}
return 0;
}
cudaMemcpy. You are always copying the first set of bytes from each pointer. But leaving that aside, copy everything to a contiguous buffer on the host. Use roughly what you have outlined, exceptmemcpyinstead ofcudaMemcpy(with proper pointer arithmetic/pointer offsets). Then copy that contiguous buffer to the device in a singlecudaMemcpycall.cudaMemcpy? Is there any example code in CUDA Samples?