I have some code for texture object allocation and Host to Device copy. It is just a modification of the answer here. I do not explicitly use streams, just cudaSetDevice()
This code works fine, however, when I run the Visual Profiler, I can see that the memory copies from Host to Array are not asynchronous. They are allocated each to their own device stream, but the second one does not start until the first one finishes (running on 2 GPUs). I have tried it with large images, so I make certain that its not overhead from CPU.
My guess is that there is something in the code that requires to be synchronous thus halts the CPU, but I don't know what. What can I do to make this loop asynchronous?
An MCVE:
void CreateTexture(int num_devices,float* imagedata, int nVoxelX, int nVoxelY, int nVoxelZ ,cudaArray** d_cuArrTex, cudaTextureObject_t *texImage);
int main(void)
{
int deviceCount =0 ;
cudaGetDeviceCount(&deviceCount);
int nVoxelX=512;
int nVoxelY=512;
int nVoxelZ=512;
float* image=(float*)malloc(nVoxelX*nVoxelY*nVoxelZ*sizeof(float));
cudaTextureObject_t *texImg =new cudaTextureObject_t[deviceCount];
cudaArray **d_cuArrTex = new cudaArray*[deviceCount];
CreateTexture(deviceCount,image, nVoxelX,nVoxelY, nVoxelZ,d_cuArrTex,texImg);
}
Actual function:
void CreateTexture(int num_devices, float* imagedata, int nVoxelX, int nVoxelY, int nVoxelZ ,cudaArray** d_cuArrTex, cudaTextureObject_t *texImage)
{
//size_t size_image=nVoxelX*nVoxelY*nVoxelZ;
for (unsigned int i = 0; i < num_devices; i++){
cudaSetDevice(i);
//cudaArray Descriptor
const cudaExtent extent = make_cudaExtent(nVoxelX, nVoxelY, nVoxelZ);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//cuda Array
cudaMalloc3DArray(&d_cuArrTex[i], &channelDesc, extent);
//cudaCheckErrors("Texture memory allocation fail");
cudaMemcpy3DParms copyParams = {0};
//Array creation
copyParams.srcPtr = make_cudaPitchedPtr((void *)imagedata, extent.width*sizeof(float), extent.width, extent.height);
copyParams.dstArray = d_cuArrTex[i];
copyParams.extent = extent;
copyParams.kind = cudaMemcpyHostToDevice;
cudaMemcpy3DAsync(©Params);
//cudaCheckErrors("Texture memory data copy fail");
//Array creation End
cudaResourceDesc texRes;
memset(&texRes, 0, sizeof(cudaResourceDesc));
texRes.resType = cudaResourceTypeArray;
texRes.res.array.array = d_cuArrTex[i];
cudaTextureDesc texDescr;
memset(&texDescr, 0, sizeof(cudaTextureDesc));
texDescr.normalizedCoords = false;
texDescr.filterMode = cudaFilterModePoint;
texDescr.addressMode[0] = cudaAddressModeBorder;
texDescr.addressMode[1] = cudaAddressModeBorder;
texDescr.addressMode[2] = cudaAddressModeBorder;
texDescr.readMode = cudaReadModeElementType;
cudaCreateTextureObject(&texImage[i], &texRes, &texDescr, NULL);
//cudaCheckErrors("Texture object creation fail");
}
}
--default-stream per-threadflag in the compiler options. However I seem to get the same result with the flag there. This is just a MCVE, there are several independent loop calls tocudaSetDevicebefore, where I check for the device names and device properties, without Asynchronous calls. I didn't add those to the MCVE to minimize code, but thanks for pointing that out.cudaCreateTextureObjectis a synchronizing call. My approach would be to attempt to get just the copies working in a loop.