Asynchronous texture object allocation in multi-GPU code

Question

I have some code for texture object allocation and Host to Device copy. It is just a modification of the answer here. I do not explicitly use streams, just cudaSetDevice()

This code works fine, however, when I run the Visual Profiler, I can see that the memory copies from Host to Array are not asynchronous. They are allocated each to their own device stream, but the second one does not start until the first one finishes (running on 2 GPUs). I have tried it with large images, so I make certain that its not overhead from CPU.

My guess is that there is something in the code that requires to be synchronous thus halts the CPU, but I don't know what. What can I do to make this loop asynchronous?

An MCVE:

    void CreateTexture(int num_devices,float* imagedata, int nVoxelX, int nVoxelY, int nVoxelZ ,cudaArray** d_cuArrTex, cudaTextureObject_t *texImage);

int main(void)
{

int deviceCount =0 ;
cudaGetDeviceCount(&deviceCount);

int nVoxelX=512;
int nVoxelY=512;
int nVoxelZ=512;
float* image=(float*)malloc(nVoxelX*nVoxelY*nVoxelZ*sizeof(float));

cudaTextureObject_t *texImg =new cudaTextureObject_t[deviceCount];
cudaArray **d_cuArrTex = new cudaArray*[deviceCount];

CreateTexture(deviceCount,image, nVoxelX,nVoxelY, nVoxelZ,d_cuArrTex,texImg);


}

Actual function:

void CreateTexture(int num_devices, float* imagedata, int nVoxelX, int nVoxelY, int nVoxelZ ,cudaArray** d_cuArrTex, cudaTextureObject_t *texImage)
{
    //size_t size_image=nVoxelX*nVoxelY*nVoxelZ;
    for (unsigned int i = 0; i < num_devices; i++){
        cudaSetDevice(i);

        //cudaArray Descriptor
        const cudaExtent extent = make_cudaExtent(nVoxelX, nVoxelY, nVoxelZ);
        cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
        //cuda Array
        cudaMalloc3DArray(&d_cuArrTex[i], &channelDesc, extent);
        //cudaCheckErrors("Texture memory allocation fail");
        cudaMemcpy3DParms copyParams = {0};


        //Array creation
        copyParams.srcPtr   = make_cudaPitchedPtr((void *)imagedata, extent.width*sizeof(float), extent.width, extent.height);
        copyParams.dstArray = d_cuArrTex[i];
        copyParams.extent   = extent;
        copyParams.kind     = cudaMemcpyHostToDevice;
        cudaMemcpy3DAsync(&copyParams);
        //cudaCheckErrors("Texture memory data copy fail");


        //Array creation End
        cudaResourceDesc    texRes;
        memset(&texRes, 0, sizeof(cudaResourceDesc));
        texRes.resType = cudaResourceTypeArray;
        texRes.res.array.array  = d_cuArrTex[i];
        cudaTextureDesc     texDescr;
        memset(&texDescr, 0, sizeof(cudaTextureDesc));
        texDescr.normalizedCoords = false;
        texDescr.filterMode = cudaFilterModePoint;
        texDescr.addressMode[0] = cudaAddressModeBorder;
        texDescr.addressMode[1] = cudaAddressModeBorder;
        texDescr.addressMode[2] = cudaAddressModeBorder;
        texDescr.readMode = cudaReadModeElementType;
        cudaCreateTextureObject(&texImage[i], &texRes, &texDescr, NULL);
        //cudaCheckErrors("Texture object creation fail");
    }
}

The asynchronous memory operations are only asynchronous when they are run in a non-blocking stream. Unless you are compiling with the default stream per thread option enabled, this looks like expected behaviour to me. I would also be a bit suspicious about having context initialization inside a tight loop with asynchronous APIs.That also might block. — talonmies
– talonmies, Commented Feb 18, 2019 at 14:28
Thanks for the comments @talonmies .I did not have the --default-stream per-thread flag in the compiler options. However I seem to get the same result with the flag there. This is just a MCVE, there are several independent loop calls to cudaSetDevice before, where I check for the device names and device properties, without Asynchronous calls. I didn't add those to the MCVE to minimize code, but thanks for pointing that out. — Ander Biguri
– Ander Biguri, Commented Feb 18, 2019 at 14:44
copies from host->device from one CPU to several GPUs have some implications for system topology if you expect the transfers to overlap. Are you sure that each GPU has a fully independent PCIE path (e.g. they are not sharing the same CPU root port via a PCIE switch)? If the 2 or more GPUs happen to be attached to 2 separate CPU sockets, this may be even more complicated. What is the system topology? It's not sensible to expect to properly engineer this scenario without that information. — Robert Crovella
– Robert Crovella, Commented Feb 18, 2019 at 17:00
I wouldn't be at all surprised if cudaCreateTextureObject is a synchronizing call. My approach would be to attempt to get just the copies working in a loop. — Robert Crovella
– Robert Crovella, Commented Feb 18, 2019 at 17:01
there is no async version. I've provided an answer indicating overlap. — Robert Crovella
– Robert Crovella, Commented Feb 18, 2019 at 17:27

Robert Crovella · Accepted Answer · 2019-02-18 17:26:43Z

The two main problems I can see with the code are:

Your host allocation is a pageable allocation. Asynchrony of copy operations in CUDA where one of the targets is host memory requires a pinned alloction for host memory.
You have other synchronizing operations in your create textures loop. Device allocation operations (cudaMalloc3DArray in this case) are synchronizing, in my experience. I haven't run tests to determine if cudaCreateTextureObject is synchronizing, but I wouldn't be surprised if it was. Therefore my recommendation for asynchrony in general is to get synchronizing operations out of the loop.

In your case, we can refactor your code as follows, which seems to allow overlap of operations from the perspective of nvprof:

$ cat t399.cu
void CreateTexture(int num_devices, float* imagedata, int nVoxelX, int nVoxelY, int nVoxelZ ,cudaArray** d_cuArrTex, cudaTextureObject_t *texImage)
{
    //size_t size_image=nVoxelX*nVoxelY*nVoxelZ;

    const cudaExtent extent = make_cudaExtent(nVoxelX, nVoxelY, nVoxelZ);
    for (unsigned int i = 0; i < num_devices; i++){
        cudaSetDevice(i);

        //cudaArray Descriptor
        cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
        //cuda Array
        cudaMalloc3DArray(&d_cuArrTex[i], &channelDesc, extent);
        //cudaCheckErrors("Texture memory allocation fail");
        }
    for (unsigned int i = 0; i < num_devices; i++){
        cudaSetDevice(i);
        cudaMemcpy3DParms copyParams = {0};
        //Array creation
        copyParams.srcPtr   = make_cudaPitchedPtr((void *)imagedata, extent.width*sizeof(float), extent.width, extent.height);
        copyParams.dstArray = d_cuArrTex[i];
        copyParams.extent   = extent;
        copyParams.kind     = cudaMemcpyHostToDevice;
        cudaMemcpy3DAsync(&copyParams);
        //cudaCheckErrors("Texture memory data copy fail");
        }
    for (unsigned int i = 0; i < num_devices; i++){
        cudaSetDevice(i);
        //Array creation End
        cudaResourceDesc    texRes;
        memset(&texRes, 0, sizeof(cudaResourceDesc));
        texRes.resType = cudaResourceTypeArray;
        texRes.res.array.array  = d_cuArrTex[i];
        cudaTextureDesc     texDescr;
        memset(&texDescr, 0, sizeof(cudaTextureDesc));
        texDescr.normalizedCoords = false;
        texDescr.filterMode = cudaFilterModePoint;
        texDescr.addressMode[0] = cudaAddressModeBorder;
        texDescr.addressMode[1] = cudaAddressModeBorder;
        texDescr.addressMode[2] = cudaAddressModeBorder;
        texDescr.readMode = cudaReadModeElementType;
        cudaCreateTextureObject(&texImage[i], &texRes, &texDescr, NULL);
        //cudaCheckErrors("Texture object creation fail");
    }
    for (unsigned int i = 0; i < num_devices; i++){
        cudaSetDevice(i);
        cudaDeviceSynchronize();
    }
}

int main(void)
{
  int deviceCount =0 ;
  cudaGetDeviceCount(&deviceCount);

  int nVoxelX=512;
  int nVoxelY=512;
  int nVoxelZ=512;
  float* image;

  cudaHostAlloc(&image, nVoxelX*nVoxelY*nVoxelZ*sizeof(float), cudaHostAllocDefault);

  cudaTextureObject_t *texImg =new cudaTextureObject_t[deviceCount];
  cudaArray **d_cuArrTex = new cudaArray*[deviceCount];

  CreateTexture(deviceCount,image, nVoxelX,nVoxelY, nVoxelZ,d_cuArrTex,texImg);
}


$ nvcc -o t399 t399.cu
$ cuda-memcheck ./t399
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof --print-gpu-trace ./t399
==19953== NVPROF is profiling process 19953, command: ./t399
==19953== Profiling application: ./t399
==19953== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
1.55311s  90.735ms                    -               -         -         -         -  512.00MB  5.5106GB/s      Pinned       Array  Tesla P100-PCIE         1         7  [CUDA memcpy HtoA]
1.55316s  90.640ms                    -               -         -         -         -  512.00MB  5.5163GB/s      Pinned       Array   Tesla K40m (1)         2        18  [CUDA memcpy HtoA]
1.55318s  85.962ms                    -               -         -         -         -  512.00MB  5.8165GB/s      Pinned       Array  Tesla K20Xm (2)         3        29  [CUDA memcpy HtoA]
1.55320s  89.908ms                    -               -         -         -         -  512.00MB  5.5612GB/s      Pinned       Array  Tesla K20Xm (3)         4        40  [CUDA memcpy HtoA]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

my system here is a 4-GPU system with two GPUs hanging on each of 2 root ports. Therefore the Host->Device pinned transfer bandwidth on PCIE Gen3 of about 10GB/s is getting split from the profiler's perspective between the 2 GPUs on each port, but careful study of the profiler start and duration times for the transfers indicate all 4 are overlapped from the profiler's perspective.

Thanks Robert. I haven't tested it yet, but you provided enough information and proof of it working. I will try this later.
Hi Robert. Your code works well, but unfortunately I can not use pinned memory in my application, my loss here. However, I see an interesting effect when running on my 2 GPUs and inspecting with the visual profiler. It does not matter which of the two I allocate first (thus is not a hardware issue) the first one that I allocate has always a speed less than 3GB/s while the second one reaches 11GB/s (for an 8GB array). I can reproduce this in the same CUDA call (same context) and within different calls. The context is created before the calls. Any idea why this may be happening?
I wouldn't be able to make any guesses without some understanding of your system topology.
Fair, I will try to investigate and post a question with all the relevant information

Collectives™ on Stack Overflow

Asynchronous texture object allocation in multi-GPU code

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related