3

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).

So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.

Consider the following kernel

texture<...> mytexture;

__global__ void mykernel(float *pOut)
{
    pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}

The kernel is launched in multiple streams

extern void *sourcedata;

#define N_CUDA_STREAMS ...

cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];

for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
    cudaStreamCreate(stream[k_stream]);

    cudaMalloc(&d_pOut[k_stream], ...);
    cudaMalloc(&d_texData[k_stream], ...);
}

/* ... */

for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
    int const k_stream = i_datablock % N_CUDA_STREAMS;
    cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);

    cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);

    mykernel<<<..., stream[k_stream]>>>(d_pOut);
}

Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.

The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.

Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:

texture<...> texarray[N_CUDA_STREAMS];

Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).

3
  • I'm not sure, but I think that such rebinding of texture would cause problems. However, OpenCL allows creation of arrays of textures, so if you can't solve your problem with CUDA, you can consider switching to OpenCL, it usually is pretty straightforward. Commented Sep 13, 2012 at 17:52
  • @aland: Do you know a OpenCL counterpart to CUFFT with similar performance? Commented Sep 13, 2012 at 18:31
  • I don't know any more-or-less established library, but there are a lot of codes on the Internet, so you can probably find something suiting your needs. Commented Sep 13, 2012 at 20:07

1 Answer 1

5

Indeed you cannot unbind the texture while still using it in a different stream.

Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:

texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;

template<int TexSel> __device__ float myTex1Dfetch(int x);

template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }


template<int TexSel> __global__ void mykernel(float *pOut)
{
    pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}


int main(void)
{
    float *out_d[2];

    // ...

    mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
    mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);

    // ...
}
Sign up to request clarification or add additional context in comments.

9 Comments

How can I use templates texture references?
I'm thinking of something like (completely untested!)
Your comment seems to be missing something.
@datenwolf, don't use templated texture references. Declare multiple texture references, then templatize your kernel to read from one based on the template parameter. Then wrap that kernel in templated host code that binds the texture reference corresponding to the template parameter, then invokes the kernel with the template parameter. (It is harmless for multiple texture references to be bound to the same CUDA array.)
My turn to be puzzled. I have personally optimized real-life HPC code where two streams were sufficient to hide host/device traffic almost perfectly behind kernel execution. While the kernels are executing in one stream, asynchronous copies are executing in the other stream. There is a bit of cross-stream synchronization using cudaStreamWaitEvent to make sure the copies are actually completed before the data is consumed by the other thread. Obviously there can be situations where more than two stream may be needed for best performance.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.