1

I want to print d_t global 2D array variable using "printf" inside main method. But I got a compile warning saying that:

a __device__ variable "d_t" cannot be directly read in a host function

How can I copy global 2D array variable from device to host and then print the first column of each row?

__device__ double *d_t;

__device__ size_t d_gridPitch;


__global__ void kernelFunc()
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    double* rowt = (double*)((char *)d_t + i * d_gridPitch);
    rowt[0] = rowt[0] + 40000;

}


int main()
{
    int size = 16;
    size_t d_pitchLoc;
    double *d_tLoc;

    cudaMallocPitch((void**)&d_tLoc, &d_pitchLoc, size * sizeof(double), size);
    cudaMemset2D(d_tLoc, d_pitchLoc, 0, size * sizeof(double), size);

    cudaMemcpyToSymbol(d_gridPitch, &d_pitchLoc, sizeof(int));
    cudaMemcpyToSymbol(d_t, & d_tLoc, sizeof(d_tLoc));

    kernelFunc<<<1,size>>>();

    for(int i=0; i< size; i++){
        double* rowt = (double*)((char *)d_t + i * d_gridPitch);
        printf("%.0f, ",rowt[0]);
    }

    cudaDeviceReset();

    return 0;
}
1
  • The cudaMemcpy2D function is used to copy to or from a pitched allocation (i.e. created with cudaMallocPitch). Here is the API documentation for cudaMemcpy2D. If you search on this CUDA tag you will find many questions and answers that demonstrate proper usage, such as this one. Use proper CUDA error checking. Commented Mar 26, 2016 at 0:30

1 Answer 1

1

As indicated in comments, the cudaMemcpy2D API is designed for exactly this task. You must allocate or statically define a host memory buffer or container to act as storage for the data from the device, and then provide the pitch of that host buffer to the cudaMemcpy2D call. The API handles the pitch conversion without any further intervention on the caller side.

If you replace the print loop with something like this:

double* h_t = new double[size * size];
cudaMemcpy2D(h_t, size * sizeof(double), d_tLoc, d_pitchLoc, 
        size * sizeof(double), size, cudaMemcpyDeviceToHost);
for(int i=0, j=0; i< size; i++){
    std::cout << h_t[i * size + j] << std::endl; 
}

[Note I'm using iostream here for the printing. CUDA uses a C++ compiler for compiling host code and you should prefer iostream functions over cstdio because they are less error prone and support improve diagnostics on most platforms].

You can see that the API call form is very similar to the cudaMemset2D call that I provided for you in your last question.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.