1

I have a very big array in Device memory, and I need to partition it into some smaller parts. Now, I wondered if I could use an array of arrays to access them by indices.

I tried to write the following code, however, it returns rubbish which is I think because of its undefined behavior. It has no error and I don't know if it is possible.

#include <stdio.h>
#include <assert.h>
#include <iostream>

inline
cudaError_t checkCuda(cudaError_t result) {
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}

__global__ void cudaVectorFill(int **array, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        array[0][i] = 1;
    else if (i < 2 * N)
        array[1][i - N] = 2;
    else if (i < 3 * N)
        array[2][i - 2 * N] = 3;
}

int main() {

    int N = 100000000;

    int **array = new int*[3];
 
    checkCuda( cudaMalloc(&array[0], N * sizeof(int)) );
    checkCuda( cudaMalloc(&array[1], N * sizeof(int)) );
    checkCuda( cudaMalloc(&array[2], N * sizeof(int)) );
 
    cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array, N);

    checkCuda( cudaPeekAtLastError() );
 
    auto *host_array0 = new int[1];
    auto *host_array1 = new int[1];
    auto *host_array2 = new int[1];
 
    checkCuda( cudaMemcpy(host_array0, array[0], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
    checkCuda( cudaMemcpy(host_array1, array[1], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
    checkCuda( cudaMemcpy(host_array2, array[2], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
 
    std::cout << *host_array0 << std::endl << *host_array1 << std::endl << *host_array2 << std::endl;

    return 0;
}

Output is something like:

707093096
707093104
707093112

Correct Output should be:

1
2
3
11
  • 3
    The memory pointed to by array Is host memory. You can’t pass plain host pointers to CUDA kernels like that. You need to allocate memory for it on the GPU Commented Apr 17, 2022 at 1:02
  • 2
    You can do it with pointer arithmetic or pointers. Just get the memory spaces correct Commented Apr 17, 2022 at 1:04
  • 1
    You could use arrays of a single dimension and calculate multidimensional indices manually. E.g. [row*100+column] Commented Apr 17, 2022 at 2:22
  • 2
    You mentioned both facts, but gave no actual reasoning in the question. It is unusual that you cannot allocate large blocks of device memory. What is your GPU memory size and what is the largest block you can successfully allocate (without any allocations before - as far as possible, if you use the graphics card also for displaying on the screen, there will be some memory already allocated and used). Can you also use cudaGetMemInfo and post the two results, please? Commented Apr 17, 2022 at 6:08
  • 2
    They also have A100 with 40GB, but you probably need to order a pro+ account and be lucky to get an A100 assigned. Commented Apr 18, 2022 at 4:46

2 Answers 2

3

As noted in comments, if you are passing pointers to a GPU kernel, they have to be accessible to the GPU. That means you either explicitly allocate a copy of the host array of device pointers and populate it on the device, or rely on managed or otherwise GPU accessible host memory.

One approach that will probably work in this case is:

int N = 100000000;

int **array = new int*[3];
 
checkCuda( cudaMalloc(&array[0], N * sizeof(int)) );
checkCuda( cudaMalloc(&array[1], N * sizeof(int)) );
checkCuda( cudaMalloc(&array[2], N * sizeof(int)) );

int **array_d;
checkCuda( cudaMalloc(&array_d, 3 * sizeof(int*)) );
checkCuda( cudaMemcpy(array_d, array, 3 * sizeof(int*), cudaMemcpyHostToDevice) );
 
cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array_d, N);

[Standard disclaimer, code written in browser, no guarantees implied or given, use at own risk]

i.e. after building array in host memory, make a copy in GPU memory and pass that GPU memory copy to your kernel. There might be other problems in your code, I haven't analyzed further than the first six lines.

Sign up to request clarification or add additional context in comments.

Comments

1

FYI, I just found another approach for 2D allocation in device memory. See method 3 in this example for more information. So we can use something like:

int N = 100000000;

int **array;
cudaMallocManaged(&array, 3 * sizeof(int *));
cudaMallocManaged(&(array[0]), N * sizeof(int));
cudaMallocManaged(&(array[1]), N * sizeof(int));
cudaMallocManaged(&(array[2]), N * sizeof(int));

cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array, N);

It also worked fine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.