Indexing scheme vs. array of pointers in CUDA

Question

I want to add two 2D arrays in CUDA.

When I use indexing scheme the program works fine:

#define COLUMNS 3
#define ROWS 2    
__global__ void add(int *a, int *b, int *c)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int i = (COLUMNS*y) + x;
    c[i] = a[i] + b[i];
}    

int main()
{
    int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **) &dev_a, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_b, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_c, ROWS*COLUMNS*sizeof(int));
    for (int y = 0; y < ROWS; y++) // Fill Arrays
        for (int x = 0; x < COLUMNS; x++)
        {
            a[y][x] = x;
            b[y][x] = y;
        }
    cudaMemcpy(dev_a, a, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    dim3 grid(COLUMNS,ROWS);
    add<<<grid,1>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, ROWS*COLUMNS*sizeof(int), cudaMemcpyDeviceToHost);
    return 0;
}

However it does not work when the matrix b is represented as an array of pointers rather then by indexing scheme as above:

int a[ROWS][COLUMNS], **b, c[ROWS][COLUMNS];
int *dev_a, *dev_b, *dev_c;
b = (int**)malloc(ROWS*sizeof(int*));
for (int i = 0; i < ROWS; i++)
b[i] = (int*) malloc(COLUMNS*sizeof(int));

Why?

I used example from here: http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf

What does not work? How do you copy the array-of-pointers-matrix to the device, that's likely the problem. — leftaroundabout
– leftaroundabout, Commented Aug 5, 2012 at 17:29

Pixelchemist · Accepted Answer · 2013-02-20 15:13:03Z

Eventhough the question is old I will try to give a hint to everyone coming here having related problems.

I agree with leftaroundabout. The compile-time-constant allocation without malloc

int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];

most likely allocates a single block of Memory for each variable (a, b, c) that is ROWS*COLUMNS in size. One can copy that block to the device in one single memcpy Operation.

In the pointer to pointer case each column is allocated seperately and thus we are not guranteed to have one contiguous block of Memory which can be copied to the device using a single memcpy.

In the given **b case, one will have to copy the data of each Column seperately, in order to achieve the analogoous behaviour.

for (int o=0; o<ROWS; ++o)
{
  cudaMemcpy((dev_b+o*COLUMNS), b[o], COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
}

Nevertheless i think one should handle a and b using the same scheme to avoid index confusion in case of different ordering of the elements. (Although i allege that a a[ROWS][COLUMNS] variable will be stored using row-major order.)

Collectives™ on Stack Overflow

Indexing scheme vs. array of pointers in CUDA

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related