0

I want to add two 2D arrays in CUDA.

When I use indexing scheme the program works fine:

#define COLUMNS 3
#define ROWS 2    
__global__ void add(int *a, int *b, int *c)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int i = (COLUMNS*y) + x;
    c[i] = a[i] + b[i];
}    

int main()
{
    int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **) &dev_a, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_b, ROWS*COLUMNS*sizeof(int));
    cudaMalloc((void **) &dev_c, ROWS*COLUMNS*sizeof(int));
    for (int y = 0; y < ROWS; y++) // Fill Arrays
        for (int x = 0; x < COLUMNS; x++)
        {
            a[y][x] = x;
            b[y][x] = y;
        }
    cudaMemcpy(dev_a, a, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, ROWS*COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
    dim3 grid(COLUMNS,ROWS);
    add<<<grid,1>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, ROWS*COLUMNS*sizeof(int), cudaMemcpyDeviceToHost);
    return 0;
}

However it does not work when the matrix b is represented as an array of pointers rather then by indexing scheme as above:

int a[ROWS][COLUMNS], **b, c[ROWS][COLUMNS];
int *dev_a, *dev_b, *dev_c;
b = (int**)malloc(ROWS*sizeof(int*));
for (int i = 0; i < ROWS; i++)
b[i] = (int*) malloc(COLUMNS*sizeof(int));

Why?

I used example from here: http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf

1
  • 2
    What does not work? How do you copy the array-of-pointers-matrix to the device, that's likely the problem. Commented Aug 5, 2012 at 17:29

1 Answer 1

1

Eventhough the question is old I will try to give a hint to everyone coming here having related problems.

I agree with leftaroundabout. The compile-time-constant allocation without malloc

int a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];

most likely allocates a single block of Memory for each variable (a, b, c) that is ROWS*COLUMNS in size. One can copy that block to the device in one single memcpy Operation.

In the pointer to pointer case each column is allocated seperately and thus we are not guranteed to have one contiguous block of Memory which can be copied to the device using a single memcpy.

In the given **b case, one will have to copy the data of each Column seperately, in order to achieve the analogoous behaviour.

for (int o=0; o<ROWS; ++o)
{
  cudaMemcpy((dev_b+o*COLUMNS), b[o], COLUMNS*sizeof(int), cudaMemcpyHostToDevice);
}

Nevertheless i think one should handle a and b using the same scheme to avoid index confusion in case of different ordering of the elements. (Although i allege that a a[ROWS][COLUMNS] variable will be stored using row-major order.)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.