I am currently using the following Reduction function to sum all of the elements in an array with CUDA:
__global__ void reduceSum(int *input, int *input2, int *input3, int *outdata, int size){
extern __shared__ int sdata[];
unsigned int tID = threadIdx.x;
unsigned int i = tID + blockIdx.x * (blockDim.x * 2);
sdata[tID] = input[i] + input[i + blockDim.x];
__syncthreads();
for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
{
if (tID < stride)
{
sdata[tID] += sdata[tID + stride];
}
__syncthreads();
}
if (tID < 32){ warpReduce(sdata, tID); }
if (tID == 0)
{
outdata[blockIdx.x] = sdata[0];
}
}
However, as you can see from the function parameters I would like to be able to sum three separate arrays inside the one reduction function. Now obviously a simple way to do this would be to launch the Kernel three times and pass a different array each time, and this would work fine of course. I am only writing this as a test kernel for just now though, the real kernel will end up taking an array of structs, and I will need to perform an addition for the all X, Y and Z values of each struct, which is why I need to sum them all in one kernel.
I have initalized and allocated memory for all three arrays
int test[1000];
std::fill_n(test, 1000, 1);
int *d_test;
int test2[1000];
std::fill_n(test2, 1000, 2);
int *d_test2;
int test3[1000];
std::fill_n(test3, 1000, 3);
int *d_test3;
cudaMalloc((void**)&d_test, 1000 * sizeof(int));
cudaMalloc((void**)&d_test2, 1000 * sizeof(int));
cudaMalloc((void**)&d_test3, 1000 * sizeof(int));
I am unsure what Grid and Block dimensions I should use for this kind of kernel and I am not entirely sure how to modify the reduction loop to place the data as I want it, i.e Output Array:
Block 1 Result|Block 2 Result|Block 3 Result|Block 4 Result|Block 5 Result|Block 6 Result|
Test Array 1 Sums Test Array 2 Sums Test Array 3 Sums
I hope that makes sense. Or is there a better way to have only one reduction function but be able to return the summation of Struct.X, Struct.Y or struct.Z?
Here's the struct:
template <typename T>
struct planet {
T x, y, z;
T vx, vy, vz;
T mass;
};
I need to add up all the vx and store it, all the vy and store it and all the vz and store it.
struct my_struct { int x,y,z;} data[1000];? The reason this is important is because a reduction operation like this will be limited by memory bandwidth. Therefore, organization of data in memory as well as access pattern is important to understand to achieve highest performance. A good solution will optimize memory access pattern to optimize use of the available memory bandwidth.