CUDA Reduction on Shared Memory with Multiple Arrays

Question

I am currently using the following Reduction function to sum all of the elements in an array with CUDA:

__global__ void reduceSum(int *input, int *input2, int *input3, int *outdata, int size){
    extern __shared__ int sdata[];

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);
    sdata[tID] = input[i] + input[i + blockDim.x];
    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            sdata[tID] += sdata[tID + stride];
        }
        __syncthreads();
    }
    
    if (tID < 32){ warpReduce(sdata, tID); }

    if (tID == 0)
    {
        outdata[blockIdx.x] = sdata[0];
    }
}

However, as you can see from the function parameters I would like to be able to sum three separate arrays inside the one reduction function. Now obviously a simple way to do this would be to launch the Kernel three times and pass a different array each time, and this would work fine of course. I am only writing this as a test kernel for just now though, the real kernel will end up taking an array of structs, and I will need to perform an addition for the all X, Y and Z values of each struct, which is why I need to sum them all in one kernel.

I have initalized and allocated memory for all three arrays

    int test[1000];
    std::fill_n(test, 1000, 1);
    int *d_test;

    int test2[1000];
    std::fill_n(test2, 1000, 2);
    int *d_test2;

    int test3[1000];
    std::fill_n(test3, 1000, 3);
    int *d_test3;

    cudaMalloc((void**)&d_test, 1000 * sizeof(int));
    cudaMalloc((void**)&d_test2, 1000 * sizeof(int));
    cudaMalloc((void**)&d_test3, 1000 * sizeof(int));

I am unsure what Grid and Block dimensions I should use for this kind of kernel and I am not entirely sure how to modify the reduction loop to place the data as I want it, i.e Output Array:

Block 1 Result|Block 2 Result|Block 3 Result|Block 4 Result|Block 5 Result|Block 6 Result|

      Test Array 1 Sums              Test Array 2 Sums            Test Array 3 Sums

I hope that makes sense. Or is there a better way to have only one reduction function but be able to return the summation of Struct.X, Struct.Y or struct.Z?

Here's the struct:

template <typename T>
struct planet {
    T x, y, z;
    T vx, vy, vz;
    T mass;
};

I need to add up all the vx and store it, all the vy and store it and all the vz and store it.

Why not provide an actual definition of the array of structures that you would like to sum? Is it just: struct my_struct { int x,y,z;} data[1000]; ? The reason this is important is because a reduction operation like this will be limited by memory bandwidth. Therefore, organization of data in memory as well as access pattern is important to understand to achieve highest performance. A good solution will optimize memory access pattern to optimize use of the available memory bandwidth. — Robert Crovella
– Robert Crovella, Commented Feb 25, 2016 at 16:54
Sorry, you are right, I've updated the main post with the definition of the struct. — Conor Watson
– Conor Watson, Commented Feb 26, 2016 at 11:19

Robert Crovella · Accepted Answer · 2016-02-27 15:08:58Z

Or is there a better way to have only one reduction function but be able to return the summation of Struct.X, Struct.Y or struct.Z?

Usually a principal focus of accelerated computing is speed. Speed (performance) of GPU codes often depends heavily on data storage and access patterns. Therefore, although as you point out in your question we could realize a solution in a number of ways, let's focus on something that should be relatively fast.

Reductions like this don't have much arithmetic/operation intensity, so our focus for performance will mostly revolve around data storage for efficient access. When accessing global memory, GPUs will typically do so in large chunks -- 32 byte or 128 byte chunks. To make efficient use of the memory subsystem, we'll want to use all 32 or 128 of those bytes that are requested, on each request.

But the implied data storage pattern of your structure:

template <typename T>
struct planet {
    T x, y, z;
    T vx, vy, vz;
    T mass;
};

pretty much rules this out. For this problem you care about vx, vy, and vz. Those 3 items should be contiguous within a given structure (element), but in an array of those structures, they will be separated by the necessary storage for the other structure items, at least:

planet0:       T x
               T y
               T z               ---------------
               T vx      <--           ^
               T vy      <--           |
               T vz      <--       32-byte read
               T mass                  |
planet1:       T x                     |
               T y                     v
               T z               ---------------
               T vx      <--
               T vy      <--
               T vz      <--
               T mass
planet2:       T x
               T y
               T z
               T vx      <--
               T vy      <--
               T vz      <--
               T mass

(for the sake of example, assuming T is float)

This points out a key drawback of Array of Structures (AoS) storage formats in a GPU. Accessing the same element from consecutive structures is inefficent, due to the access granularity (32-byte) of the GPU. The usual suggestion for performance in such cases is to convert the AoS storage to SoA (structure of arrays):

template <typename T>
struct planets {
    T x[N], y[N], z[N];
    T vx[N], vy[N], vz[N];
    T mass[N];
};

The above is just one possible example, probably not what you would actually use, as the structure would serve little purpose, since we would only have one structure for N planets. The point is, now when I access vx for consecutive planets, the individual vx elements are all adjacent in memory, so a 32-byte read gives me 32 bytes worth of vx data, with no wasted or unused elements.

With such a transformation, the reduction problem becomes relatively simple again, from the standpoint of code organization. You can use essentially the same as your single array reduction code, either called 3 times in a row or else with a straightforward extension to the kernel code to essentially handle all 3 arrays independently. A "3-in-1" kernel might look something like this:

template <typename T>
__global__ void reduceSum(T *input_vx, T *input_vy, T *input_vz, T *outdata_vx, T *outdata_vy, T *outdata_vz, int size){
    extern __shared__ T sdata[];

    const int VX = 0;
    const int VY = blockDim.x;
    const int VZ = 2*blockDim.x;

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);
    sdata[tID+VX] = input_vx[i] + input_vx[i + blockDim.x];
    sdata[tID+VY] = input_vy[i] + input_vy[i + blockDim.x];
    sdata[tID+VZ] = input_vz[i] + input_vz[i + blockDim.x];
    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            sdata[tID+VX] += sdata[tID+VX + stride];
            sdata[tID+VY] += sdata[tID+VY + stride];
            sdata[tID+VZ] += sdata[tID+VZ + stride];
        }
        __syncthreads();
    }

    if (tID < 32){ warpReduce(sdata+VX, tID); }
    if (tID < 32){ warpReduce(sdata+VY, tID); }
    if (tID < 32){ warpReduce(sdata+VZ, tID); }

    if (tID == 0)
    {
        outdata_vx[blockIdx.x] = sdata[VX];
        outdata_vy[blockIdx.x] = sdata[VY];
        outdata_vz[blockIdx.x] = sdata[VZ];
    }
}

(coded in browser - not tested - merely an extension of what you have shown as a "reference kernel")

The above AoS -> SoA data transformation will likely have performance benefits elsewhere in your code as well. Since the proposed kernel will handle 3 arrays at once, the grid and block dimensions should be exactly the same as what you would use for your reference kernel in the single-array case. Shared memory storage will need to increase (triple) per block.

James W · Accepted Answer · 2016-02-27 16:42:54Z

Robert Crovella gave an excellent answer that highlights the importance of the AoS -> SoA layout transformation that often improves performance on the GPU, I'd just like to propose a middle ground that might be more convenient. The CUDA language provides a few vector types for just the purpose you describe (see this section of the CUDA programming guide).

For example, CUDA defines int3, a datatype that stores 3 integers.

 struct int3
 {
    int x; int y; int z;
 };

Similar types exist for floats, chars, doubles etc. What's nice about these datatypes is that they can be loaded with a single instruction, which may give you a small performance boost. See this NVIDIA blog post for a discussion of this. It's also a more "natural" datatype for this case, and it might make other parts of your code easier to work with. You could define, for example:

struct planets {
    float3 position[N];
    float3 velocity[N];
    int mass[N];
};

A reduction kernel that uses this datatype might look something like this (adapted from Robert's).

__inline__ __device__ void SumInt3(int3 const & input1, int3 const & input2, int3 & result)
{
    result.x = input1.x + input2.x;
    result.y = input1.y + input2.y;
    result.z = input1.z + input2.z;
}

__inline__ __device__ void WarpReduceInt3(int3 const & input, int3 & output, unsigned int const tID)
{
    output.x = WarpReduce(input.x, tID);
    output.y = WarpReduce(input.y, tID);
    output.z = WarpReduce(input.z, tID);    
}

__global__ void reduceSum(int3 * inputData, int3 * output, int size){
    extern __shared__ int3 sdata[];

    int3 temp;

    unsigned int tID = threadIdx.x;
    unsigned int i = tID + blockIdx.x * (blockDim.x * 2);

    // Load and sum two integer triplets, store the answer in temp.
    SumInt3(input[i], input[i + blockDim.x], temp);

    // Write the temporary answer to shared memory.
    sData[tID] = temp;

    __syncthreads();

    for (unsigned int stride = blockDim.x / 2; stride > 32; stride >>= 1)
    {
        if (tID < stride)
        {
            SumInt3(sdata[tID], sdata[tID + stride], temp);
            sData[tID] = temp;
        }
        __syncthreads();
    }

    // Sum the intermediate results accross a warp.
    // No need to write the answer to shared memory,
    // as only the contribution from tID == 0 will matter.
    if (tID < 32)
    {
        WarpReduceInt3(sdata[tID], tID, temp);
    }

    if (tID == 0)
    {
        output[blockIdx.x] = temp;
    }
}

int3 and float3 cannot be loaded in a single instruction. GIven that packed int3 or float3 storage will fall on varying boundaries, the compiler will almost certainly break it up into 3 int or float loads. Since these individual int or float loads now have intervening members that are not useful, you will once again run into the efficiency problem I mentioned in my answer. There's a reason why the blog post you linked didn't suggest using a vector-3 method.

Collectives™ on Stack Overflow

CUDA Reduction on Shared Memory with Multiple Arrays

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related