OpenMP Offloading Shared Memory Problem

Hi,
I’ve encountered some problems with an OpenMP offloading program that uses shared memory.

According to OMP111, the data of globalized variables will be optimized into shared memory.

Using the command below to compile the following code sample:

clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 --offload-arch=sm_89 test.c -o test
#pragma omp target data map(to: A[0:N][0:M], B[0:N][0:M]) map(tofrom: C[0:N][0:N])
{
  #pragma omp target teams distribute
  for (int i = 0; i < n; i += TILE_SIZE)
    for (int j = 0; j <= n - 1; j += TILE_SIZE)
      for (int k = 0; k <= m - 1; k += TILE_SIZE) {
        // use of shared memory
        float sA1[TILE_SIZE][TILE_SIZE];
        float sA2[TILE_SIZE][TILE_SIZE];
        float sB1[TILE_SIZE][TILE_SIZE];
        float sB2[TILE_SIZE][TILE_SIZE];
        
        // other memory copy and computation
        ...
      }
}

If I set TILE_SIZE to 32, the program works fine, but if TILE_SIZE is set to 64, the program crashes. So I suspect this might be a shared memory optimization issue.

According to this, I also tried:

env LIBOMPTARGET_SHARED_MEMORY_SIZE=256000 ./test

but the environment variable seems to have no effect and the program still crashes.

I’m using LLVM 19.1.6 with the following settings:

cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DCMAKE_C_COMPILER="clang" \
      -DCMAKE_CXX_COMPILER="clang++"  \
      -DLLVM_ENABLE_PROJECTS="clang" \
      -DLLVM_ENABLE_RUNTIMES="openmp;offload" \
      -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
      -DLLVM_PARALLEL_COMPILE_JOBS=16 \
      -DLLVM_BUILD_LLVM_DYLIB=ON \
      -DLLVM_LINK_LLVM_DYLIB=ON \
      -DOMPTARGET_DEBUG=ON \
      -DOPENMP_ENABLE_LIBOMP_PROFILING=ON \
      ../llvm

Is this a shared memory allocation problem or could it be caused by something else?

On first glance, I would suspect the shared memory capacity to be the issue, given that it works on smaller TILE_SIZEs. Does env LIBOMPTARGET_INFO=-1 or env LIBOMPTARGET_DEBUG=1 provide any more information about the last kernel launch or the error that might be useful?

Note that env LIBOMPTARGET_SHARED_MEMORY_SIZE=256000 influences the dynamic shared memory size, and not the static one. So it might even cause an issue if you want dynamic shared memory (akin to the shared memory argument you pass to the cuda kernel launch) and have static shared memory buffers (akin to __shared).

To use the dynamic shared memory feature, try something like:

      float *sA1 = llvm_omp_target_dynamic_shared_alloc();
      float *sA2 = &sA1[TILE_SIZE * TILE_SIZE];
      float *sB1 = &sA2[TILE_SIZE * TILE_SIZE];
      float *sB2 = &sB2[TILE_SIZE * TILE_SIZE];
      #define ACC_SHARED(A, I, J) (A)[(I) * TILE_SIZE + (J)]  

      // Use sA1
     ACC_SHARED(sA1, 3, 4) = 42;

And start this with env LIBOMPTARGET_SHARED_MEMORY_SIZE=65536 (64 * 64 * 4 * 4).

If you have a full reproducer, we can probably help look into it, at least with LLVM main.

1 Like

Most modern CUDA GPUs have 48 KiB of static shared memory available for use. With a tile size of 32 you use 16 KiB while a tile size of 64 uses 64 KiB, which is over the limit as far as I’m aware.

2 Likes

Thank you @jdoerfert and @jhuber6 ,

I found that the problem was due to my misunderstanding of the static/dynamic shared memory attribute and its declaration.
After replacing the static usage with dynamic usage, the problem was resolved.

1 Like