CUDA_ERROR_INVALID_VALUE when converting tiled scf.parallel to GPU kernels

lilil · September 9, 2025, 7:27am

I’m encountering a CUDA_ERROR_INVALID_VALUE error when executing GPU kernels that are generated from tiled scf.parallel operations. The specific error message is:

'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, smem, stream, params, extra)' failed with 'CUDA_ERROR_INVALID_VALUE'

Under what circumstances does tiling scf.parallel operations lead to this CUDA launch error during GPU kernel execution? Do I need to manually control the tile sizes to satisfy specific hardware constraints? If so, what are these constraints?

Example IR (uncertain if this specific case causes the error):

// before tiling
scf.parallel (%arg1, %arg2, %arg3) = (%c0, %c0, %c0) to (%c128, %c112, %c112) step (%c1, %c1, %c1) {
  %2742 = memref.load %reinterpret_cast_90[%arg1, %arg2, %arg3] : memref<128x112x112xf32>
  %2743 = memref.load %14[] : memref<f32>
  %2744 = arith.cmpf oge, %2742, %2743 : f32
  %2745 = arith.extui %2744 : i1 to i32
  %2746 = arith.sitofp %2745 : i32 to f32
  memref.store %2746, %reinterpret_cast_91[%arg1, %arg2, %arg3] : memref<128x112x112xf32>
  scf.reduce 
}

// after tiling
%c0_96 = arith.constant 0 : index
%c32_97 = arith.constant 32 : index
%c32_98 = arith.constant 32 : index
%c32_99 = arith.constant 32 : index
%120 = arith.muli %c1, %c32_97 : index
%121 = arith.muli %c1, %c32_98 : index
%122 = arith.muli %c1, %c32_99 : index
scf.parallel (%arg1, %arg2, %arg3) = (%c0, %c0, %c0) to (%c128, %c112, %c112) step (%120, %121, %122) {
  scf.parallel (%arg4, %arg5, %arg6) = (%c0_96, %c0_96, %c0_96) to (%120, %121, %122) step (%c1, %c1, %c1) {
    %4253 = arith.addi %arg4, %arg1 : index
    %4254 = arith.addi %arg5, %arg2 : index
    %4255 = arith.addi %arg6, %arg3 : index
    %true = arith.constant true
    %4256 = arith.muli %arg4, %c1 : index
    %4257 = arith.addi %4256, %arg1 : index
    %4258 = arith.cmpi ult, %4257, %c128 : index
    %4259 = arith.andi %true, %4258 : i1
    %4260 = arith.muli %arg5, %c1 : index
    %4261 = arith.addi %4260, %arg2 : index
    %4262 = arith.cmpi ult, %4261, %c112 : index
    %4263 = arith.andi %4259, %4262 : i1
    %4264 = arith.muli %arg6, %c1 : index
    %4265 = arith.addi %4264, %arg3 : index
    %4266 = arith.cmpi ult, %4265, %c112 : index
    %4267 = arith.andi %4263, %4266 : i1
    scf.if %4267 {
      %4268 = memref.load %reinterpret_cast_94[%4253, %4254, %4255] : memref<128x112x112xf32>
      %4269 = memref.load %14[] : memref<f32>
      %4270 = arith.cmpf oge, %4268, %4269 : f32
      %4271 = arith.extui %4270 : i1 to i32
      %4272 = arith.sitofp %4271 : i32 to f32
      memref.store %4272, %reinterpret_cast_95[%4253, %4254, %4255] : memref<128x112x112xf32>
    }
    scf.reduce 
  }
  scf.reduce 
}

Any insights on debugging this issue or understanding the hardware limitations that might be causing this error would be greatly appreciated. Thank you!

Topic		Replies	Views
Could not tile NamedStructuredOp in linalg whith parallel Loop type MLIR	1	204	February 14, 2022
NVPTX CUDA_ERROR_NO_BINARY_FOR_GPU LLVM Dev List Archives	7	89	March 1, 2013
Problems on lowering scf.parallel with dynamic boundary to GPU MLIR	4	373	August 26, 2022
Linalg tiling error MLIR	5	680	September 7, 2021
Why this fails on X86_64 host? LLVM Dev List Archives	0	89	March 30, 2012

CUDA_ERROR_INVALID_VALUE when converting tiled scf.parallel to GPU kernels

Related topics