CUDA_ERROR_INVALID_VALUE when converting tiled scf.parallel to GPU kernels

I’m encountering a CUDA_ERROR_INVALID_VALUE error when executing GPU kernels that are generated from tiled scf.parallel operations. The specific error message is:

'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, smem, stream, params, extra)' failed with 'CUDA_ERROR_INVALID_VALUE'

Under what circumstances does tiling scf.parallel operations lead to this CUDA launch error during GPU kernel execution? Do I need to manually control the tile sizes to satisfy specific hardware constraints? If so, what are these constraints?

Example IR (uncertain if this specific case causes the error):

// before tiling
scf.parallel (%arg1, %arg2, %arg3) = (%c0, %c0, %c0) to (%c128, %c112, %c112) step (%c1, %c1, %c1) {
  %2742 = memref.load %reinterpret_cast_90[%arg1, %arg2, %arg3] : memref<128x112x112xf32>
  %2743 = memref.load %14[] : memref<f32>
  %2744 = arith.cmpf oge, %2742, %2743 : f32
  %2745 = arith.extui %2744 : i1 to i32
  %2746 = arith.sitofp %2745 : i32 to f32
  memref.store %2746, %reinterpret_cast_91[%arg1, %arg2, %arg3] : memref<128x112x112xf32>
  scf.reduce 
}

// after tiling
%c0_96 = arith.constant 0 : index
%c32_97 = arith.constant 32 : index
%c32_98 = arith.constant 32 : index
%c32_99 = arith.constant 32 : index
%120 = arith.muli %c1, %c32_97 : index
%121 = arith.muli %c1, %c32_98 : index
%122 = arith.muli %c1, %c32_99 : index
scf.parallel (%arg1, %arg2, %arg3) = (%c0, %c0, %c0) to (%c128, %c112, %c112) step (%120, %121, %122) {
  scf.parallel (%arg4, %arg5, %arg6) = (%c0_96, %c0_96, %c0_96) to (%120, %121, %122) step (%c1, %c1, %c1) {
    %4253 = arith.addi %arg4, %arg1 : index
    %4254 = arith.addi %arg5, %arg2 : index
    %4255 = arith.addi %arg6, %arg3 : index
    %true = arith.constant true
    %4256 = arith.muli %arg4, %c1 : index
    %4257 = arith.addi %4256, %arg1 : index
    %4258 = arith.cmpi ult, %4257, %c128 : index
    %4259 = arith.andi %true, %4258 : i1
    %4260 = arith.muli %arg5, %c1 : index
    %4261 = arith.addi %4260, %arg2 : index
    %4262 = arith.cmpi ult, %4261, %c112 : index
    %4263 = arith.andi %4259, %4262 : i1
    %4264 = arith.muli %arg6, %c1 : index
    %4265 = arith.addi %4264, %arg3 : index
    %4266 = arith.cmpi ult, %4265, %c112 : index
    %4267 = arith.andi %4263, %4266 : i1
    scf.if %4267 {
      %4268 = memref.load %reinterpret_cast_94[%4253, %4254, %4255] : memref<128x112x112xf32>
      %4269 = memref.load %14[] : memref<f32>
      %4270 = arith.cmpf oge, %4268, %4269 : f32
      %4271 = arith.extui %4270 : i1 to i32
      %4272 = arith.sitofp %4271 : i32 to f32
      memref.store %4272, %reinterpret_cast_95[%4253, %4254, %4255] : memref<128x112x112xf32>
    }
    scf.reduce 
  }
  scf.reduce 
}

Any insights on debugging this issue or understanding the hardware limitations that might be causing this error would be greatly appreciated. Thank you!