[GPU] Why does `gpu.launch_func` only accept a single async dependency?

Hi all!

I am trying to do [GPUToLLVM] `gpu.launch_func` lowering in gpu-to-llvm only supports 1 async dependency · Issue #156984 · llvm/llvm-project · GitHub , but as pointed out in the issue, there are explicit checks in the code to prevent gpu.launc_func from being created with multiple async dependencies: llvm-project/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp at main · llvm/llvm-project · GitHub

Is there a reason why this is the case?

I looked at the CUDA runtime API for submitting a kernel and the HIP runtime API for submitting a kernel, but both seems like they’re more than happy accepting any valid stream in their respective APIs: I’m assuming that these streams have no restrictions on how many tasks/kernels can be queued on a stream at once.

Are we doing this because different hardware implementations may have streams that behave differently (i.e. out-of-order execution, non-default behavior in CUDA/HIP)? Would any implementation that handles this need to consider the type of streams too, which afaik doesn’t have any representation in MLIR (because instead of “streams” MLIR just uses vectors of Value’s)?

Also, how are these handled on the Nvidia/AMD side? Is everyone else manually placing a gpu.wait on all the async dependencies of a gpu.launch before launching the kernel? Are there any other vendor-specific reasons that prevent gpu.launch_func ops from depending on multiple async ops?

Thanks so much for reading, and thanks for your time!

Hi, the MLIR restriction on a single token is not related to how many tasks you can launch on a stream.

Take this code:

module attributes {gpu.container_module} {
  gpu.binary @kernels  [#gpu.object<#rocdl.target, assembly = "">]
  func.func @async(%arg0: index) {
    %0 = gpu.wait async
    %1 = gpu.launch_func async [%0] @kernels::@kernel blocks in (%arg0, %arg0, %arg0) threads in (%arg0, %arg0, %arg0)  
    %2 = gpu.launch_func async [%1] @kernels::@kernel blocks in (%arg0, %arg0, %arg0) threads in (%arg0, %arg0, %arg0)  
    %memref, %asyncToken = gpu.alloc async [%2] () : memref<7xf32>
    %3 = gpu.dealloc async [%asyncToken] %memref : memref<7xf32>
    gpu.wait [%3]
    return
  }
}

When lowered and translated to LLVM it generates:

define void @async(i64 %0) {
  %2 = call ptr @mgpuStreamCreate()
  %3 = alloca %0, align 8
  %4 = alloca ptr, i64 0, align 8
  %5 = load ptr, ptr @kernels_module, align 8
  %6 = call ptr @mgpuModuleGetFunction(ptr %5, ptr @kernels_kernel_name)
  call void @mgpuLaunchKernel(ptr %6, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i32 0, ptr %2, ptr %4, ptr null, i64 0)
  %7 = alloca %1, align 8
  %8 = alloca ptr, i64 0, align 8
  %9 = load ptr, ptr @kernels_module, align 8
  %10 = call ptr @mgpuModuleGetFunction(ptr %9, ptr @kernels_kernel_name)
  call void @mgpuLaunchKernel(ptr %10, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i32 0, ptr %2, ptr %8, ptr null, i64 0)
  %11 = call ptr @mgpuMemAlloc(i64 28, ptr %2, i8 0)
  call void @mgpuMemFree(ptr %11, ptr %2)
  call void @mgpuStreamSynchronize(ptr %2)
  call void @mgpuStreamDestroy(ptr %2)
  ret void
}

For reference, the calls in the above code refer to llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp at main · llvm/llvm-project · GitHub.

As seen from the above code, all the kernels get scheduled on the same stream.

For all terms and purposes, !gpu.async at a certain point becomes a stream, which is the reason there’s a restriction on a single token in the launch, as the APIs you reference also only take a single stream.

1 Like

You can always use the gpu.wait operation to tie n tokens to the result token it defines, and then use the result token on a gpu.launch_func/gpu.launch. This is functionally equivalent to specifying multiple tokens on a launch - so better to have a single canonical form. I don’t see why the restriction should be removed if the semantics are expressible.

1 Like

Hi @fabianmc, @bondhugula, thanks for explaining! I see where I was confused beforehand and I see the rationale for this now.

With regards to [GPUToLLVM] `gpu.launch_func` lowering in gpu-to-llvm only supports 1 async dependency · Issue #156984 · llvm/llvm-project · GitHub , do you guys have a preference as to how I should go about this?

  1. I could add a pattern in GPUToLLVM that adds a gpu.wait in front of Ops that have multiple dependencies (but shouldn’t, i.e. gpu.launc_func or gpu.alloc), and then rely on the pattern already in GPUToLLVM to convert my newly added gpu.wait’s into LLVM dialect, or
  2. If such a thing shouldn’t be in GPUToLLVM, I can put this in another pass, i.e. GPUAsyncRegionPass

I haven’t heard back, so I am tentatively going for the GPUToLLVM approach. If there are problems with this please feel free to let me know. Thanks!