[GPU] Why does `gpu.launch_func` only accept a single async dependency?

ianayl · October 6, 2025, 7:55pm

Hi all!

I am trying to do [GPUToLLVM] `gpu.launch_func` lowering in gpu-to-llvm only supports 1 async dependency · Issue #156984 · llvm/llvm-project · GitHub , but as pointed out in the issue, there are explicit checks in the code to prevent gpu.launc_func from being created with multiple async dependencies: llvm-project/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp at main · llvm/llvm-project · GitHub

Is there a reason why this is the case?

I looked at the CUDA runtime API for submitting a kernel and the HIP runtime API for submitting a kernel, but both seems like they’re more than happy accepting any valid stream in their respective APIs: I’m assuming that these streams have no restrictions on how many tasks/kernels can be queued on a stream at once.

Are we doing this because different hardware implementations may have streams that behave differently (i.e. out-of-order execution, non-default behavior in CUDA/HIP)? Would any implementation that handles this need to consider the type of streams too, which afaik doesn’t have any representation in MLIR (because instead of “streams” MLIR just uses vectors of Value’s)?

Also, how are these handled on the Nvidia/AMD side? Is everyone else manually placing a gpu.wait on all the async dependencies of a gpu.launch before launching the kernel? Are there any other vendor-specific reasons that prevent gpu.launch_func ops from depending on multiple async ops?

Thanks so much for reading, and thanks for your time!

fabianmc · October 6, 2025, 10:45pm

Hi, the MLIR restriction on a single token is not related to how many tasks you can launch on a stream.

Take this code:

module attributes {gpu.container_module} {
  gpu.binary @kernels  [#gpu.object<#rocdl.target, assembly = "">]
  func.func @async(%arg0: index) {
    %0 = gpu.wait async
    %1 = gpu.launch_func async [%0] @kernels::@kernel blocks in (%arg0, %arg0, %arg0) threads in (%arg0, %arg0, %arg0)  
    %2 = gpu.launch_func async [%1] @kernels::@kernel blocks in (%arg0, %arg0, %arg0) threads in (%arg0, %arg0, %arg0)  
    %memref, %asyncToken = gpu.alloc async [%2] () : memref<7xf32>
    %3 = gpu.dealloc async [%asyncToken] %memref : memref<7xf32>
    gpu.wait [%3]
    return
  }
}

When lowered and translated to LLVM it generates:

define void @async(i64 %0) {
  %2 = call ptr @mgpuStreamCreate()
  %3 = alloca %0, align 8
  %4 = alloca ptr, i64 0, align 8
  %5 = load ptr, ptr @kernels_module, align 8
  %6 = call ptr @mgpuModuleGetFunction(ptr %5, ptr @kernels_kernel_name)
  call void @mgpuLaunchKernel(ptr %6, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i32 0, ptr %2, ptr %4, ptr null, i64 0)
  %7 = alloca %1, align 8
  %8 = alloca ptr, i64 0, align 8
  %9 = load ptr, ptr @kernels_module, align 8
  %10 = call ptr @mgpuModuleGetFunction(ptr %9, ptr @kernels_kernel_name)
  call void @mgpuLaunchKernel(ptr %10, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i64 %0, i32 0, ptr %2, ptr %8, ptr null, i64 0)
  %11 = call ptr @mgpuMemAlloc(i64 28, ptr %2, i8 0)
  call void @mgpuMemFree(ptr %11, ptr %2)
  call void @mgpuStreamSynchronize(ptr %2)
  call void @mgpuStreamDestroy(ptr %2)
  ret void
}

For reference, the calls in the above code refer to llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp at main · llvm/llvm-project · GitHub.

As seen from the above code, all the kernels get scheduled on the same stream.

For all terms and purposes, !gpu.async at a certain point becomes a stream, which is the reason there’s a restriction on a single token in the launch, as the APIs you reference also only take a single stream.

bondhugula · October 7, 2025, 1:09pm

You can always use the gpu.wait operation to tie n tokens to the result token it defines, and then use the result token on a gpu.launch_func/gpu.launch. This is functionally equivalent to specifying multiple tokens on a launch - so better to have a single canonical form. I don’t see why the restriction should be removed if the semantics are expressible.

ianayl · October 7, 2025, 3:48pm

Hi @fabianmc, @bondhugula, thanks for explaining! I see where I was confused beforehand and I see the rationale for this now.

With regards to [GPUToLLVM] `gpu.launch_func` lowering in gpu-to-llvm only supports 1 async dependency · Issue #156984 · llvm/llvm-project · GitHub , do you guys have a preference as to how I should go about this?

I could add a pattern in GPUToLLVM that adds a gpu.wait in front of Ops that have multiple dependencies (but shouldn’t, i.e. gpu.launc_func or gpu.alloc), and then rely on the pattern already in GPUToLLVM to convert my newly added gpu.wait’s into LLVM dialect, or
If such a thing shouldn’t be in GPUToLLVM, I can put this in another pass, i.e. GPUAsyncRegionPass

ianayl · October 14, 2025, 9:32pm

I haven’t heard back, so I am tentatively going for the GPUToLLVM approach. If there are problems with this please feel free to let me know. Thanks!

Topic		Replies	Views
Mark gpu::LaunchFuncOp Async? MLIR	6	495	August 25, 2021
Does MLIR supports CUDA stream? MLIR gpu	1	469	April 26, 2022
Where do gpu async tokens come from? MLIR	2	233	June 17, 2023
How to lower the combination of async gpu ops in `gpu` Dialect MLIR gpu	18	913	April 22, 2025
Is there a way to create dynamic number of streams on gpu? MLIR gpu	0	288	September 1, 2023

[GPU] Why does `gpu.launch_func` only accept a single async dependency?

Related topics