Hi all!
I am trying to do [GPUToLLVM] `gpu.launch_func` lowering in gpu-to-llvm only supports 1 async dependency · Issue #156984 · llvm/llvm-project · GitHub , but as pointed out in the issue, there are explicit checks in the code to prevent gpu.launc_func from being created with multiple async dependencies: llvm-project/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp at main · llvm/llvm-project · GitHub
Is there a reason why this is the case?
I looked at the CUDA runtime API for submitting a kernel and the HIP runtime API for submitting a kernel, but both seems like they’re more than happy accepting any valid stream in their respective APIs: I’m assuming that these streams have no restrictions on how many tasks/kernels can be queued on a stream at once.
Are we doing this because different hardware implementations may have streams that behave differently (i.e. out-of-order execution, non-default behavior in CUDA/HIP)? Would any implementation that handles this need to consider the type of streams too, which afaik doesn’t have any representation in MLIR (because instead of “streams” MLIR just uses vectors of Value’s)?
Also, how are these handled on the Nvidia/AMD side? Is everyone else manually placing a gpu.wait on all the async dependencies of a gpu.launch before launching the kernel? Are there any other vendor-specific reasons that prevent gpu.launch_func ops from depending on multiple async ops?
Thanks so much for reading, and thanks for your time!