but it does nothing. It works but I get the same MLIR code. I assume that the GPU dialect cannot be converted to a lower level dialect? But it can’t be converted to LLVM either, because if I try to do so, I get:
cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: gpu.alloc
Then how I’m supposed to lower GPU dialect?
PS: I expected the gpu.alloc to be lowered to some llvm.call to cudaMalloc (on the device). However, because gpu dialect should work with AMD too, I don’t know where should I specify that I want to lower to NVIDIA.
GPU dialect represents both host code and device code, which should be converted differently. You need to create a (host) module that contains gpu modules. Then convert gpu modules to either NVVM or ROCDL and compile them to binary with appropriate passes. This will give you the (host) module with kernels as binary blobs that can be compiled further. In addition to @mehdi_amini’s example, here’s an end-to-end integration test llvm-project/mlir/test/Integration/GPU/CUDA/gpu-to-cubin.mlir at main · llvm/llvm-project · GitHub.
It can be converted to a mix of LLVM and target-specific dialects (NVVM, ROCDL). It cannot be translated to LLVM IR, which is where the registration error pops up. Please avoid the confusion about conversion and translation.
So, according to this minimal examples, the GpuToLLVMConversionPass pass will only work if the code contains alloc/dealloc with async. Why does this happen?
This may be something that got broken when the async support was added, but I didn’t follow closely enough the transition. It does not seem expected to me that the non-async would just be silently ignored here (otherwise what’s the intent to have these!).
mgpuMemAlloc/Free() are intended to map to cuMemAllocAsync() (they don’t yet because we haven’t upgraded to CUDA 11.3) and therefore take a stream argument. The stream is converted from !gpu.async.token, which is missing from the operands in your initial code.
The gpu-async-region pass gets you from your initial code to the async variant.
It does not seem expected to me that the non-async would just be silently ignored here (otherwise what’s the intent to have these!).
I would like to ask a final question. Why the GPU dialect does not have something like populateGPUToLLVMConversionPatterns like Affine, Std, etc? If I’m not wrong, the only way to lower GPU to LLVM is to add a pass like I have written below, but it is not possible to use transitive lowering with the GPU dialect. Please correct me if I’m mistaken.
I guess this behaviour is by desing, probably because it makes sense to have some dialect to output gpu code and then lower this gpu code with a pass (createGpuToLLVMConversionPass).
Some of the transformations in the GPU dialect lowering are not rewrite-pattern based but walk the IR instead. For example the passes around introducing async behavior work that way.
I think for the GPU to LLVM transformation in particular there is no good reason. If you would like those patterns to be exposed, feel free to add a populate method. Note that you also need to configure the type converter and legality.
Sure, I’ve been playing with it and I have just created the populate method. It works pretty well.
Sadly, I’m still having issues with GPU dialect. I’m able to generate pure LLVM code, but at the moment of generating the final executable file, I don’t know how to link my code with mgpu functions (e.g, mgpuStreamCreate) because right now I get the undefined reference to ... errors. @ftynse mentioned that there were two backends (NVIDIA and AMD), but I don’t even know how to choose between them.
I honestly think that some aspects of GPU dialect that we have been discussing in this post should be documented somewhere (maybe they are, I just didn’t find them)…