See the custom_dispatch README for an overview of this approach.
This sample demonstrates how to define external device functions that can be dispatched from within IREE programs when following the IREE CUDA ABI. The user authoring the kernels compiles their CUDA code to PTX blobs and can dispatch functions within those blobs by declaring them in their IR.
Note that currently only entire kernel launches can be modeled and this prevents IREE from performing optimizations it otherwise can. In the future PTX linking will be implemented such that the external functions are referenced and linked with the compiler-produced portions such that more information about the usage of the dispatch can be used to specialize/prune the hand-authored kernel. Since the IREE CUDA ABI is not version-stable this entire kernel approach may require updating when taking new IREE versions while function-level linking would not.
Since today only entire kernels can be provided the user must specify an empty executable (no builtin.module contents) and thus must provide objects for all targets they are compiling for. When partial function linking is available it'll be possible to provide fallback code as IR for when objects are not available.
+------------+              +-------------------+       +--------------+
| kernels.cu | -> nvcc -+-> | kernels_sm_52.ptx | -+    | example.mlir |
+------------+          |   +-------------------+  |    +--------------+
                        |   +-------------------+  |           v
                        +-> | kernels_sm_80.ptx | -+----> iree-compile
                            +-------------------+              v
                                                        +--------------+
                                                        | example.vmfb |
                                                        +--------------+
extern "C" __global__ void simple_mul(const float* __restrict__ binding0, const float* __restrict__ binding1, float* __restrict__ binding2, int dim) { int tid = blockDim.x * blockIdx.x + threadIdx.x; if (tid < dim) binding2[tid] = binding0[tid] * binding1[tid]; }
nvcc ... (TODO, see CMakeLists.txt) -o kernels_sm_80.ptx
%device if runtime device information is needed.hal.executable.source private @executable attributes { objects = #hal.executable.objects<{ #nvptx_sm_52_target = [ #hal.executable.object<{path = "kernels_sm_80.ptx"}> ] }> hal.executable.export public @simple_mul ordinal(0) layout(#hal.pipeline.layout<constants = 1, bindings = [ #hal.pipeline.binding<storage_buffer, ReadOnly>, #hal.pipeline.binding<storage_buffer, ReadOnly>, #hal.pipeline.binding<storage_buffer> ]>) attributes {workgroup_size = [64 : index, 1 : index, 1 : index]} { ^bb0(%device: !hal.device, %workload: index): %x = affine.apply affine_map<()[s0] -> (s0 ceildiv 64)>()[%workload] %c1 = arith.constant 1 : index hal.return %x, %c1, %c1 : index, index, index } }
%0 = flow.dispatch @executable::@simple_mul[%dim](%dim_i32, %arg0, %arg1) : (i32, tensor<?xf32>{%dim}, tensor<?xf32>{%dim}) -> tensor<?xf32>{%dim}
This presumes that iree-compile and iree-run-module have been installed or built. See here for instructions for CMake setup and building from source.
Ensure that the CUDA SDK and nvcc is on your PATH:
nvcc --version
Build the iree-sample-deps CMake target to compile the .cu to .ptx:
cmake --build ../iree-build/ --target iree-sample-deps
In a user application this would be replaced with whatever build infrastructure the user has for compiling kernels to PTX. No IREE compiler or runtime changes are required and the normal compiler install can be used.
Compile the example module to a .vmfb file and pass the path to the build directory so the .spv files can be found:
iree-compile \
    --iree-hal-executable-object-search-path=../iree-build/ \
    samples/custom_dispatch/cuda/kernels/example.mlir \
    -o=/tmp/example.vmfb
Run the example program using the custom kernels:
iree-run-module \
    --device=cuda \
    --function=mixed_invocation \
    --input=8xf32=2 \
    --input=8xf32=4 \
    /tmp/example.vmfb