Custom CUDA Dispatch Kernels

See the custom_dispatch README for an overview of this approach.

This sample demonstrates how to define external device functions that can be dispatched from within IREE programs when following the IREE CUDA ABI. The user authoring the kernels compiles their CUDA code to PTX blobs and can dispatch functions within those blobs by declaring them in their IR.

Work in Progress

Note that currently only entire kernel launches can be modeled and this prevents IREE from performing optimizations it otherwise can. In the future PTX linking will be implemented such that the external functions are referenced and linked with the compiler-produced portions such that more information about the usage of the dispatch can be used to specialize/prune the hand-authored kernel. Since the IREE CUDA ABI is not version-stable this entire kernel approach may require updating when taking new IREE versions while function-level linking would not.

Since today only entire kernels can be provided the user must specify an empty executable (no builtin.module contents) and thus must provide objects for all targets they are compiling for. When partial function linking is available it'll be possible to provide fallback code as IR for when objects are not available.

Workflow

+------------+              +-------------------+       +--------------+
| kernels.cu | -> nvcc -+-> | kernels_sm_52.ptx | -+    | example.mlir |
+------------+          |   +-------------------+  |    +--------------+
                        |   +-------------------+  |           v
                        +-> | kernels_sm_80.ptx | -+----> iree-compile
                            +-------------------+              v
                                                        +--------------+
                                                        | example.vmfb |
                                                        +--------------+

The user authors their kernels in a .cu file.

extern "C" __global__ void simple_mul(const float* __restrict__ binding0,
                                      const float* __restrict__ binding1,
                                      float* __restrict__ binding2, int dim) {
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  if (tid < dim) binding2[tid] = binding0[tid] * binding1[tid];
}

Source files are compiled to PTX files via nvcc. Each architecture the user is targeting will need its own object file(s).

nvcc ... (TODO, see CMakeLists.txt) -o kernels_sm_80.ptx

The user (or compiler transforms) declare the externally defined kernels and the target-to-objects map for each architecture. The layout specifies how the dispatch arguments are mapped to buffers and the workgroup size is the CUDA block size. The region on the export is used to compute the workgroup count (CUDA grid size) and can query the %device if runtime device information is needed.

  hal.executable.source private @executable attributes {
    objects = #hal.executable.objects<{
      #nvptx_sm_52_target = [
        #hal.executable.object<{path = "kernels_sm_80.ptx"}>
      ]
    }>
    hal.executable.export public @simple_mul ordinal(0)
        layout(#hal.pipeline.layout<constants = 1, bindings = [
          #hal.pipeline.binding<storage_buffer, ReadOnly>,
          #hal.pipeline.binding<storage_buffer, ReadOnly>,
          #hal.pipeline.binding<storage_buffer>
        ]>) attributes {workgroup_size = [64 : index, 1 : index, 1 : index]} {
    ^bb0(%device: !hal.device, %workload: index):
      %x = affine.apply affine_map<()[s0] -> (s0 ceildiv 64)>()[%workload]
      %c1 = arith.constant 1 : index
      hal.return %x, %c1, %c1 : index, index, index
    }
  }

The user (or compiler transforms) dispatches the kernel.

  %0 = flow.dispatch @executable::@simple_mul[%dim](%dim_i32, %arg0, %arg1) :
      (i32, tensor<?xf32>{%dim}, tensor<?xf32>{%dim}) -> tensor<?xf32>{%dim}

The IREE compiler selects the appropriate object files for the target configuration and links them into the binaries it produces. Dispatches are automatically routed to the appropriate variant of those available at runtime.

Instructions

This presumes that iree-compile and iree-run-module have been installed or built. See here for instructions for CMake setup and building from source.

Ensure that the CUDA SDK and nvcc is on your PATH:
```
nvcc --version
```
Build the iree-sample-deps CMake target to compile the .cu to .ptx:
```
cmake --build ../iree-build/ --target iree-sample-deps
```
In a user application this would be replaced with whatever build infrastructure the user has for compiling kernels to PTX. No IREE compiler or runtime changes are required and the normal compiler install can be used.

Compile the example module to a .vmfb file and pass the path to the build directory so the .spv files can be found:

iree-compile \
    --iree-hal-executable-object-search-path=../iree-build/ \
    samples/custom_dispatch/cuda/kernels/example.mlir \
    -o=/tmp/example.vmfb

Run the example program using the custom kernels:

iree-run-module \
    --device=cuda \
    --function=mixed_invocation \
    --input=8xf32=2 \
    --input=8xf32=4 \
    /tmp/example.vmfb