| # Custom Dispatches | 
 |  | 
 | "Dispatches" in IREE are parallelized device function calls where "device" may | 
 | be a CPU task system, a GPU, or a dedicated accelerator. Parallelism is a | 
 | first-class concept in this model and plugging custom dispatches into IREE | 
 | requires reasoning about these function calls as if they were parallelized and | 
 | dispatched across a 3D grid (as on GPUs or CPU task systems). Note that a | 
 | degenerate case of grid dispatch is a grid size of 1x1x1 which turns the | 
 | dispatches into simple (if inefficient) device-side function calls. | 
 |  | 
 | In normal workflows the IREE compiler forms the dispatch functions by way of | 
 | fusion and then uses code generation ("codegen") to translate the dispatch | 
 | functions into backend-specific forms like PTX, SPIR-V, or LLVM IR. It's | 
 | possible to augment this translation by either bypassing these steps entirely | 
 | and providing the already translated representation of the dispatch function or | 
 | extending the code generation portion by calling out to external functions from | 
 | within the generated dispatch (sometimes "microkernels" or "device libraries"). | 
 |  | 
 | There's ongoing work across the core IREE compiler and specific backends to | 
 | enable more extension points and ways to connect to them from frontends or | 
 | compiler transforms. This current set of samples demonstrates a very early | 
 | version of this extensibility that can be used to tunnel through dispatch | 
 | workloads by bypassing code generation (in the case of PTX/SPIR-V) or coarsely | 
 | interoperating (CPU function calls). In its current form it is intended for | 
 | things that would traditionally be custom ops in ML frameworks and produces much | 
 | smaller, hermetic, retargetable, and optimizable programs than traditional | 
 | custom ops can as there's nearly zero performance delta between what the | 
 | compiler can dispatch and what the user decides to dispatch. | 
 |  | 
 | ## Approaches | 
 |  | 
 | In the fullness of time all backends will support all approaches but currently | 
 | there are limitations and these samples only cover the supported cases: | 
 |  | 
 | |                    | CPU                | CUDA               | Metal | SPIR-V             | WGSL            | | 
 | |--------------------|:------------------:|:------------------:|:-----:|:------------------:|:---------------:| | 
 | | Static Functions   | :white_check_mark: | TBD                | TBD   | TBD                | TBD             | | 
 | | Dynamic Functions  | :white_check_mark: | TBD                | TBD   | :grey_question:    | :grey_question: | | 
 | | Static Dispatches  | TBD                | :white_check_mark: | TBD   | :white_check_mark: | TBD             | | 
 | | Dynamic Dispatches | TBD                | TBD                | TBD   | TBD                | TBD             | | 
 | | Commands           | TBD                | TBD                | TBD   | TBD                | TBD             | | 
 |  | 
 | ### Statically-linked Function Calls | 
 |  | 
 | **Overview**: user defines functions in .c files, compiles them with specific | 
 | settings to .o files, emits calls to the functions in IR interleaved with other | 
 | IR, and lets the IREE compiler link the objects into the final binary. | 
 |  | 
 | **Workflow**: | 
 |  | 
 | ``` | 
 | +-------------+               +---------------------+       +--------------+ | 
 | | functions.c | -> clang -+-> | functions_aarch64.o | -+    | example.mlir | | 
 | +-------------+           |   +---------------------+  |    +--------------+ | 
 |                           |   +---------------------+  |           v | 
 |                           +-> | functions_x86_64.o  | -+----> iree-compile | 
 |                               +---------------------+              v | 
 |                                                             +--------------+ | 
 |                                                    hermetic | example.vmfb | | 
 |                                                             +--------------+ | 
 | ``` | 
 |  | 
 | **Samples**: | 
 |  | 
 | * CPU: [custom_dispatch/cpu/embedded/](./cpu/embedded/) (.c -> .o) | 
 |  | 
 | This approach is usable both from frontends as well as something that can be | 
 | synthesized by compiler transforms ("replace op X with call to extern fX and | 
 | link in object f.o") and the object files referenced can be generated on the | 
 | fly and embedded into the IR to allow for compile-time specialization. | 
 |  | 
 | This is the preferred method of extension as it preserves IREE's ability to | 
 | specialize executables, optimize aggressively across call boundaries, | 
 | hermetically deploy the custom code without runtime changes, and portably target | 
 | multiple platforms and architectures. | 
 |  | 
 | ### Dynamically-linked Function Calls | 
 |  | 
 | **Overview**: user defines functions in any source language with a compatible C | 
 | ABI, wires them up and links them in their runtime binary, declares the | 
 | externally available functions in IR, and emits calls to the functions in IR | 
 | interleaved with other IR. | 
 |  | 
 | **Workflows**: | 
 |  | 
 | Statically-linked into the hosting runtime: | 
 | ``` | 
 |                      +--------------+   +--------------+ | 
 |                      | example.mlir |   | runtime srcs | -+ | 
 |                      +--------------+   +--------------+  | | 
 | +--------------+            v           +--------------+  | | 
 | | declarations | ----> iree-compile     | imports.c    | -+-> custom runtime | 
 | +--------------+            v           +--------------+            ^ | 
 |                      +--------------+                               | | 
 |                      | example.vmfb | - - - - - - - - - - - - - - - + | 
 |                      +--------------+ | 
 | ``` | 
 |  | 
 | Dynamically-linked into the hosting runtime via system libraries: | 
 | ``` | 
 | +----------+    +---------------+      +--------------+ | 
 | | plugin.c | -> | plugin.so/dll |-+    | example.mlir | | 
 | +----------+    +---------------+ |    +--------------+ | 
 |                                   |           v | 
 |                                   |      iree-compile | 
 |                                   |           v | 
 |                                   |    +--------------+ | 
 |                                   |    | example.vmfb | (non-hermetic) | 
 |                                   |    +--------------+ | 
 |                                   |           | | 
 |                                   +-----+-----+ | 
 |                                         v | 
 |                                +-----------------+ | 
 |                                | iree-run-module | | 
 |                                +-----------------+ | 
 | ``` | 
 |  | 
 | Dynamically-linked into the hosting runtime via portable embedded ELFs: | 
 |  | 
 | ``` | 
 | +----------+      +-------------------+       +--------------+ | 
 | | plugin.c | -+-> | plugin_aarch64.so | -+    | example.mlir | | 
 | +----------+  |   +-------------------+  |    +--------------+ | 
 |               |   +-------------------+  |           v | 
 |               +-> | plugin_x86_64.so  | -+      iree-compile | 
 |                   +-------------------+  |           v | 
 |                        +------------+    |    +--------------+ | 
 |                        | plugin.sos | <--+    | example.vmfb | (non-hermetic) | 
 |                        +------------+         +--------------+ | 
 |                              |                       | | 
 |                              +----------+------------+ | 
 |                                         v | 
 |                                +-----------------+ | 
 |                                | iree-run-module | | 
 |                                +-----------------+ | 
 | ``` | 
 |  | 
 | **Samples**: | 
 |  | 
 | * CPU (plugins): [custom_dispatch/cpu/plugin/](./cpu/plugin/) (.c -> .so/.sos) | 
 |  | 
 | Unlike the other approaches this requires runtime device support for dynamic | 
 | linking and introduces complexity to the user as they must be careful to version | 
 | their input programs and their runtime libraries themselves. IREE's CPU backend | 
 | does provide basic support for optional imports such that users can emit their | 
 | calls and add fallbacks but otherwise such behavior falls on the user to | 
 | implement. | 
 |  | 
 | ### Statically-linked Dispatch Functions | 
 |  | 
 | **Overview**: user produces the final dispatch executable binary themselves, | 
 | declares it in IR, and dispatches it. | 
 |  | 
 | **Workflow**: | 
 |  | 
 | ``` | 
 | +------------+              +-------------------+       +--------------+ | 
 | | kernels.cu | -> nvcc -+-> | kernels_sm_52.ptx | -+    | example.mlir | | 
 | +------------+          |   +-------------------+  |    +--------------+ | 
 |                         |   +-------------------+  |           v | 
 |                         +-> | kernels_sm_80.ptx | -+----> iree-compile | 
 |                             +-------------------+              v | 
 |                                                         +--------------+ | 
 |                                                hermetic | example.vmfb | | 
 |                                                         +--------------+ | 
 | ``` | 
 |  | 
 | **Samples**: | 
 |  | 
 | * CUDA/PTX: [custom_dispatch/cuda/kernels/](./cuda/kernels/) (.cu -> .ptx) | 
 | * Vulkan/SPIR-V: [custom_dispatch/vulkan/shaders/](./vulkan/shaders/) (.glsl -> .spv) | 
 |  | 
 | Here IREE is used for scheduling the work and ensuring that buffers and | 
 | parameters are passed to the dispatch function but otherwise treats them as | 
 | opaque. This disables some IREE optimizations but the functions are still able | 
 | to be scheduled concurrently and with parallelization. Many custom kernels can | 
 | often be implemented like this instead of needing much heavier-weight runtime | 
 | custom calls that prevent the asynchronous scheduling that IREE uses. | 
 |  | 
 | The dispatch functions are embedded into the IREE compiler outputs such that no | 
 | runtime changes are required and compiled programs are hermetic. Multi-targeting | 
 | is enabled by allowing the user to provide binaries for the target devices and | 
 | architectures they are compiling for. | 
 |  | 
 | ### Dynamically-linked Dispatch Functions | 
 |  | 
 | **Overview**: user defines functions in a target-specific source language, | 
 | compiles them into target-specific libraries, wires them up and links them in | 
 | their runtime binary, declares the externally available dispatch executable in | 
 | IR, and emits calls to the functions in IR interleaved with other IR. | 
 |  | 
 | **Workflow**: | 
 |  | 
 | ``` | 
 |                      +--------------+   +--------------+ | 
 |                      | example.mlir |   | runtime srcs | -+ | 
 |                      +--------------+   +--------------+  | | 
 | +--------------+            v           +--------------+  | | 
 | | declarations | ----> iree-compile     | dispatches.c | -+-> custom runtime | 
 | +--------------+            v           +--------------+            ^ | 
 |                      +--------------+                               | | 
 |                      | example.vmfb | - - - - - - - - - - - - - - - + | 
 |                      +--------------+ | 
 | ``` | 
 |  | 
 | **Samples**: plumbing required. | 
 |  | 
 | Similar to statically-linked dispatch functions IREE is doing the scheduling and | 
 | managing resources but deferring the entire dispatch logic to externally-sourced | 
 | executables. In contrast to the static linking approach this has the compiler | 
 | emit references to runtime-provided target-specific executables that must be | 
 | built into the runtime. This means that deployment gets more complicated as | 
 | compiler-produced outputs are no longer hermetic and users must handle | 
 | versioning and platform constraints themselves. | 
 |  | 
 | ### Custom Commands | 
 |  | 
 | **Overview**: user writes custom command buffer operations in VM modules, | 
 | declares them in IR, dispatches them, and then links their custom modules into | 
 | the runtime. | 
 |  | 
 | **Workflow**: | 
 |  | 
 | ``` | 
 |                      +--------------+   +--------------+ | 
 |                      | example.mlir |   | runtime srcs | -+ | 
 |                      +--------------+   +--------------+  | | 
 | +--------------+            v           +--------------+  | | 
 | | declarations | ----> iree-compile     | module.c     | -+-> custom runtime | 
 | +--------------+            v           +--------------+            ^ | 
 |                      +--------------+                               | | 
 |                      | example.vmfb | - - - - - - - - - - - - - - - + | 
 |                      +--------------+ | 
 | ``` | 
 |  | 
 | **Samples**: plumbing required. | 
 |  | 
 | Though more work than the other approaches this allows for a host-side call that | 
 | can produce new transfer or execution commands. The compiler effectively treats | 
 | these calls as if they were custom versions of `hal.command_buffer.dispatch`/ | 
 | `iree_hal_command_buffer_dispatch` and the runtime custom module receives the | 
 | device, command buffer, and push constants/bindings. Portable modules can | 
 | use the HAL APIs to schedule more commands (multiple dispatches, transfers, | 
 | collectives, etc) while backend-specific ones can crack open the HAL objects and | 
 | get the internals with all the normal API stability caveats. | 
 |  | 
 | An example use case would be calling a CUDA library that takes a `CUstream` as | 
 | an argument. The user would define their custom module with a call that uses the | 
 | `iree/hal/drivers/cuda/api.h` to cast the command buffer to a stream (using | 
 | graph capture when the command buffer is constructing a graph), make the call, | 
 | and then return. This only works if the library is designed for asynchronous and | 
 | deferred execution - if the library makes large allocations, schedules blocking | 
 | operations, or has side-effecting behavior then full custom modules must be | 
 | used (see the [custom_module/async/](/samples/custom_module/async/) sample). | 
 |  | 
 | When at all possible the dispatch function call or dispatch function | 
 | substitution approaches should be used instead and in many cases that is | 
 | sufficient for most workloads not involving other libraries. | 
 |  | 
 | ### Compile Time Inlining Custom Function Calls | 
 |  | 
 | **Overview**: user defines functions with MLIR dialects IREE is able to ingest | 
 | paired with a matcher and replacement pattern. The matcher runs as preprocessing | 
 | and calls into the replacement pattern for all successful matches. The | 
 | replacement pattern imports a function from the externally | 
 | ABI, wires them up and links them in their runtime binary, declares the | 
 | externally available functions in IR, and emits calls to the functions in IR | 
 | interleaved with other IR. | 
 |  | 
 | **Workflows**: | 
 |  | 
 | Statically matched and imported external functions | 
 | ``` | 
 |                                             +--------------+ | 
 |                                             | example.mlir | | 
 | +--------------+       +--------------+     +--------------+ | 
 | | (one of the  |       | functions +  |            v | 
 | | above static | ----> | matchers +   | ----> iree-compile | 
 | | workflows)   |       | replace.mlir |            v | 
 | +--------------+       +--------------+     +--------------+ | 
 |                                             | example.vmfb | | 
 |                                             +--------------+ | 
 | ``` | 
 |  | 
 | **Samples**: | 
 |  | 
 | * CPU: [custom_dispatch/cpu/embedded/](./cpu/embedded/) (.c -> .o) | 
 |   * [custom_dispatch/cpu/embedded/](./cpu/embedded/example_transform_spec.mlir) (.mlir) | 
 | * Vulkan/SPIR-V: [custom_dispatch/vulkan/shaders/](./vulkan/shaders/) (.glsl -> .spv) | 
 |   * [custom_dispatch/vulkan/shaders/](./vulkan/shaders/example_transform_spec.mlir) (.mlir) | 
 |  | 
 | The above two samples build on top of a couple of the static workflows shown | 
 | above, but should work with any of the other approaches. The idea is to separate | 
 | the custom kernel from the target module to be compiled, allowing integration of | 
 | custom dispatches with default IREE codegen without the need to build a custom | 
 | set of compiler tools around IREE to generate the necessary IR. | 
 |  | 
 | There are a number of possible points at which the match and replace can happen; | 
 | the above shows it after import + input conversion. Other plugin points are | 
 | possible (e.g. before input conversion or after global optimization), but | 
 | currently are missing some ergonomics on the available matchers. | 
 |  | 
 | ### Others | 
 |  | 
 | Most other situations are covered by [custom modules](/samples/custom_module/). | 
 | These can still be asynchronous and scheduled in device order but incur | 
 | additional overheads and deployment complexity as custom runtimes are required. |