tree: d55d1b1278ffb404405b84c0051788499d921818 [path history] [tgz]
  1. cpu/
  2. cuda/
  3. hip/
  4. vulkan/
  5. CMakeLists.txt

Custom Dispatches

“Dispatches” in IREE are parallelized device function calls where “device” may be a CPU task system, a GPU, or a dedicated accelerator. Parallelism is a first-class concept in this model and plugging custom dispatches into IREE requires reasoning about these function calls as if they were parallelized and dispatched across a 3D grid (as on GPUs or CPU task systems). Note that a degenerate case of grid dispatch is a grid size of 1x1x1 which turns the dispatches into simple (if inefficient) device-side function calls.

In normal workflows the IREE compiler forms the dispatch functions by way of fusion and then uses code generation (“codegen”) to translate the dispatch functions into backend-specific forms like PTX, SPIR-V, or LLVM IR. It's possible to augment this translation by either bypassing these steps entirely and providing the already translated representation of the dispatch function or extending the code generation portion by calling out to external functions from within the generated dispatch (sometimes “microkernels” or “device libraries”).

There‘s ongoing work across the core IREE compiler and specific backends to enable more extension points and ways to connect to them from frontends or compiler transforms. This current set of samples demonstrates a very early version of this extensibility that can be used to tunnel through dispatch workloads by bypassing code generation (in the case of PTX/SPIR-V) or coarsely interoperating (CPU function calls). In its current form it is intended for things that would traditionally be custom ops in ML frameworks and produces much smaller, hermetic, retargetable, and optimizable programs than traditional custom ops can as there’s nearly zero performance delta between what the compiler can dispatch and what the user decides to dispatch.


In the fullness of time all backends will support all approaches but currently there are limitations and these samples only cover the supported cases:

Static Functions:white_check_mark:TBDTBDTBDTBD
Dynamic Functions:white_check_mark:TBDTBD:grey_question::grey_question:
Static DispatchesTBD:white_check_mark:TBD:white_check_mark:TBD

Statically-linked Function Calls

Overview: user defines functions in .c files, compiles them with specific settings to .o files, emits calls to the functions in IR interleaved with other IR, and lets the IREE compiler link the objects into the final binary.


+-------------+               +---------------------+       +--------------+
| functions.c | -> clang -+-> | functions_aarch64.o | -+    | example.mlir |
+-------------+           |   +---------------------+  |    +--------------+
                          |   +---------------------+  |           v
                          +-> | functions_x86_64.o  | -+----> iree-compile
                              +---------------------+              v
                                                   hermetic | example.vmfb |


This approach is usable both from frontends as well as something that can be synthesized by compiler transforms (“replace op X with call to extern fX and link in object f.o”) and the object files referenced can be generated on the fly and embedded into the IR to allow for compile-time specialization.

This is the preferred method of extension as it preserves IREE's ability to specialize executables, optimize aggressively across call boundaries, hermetically deploy the custom code without runtime changes, and portably target multiple platforms and architectures.

Dynamically-linked Function Calls

Overview: user defines functions in any source language with a compatible C ABI, wires them up and links them in their runtime binary, declares the externally available functions in IR, and emits calls to the functions in IR interleaved with other IR.


Statically-linked into the hosting runtime:

                     +--------------+   +--------------+
                     | example.mlir |   | runtime srcs | -+
                     +--------------+   +--------------+  |
+--------------+            v           +--------------+  |
| declarations | ----> iree-compile     | imports.c    | -+-> custom runtime
+--------------+            v           +--------------+            ^
                     +--------------+                               |
                     | example.vmfb | - - - - - - - - - - - - - - - +

Dynamically-linked into the hosting runtime via system libraries:

+----------+    +---------------+      +--------------+
| plugin.c | -> | |-+    | example.mlir |
+----------+    +---------------+ |    +--------------+
                                  |           v
                                  |      iree-compile
                                  |           v
                                  |    +--------------+
                                  |    | example.vmfb | (non-hermetic)
                                  |    +--------------+
                                  |           |
                               | iree-run-module |

Dynamically-linked into the hosting runtime via portable embedded ELFs:

+----------+      +-------------------+       +--------------+
| plugin.c | -+-> | | -+    | example.mlir |
+----------+  |   +-------------------+  |    +--------------+
              |   +-------------------+  |           v
              +-> |  | -+      iree-compile
                  +-------------------+  |           v
                       +------------+    |    +--------------+
                       | plugin.sos | <--+    | example.vmfb | (non-hermetic)
                       +------------+         +--------------+
                             |                       |
                               | iree-run-module |


Unlike the other approaches this requires runtime device support for dynamic linking and introduces complexity to the user as they must be careful to version their input programs and their runtime libraries themselves. IREE's CPU backend does provide basic support for optional imports such that users can emit their calls and add fallbacks but otherwise such behavior falls on the user to implement.

Statically-linked Dispatch Functions

Overview: user produces the final dispatch executable binary themselves, declares it in IR, and dispatches it.


+------------+              +-------------------+       +--------------+
| | -> nvcc -+-> | kernels_sm_52.ptx | -+    | example.mlir |
+------------+          |   +-------------------+  |    +--------------+
                        |   +-------------------+  |           v
                        +-> | kernels_sm_80.ptx | -+----> iree-compile
                            +-------------------+              v
                                               hermetic | example.vmfb |


Here IREE is used for scheduling the work and ensuring that buffers and parameters are passed to the dispatch function but otherwise treats them as opaque. This disables some IREE optimizations but the functions are still able to be scheduled concurrently and with parallelization. Many custom kernels can often be implemented like this instead of needing much heavier-weight runtime custom calls that prevent the asynchronous scheduling that IREE uses.

The dispatch functions are embedded into the IREE compiler outputs such that no runtime changes are required and compiled programs are hermetic. Multi-targeting is enabled by allowing the user to provide binaries for the target devices and architectures they are compiling for.

Dynamically-linked Dispatch Functions

Overview: user defines functions in a target-specific source language, compiles them into target-specific libraries, wires them up and links them in their runtime binary, declares the externally available dispatch executable in IR, and emits calls to the functions in IR interleaved with other IR.


                     +--------------+   +--------------+
                     | example.mlir |   | runtime srcs | -+
                     +--------------+   +--------------+  |
+--------------+            v           +--------------+  |
| declarations | ----> iree-compile     | dispatches.c | -+-> custom runtime
+--------------+            v           +--------------+            ^
                     +--------------+                               |
                     | example.vmfb | - - - - - - - - - - - - - - - +

Samples: plumbing required.

Similar to statically-linked dispatch functions IREE is doing the scheduling and managing resources but deferring the entire dispatch logic to externally-sourced executables. In contrast to the static linking approach this has the compiler emit references to runtime-provided target-specific executables that must be built into the runtime. This means that deployment gets more complicated as compiler-produced outputs are no longer hermetic and users must handle versioning and platform constraints themselves.

Custom Commands

Overview: user writes custom command buffer operations in VM modules, declares them in IR, dispatches them, and then links their custom modules into the runtime.


                     +--------------+   +--------------+
                     | example.mlir |   | runtime srcs | -+
                     +--------------+   +--------------+  |
+--------------+            v           +--------------+  |
| declarations | ----> iree-compile     | module.c     | -+-> custom runtime
+--------------+            v           +--------------+            ^
                     +--------------+                               |
                     | example.vmfb | - - - - - - - - - - - - - - - +

Samples: plumbing required.

Though more work than the other approaches this allows for a host-side call that can produce new transfer or execution commands. The compiler effectively treats these calls as if they were custom versions of hal.command_buffer.dispatch/ iree_hal_command_buffer_dispatch and the runtime custom module receives the device, command buffer, and push constants/bindings. Portable modules can use the HAL APIs to schedule more commands (multiple dispatches, transfers, collectives, etc) while backend-specific ones can crack open the HAL objects and get the internals with all the normal API stability caveats.

An example use case would be calling a CUDA library that takes a CUstream as an argument. The user would define their custom module with a call that uses the iree/hal/drivers/cuda/api.h to cast the command buffer to a stream (using graph capture when the command buffer is constructing a graph), make the call, and then return. This only works if the library is designed for asynchronous and deferred execution - if the library makes large allocations, schedules blocking operations, or has side-effecting behavior then full custom modules must be used (see the custom_module/async/ sample).

When at all possible the dispatch function call or dispatch function substitution approaches should be used instead and in many cases that is sufficient for most workloads not involving other libraries.

Compile Time Inlining Custom Function Calls

Overview: user defines functions with MLIR dialects IREE is able to ingest paired with a matcher and replacement pattern. The matcher runs as preprocessing and calls into the replacement pattern for all successful matches. The replacement pattern imports a function from the externally ABI, wires them up and links them in their runtime binary, declares the externally available functions in IR, and emits calls to the functions in IR interleaved with other IR.


Statically matched and imported external functions

                                            | example.mlir |
+--------------+       +--------------+     +--------------+
| (one of the  |       | functions +  |            v
| above static | ----> | matchers +   | ----> iree-compile
| workflows)   |       | replace.mlir |            v
+--------------+       +--------------+     +--------------+
                                            | example.vmfb |


The above two samples build on top of a couple of the static workflows shown above, but should work with any of the other approaches. The idea is to separate the custom kernel from the target module to be compiled, allowing integration of custom dispatches with default IREE codegen without the need to build a custom set of compiler tools around IREE to generate the necessary IR.

There are a number of possible points at which the match and replace can happen; the above shows it after import + input conversion. Other plugin points are possible (e.g. before input conversion or after global optimization), but currently are missing some ergonomics on the available matchers.


Most other situations are covered by custom modules. These can still be asynchronous and scheduled in device order but incur additional overheads and deployment complexity as custom runtimes are required.