blob: d446328b56e1ec3e02bdc33e7c23ecc52a92ffe5 [file] [log] [blame] [view]
# Custom Dispatches
"Dispatches" in IREE are parallelized device function calls where "device" may
be a CPU task system, a GPU, or a dedicated accelerator. Parallelism is a
first-class concept in this model and plugging custom dispatches into IREE
requires reasoning about these function calls as if they were parallelized and
dispatched across a 3D grid (as on GPUs or CPU task systems). Note that a
degenerate case of grid dispatch is a grid size of 1x1x1 which turns the
dispatches into simple (if inefficient) device-side function calls.
In normal workflows the IREE compiler forms the dispatch functions by way of
fusion and then uses code generation ("codegen") to translate the dispatch
functions into backend-specific forms like PTX, SPIR-V, or LLVM IR. It's
possible to augment this translation by either bypassing these steps entirely
and providing the already translated representation of the dispatch function or
extending the code generation portion by calling out to external functions from
within the generated dispatch (sometimes "microkernels" or "device libraries").
There's ongoing work across the core IREE compiler and specific backends to
enable more extension points and ways to connect to them from frontends or
compiler transforms. This current set of samples demonstrates a very early
version of this extensibility that can be used to tunnel through dispatch
workloads by bypassing code generation (in the case of PTX/SPIR-V) or coarsely
interoperating (CPU function calls). In its current form it is intended for
things that would traditionally be custom ops in ML frameworks and produces much
smaller, hermetic, retargetable, and optimizable programs than traditional
custom ops can as there's nearly zero performance delta between what the
compiler can dispatch and what the user decides to dispatch.
## Approaches
In the fullness of time all backends will support all approaches but currently
there are limitations and these samples only cover the supported cases:
| | CPU | CUDA | Metal | SPIR-V | WGSL |
|--------------------|:------------------:|:------------------:|:-----:|:------------------:|:---------------:|
| Static Functions | :white_check_mark: | TBD | TBD | TBD | TBD |
| Dynamic Functions | :white_check_mark: | TBD | TBD | :grey_question: | :grey_question: |
| Static Dispatches | TBD | :white_check_mark: | TBD | :white_check_mark: | TBD |
| Dynamic Dispatches | TBD | TBD | TBD | TBD | TBD |
| Commands | TBD | TBD | TBD | TBD | TBD |
### Statically-linked Function Calls
**Overview**: user defines functions in .c files, compiles them with specific
settings to .o files, emits calls to the functions in IR interleaved with other
IR, and lets the IREE compiler link the objects into the final binary.
**Workflow**:
```
+-------------+ +---------------------+ +--------------+
| functions.c | -> clang -+-> | functions_aarch64.o | -+ | example.mlir |
+-------------+ | +---------------------+ | +--------------+
| +---------------------+ | v
+-> | functions_x86_64.o | -+----> iree-compile
+---------------------+ v
+--------------+
hermetic | example.vmfb |
+--------------+
```
**Samples**:
* CPU: [custom_dispatch/cpu/embedded/](./cpu/embedded/) (.c -> .o)
This approach is usable both from frontends as well as something that can be
synthesized by compiler transforms ("replace op X with call to extern fX and
link in object f.o") and the object files referenced can be generated on the
fly and embedded into the IR to allow for compile-time specialization.
This is the preferred method of extension as it preserves IREE's ability to
specialize executables, optimize aggressively across call boundaries,
hermetically deploy the custom code without runtime changes, and portably target
multiple platforms and architectures.
### Dynamically-linked Function Calls
**Overview**: user defines functions in any source language with a compatible C
ABI, wires them up and links them in their runtime binary, declares the
externally available functions in IR, and emits calls to the functions in IR
interleaved with other IR.
**Workflows**:
Statically-linked into the hosting runtime:
```
+--------------+ +--------------+
| example.mlir | | runtime srcs | -+
+--------------+ +--------------+ |
+--------------+ v +--------------+ |
| declarations | ----> iree-compile | imports.c | -+-> custom runtime
+--------------+ v +--------------+ ^
+--------------+ |
| example.vmfb | - - - - - - - - - - - - - - - +
+--------------+
```
Dynamically-linked into the hosting runtime via system libraries:
```
+----------+ +---------------+ +--------------+
| plugin.c | -> | plugin.so/dll |-+ | example.mlir |
+----------+ +---------------+ | +--------------+
| v
| iree-compile
| v
| +--------------+
| | example.vmfb | (non-hermetic)
| +--------------+
| |
+-----+-----+
v
+-----------------+
| iree-run-module |
+-----------------+
```
Dynamically-linked into the hosting runtime via portable embedded ELFs:
```
+----------+ +-------------------+ +--------------+
| plugin.c | -+-> | plugin_aarch64.so | -+ | example.mlir |
+----------+ | +-------------------+ | +--------------+
| +-------------------+ | v
+-> | plugin_x86_64.so | -+ iree-compile
+-------------------+ | v
+------------+ | +--------------+
| plugin.sos | <--+ | example.vmfb | (non-hermetic)
+------------+ +--------------+
| |
+----------+------------+
v
+-----------------+
| iree-run-module |
+-----------------+
```
**Samples**:
* CPU (plugins): [custom_dispatch/cpu/plugin/](./cpu/plugin/) (.c -> .so/.sos)
Unlike the other approaches this requires runtime device support for dynamic
linking and introduces complexity to the user as they must be careful to version
their input programs and their runtime libraries themselves. IREE's CPU backend
does provide basic support for optional imports such that users can emit their
calls and add fallbacks but otherwise such behavior falls on the user to
implement.
### Statically-linked Dispatch Functions
**Overview**: user produces the final dispatch executable binary themselves,
declares it in IR, and dispatches it.
**Workflow**:
```
+------------+ +-------------------+ +--------------+
| kernels.cu | -> nvcc -+-> | kernels_sm_52.ptx | -+ | example.mlir |
+------------+ | +-------------------+ | +--------------+
| +-------------------+ | v
+-> | kernels_sm_80.ptx | -+----> iree-compile
+-------------------+ v
+--------------+
hermetic | example.vmfb |
+--------------+
```
**Samples**:
* CUDA/PTX: [custom_dispatch/cuda/kernels/](./cuda/kernels/) (.cu -> .ptx)
* Vulkan/SPIR-V: [custom_dispatch/vulkan/shaders/](./vulkan/shaders/) (.glsl -> .spv)
Here IREE is used for scheduling the work and ensuring that buffers and
parameters are passed to the dispatch function but otherwise treats them as
opaque. This disables some IREE optimizations but the functions are still able
to be scheduled concurrently and with parallelization. Many custom kernels can
often be implemented like this instead of needing much heavier-weight runtime
custom calls that prevent the asynchronous scheduling that IREE uses.
The dispatch functions are embedded into the IREE compiler outputs such that no
runtime changes are required and compiled programs are hermetic. Multi-targeting
is enabled by allowing the user to provide binaries for the target devices and
architectures they are compiling for.
### Dynamically-linked Dispatch Functions
**Overview**: user defines functions in a target-specific source language,
compiles them into target-specific libraries, wires them up and links them in
their runtime binary, declares the externally available dispatch executable in
IR, and emits calls to the functions in IR interleaved with other IR.
**Workflow**:
```
+--------------+ +--------------+
| example.mlir | | runtime srcs | -+
+--------------+ +--------------+ |
+--------------+ v +--------------+ |
| declarations | ----> iree-compile | dispatches.c | -+-> custom runtime
+--------------+ v +--------------+ ^
+--------------+ |
| example.vmfb | - - - - - - - - - - - - - - - +
+--------------+
```
**Samples**: plumbing required.
Similar to statically-linked dispatch functions IREE is doing the scheduling and
managing resources but deferring the entire dispatch logic to externally-sourced
executables. In contrast to the static linking approach this has the compiler
emit references to runtime-provided target-specific executables that must be
built into the runtime. This means that deployment gets more complicated as
compiler-produced outputs are no longer hermetic and users must handle
versioning and platform constraints themselves.
### Custom Commands
**Overview**: user writes custom command buffer operations in VM modules,
declares them in IR, dispatches them, and then links their custom modules into
the runtime.
**Workflow**:
```
+--------------+ +--------------+
| example.mlir | | runtime srcs | -+
+--------------+ +--------------+ |
+--------------+ v +--------------+ |
| declarations | ----> iree-compile | module.c | -+-> custom runtime
+--------------+ v +--------------+ ^
+--------------+ |
| example.vmfb | - - - - - - - - - - - - - - - +
+--------------+
```
**Samples**: plumbing required.
Though more work than the other approaches this allows for a host-side call that
can produce new transfer or execution commands. The compiler effectively treats
these calls as if they were custom versions of `hal.command_buffer.dispatch`/
`iree_hal_command_buffer_dispatch` and the runtime custom module receives the
device, command buffer, and push constants/bindings. Portable modules can
use the HAL APIs to schedule more commands (multiple dispatches, transfers,
collectives, etc) while backend-specific ones can crack open the HAL objects and
get the internals with all the normal API stability caveats.
An example use case would be calling a CUDA library that takes a `CUstream` as
an argument. The user would define their custom module with a call that uses the
`iree/hal/drivers/cuda/api.h` to cast the command buffer to a stream (using
graph capture when the command buffer is constructing a graph), make the call,
and then return. This only works if the library is designed for asynchronous and
deferred execution - if the library makes large allocations, schedules blocking
operations, or has side-effecting behavior then full custom modules must be
used (see the [custom_module/async/](/samples/custom_module/async/) sample).
When at all possible the dispatch function call or dispatch function
substitution approaches should be used instead and in many cases that is
sufficient for most workloads not involving other libraries.
### Compile Time Inlining Custom Function Calls
**Overview**: user defines functions with MLIR dialects IREE is able to ingest
paired with a matcher and replacement pattern. The matcher runs as preprocessing
and calls into the replacement pattern for all successful matches. The
replacement pattern imports a function from the externally
ABI, wires them up and links them in their runtime binary, declares the
externally available functions in IR, and emits calls to the functions in IR
interleaved with other IR.
**Workflows**:
Statically matched and imported external functions
```
+--------------+
| example.mlir |
+--------------+ +--------------+ +--------------+
| (one of the | | functions + | v
| above static | ----> | matchers + | ----> iree-compile
| workflows) | | replace.mlir | v
+--------------+ +--------------+ +--------------+
| example.vmfb |
+--------------+
```
**Samples**:
* CPU: [custom_dispatch/cpu/embedded/](./cpu/embedded/) (.c -> .o)
* [custom_dispatch/cpu/embedded/](./cpu/embedded/example_transform_spec.mlir) (.mlir)
* Vulkan/SPIR-V: [custom_dispatch/vulkan/shaders/](./vulkan/shaders/) (.glsl -> .spv)
* [custom_dispatch/vulkan/shaders/](./vulkan/shaders/example_transform_spec.mlir) (.mlir)
The above two samples build on top of a couple of the static workflows shown
above, but should work with any of the other approaches. The idea is to separate
the custom kernel from the target module to be compiled, allowing integration of
custom dispatches with default IREE codegen without the need to build a custom
set of compiler tools around IREE to generate the necessary IR.
There are a number of possible points at which the match and replace can happen;
the above shows it after import + input conversion. Other plugin points are
possible (e.g. before input conversion or after global optimization), but
currently are missing some ergonomics on the available matchers.
### Others
Most other situations are covered by [custom modules](/samples/custom_module/).
These can still be asynchronous and scheduled in device order but incur
additional overheads and deployment complexity as custom runtimes are required.