hide:
!!! Note Much of this describes provisions for extension within IREE but until the core of the system has settled little work will be done to fully flesh-out and document them in detail. A large majority of things that would make someone want to extend IREE can instead be accomplished much easier and performantly using native MLIR dialects that are then processed by the IREE compiler.
IREE has a compiler and runtime separation, a multi-layered architecture, and split between execution of “host code” that schedules compute-heavy work and SPMD “device code” that performs the bulk of compute operations. Each axis has a different set of extension mechanisms that can be used independently or combined.
Organized below are some of the mechanisms IREE provides for extending the core compiler and runtime and when they should(n't) be used. The goal of these progressively lower-level extension mechanisms is to make it easier for users to fall into the pit of success:
!!! quote
"_a well-designed system makes it easy to do the right things and annoying (but not impossible) to do the wrong things._" - [Jeff Atwood](https://blog.codinghorror.com/falling-into-the-pit-of-success/)
The amount of engineering complexity for initial bring-up and maintenance increases with each subsequently lower-level approach and it is best to start from the top and exit as fast as possible: this is a choose-your-own-adventure where you‘re trying to escape the dungeon with both the loot and your limbs :dragon:. Avoid the temptation of immediately dropping down to making external C calls at runtime because that’s how it‘s been done before as it’s easier, more robust, and more performant to use the system as it is intended to be used.
The primary goal when extending any framework should first be to avoid extending it at all. There is no mechanism that is free - whether in terms of engineering effort to develop and maintain over time, include in compiler deployments, or include in runtime deployments. As a system scales in deployment configurations the available mechanisms for extension increase but so too does the chaos introduced by extensions that do not also scale with that design. Users are the only ones who can determine the tradeoffs they are willing to accept: for example, the mechanism to extend device code with a custom runtime call to a C function does not work on GPUs and gets significantly more complicated on CPUs as sandboxes/enclaves are used - but if the user scenario is for local process CPU-only execution that may not matter.
Consider in normal software development when one would choose to write more code (possibly packaging it into a reusable library) vs. changing the programming language or compiler they are using to compile their code vs. changing the operating systems their code runs on. The further one gets from the problem they are trying to solve the more work, coordination, and maintenance is involved and though there are reasons to make changes across the stack they should be done only when a simpler solution would not suffice.
An author will retain more control over their logic the closer they sit to the inputs to the compiler. IREE provides several mechanisms that try to keep control with the author and robust to changes in IREE or MLIR internals and it is strongly encouraged that those looking to extend take those routes first. Contributions that help everyone are very welcome but do have a higher cost and it's often much easier to design and justify upstream changes with working examples in forks or at higher levels of the stack.
From a performance perspective the rule is to colocate code with the data it is acting on: tensor data, for example, should almost exclusively be manipulated by device code as tensors live on device. Attempting to use tensor data with host code will result in synchronization points and host/device transfers that can decimate performance. This can lead to seemingly paradoxical situations where swapping out compiler-generated code for a human-authored “fast path” can be slower than even the most naive compiler results. An important thing to keep in mind with compilers is that it is exceedingly difficult to produce code by hand that is consistently more performant across a broad range of deployments and the first temptation should always be to improve the compiler - extending it via other mechanisms when not required by the task is often just premature optimization.
!!! tldr “TL;DR”
Convert your custom ops into standard MLIR dialects.
+------------+ +--------+ +---------------+ | Your input | -+-> | iree | -+-> | IREE compiler | +------------+ | +--------+ | +---------------+ | +--------+ | +-> | linalg | -+ | +--------+ | | .... |
The easiest, cleanest, and most robust path to extend IREE is to make use of what MLIR is designed for: composing dialects and converting between them. IREE supports several input dialects such as tosa, mhlo, linalg, and the standard arith, math, tensor, and scf dialects. Any source IR that can be turned into that mix of dialects (directly or transitively) will work with the whole IREE pipeline for all deployment configurations and targets. If possible to express the computation in this form it will always be the best route to getting small deployments without the need to modify or include any additional code at runtime and run on all device types and execution modes.
This mechanism can also be layered with any of the subsequent lower-level ones: if some part of the operation runs on the host and some part on device then decomposing it such that it contains as many standard ops for flow control as possible and linear algebra/custom ops for the dense math will reduce the engineering effort required on both sides and lead to an easier to maintain solution even if lower-level extension is required.
A large majority of classic ML “custom ops” can be accomplished with this approach. When bringing up projects built on IREE it's best to concisely describe the operation in more elemental mathematical representations and then add optimizations where required knowing that things will still work even if those optimizations never happen.
To make use of this approach one just needs to follow the standard MLIR dialect conversion behavior: add a dialect with ops, add a conversion pass, and run that pass before providing the resulting IR to the IREE compiler. See Creating a Dialect.
Think of this like authoring C++ sources with templates that you compile into your application: Clang (and LLVM beyond) don‘t know about your library details and instead just process it as it would any other code. You can take the same source and pass it to GCC and it’ll be robust to underlying changes in the system.
!!! tldr “TL;DR”
Import MLIR functions in the compiler and custom modules at runtime.
// Main user module compiled by IREE: module @model { // Declare a synchronous external function: func.func private @my_custom_module.sync_func(%input: tensor<?xf32>) -> i32 // Declare an asynchronous external function: func.func private @my_custom_module.async_func(%input: tensor<?xf32>) -> tensor<?xf32> attributes { iree.abi.model = "coarse-fences", nosideeffects } func.func @predict() { ... // Call a synchronous/blocking external function: %sync_result = call @my_custom_module.sync_func(%sync_input) : (tensor<?xf32>) -> i32 ... ... // Call an asynchronous/non-blocking external function: %async_result = call @my_custom_module.async_func(%async_input) : (tensor<?xf32>) -> tensor<?xf32> ... } }
IREE provides dynamic linking at runtime via its VM interfaces. For code that runs on the host and requires syscalls or calling out to existing libraries - such as file IO, text processing, and JPEG decoding - this is an easy way to interop without paying attention to the more complex details of device code. An IREE module compiled using custom modules is portable and dynamically deployable so long as the custom module is registered at runtime.
This approach conceptually matches what normal native binaries do in an OS: imports are declared and at runtime they are resolved based on the available exports of modules in the system. Just as with normal systems engineering design of the API between modules is up to the user and depending on rigor can have several pitfalls but these problems and their solutions are not IREE specific and anyone who has designed a shared library interface can apply the same rules here in IREE around versioning, performance, etc. One does not add 2 integers via a syscall and the same holds here: custom modules and the functions within should perform a large amount of work to hide overheads involved in the cross-module calls and users must be aware that the compiler cannot optimize across the call boundaries.
See the synchronous tensor I/O and asynchronous tensor I/O samples.
The runtime portion requires that the code be exported to the VM system by way of an iree_vm_module_t interface. A low-level native interface exists with minimal overhead and is used for example by the IREE HAL itself. There is also a C++ wrapper that is significantly easier to work with however it needs some performance improvements.
Full end-to-end examples can be found under samples/custom_modules/:
!!! tldr “TL;DR”
Add patterns to `iree/Compiler/Codegen/` to emit target code.
The easiest and most robust path for specializations of device code is to emit such code mixed with the IREE compiler generated code at the highest possible level of abstraction within the target pipeline. For example, if the code can be represented with the vector dialect then inserting conversion patterns between linalg and vector enables the emitted code to be specialized further based on user configuration and optimized with the full set of available passes that run in the pipeline. For each level lower one goes the more flexibility they gain such as being able to emit inline assembly blocks that do anything while trading off generality and multi-targeting applicability.
How much the tradeoff matters is based on the behavior of the extension. If a pattern changing a transcendental function to an approximation can operate at the vector level then all IREE deployment targets can benefit from the pattern and as new targets are made available they will automatically receive the benefits. In contrast, a pattern at the vector level that turns generic vector operations into architecture-specific LLVM intrinsics by its nature only pertains to a single target family and can be done at a lower level. As a rule of thumb if a particular pattern is going to need ~N implementations for ~N targets that are all mostly the same it's better to try to move that higher in the stack.
At this point the complexity of extending things is still fairly constrained: a C++ pass or pattern is verified with normal lit tests and can be upstreamed easily either into MLIR or IREE (a large number of IREE patterns are upstreamed, benefiting all users of MLIR). Cross-compilation and versioning are not a factor and the IREE artifacts can be considered durable at a coarse level (outside of major target architectural changes).
Note that depending on the target there are various mechanisms for representing code in MLIR, up to including inline assembly snippets in IR via llvm.inline_asm.
linalg or vector as well as target-specific representations like SPIR-V and LLVM IR.There are several ways to author patterns and passes in MLIR. As examples:
linalg uses a python-based DSL for defining some of its extended ops.There are many examples within both MLIR and IREE, one specifically being the polynomial approximation expansion patterns.
!!! tldr “TL;DR”
Statically link external object files into IREE executables.
For large bodies of existing device code or library calls that are available for static linkage the work involved to reimplement them at higher levels of the stack can be cost prohibitive even if it leads to better results. In these cases just as with a normal toolchain one would just want to declare an external function, call it, and add the object file to the linker command line. In IREE the same can be performed by way of taking compatible bitcode or native object files and linking them in with the generated code. An MLIR pattern would declare and emit the call and the target-specific IREE linker would pull in the objects.
As the linking behavior varies per target (for example, some targets like SPIR-V don't have traditional linkers) how this is performed is up to the IREE target backends. The complexity involved in producing the object files to link will also vary per-backend and the complexity of the deployment: cross-compiling for multiple architectures or compilation modes (ASAN, etc) will require unique copies of the object files matching that precise configuration.
At this point generality is largely out as is the ability to cleanly upstream such files. It should be apparent how a few dozen lines of C++ or PDL that avoids the need for any of this complexity is more appealing. In extremely specific cases of a single platform/architecture/version for a single program deployed via a specific artifact composition it's not so bad but IREE is designed such that extreme specificity is an optional mode of the more general solution. This does not mean this mechanism is not useful in some situations and only that it should be a last-resort when one of the easier to manage solutions is not viable - not a shortcut to avoid writing some C++ patterns.
linalg, vector, or LLVM IR. Use target-specific conversion patterns instead.As the linking behavior varies per target backend there is no general solution at this level: if targeting the CPU then the system native linker or lld need to be provided the object files, while SPIR-V will need to merge the SPIR-V binaries directly, and Metal shader libraries will need to be constructed with the Apple-specific metallib tooling. Producing these files and performing the linking is outside the scope of IREE.
If the files can be acquired then compiler changes will be required to emit calls to them and invoke the linker with the the files.
On the CPU an alternative is to use the static library output mode where IREE produces an object file and then the user invokes the linker themselves; this still requires the compiler changes to emit the calls but avoids needing to teach the compiler how to link the files.
!!! tldr “TL;DR”
Dynamically link external C functions at runtime from device code.
It is pitch black. You are likely to be eaten by a grue.
This is the lowest-level integration in the system and is designed to act as an escape hatch and - as with any emergency escape hatch - it's not designed for ergonomics. Users should try first to come in through the door and attempting to use this mechanism should trigger alarms about the approach being attempted.
IREE's execution model for device code and native machine binary deployment mechanisms are designed with several constraints in order to make all of the above approaches possible and performant. Calling arbitrary C functions from deep within the system can introduce subtle (and not-so-subtle) bugs that are extremely difficult to track down and versioning between the compiler emitting the calls and the runtime providing the implementations can cause skew unless held carefully. Consider the methods added here like syscalls in that they must be extremely focused and if they are ever likely to change (including being removed) then care will be needed just as with versioning or redirecting a syscall. Designing good stable interfaces is hard and a classic pit of failure.
Some things to note:
Most of the constraints here come from the SPMD parallelism model, platform-agnostic deployment format, and overall data-oriented design of IREE. Code operating in this fashion has a certain shape and that is usually not the same as big legacy single-threaded CPU-focused BLAS libraries that perform their own caching, internal thread and state management, and other shenanigans. IREE is not designed to wrap such things and if any of these notes are issues it is more an indicator that the approach needs adjustment than anything else. Trying to bypass or workaround the constraints is possible - after all IREE is an open source project and any user is welcome to fork it - but unsupported by the core IREE team.
The compiler is changed to produce calls to imports via a dynamic import table provided to each dispatch function. The import table is declared in the executable library for use at runtime. Runtime applications register an import provider to resolve named symbols in the import table to C functions that marshal arguments and results.
The compiler-side needs some additional work but an example is included here: Issue 7504. The runtime-side is complete and resolution is performed by a user-supplied iree_hal_executable_import_provider_t.