A not-so-concise walkthrough of various IREE features that are in the design process and planned for future versions. A lot of the questions around how the IREE IR is designed and why certain components exist (such as the VM) hopefully become much clearer when seeing where we want to go with the infrastructure we are building (as opposed to where we currently are with our MVP slice). This document is not meant to encompass the entire design of any individual feature and if there's interest please say hi on the iree-discuss mailing list.
linalg: High-level Hierarchical Optimizationflow: Data- and Execution-Flow Modelingflow.streamflow.stream through the CFGflow.dispatchflow.executableshal: Hardware Abstraction Layer and Multi-Architecture Executableshal.interfacesvm: Lightweight Virtual MachineIREE's current input dialect is the XLA HLO dialect representing operations on tensors. This was a pragmatic decision based on having HLO already defined and proof of existing models being lowered to it from Tensorflow, allowing us to focus on the IREE-specific portions of work. Unfortunately, HLO is tied to Tensorflow and has many quirks that would not otherwise have been designed had that not been the case. There are discussions happening about an upstream MLIR Tensor Compute Primitives dialect that HLO can be lowered into, allowing IREE (and other backends) to decouple themselves from XLA and be easier to target from frontends.
linalg: High-level Hierarchical OptimizationIt's required that IREE inputs are all in tensor form (and not in-place memref updates) in order to perform a large majority of the flow transformations. Recent work in the Linalg dialect is adding support for operating on value-semantic tensors, meaning that we can first apply mhlo to linalg lowerings and any of the transformations available in Linalg prior to performing our own flow lowerings. The advantage is that Linalg will have much stronger and principled code motion and nested loop transformation optimizations than is possible on higher-level ops. As not all operations can be represented as linalg ops IREE will be able to ingest a mix of linalg, std, and mhlo (or its replacement) ops.
Very little effort has been applied to mhlo optimizations and there are a significant number of missing folders, canonicalizers, and simple transformations. Many of these happen in legacy XLA C++ backends; however we need them in MLIR so that we can make use of dynamic shapes, mixed dialect inputs, etc. The tf2xla bridge work (converting Tensorflow models into the corresponding mhlo ops) is nearing its initial milestones and afterward we expect more of these missing pieces to be filled in.
Examples of the optimizations that will greatly benefit IREE (and any other backend consuming mhlo) include:
HLO only operates on tensor values - even for simple scalars - and this presents a problem when attempting to determine which code should be specified to run on accelerators vs. what should run on the host. The canonical example is mhlo.while, which as seen in the example below uses scalar tensors for its loop iteration counter and comparison.
%start = constant dense<1> : tensor<i32> %bound = constant dense<3> : tensor<i32> %res = "mhlo.while"(%start) ( { ^bb0(%count: tensor<i32>): %1 = "mhlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1> "mhlo.return"(%1) : (tensor<i1>) -> () }, { ^bb0(%count: tensor<i32>): %1 = mhlo.add %count, %count : tensor<i32> "mhlo.return"(%1) : (tensor<i32>) -> () }) : (tensor<i32>) -> tensor<i32>
A naïve but correct lowering (what's currently in IREE) would perform the comparison and increment on the device and insert a host readback to see if the loop should continue:
func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} { %cst = constant dense<1> : tensor<i32> %cst_0 = constant dense<3> : tensor<i32> %cst_1 = constant dense<1> : vector<3xi32> br ^bb1(%cst : tensor<i32>) ^bb1(%2: tensor<i32>): // 2 preds: ^bb0, ^bb2 %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> { %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1> flow.return %8 : tensor<i1> } %4 = flow.tensor.load %3 : tensor<i1> cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>) ^bb2(%5: tensor<i32>): // pred: ^bb1 %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> { %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32> flow.return %8 : tensor<i32> } br ^bb1(%6 : tensor<i32>) ^bb3(%7: tensor<i32>): // pred: ^bb1 return %7 : tensor<i32> }
Of note is the flow.tensor.load op indicating a host readback. Though this correctly executes the loop it is extremely inefficient. What's desired is for the loop iterator and condition to all happen on the host, with the iterator being passed to the loop body as an argument that can be encoded into a command buffer in future lowering stages. This eliminates host readback and allows for much larger flow.stream sequences, feeding more into the pipeline for the accelerator.
Not all source frontends have this issue (misrepresenting simple host computation as non-dense tensor operations), and our goal is to add a transformation that heuristically converts mhlo ops acting on small tensors to std ops acting on primitive values (i32, index, etc).
It's assumed that any work related to quantization/compression has happened prior to lowering into IREE dialects. Our plan is to use the proposed Quantization Transforms to achieve both training and inference-time quantization of types in a way that preserves maximum accuracy. IREE will support running with original unquantized floats in all cases, allowing for a smooth on-ramp to quantization and the gains in performance and reduction in model size that come from it.
As future work IREE would like to move beyond these transformation-directed approaches to quantization and interface directly to frontends which have a defined enough type system to represent accurate quantized (and otherwise compressed) computations directly, not relying exclusively on compiler-side type inference transforms.
flow: Data- and Execution-Flow ModelingThe flow dialect is designed to allow us to extract as much concurrency as possible from a program and partition IR into the scheduling and execution domains. Today we have the IR structure and transformation flow in place but have not yet got to the most interesting things such an infrastructure enables. A majority of the largest performance, latency, and memory usage improvements IREE can offer are determined first here and all following lowerings benefit. The fastest code is the code you don‘t execute and the smallest allocation is the allocation you don’t make ;)
flow.streamA majority of the readbacks we have today (manifested as flow.tensor.load.* ops) will be removed when we have an HLO tensor->primitive conversion. There will still be cases when readbacks are required for correctness but they usually fall into a small set of usage patterns. For those that don't this is one place where IREE will warn about performance issues, allowing programs that perform suboptimally but encouraging authors to adjust their input model to enable better behavior. The IREE VM also has specific support for hiding readback latency in an efficient way via coroutines.
The most common case we are currently seeing in the IR is that of dynamic copies where the offsets are dependent on the result of previous computations. Source models may have top-k + gather operations, for example. These appear as a flow.stream, a flow.tensor.load, and then another flow.stream that uses the loaded value for a flow.tensor.update (or other operation):
%index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... } %index = flow.tensor.load %index_tensor : tensor<i32> %result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... { %0 = flow.dispatch ... %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32> ... }
Today the flow.tensor.update turns into HAL command buffer transfer operations that must have their offsets known at recording time. This is a limitation of vkCmdCopyBuffer but not a fundamental limitation of any hardware. In fact several drivers implement copies as small built-in shader programs meaning that we could perform the same expansion here with the right primitives. This would allow, in the above example, both the index to be computed and the tensor to be updated within the same stream to entirely remove the host round-trip.
flow.stream through the CFGThe current flow.ex.stream.fragment, as denoted by the experimental tag, is a temporary implementation designed to get the concept of streams lowered to the HAL dialect. For streams to be effective at modeling larger concurrency scopes they need to be able to move across branches in the CFG. This intuitively follows exactly what one would do if recording commands in C:
vkCmdCopyBuffer(cmd, ...); if (some_flag) { vkCmdBindPipeline(cmd, ..., pipeline_a); } else { vkCmdBindPipeline(cmd, ..., pipeline_b); } vkCmdDispatch(cmd, ...);
The corresponding flow IR:
flow.stream.append[%s0](...) { flow.tensor.update ... } %b = cmpi ne %some_flag, ... cond_br %b, ^a(%s0), ^b(%s0) ^a(%s1): flow.stream.append[%s1](...) { flow.dispatch @pipeline_a, ... } br ^end(%s1) ^b(%s2): flow.stream.append[%s2](...) { flow.dispatch @pipeline_b, ... } br ^end(%s2) ^end(%s3): ...
This allows the entire stream to be lowered into one command buffer without the need for any host round-trips. The conversion into the flow dialect will walk the CFG and attempt to thread the flow.stream values through so long as there are no external dependencies.
flow.dispatchWhile the flow.stream threading through the CFG can remove many of the simpler conditional dispatches there will always be some that will have their execution dependent on the result of prior dispatches. For these a flow.cond_dispatch will allow a condition to be provided that must be true for the dispatch to actually be performed.
For targets that natively support predication in their command buffers (such as D3D12's ID3D12GraphicsCommandList::SetPredication) this provides a host round-trip-free way of conditionally executing dispatches and transfers. Unfortunately Vulkan support is still lacking, though Nvidia supports the VK_EXT_conditional_rendering extension that exposes the same behavior.
For targets that do not support predication natively it's still possible to emulate predication with indirect dispatches. In this model the workgroup counts normally used to dispatch execution are sourced from another device buffer at the time the dispatch is made instead of sourced from the command buffer at the time the dispatch is recorded. Degenerate dispatches with counts of 0, 0, 0 allow for effective neutering of the dispatch with minimal overhead (vs. the significant penalty of a host round-trip!).
By modeling such predication at the flow level we are able to lower into the HAL with target-aware predication semantics and fuse indirect dispatch workgroup count calculations into existing dispatches already being performed such that overhead is reduced.
flow.executablesWhile still in the flow dialect, the executables are target-agnostic. This makes simple IR tree diffing a potential solution to deduplication. Since most of the dispatches originate from the same source-language library calls in input frameworks there's a high likelihood of duplication, and depending on when inlining is performed we may have stronger or weaker ability to perform the deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are rearranged into consistent canonical representations) the IR comparisons can be done rather trivially.
Common subexpression elimination is performed many times during lowering, however there comes a point where the CSE can introduce false dependencies and additional allocations that are otherwise avoidable. For example if a broadcasting operation is CSE'd and then the result is used by two or more operations that are scheduled independently what would have been a relatively cheap lowering of the broadcast to a simple index remapping now becomes an additional dispatch, materialization of an intermediate tensor, and a barrier:
%bcast = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> %mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32> // (pretend something here that prevents fusion) %mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
%bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> { %0 = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> return %0 : tensor<1024x10xf32> } // a barrier will be required here %mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { %1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32> return %1 : tensor<1024x10xf32> } %mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { %2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32> return %2 : tensor<1024x10xf32> }
Instead the broadcast should be rematerialized inside of both dispatch regions as the cost of doing so is significantly less in compute resources and then the intermediate tensor will not be required at all. Though at first it may seem counter-intuitive to undo such a critical optimization as CSE (both to code size and often to compute) but here it's something we must carefully balance while looking at the whole system. It gets even more important when considering multi-device execution as the cost of sharing memory and synchronizing may be extremely non-trivial.
While still within the flow dialect we have the ability to easily split streams and safely shuffle around operations. Target execution backends can opt into such behavior to ensure that device restrictions such as maximum in-flight memory, maximum scheduling depth, and capabilities are observed. For heterogeneous configurations the intent is that certain operations, dispatches, and streams can be attributed to specify which device categories they should be lowered. The constraint solving that takes place can be provided with generic heuristics (“big GEMMs go on the accelerator”), profile-guided databases based on benchmarks, learned traits via ML, etc.
hal: Hardware Abstraction Layer and Multi-Architecture ExecutablesAs the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of the techniques classically used in real-time graphics apply. The benefit we have by modeling our usage of such a low-level API in IR is that the normal work - some of which is very non-trivial - for managing allocations, tracking resource lifetime, and ensuring proper synchronization/barriers is something we can apply the full force of an offline compiler against.
hal.interfacesThe hal.interface op specifies the ABI between the scheduler and the device containing the buffer bindings and additional non-buffer data (parameters, shapes, specialization flags, etc). Today a naïve ordering is used uniformly for all targets however it is possible for target backends to opt into providing their own interfaces based on target configuration. The same hal.executable may have multiple interfaces and the same backend may use one or more. This is useful for when target capabilities may vary at runtime, such as the number of available storage buffer bindings in Vulkan. By exposing a few hal.interface variants with different binding amounts the Vulkan backend could make better use of the larger number of bindings available at runtime while still providing support for smaller configurations.
Once we have multiple hal.interfaces defined for executables the scheduler needs to emit HAL ops that properly switch between them. By having a canonical form for bindings we can ensure that only the differences between the interfaces will need additional code.
Though the flow dialect attempts to fuse as many ops as possible into dispatch regions, it's not always possible for all target backends to schedule a region as a single dispatch. A classic example is algorithms like parallel reduction commonly used on GPUs that may require many dispatches to identical executables, while other algorithms may vary the executables they use based on the input parameters such as shape or the target runtime device support.
By default the flow.dispatch executable translation to hal.executables is performed 1:1 and it is assumed that a single dispatch is required. Extending target backends with scheduling interfaces (enabling them to opt into different scheduling behavior) will allow the backends to emit any number of hal.executables and any stream commands (such as additional dispatches or transfers) they may need. This is effectively equivalent to what would be done at runtime only because we are still operating on IR prior to buffer allocation and can use the hal ringbuffer primitive. Through this we can elide many of the allocations that would otherwise be required at runtime (and the concurrency-limiting false dependencies that usually come along with scratch memory).
Since the algorithm used may vary based on the parameters of the dispatch (such as the shape of the reduction which may be dynamically determined) scheduling specialization may occur even when targeting a single backend. In many cases folding and canonicalization can eliminate the overhead as whether one dynamically computed workgroup size is used instead of another the same IR is present.
Many explicit hardware APIs require knowing how buffers are used alongside with where they should be located. For example this additional information determines caching policy on buffer accesses (write-through, write-back, etc), visibility of writes across compute units, and the possible MMU properties that may need to be maintained/matched for the buffer. By using the SSA-form value-semantics of the MLIR tensor as used in the flow dialect we have complete information of where buffers may be used or at least where they enter or leave regions where we can derive such information.
Analysis passes can run over IR to attribute tensors such that when allocation is performed when lowering to the hal dialect we do so from an allocator compatible with where the buffer will be used, with memory types chosen based on the potential cost and location of operations performed (write-only on host vs. read-write on host and device, etc), and with usage bits indicating what kind of operations may be performed on the buffer. Many of these are local transformations as most buffers are only live within very small regions such as the flow.stream encompassing their usage.
Traditional systems need to either use very permissive buffer properties or heuristics that can introduce additional non-trivial overhead when such heuristics are incorrect. For example, OpenGL had several such usage hints that drivers were then able to use but almost no drivers behaved as desired in all cases and it lead to additional memory ghosting, copies, readbacks, and unpredictable performance. For almost all uses of the buffers within an IREE invocation we instead can know precisely where and how buffers may need to be moved and do it a minimum number of times if it is required.
For targets that may require runtime preprocessing of their executables prior to dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch compilation mechanism based on Vulkan's Pipeline Cache.
Today each executable is compiled on-demand and cached only for the process lifetime. Though some drivers may provide their own caching we can make better use of the explicit caching and compilation behavior with the additional information we have in the compiler.
For any given entry point (or group of entry points) into an IREE module we can perform reachability analysis to know which executables may be executed when that entry point is invoked. In this way we can emit pre-invocation compilation checks (similar to an std::call_once block) that provides all required executables for compilation and allows more efficient compilation through multithreading the compiler invocations. These same compilation caching function can be exposed and invoked manually by an application to force pre-compilation when it is least likely to impact the user, such as a post-install/first-run step or concurrently while other application features are loading.
We can use zero or more scoped caches for executables within a module. Completely dynamic modules (such as those emitted in eager-mode usage) may avoid the caching overhead entirely, while modules that have several primary usage modes (such as training and inference) may choose to use independent caches for each such mode.
The caches generated can then be retrieved and saved by the hosting application. Upon the next execution the application can provide the caches and if still valid they will be used to avoid compilation.
An advantage of representing executable binaries in IR after translation is that we can apply various post-compilation compression and minification techniques while still know precisely where the executable will be used. This is extremely important for SPIR-V as it is not designed to be a small at-rest format. Though the biggest lever we have to control generated code size is higher-level deduplication and specialization there will still be a sufficiently large number of executable binaries we will need to embed within the final modules and having targeted approaches for reducing their size beyond just “gzip everything” is very powerful.
For example, SMOL-V is a fantastic lossless SPIR-V compression technique that, when coupled with modern dictionary-based compression algorithms, can save significant binary size. As a data point, the SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB of compressed SMOL-V.
Combined with Batched Executable Caching and Precompilation we can easily use shared dictionaries and other cross-artifact compression in a relatively plug-in way.
It's still an area that needs more research but one goal of the IREE design was to enable efficient target- and context-aware compression of large constants (typically model weights/parameters/embeddings). This may mean reusing existing hardware compression formats on GPUs, ML accelerator-specific formats, or very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be directly used without first decompressing. The inspiration here is formats like Crunch and Basis Universal that perform “supercompression”, and we may even be able to use these directly as then we can make use of GPU hardware samplers to do the 4-bit to 32-bit decompression, etc.
The IREE HAL - much like Vulkan it is based on - eschews much of the state that traditional APIs have in favor of (mostly) immutable state objects (pipeline layouts, pipeline states, descriptor sets, etc). There are still a few stateful entry points in the API, though, and deduplicating or reordering redundant calls can reduce both IR, API, and execution overhead.
The key place this will have the largest impact is around descriptor set bindings and push descriptors, both of which are state and can have non-trivial setup overhead. A canonicalization for such commands that inspects the target hal.command_buffer to see if the same state was set prior and code motion to move such commands out of loop bodies when possible would be helpful.
A core concept of the IREE scheduler that allows for overlapping in-flight invocations is that of the resource timeline. This identifies module state that can be in use by multiple invocations and assigns timeline milestones denoting when the resource will be in the appropriate state for the current invocation to proceed. Conceptually it is like a epoch-based synchronization mechanism as commonly found in garbage collectors to allow for lock-free asynchronous memory reclamation.
The advantage we have in the IR is that we know both the usage of all resources thanks to buffer usage tracking and the synchronization domains of all resources (in most cases). This allows us to effectively assign one timeline semaphore per writeable resource while in practice having far fewer than 1:1, as for example if two resources are only ever written in the same command buffer only one semaphore is needed to signal the completion of both writes.
By transforming IR to sink all resource reads and writes closest to where the value is used we can enlarge the time windows that can overlap across invocations that may share those resources. This is similar to what out-of-order CPUs do with register renaming/reorder buffers/etc and something we can apply some traditional instruction scheduling techniques to (only here our ‘instructions’ are entire command buffer dispatches/transfers).
Two degenerate cases of this approach are that of resource indirection (iree.ptr<tensor<T>>) and dynamic resource shapes. In these two cases it may not be possible to continue recording commands even if we are able to ensure execution is appropriately synchronized. This is where indirect dispatch, predication, indirect command buffers, and VM coroutines can all help cover for the times where we are unable to transform away the indirection or emit shape logic without data dependencies.
(When properly implemented) almost all buffers required during execution never escape the command buffers they are used in or a single VM invocation. We can trivially identify this from the explicit captures of flow.stream and flow.dispatch ops and the fact that all tensor types have value-semantics. Only those tensor values loaded-from/stored-to module state or that cross the exported module function boundary need special consideration while almost everything else can live transiently only so long as it is required during execution.
Thanks to this information about buffer usage and lifetime we can use a ringbuffer to store the transient tensor data and other required data reservations such as uniform buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches. This gives the compiler and the application a knob that allows them to control maximum concurrency (by having a very large ringbuffer) or maximum memory usage (by having a minimally small ringbuffer).
Allocating tensors from the ringbuffer does not require sophisticated runtime packing as we can emit IR to calculate required sizes for dynamically shaped tensors. Whether a basic block reserves %sz = constant 42 : index bytes or %sz = std.muli %cst, %dyn_dim : index bytes doesn't materially change how the allocations are performed. Since almost all usage involves simple write head bumps there is no need for ahead-of-time memory planning or large fixed allocations, and since no buffer within the ringbuffer can alias we can have coarse (read: low overhead) guarantees about the availability of certain regions of the ringbuffer (“when this event is signaled all prior ringbuffer writes have completed”).
Usually any planning we may want to perform can be done in IR via code motion. For example applying traditional algorithms used to reduce register pressure will help us attain narrower live windows within the ringbuffer leading to a larger number of in-flight operations for the same ringbuffer memory usage.
We may end up using both a classical ringbuffer and a variant known as the bip buffer because it is better for descriptor set utilization (as we can provide many dispatch parameters with a single base offset bound once at the beginning of a region).
Functions calls made across modules (either from C++ into the VM, VM->VM, or VM->C++) should be able to define timeline semaphores used to wait and signal on the call. We can do this by making all exports automatically have the semaphores and then make invocations populate them if they were not provided by the caller. In this way we can allow multiple invocations of exported functions to chain naturally with internal asynchronous workloads, turning most IREE invocations into just recording of command buffers that can never block.
When combined with VM coroutine support we even have the ability to interleave any required host execution between the wait and signal semaphores provided such that the caller never knows on which device execution is taking place. It's still possible to provide synchronous wrappers that emulate blocking behavior but by having the core system designed around a single system-supported primitive we avoid the need for additional things like interrupt watchdog threads, implicit blocking, and other pitfalls.
One approach to using multiple cores on a CPU is to perform interior parallelization of operations such as OpenMP or library-call-based custom thread pools (gemmlowp). This works when each individual operation is relatively costly vs. potential pipeline bubbles caused by work spinning down near the end of an operation and spinning up at the beginning of the next.
IREE is designed to handle many more workloads - some of which have very narrow shapes but very deep pipelines (like search algorithms) - such that the above approach of multithreading within ops becomes a bottleneck. These workloads are traditionally very poorly handled by frameworks and issues with oversubscription, pipeline stalls, and suboptimal system schedulers (such as on Android) can lead to more time being spent thrashing about than actually executing real work.
The approach we take here is to treat the cores of a CPU as if they were computation units on a GPU, each able to perform some set of heterogeneous work independent of others units. This means that the concurrency we are trying to model at the flow level and communicate to the runtime via the hal that explicitly states which dispatches can overlap and the size of the workgroups can trivially be used to distribute this work over many cores exactly as a GPU would do it. Integration with library calls that may require their own threading (such as Ruy) requires that they be able to use the IREE thread pool instead of their own.
In this way we can avoid pipeline bubbles and other latency-inducing unpredictable scheduling. This does not mean that we treat individual units of work at the same scale as we would for GPUs, but instead that we tile and have one or more processing units that allows us to work on those tiles. Whether the tile size is defined by a library call contract, heuristics, or empirically is TBD, but expect workgroup sizes in the thousands to millions of invocations vs. normal GPU workgroup sizes in the dozens to hundreds of invocations.
To achieve this style of scheduling efficiently we'll likely use something like marl as the scheduler. Marl provides cross-platform low-overhead fibers and is compatible with this style of scheduling as it was built for the Swiftshader software rasterizer.
Even if IREE was only targeting CPUs the assertion is that we would still want to schedule this way and it's only an incidental benefit that if building for heterogeneous targets the scheduling code may be shared (just with a different divisor for workgroup count calculations).
vm: Lightweight Virtual MachineThe VM is designed as a dynamic linkage ABI, stable bytecode representation, and intermediate lowering IR. Many of the optimizations we can perform on it will benefit all use cases (such as when lowering to LLVM IR) by allowing higher-level program transformations around synchronization that are difficult to perform on arbitrary LLVM IR.
One of the largest features currently missing from the VM is coroutines (aka user-mode fiber scheduling). Coroutines are what will allow us to have multiple in-flight invocations into a module - some of which may be waiting on external events - without the need for complex multithreading logic or state machine machinations.
In many cases once semaphores are exposed to callers we will not need to yield in the VM. The user will call into the module with provided semaphores, the work to perform will be recorded to one or more command buffers and submitted to the device, and then control return will return to the caller immediately.
In cases requiring host readbacks that we were not able to remove, however, additional VM code may need to run prior to when the final semaphore is signaled. To preserve the asynchronous interface and immediate execution guarantees the compiler can emit explicit yield points (vm.yield) that are known-good locations for yielding (such as most resources not required after the yield having been flushed/discarded, partial synchronization scope availability if other work may be able to execute concurrently irrespective of the yielded coroutine, etc).
When the VM encounters the yield at runtime it will suspend the coroutine until a defined condition is met. Many coroutines can be in various states at any given time and - thanks to the resource timeline - can still be memory safe. For example if two stateless invocations are made with a common wait semaphore both can be recorded and submitted without waiting on each other. If there is internal module state accessed the invocations are implicitly ordered by invocation order (similar to what Vulkan calls API order) based on internal resource timeline semaphores.
Waking the coroutines can be performed by either an application-provided callback in the case of the application already having a periodic event which is doing bookkeeping (such as frame end callbacks when rendering or Looper idle events on Android), giving direct control over the frequency and location which IREE utilizes to perform additional work. A helper will be provided as well that runs a dedicated IREE thread to do this, but the expectation is that applications can often do a better (and importantly more predictable) job.
By utilizing coroutines IREE will have a way to fill traditional pipeline bubbles even with execution from the same module (let alone across modules) in the situation where host readbacks or other logic is required. This increases overall throughput and utilization while reducing host wakeups as many coroutines can be processed at once to submit new work to the device queues, though it does not help reduce per-invocation latency.
External code such as the HAL implementation or user ops may provide the wait handles used for continuation. For example, the HAL can expose a function that yields and wakes only when one or more timeline semaphores reach their target values:
// submit work hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target // continue here, possibly much later in time
Though coroutines help throughput there is a way we‘ve found to reduce latency that’s been documented as cellular batching. This same technique has been implemented in prior internal systems and is one of the motivating design goals for IREE's creation. The core idea is to identify small uniform work that can be partitioned and scheduled greedily such as to enable batching or reduce associated invocation costs (such as refreshing accelerator SRAM/caches with new parameters). This usually manifests as finding large GEMM/GEMV operations using the same fixed parameters and either dynamically increasing the batch size by adding the waiting work (without deferring the actual execution time) or sequencing them back to back to ensure better cache utilization. Which approach is taken depends on any data dependencies that may be present (such as LSTM state feedback edges).
With the foundation of coroutines in IREE it's possible to yield execution at any given point - including during command buffer recording - and wake on specific conditions. A majority of the logic can be built into the module itself with very little need for runtime machinery, as shared VM variables can be used to track pending work across invocations (even from different parts of the program) and flush based on logic wholly controlled by the user or compiler (such as count/max time latency/etc limits). This allows for the large variety of scheduling behavior various applications may want to use, such as a zero-latency batch-only-within-this-invocation to a Nagle's Algorithm-esque time or limit based behavior or even some learned model-specific windowing.
Design work is still required on how to represent this in IR but the current thought is to model the regions in which deferred execution is possible and beneficial and allow during lowering to the VM additional transformations. This is similar to how the async-await behavior works in C# where the async keyword is just sugar that expands to additional generated helper utilities.
A simple strawman representation for sequential dispatch may look like:
hal.scheduling_policy @defer_policy { // max time, max count, max live memory, etc } ... hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ... // vm.yield added here during lowering
There are many cases to explore and as cellular batching can have performance benefits of several orders of magnitudes it'll be one of the primary areas of research in the long-term.
For scenarios where dynamic module loading is not required and entire modules can be compiled into applications we can lower the VM IR to LLVM IR within MLIR's transformation pipeline. Instead of embedding vm.call ops that are dispatched at runtime to things like the HAL we can instead lower to llvm::CallInst to runtime-resolved function pointers. This still enables all of the flexibility of heterogeneous/runtime-determined devices, pluggable diagnostics, and backend composition without any need for flatbuffers or the VM bytecode interpreter.
The VM was designed to make such a lowering easy and the C-style struct-based function pointer registration for runtime modules was designed to make emitting code that used it fairly robust even when linked in dynamically such as when embedded in shared objects.
An extension of this is what we've been calling ‘runtimeless mode’, where the IREE VM linkage code is statically linked into the binary alongside the generated module LLVM IR. If only a single HAL backend is linked in then (with some build-fu) we should be able to get call devirtualization to reduce code size to precisely the functionality used by the module.
Currently the VM only supports two types: i32 and vm.ref<T>. This is an intentional limitation such that we can determine what is really needed to express the scheduling we perform, with the idea being that such a limited model will make it easier to use techniques like indirect command buffers to compile the VM itself to an accelerator executable that dispatches work without host involvement.
As we port more models we may find a few primitives that are worth bringing into the VM design such that it's worth potential complications to future porting. These includes types like f32 (for simple float calculations/comparisons), list/dict (easier python compatibility), and vector<4xf32> (for simple inline calculations that are not worth dispatch overhead/synchronization).
Though IREE will use many different tricks such as predication to build deep pipelines there is still the requirement that the command recording and submission happens on the host CPU. Though the cost of this in terms of latency and power use can be minimized by coalescing and timelines there is still the possibility of non-trivial roundtrips being introduced that limit performance. For particular applications like low-power always-on compute or where there is significantly branchy behavior (such as search algorithms) it is important that the decision making logic as to what is dispatched runs as close to real-time as possible within the execution pipeline.
The IREE VM is designed to be runnable on-device in a secure and cooperative way (no pointers, indirect buffer handles to allow for memory space rearrangement op-to-op, deterministic execution and explicit yield points, etc).
The recent efforts to bring indirect command buffers to Vulkan and Metal‘s Indirect Command Buffers (that both derive inspiration from NV_command_list) are one such target for this. Either by lowering the VM IR to LLVM IR or SPIR-V, by a special conversion to target-specific forms, or by actually executing the VM bytecode directly on-device (it’s ~1000 LoC) we should be able to prototype what full on-device usage is like. Even if only some VM functions the compiler deems useful to schedule on the device are used and the rest run on the host (particularly those functions calling imported functions) some of the most costly logic that creates tight coupling of the host and device scheduling can be limited.