docs/roadmap_design.md - 3p/openxla/iree - Git at Google

 # IREE Design Roadmap

 <a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a>

 <!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING -->

 A not-so-concise walkthrough of various IREE features that are in the design
 process and planned for future versions. A lot of the questions around how the
 IREE IR is designed and why certain components exist (such as the VM) hopefully
 become much clearer when seeing where we want to go with the infrastructure we
 are building (as opposed to where we currently are with our MVP slice). This
 document is not meant to encompass the entire design of any individual feature
 and if there's interest please say hi on the
 [iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing
 list.

 <!-- TOC -->

 -   [IREE Design Roadmap](#iree-design-roadmap)
     -   [Input Dialects](#input-dialects)
     -   [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement)
     -   [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization)
     -   [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations)
     -   [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion)
     -   [Quantization](#quantization)
     -   [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling)
     -   [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream)
     -   [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg)
     -   [Predication of `flow.dispatch`](#predication-of-flowdispatch)
     -   [Deduping `flow.executable`s](#deduping-flowexecutables)
     -   [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions)
     -   [Device Placement](#device-placement)
     -   [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables)
     -   [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces)
     -   [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization)
     -   [Buffer Usage Tracking](#buffer-usage-tracking)
     -   [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
     -   [Target-aware Executable Compression](#target-aware-executable-compression)
     -   [Target-aware Constant Compression](#target-aware-constant-compression)
     -   [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication)
     -   [Resource Timeline](#resource-timeline)
     -   [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer)
     -   [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi)
     -   [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling)
     -   [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine)
     -   [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling)
         -   [Cellular Batching](#cellular-batching)
     -   [Lowering to LLVM IR](#lowering-to-llvm-ir)
     -   [Improved Type Support](#improved-type-support)
     -   [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution)

 <!-- /TOC -->

 ## Input Dialects

 <a id="markdown-Input%20Dialects" name="Input%20Dialects"></a>

 ### Future MLIR XLA HLO Replacement

 <a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a>

 IREE's current input dialect is the XLA HLO dialect representing operations on
 tensors. This was a pragmatic decision based on having HLO already defined and
 proof of existing models being lowered to it from Tensorflow, allowing us to
 focus on the IREE-specific portions of work. Unfortunately, HLO is tied to
 Tensorflow and has many quirks that would not otherwise have been designed had
 that not been the case. There are discussions happening about an upstream MLIR
 [Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/)
 dialect that HLO can be lowered into, allowing IREE (and other backends) to
 decouple themselves from XLA and be easier to target from frontends.

 ### `linalg`: High-level Hierarchical Optimization

 <a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a>

 It's required that IREE inputs are all in tensor form (and not in-place memref
 updates) in order to perform a large majority of the `flow` transformations.
 Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect
 is adding support for operating on value-semantic tensors, meaning that we can
 first apply `xla_hlo` to `linalg` lowerings and any of the transformations
 available in Linalg prior to performing our own `flow` lowerings. The advantage
 is that Linalg will have much stronger and principled code motion and nested
 loop transformation optimizations than is possible on higher-level ops. As not
 all operations can be represented as `linalg` ops IREE will be able to ingest a
 mix of `linalg`, `std`, and `xla_hlo` (or its replacement) ops.

 ### XLA HLO: Canonicalizations

 <a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a>

 Very little effort has been applied to `xla_hlo` optimizations and there are a
 significant number of missing folders, canonicalizers, and simple
 transformations. Many of these happen in legacy XLA C++ backends; however we
 need them in MLIR so that we can make use of dynamic shapes, mixed dialect
 inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the
 corresponding `xla_hlo` ops) is nearing its initial milestones and afterward we
 expect more of these missing pieces to be filled in.

 Examples of the optimizations that will greatly benefit IREE (and any other
 backend consuming `xla_hlo`) include:

 -   Eliding unneeded transpose, reshape, and broadcast operations.
 -   Inserting transpose, reshape, and broadcast operations to allow for more
     optimal memory access patterns (such as transposing gather input to allow
     for memcpy-like transfers instead of column-wise cache-unfriendly accesses).
 -   Moving operations above broadcasts such that the smallest amount of work is
     performed.

 ### XLA HLO: Tensor to Primitive Conversion

 <a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a>

 HLO only operates on tensor values - even for simple scalars - and this presents
 a problem when attempting to determine which code should be specified to run on
 accelerators vs. what should run on the host. The canonical example is
 `xla_hlo.while`, which as seen in the example below uses scalar tensors for its
 loop iteration counter and comparison.

 ```mlir
 %start = constant dense<1> : tensor<i32>
 %bound = constant dense<3> : tensor<i32>
 %res = "xla_hlo.while"(%start) ( {
 ^bb0(%count: tensor<i32>):
   %1 = "xla_hlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1>
   "xla_hlo.return"(%1) : (tensor<i1>) -> ()
 },  {
 ^bb0(%count: tensor<i32>):
   %1 = xla_hlo.add %count, %count : tensor<i32>
   "xla_hlo.return"(%1) : (tensor<i32>) -> ()
 }) : (tensor<i32>) -> tensor<i32>
 ```

 A naïve but correct lowering (what's currently in IREE) would perform the
 comparison and increment on the device and insert a host readback to see if the
 loop should continue:

 ```mlir
 func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} {
   %cst = constant dense<1> : tensor<i32>
   %cst_0 = constant dense<3> : tensor<i32>
   %cst_1 = constant dense<1> : vector<3xi32>
   br ^bb1(%cst : tensor<i32>)
 ^bb1(%2: tensor<i32>):  // 2 preds: ^bb0, ^bb2
   %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> {
     %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1>
     flow.return %8 : tensor<i1>
   }
   %4 = flow.tensor.load %3 : tensor<i1>
   cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>)
 ^bb2(%5: tensor<i32>):  // pred: ^bb1
   %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> {
     %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32>
     flow.return %8 : tensor<i32>
   }
   br ^bb1(%6 : tensor<i32>)
 ^bb3(%7: tensor<i32>):  // pred: ^bb1
   return %7 : tensor<i32>
 }
 ```

 Of note is the `flow.tensor.load` op indicating a host readback. Though this
 correctly executes the loop it is extremely inefficient. What's desired is for
 the loop iterator and condition to all happen on the host, with the iterator
 being passed to the loop body as an argument that can be encoded into a command
 buffer in future lowering stages. This eliminates host readback and allows for
 much larger `flow.stream` sequences, feeding more into the pipeline for the
 accelerator.

 Not all source frontends have this issue (misrepresenting simple host
 computation as non-dense tensor operations), and our goal is to add a
 transformation that heuristically converts `xla_hlo` ops acting on small tensors
 to `std` ops acting on primitive values (`i32`, `index`, etc).

 ### Quantization

 <a id="markdown-Quantization" name="Quantization"></a>

 It's assumed that any work related to quantization/compression has happened
 prior to lowering into IREE dialects. Our plan is to use the proposed
 [Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655)
 to achieve both training and inference-time quantization of types in a way that
 preserves maximum accuracy. IREE will support running with original unquantized
 floats in all cases, allowing for a smooth on-ramp to quantization and the gains
 in performance and reduction in model size that come from it.

 As future work IREE would like to move beyond these transformation-directed
 approaches to quantization and interface directly to frontends which have a
 defined enough type system to represent accurate quantized (and otherwise
 compressed) computations directly, not relying exclusively on compiler-side type
 inference transforms.

 ## `flow`: Data- and Execution-Flow Modeling

 <a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a>

 The `flow` dialect is designed to allow us to extract as much concurrency as
 possible from a program and partition IR into the scheduling and execution
 domains. Today we have the IR structure and transformation flow in place but
 have not yet got to the most interesting things such an infrastructure enables.
 A majority of the largest performance, latency, and memory usage improvements
 IREE can offer are determined first here and all following lowerings benefit.
 _The fastest code is the code you don't execute and the smallest allocation is
 the allocation you don't make_ ;)

 ### Avoiding Readbacks with `flow.stream`

 <a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a>

 A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
 ops) will be removed when we have an
 [HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion).
 There will still be cases when readbacks are required for correctness but they
 usually fall into a small set of usage patterns. For those that don't this is
 one place where IREE will warn about performance issues, allowing programs that
 perform suboptimally but encouraging authors to adjust their input model to
 enable better behavior. The IREE VM also has specific support for hiding
 readback latency in an efficient way via
 [coroutines](coroutines-for-batching-and-cooperative-scheduling).

 The most common case we are currently seeing in the IR is that of dynamic copies
 where the offsets are dependent on the result of previous computations. Source
 models may have top-k + gather operations, for example. These appear as a
 `flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses
 the loaded value for a `flow.tensor.update` (or other operation):

 ```mlir
   %index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... }
   %index = flow.tensor.load %index_tensor : tensor<i32>
   %result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... {
     %0 = flow.dispatch ...
     %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32>
     ...
   }
 ```

 Today the `flow.tensor.update` turns into HAL command buffer transfer operations
 that must have their offsets known at recording time. This is a limitation of
 `vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact
 several drivers implement copies as small built-in shader programs meaning that
 we could perform the same expansion here with the right primitives. This would
 allow, in the above example, both the index to be computed and the tensor to be
 updated within the same stream to entirely remove the host round-trip.

 ### Threading `flow.stream` through the CFG

 <a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a>

 The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
 a temporary implementation designed to get the concept of streams lowered to the
 HAL dialect. For streams to be effective at modeling larger concurrency scopes
 they need to be able to move across branches in the CFG. This intuitively
 follows exactly what one would do if recording commands in C:

 ```c++
   vkCmdCopyBuffer(cmd, ...);
   if (some_flag) {
     vkCmdBindPipeline(cmd, ..., pipeline_a);
   } else {
     vkCmdBindPipeline(cmd, ..., pipeline_b);
   }
   vkCmdDispatch(cmd, ...);
 ```

 The corresponding `flow` IR:

 ```mlir
   flow.stream.append[%s0](...) {
     flow.tensor.update ...
   }
   %b = cmpi ne %some_flag, ...
   cond_br %b, ^a(%s0), ^b(%s0)
 ^a(%s1):
   flow.stream.append[%s1](...) {
     flow.dispatch @pipeline_a, ...
   }
   br ^end(%s1)
 ^b(%s2):
   flow.stream.append[%s2](...) {
     flow.dispatch @pipeline_b, ...
   }
   br ^end(%s2)
 ^end(%s3):
   ...
 ```

 This allows the entire stream to be lowered into one command buffer without the
 need for any host round-trips. The conversion into the `flow` dialect will walk
 the CFG and attempt to thread the `flow.stream` values through so long as there
 are no external dependencies.

 ### Predication of `flow.dispatch`

 <a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a>

 While the
 [`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
 can remove many of the simpler conditional dispatches there will always be some
 that will have their execution dependent on the result of prior dispatches. For
 these a `flow.cond_dispatch` will allow a condition to be provided that must be
 true for the dispatch to actually be performed.

 For targets that natively support predication in their command buffers (such as
 D3D12's
 [ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication))
 this provides a host round-trip-free way of conditionally executing dispatches
 and transfers. Unfortunately Vulkan support is still lacking, though Nvidia
 supports the
 [VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/)
 extension that exposes the same behavior.

 For targets that do not support predication natively it's still possible to
 emulate predication with
 [indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model
 the workgroup counts normally used to dispatch execution are sourced from
 another device buffer at the time the dispatch is made instead of sourced from
 the command buffer at the time the dispatch is recorded. Degenerate dispatches
 with counts of `0, 0, 0` allow for effective neutering of the dispatch with
 minimal overhead (vs. the significant penalty of a host round-trip!).

 By modeling such predication at the `flow` level we are able to lower into the
 HAL with target-aware predication semantics and fuse indirect dispatch workgroup
 count calculations into existing dispatches already being performed such that
 overhead is reduced.

 ### Deduping `flow.executable`s

 <a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a>

 While still in the `flow` dialect, the executables are target-agnostic. This
 makes simple IR tree diffing a potential solution to deduplication. Since most
 of the dispatches originate from the same source-language library calls in input
 frameworks there's a high likelihood of duplication, and depending on when
 inlining is performed we may have stronger or weaker ability to perform the
 deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are
 rearranged into consistent canonical representations) the IR comparisons can be
 done rather trivially.

 ### Rematerializing CSE'd Expressions

 <a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a>

 Common subexpression elimination is performed many times during lowering,
 however there comes a point where the CSE can introduce false dependencies and
 additional allocations that are otherwise avoidable. For example if a
 broadcasting operation is CSE'd and then the result is used by two or more
 operations that are scheduled independently what would have been a relatively
 cheap lowering of the broadcast to a simple index remapping now becomes an
 additional dispatch, materialization of an intermediate tensor, and a barrier:

 ```mlir
 %bcast = "xla_hlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
 %mul1 = xla_hlo.multiply %arg0, %bcast : tensor<1024x10xf32>
 // (pretend something here that prevents fusion)
 %mul2 = xla_hlo.multiply %arg1, %bcast : tensor<1024x10xf32>
 ```

 ```mlir
 %bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> {
   %0 = "xla_hlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
   return %0 : tensor<1024x10xf32>
 }
 // a barrier will be required here
 %mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
   %1 = xla_hlo.multiply %arg0, %bcast : tensor<1024x10xf32>
   return %1 : tensor<1024x10xf32>
 }
 %mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
   %2 = xla_hlo.multiply %arg1, %bcast : tensor<1024x10xf32>
   return %2 : tensor<1024x10xf32>
 }
 ```

 Instead the broadcast should be rematerialized inside of both dispatch regions
 as the cost of doing so is significantly less in compute resources and then the
 intermediate tensor will not be required at all. Though at first it may seem
 counter-intuitive to undo such a critical optimization as CSE (both to code size
 and often to compute) but here it's something we must carefully balance while
 looking at the whole system. It gets even more important when considering
 multi-device execution as the cost of sharing memory and synchronizing may be
 extremely non-trivial.

 ### Device Placement

 <a id="markdown-Device%20Placement" name="Device%20Placement"></a>

 While still within the `flow` dialect we have the ability to easily split
 streams and safely shuffle around operations. Target execution backends can opt
 into such behavior to ensure that device restrictions such as maximum in-flight
 memory, maximum scheduling depth, and capabilities are observed. For
 heterogeneous configurations the intent is that certain operations, dispatches,
 and streams can be attributed to specify which device categories they should be
 lowered. The constraint solving that takes place can be provided with generic
 heuristics ("big GEMMs go on the accelerator"), profile-guided databases based
 on benchmarks, learned traits via ML, etc.

 ## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables

 <a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a>

 As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
 the techniques classically used in real-time graphics apply. The benefit we have
 by modeling our usage of such a low-level API in IR is that the normal work -
 some of which is very non-trivial - for managing allocations, tracking resource
 lifetime, and ensuring proper synchronization/barriers is something we can apply
 the full force of an offline compiler against.

 ### Allow Targets to Specify `hal.interface`s

 <a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a>

 The `hal.interface` op specifies the ABI between the scheduler and the device
 containing the buffer bindings and additional non-buffer data (parameters,
 shapes, specialization flags, etc). Today a naïve ordering is used uniformly for
 all targets however it is possible for target backends to opt into providing
 their own interfaces based on target configuration. The same `hal.executable`
 may have multiple interfaces and the same backend may use one or more. This is
 useful for when target capabilities may vary at runtime, such as the
 [number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android)
 in Vulkan. By exposing a few `hal.interface` variants with different binding
 amounts the Vulkan backend could make better use of the larger number of
 bindings available at runtime while still providing support for smaller
 configurations.

 Once we have multiple `hal.interface`s defined for executables the scheduler
 needs to emit HAL ops that properly switch between them. By having a canonical
 form for bindings we can ensure that only the differences between the interfaces
 will need additional code.

 ### Target-specific Scheduling Specialization

 <a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a>

 Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
 regions, it's not always possible for all target backends to schedule a region
 as a single dispatch. A classic example is algorithms like
 [parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm)
 commonly used on GPUs that may require many dispatches to identical executables,
 while other algorithms may vary the executables they use based on the input
 parameters such as shape or the target runtime device support.

 By default the `flow.dispatch` executable translation to `hal.executable`s is
 performed 1:1 and it is assumed that a single dispatch is required. Extending
 target backends with scheduling interfaces (enabling them to opt into different
 scheduling behavior) will allow the backends to emit any number of
 `hal.executable`s and any stream commands (such as additional dispatches or
 transfers) they may need. This is effectively equivalent to what would be done
 at runtime only because we are still operating on IR prior to buffer allocation
 and can use the `hal` ringbuffer primitive. Through this we can elide many of
 the allocations that would otherwise be required at runtime (and the
 concurrency-limiting false dependencies that usually come along with scratch
 memory).

 Since the algorithm used may vary based on the parameters of the dispatch (such
 as the shape of the reduction which may be dynamically determined) scheduling
 specialization may occur even when targeting a single backend. In many cases
 folding and canonicalization can eliminate the overhead as whether one
 dynamically computed workgroup size is used instead of another the same IR is
 present.

 ### Buffer Usage Tracking

 <a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a>

 Many explicit hardware APIs require knowing how buffers are used alongside with
 where they should be located. For example this additional information determines
 caching policy on buffer accesses (write-through, write-back, etc), visibility
 of writes across compute units, and the possible MMU properties that may need to
 be maintained/matched for the buffer. By using the SSA-form value-semantics of
 the MLIR `tensor` as used in the `flow` dialect we have complete information of
 where buffers may be used or at least where they enter or leave regions where we
 can derive such information.

 Analysis passes can run over IR to attribute tensors such that when allocation
 is performed when lowering to the `hal` dialect we do so from an allocator
 compatible with where the buffer will be used, with memory types chosen based on
 the potential cost and location of operations performed (write-only on host vs.
 read-write on host and device, etc), and with usage bits indicating what kind of
 operations may be performed on the buffer. Many of these are local
 transformations as most buffers are only live within very small regions such as
 the `flow.stream` encompassing their usage.

 Traditional systems need to either use very permissive buffer properties or
 heuristics that can introduce additional non-trivial overhead when such
 heuristics are incorrect. For example,
 [OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml)
 that drivers were then able to use but almost no drivers behaved as desired in
 all cases and it lead to additional memory ghosting, copies, readbacks, and
 unpredictable performance. For almost all uses of the buffers within an IREE
 invocation we instead can know precisely where and how buffers may need to be
 moved and do it a minimum number of times if it is required.

 ### Batched Executable Caching and Precompilation

 <a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a>

 For targets that may require runtime preprocessing of their executables prior to
 dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
 compilation mechanism based on Vulkan's
 [Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html).

 Today each executable is compiled on-demand and cached only for the process
 lifetime. Though some drivers may provide their own caching we can make better
 use of the explicit caching and compilation behavior with the additional
 information we have in the compiler.

 For any given entry point (or group of entry points) into an IREE module we can
 perform reachability analysis to know which executables may be executed when
 that entry point is invoked. In this way we can emit pre-invocation compilation
 checks (similar to an `std::call_once` block) that provides all required
 executables for compilation and allows more efficient compilation through
 multithreading the compiler invocations. These same compilation caching function
 can be exposed and invoked manually by an application to force pre-compilation
 when it is least likely to impact the user, such as a post-install/first-run
 step or concurrently while other application features are loading.

 We can use zero or more scoped caches for executables within a module.
 Completely dynamic modules (such as those emitted in eager-mode usage) may avoid
 the caching overhead entirely, while modules that have several primary usage
 modes (such as training and inference) may choose to use independent caches for
 each such mode.

 The caches generated can then be retrieved and saved by the hosting application.
 Upon the next execution the application can provide the caches and if still
 valid they will be used to avoid compilation.

 ### Target-aware Executable Compression

 <a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a>

 An advantage of representing executable binaries in IR after translation is that
 we can apply various post-compilation compression and minification techniques
 while still know precisely where the executable will be used. This is extremely
 important for SPIR-V as it is not designed to be a small at-rest format. Though
 the biggest lever we have to control generated code size is higher-level
 deduplication and specialization there will still be a sufficiently large number
 of executable binaries we will need to embed within the final modules and having
 targeted approaches for reducing their size beyond just "gzip everything" is
 very powerful.

 For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless
 SPIR-V compression technique that, when coupled with modern dictionary-based
 compression algorithms, can save significant binary size. As a data point, the
 SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB
 of compressed SMOL-V.

 Combined with
 [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
 we can easily use shared dictionaries and other cross-artifact compression in a
 relatively plug-in way.

 ### Target-aware Constant Compression

 <a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a>

 It's still an area that needs more research but one goal of the IREE design was
 to enable efficient target- and context-aware compression of large constants
 (typically model weights/parameters/embeddings). This may mean reusing existing
 hardware compression formats on GPUs, ML accelerator-specific formats, or
 very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be
 directly used without first decompressing. The inspiration here is formats like
 [Crunch](https://github.com/BinomialLLC/crunch) and
 [Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform
 ["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be
 able to use these directly as then we can make use of GPU hardware samplers to
 do the 4-bit to 32-bit decompression, etc.

 ### Command Buffer Stateful Deduplication

 <a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a>

 The IREE HAL - much like Vulkan it is based on - eschews much of the state that
 traditional APIs have in favor of (mostly) immutable state objects (pipeline
 layouts, pipeline states, descriptor sets, etc). There are still a few stateful
 entry points in the API, though, and deduplicating or reordering redundant calls
 can reduce both IR, API, and execution overhead.

 The key place this will have the largest impact is around descriptor set
 bindings and push descriptors, both of which are state and can have non-trivial
 setup overhead. A canonicalization for such commands that inspects the target
 `hal.command_buffer` to see if the same state was set prior and code motion to
 move such commands out of loop bodies when possible would be helpful.

 ### Resource Timeline

 <a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a>

 A core concept of the IREE scheduler that allows for overlapping in-flight
 invocations is that of the resource timeline. This identifies module state that
 can be in use by multiple invocations and assigns timeline milestones denoting
 when the resource will be in the appropriate state for the current invocation to
 proceed. Conceptually it is like a epoch-based synchronization mechanism as
 commonly found in garbage collectors to allow for lock-free asynchronous memory
 reclamation.

 The advantage we have in the IR is that we know both the usage of all resources
 thanks to [buffer usage tracking](#buffer-usage-tracking) and the
 synchronization domains of all resources (in most cases). This allows us to
 effectively assign one timeline semaphore per writeable resource while in
 practice having far fewer than 1:1, as for example if two resources are only
 ever written in the same command buffer only one semaphore is needed to signal
 the completion of both writes.

 By transforming IR to sink all resource reads and writes closest to where the
 value is used we can enlarge the time windows that can overlap across
 invocations that may share those resources. This is similar to what out-of-order
 CPUs do with register renaming/reorder buffers/etc and something we can apply
 some traditional instruction scheduling techniques to (only here our
 'instructions' are entire command buffer dispatches/transfers).

 Two degenerate cases of this approach are that of resource indirection
 (`iree.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may
 not be possible to continue recording commands even if we are able to ensure
 execution is appropriately synchronized. This is where indirect dispatch,
 [predication](#predication-of-flowdispatch),
 [indirect command buffers](#indirect-command-bufferon-accelerator-execution),
 and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all
 help cover for the times where we are unable to transform away the indirection
 or emit shape logic without data dependencies.

 ### Transient Tensor Ringbuffer

 <a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a>

 (When properly implemented) almost all buffers required during execution never
 escape the command buffers they are used in or a single VM invocation. We can
 trivially identify this from the explicit captures of `flow.stream` and
 `flow.dispatch` ops and the fact that all tensor types have value-semantics.
 Only those tensor values loaded-from/stored-to module state or that cross the
 exported module function boundary need special consideration while almost
 everything else can live transiently only so long as it is required during
 execution.

 Thanks to this information about buffer usage and lifetime we can use a
 [ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the
 transient tensor data and other required data reservations such as uniform
 buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches.
 This gives the compiler and the application a knob that allows them to control
 maximum concurrency (by having a very large ringbuffer) or maximum memory usage
 (by having a minimally small ringbuffer).

 Allocating tensors from the ringbuffer does not require sophisticated runtime
 packing as we can emit IR to calculate required sizes for dynamically shaped
 tensors. Whether a basic block reserves `%sz = constant 42 : index` bytes or
 `%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how the
 allocations are performed. Since almost all usage involves simple write head
 bumps there is no need for ahead-of-time memory planning or large fixed
 allocations, and since no buffer within the ringbuffer can alias we can have
 coarse (_read: low overhead_) guarantees about the availability of certain
 regions of the ringbuffer (_"when this event is signaled all prior ringbuffer
 writes have completed"_).

 Usually any planning we may want to perform can be done in IR via code motion.
 For example applying traditional algorithms used to reduce register pressure
 will help us attain narrower live windows within the ringbuffer leading to a
 larger number of in-flight operations for the same ringbuffer memory usage.

 We may end up using both a classical ringbuffer and a variant known as the
 [bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist)
 because it is better for descriptor set utilization (as we can provide many
 dispatch parameters with a single base offset bound once at the beginning of a
 region).

 ### Timeline Semaphores on the Module ABI

 <a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a>

 Functions calls made across modules (either from C++ into the VM, VM->VM, or
 VM->C++) should be able to define timeline semaphores used to wait and signal on
 the call. We can do this by making all exports automatically have the semaphores
 and then make invocations populate them if they were not provided by the caller.
 In this way we can allow multiple invocations of exported functions to chain
 naturally with internal asynchronous workloads, turning most IREE invocations
 into just recording of command buffers that can never block.

 When combined with
 [VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we
 even have the ability to interleave any required host execution between the wait
 and signal semaphores provided such that the caller never knows on which device
 execution is taking place. It's still possible to provide synchronous wrappers
 that emulate blocking behavior but by having the core system designed around a
 single system-supported primitive we avoid the need for additional things like
 interrupt watchdog threads, implicit blocking, and other pitfalls.

 ### GPU-like CPU Scheduling

 <a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a>

 One approach to using multiple cores on a CPU is to perform interior
 parallelization of operations such as OpenMP or library-call-based custom thread
 pools (gemmlowp). This works when each individual operation is relatively costly
 vs. potential pipeline bubbles caused by work spinning down near the end of an
 operation and spinning up at the beginning of the next.

 IREE is designed to handle many more workloads - some of which have very narrow
 shapes but very deep pipelines (like search algorithms) - such that the above
 approach of multithreading within ops becomes a bottleneck. These workloads are
 traditionally very poorly handled by frameworks and issues with
 oversubscription, pipeline stalls, and suboptimal system schedulers (such as on
 Android) can lead to more time being spent thrashing about than actually
 executing real work.

 The approach we take here is to treat the cores of a CPU as if they were
 computation units on a GPU, each able to perform some set of heterogeneous work
 independent of others units. This means that the concurrency we are trying to
 model at the `flow` level and communicate to the runtime via the `hal` that
 explicitly states which dispatches can overlap and the size of the workgroups
 can trivially be used to distribute this work over many cores exactly as a GPU
 would do it. Integration with library calls that may require their own threading
 (such as Ruy) requires that they be able to use the IREE thread pool instead of
 their own.

 In this way we can avoid pipeline bubbles and other latency-inducing
 unpredictable scheduling. This does not mean that we treat individual units of
 work at the same scale as we would for GPUs, but instead that we tile and have
 one or more processing units that allows us to work on those tiles. Whether the
 tile size is defined by a library call contract, heuristics, or empirically is
 TBD, but expect workgroup sizes in the thousands to millions of invocations vs.
 normal GPU workgroup sizes in the dozens to hundreds of invocations.

 To achieve this style of scheduling efficiently we'll likely use
 [marl](https://github.com/google/marl) as the scheduler. This provides
 cross-platform low-overhead fibers and is compatible with this style of
 scheduling as it was built for the Swiftshader software rasterizer.

 Even if IREE was only targeting CPUs the assertion is that we would still want
 to schedule this way and it's only an incidental benefit that if building for
 heterogeneous targets the scheduling code may be shared (just with a different
 divisor for workgroup count calculations).

 ## `vm`: Lightweight Virtual Machine

 <a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a>

 The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
 intermediate lowering IR. Many of the optimizations we can perform on it will
 benefit all use cases (such as when lowering to LLVM IR) by allowing
 higher-level program transformations around synchronization that are difficult
 to perform on arbitrary LLVM IR.

 ### Coroutines for Batching and Cooperative Scheduling

 <a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a>

 One of the largest features currently missing from the VM is coroutines (aka
 user-mode fiber scheduling). Coroutines are what will allow us to have multiple
 in-flight invocations into a module - some of which may be waiting on external
 events - without the need for complex multithreading logic or state machine
 machinations.

 In many cases
 [once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi)
 we will not need to yield in the VM. The user will call into the module with
 provided semaphores, the work to perform will be recorded to one or more command
 buffers and submitted to the device, and then control return will return to the
 caller immediately.

 In cases requiring host readbacks that we were not able to remove, however,
 additional VM code may need to run prior to when the final semaphore is
 signaled. To preserve the asynchronous interface and immediate execution
 guarantees the compiler can emit explicit yield points (`vm.yield`) that are
 known-good locations for yielding (such as most resources not required after the
 yield having been flushed/discarded, partial synchronization scope availability
 if other work may be able to execute concurrently irrespective of the yielded
 coroutine, etc).

 When the VM encounters the yield at runtime it will suspend the coroutine until
 a defined condition is met. Many coroutines can be in various states at any
 given time and - thanks to the resource timeline - can still be memory safe. For
 example if two stateless invocations are made with a common wait semaphore both
 can be recorded and submitted without waiting on each other. If there is
 internal module state accessed the invocations are implicitly ordered by
 invocation order (similar to what Vulkan calls
 [API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder))
 based on internal resource timeline semaphores.

 Waking the coroutines can be performed by either an application-provided
 callback in the case of the application already having a periodic event which is
 doing bookkeeping (such as frame end callbacks when rendering or Looper idle
 events on Android), giving direct control over the frequency and location which
 IREE utilizes to perform additional work. A helper will be provided as well that
 runs a dedicated IREE thread to do this, but the expectation is that
 applications can often do a better (and importantly more predictable) job.

 By utilizing coroutines IREE will have a way to fill traditional pipeline
 bubbles even with execution from the same module (let alone across modules) in
 the situation where host readbacks or other logic is required. This increases
 overall throughput and utilization while reducing host wakeups as many
 coroutines can be processed at once to submit new work to the device queues,
 though it does not help reduce per-invocation latency.

 External code such as the HAL implementation or user ops may provide the wait
 handles used for continuation. For example, the HAL can expose a function that
 yields and wakes only when one or more timeline semaphores reach their target
 values:

 ```mlir
 // submit work
 hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target
 // continue here, possibly much later in time
 ```

 #### Cellular Batching

 <a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a>

 Though coroutines help throughput there is a way we've found to reduce latency
 that's been documented as
 [cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf).
 This same technique has been implemented in prior internal systems and is one of
 the motivating design goals for IREE's creation. The core idea is to identify
 small uniform work that can be partitioned and scheduled greedily such as to
 enable batching or reduce associated invocation costs (such as refreshing
 accelerator SRAM/caches with new parameters). This usually manifests as finding
 large GEMM/GEMV operations using the same fixed parameters and either
 dynamically increasing the batch size by adding the waiting work (without
 deferring the actual execution time) or sequencing them back to back to ensure
 better cache utilization. Which approach is taken depends on any data
 dependencies that may be present (such as LSTM state feedback edges).

 With the foundation of coroutines in IREE it's possible to yield execution at
 any given point - including during command buffer recording - and wake on
 specific conditions. A majority of the logic can be built into the module itself
 with very little need for runtime machinery, as shared VM variables can be used
 to track pending work across invocations (even from different parts of the
 program) and flush based on logic wholly controlled by the user or compiler
 (such as count/max time latency/etc limits). This allows for the large variety
 of scheduling behavior various applications may want to use, such as a
 zero-latency batch-only-within-this-invocation to a
 [Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque
 time or limit based behavior or even some learned model-specific windowing.

 Design work is still required on how to represent this in IR but the current
 thought is to model the regions in which deferred execution is possible and
 beneficial and allow during lowering to the VM additional transformations. This
 is similar to how the async-await behavior works in C# where the async keyword
 is just sugar that expands to additional generated helper utilities.

 A simple strawman representation for sequential dispatch may look like:

 ```mlir
 hal.scheduling_policy @defer_policy {
   // max time, max count, max live memory, etc
 }
 ...
 hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ...
 // vm.yield added here during lowering
 ```

 There are many cases to explore and as cellular batching can have performance
 benefits of several orders of magnitudes it'll be one of the primary areas of
 research in the long-term.

 ### Lowering to LLVM IR

 <a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a>

 For scenarios where dynamic module loading is not required and entire modules
 can be compiled into applications we can lower the VM IR to LLVM IR within
 MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are
 dispatched at runtime to things like the HAL we can instead lower to
 `llvm::CallInst` to runtime-resolved function pointers. This still enables all
 of the flexibility of heterogeneous/runtime-determined devices, pluggable
 diagnostics, and backend composition without any need for flatbuffers or the VM
 bytecode interpreter.

 The VM was designed to make such a lowering easy and the C-style struct-based
 function pointer registration for runtime modules was designed to make emitting
 code that used it fairly robust even when linked in dynamically such as when
 embedded in shared objects.

 An extension of this is what we've been calling 'runtimeless mode', where the
 IREE VM linkage code is statically linked into the binary alongside the
 generated module LLVM IR. If only a single HAL backend is linked in then (with
 some build-fu) we should be able to get call devirtualization to reduce code
 size to precisely the functionality used by the module.

 ### Improved Type Support

 <a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a>

 Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
 intentional limitation such that we can determine what is really needed to
 express the scheduling we perform, with the idea being that such a limited model
 will make it easier to use techniques like
 [indirect command buffers](#indirect-command-bufferon-accelerator-execution) to
 compile the VM itself to an accelerator executable that dispatches work without
 host involvement.

 As we port more models we may find a few primitives that are worth bringing into
 the VM design such that it's worth potential complications to future porting.
 These includes types like `f32` (for simple float calculations/comparisons),
 `list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple
 inline calculations that are not worth dispatch overhead/synchronization).

 ### Indirect Command Buffer/On-Accelerator Execution

 <a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a>

 Though IREE will use many different tricks such as
 [predication](#predication-of-flowdispatch) to build deep pipelines there is
 still the requirement that the command recording and submission happens on the
 host CPU. Though the cost of this in terms of latency and power use can be
 minimized by coalescing and timelines there is still the possibility of
 non-trivial roundtrips being introduced that limit performance. For particular
 applications like low-power always-on compute or where there is significantly
 branchy behavior (such as search algorithms) it is important that the decision
 making logic as to what is dispatched runs as close to real-time as possible
 within the execution pipeline.

 The IREE VM is designed to be runnable on-device in a secure and cooperative way
 (no pointers, indirect buffer handles to allow for memory space rearrangement
 op-to-op, deterministic execution and explicit yield points, etc).

 The recent efforts to bring indirect command buffers to Vulkan and Metal's
 [Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu)
 (that both derive inspiration from
 [NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt))
 are one such target for this. Either by
 [lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special
 conversion to target-specific forms, or by actually executing the VM bytecode
 directly on-device (it's ~1000 LoC) we should be able to prototype what full
 on-device usage is like. Even if only some VM functions the compiler deems
 useful to schedule on the device are used and the rest run on the host
 (particularly those functions calling imported functions) some of the most
 costly logic that creates tight coupling of the host and device scheduling can
 be limited.