| # IREE Design Roadmap |
| |
| <a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a> |
| |
| <!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING --> |
| |
| A not-so-concise walkthrough of various IREE features that are in the design |
| process and planned for future versions. A lot of the questions around how the |
| IREE IR is designed and why certain components exist (such as the VM) hopefully |
| become much clearer when seeing where we want to go with the infrastructure we |
| are building (as opposed to where we currently are with our MVP slice). This |
| document is not meant to encompass the entire design of any individual feature |
| and if there's interest please say hi on the |
| [iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing |
| list. |
| |
| <!-- TOC --> |
| |
| - [IREE Design Roadmap](#iree-design-roadmap) |
| - [Input Dialects](#input-dialects) |
| - [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement) |
| - [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization) |
| - [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations) |
| - [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion) |
| - [Quantization](#quantization) |
| - [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling) |
| - [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream) |
| - [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg) |
| - [Predication of `flow.dispatch`](#predication-of-flowdispatch) |
| - [Deduping `flow.executable`s](#deduping-flowexecutables) |
| - [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions) |
| - [Device Placement](#device-placement) |
| - [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables) |
| - [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces) |
| - [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization) |
| - [Buffer Usage Tracking](#buffer-usage-tracking) |
| - [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation) |
| - [Target-aware Executable Compression](#target-aware-executable-compression) |
| - [Target-aware Constant Compression](#target-aware-constant-compression) |
| - [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication) |
| - [Resource Timeline](#resource-timeline) |
| - [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer) |
| - [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi) |
| - [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling) |
| - [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine) |
| - [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling) |
| - [Cellular Batching](#cellular-batching) |
| - [Lowering to LLVM IR](#lowering-to-llvm-ir) |
| - [Improved Type Support](#improved-type-support) |
| - [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution) |
| |
| <!-- /TOC --> |
| |
| ## Input Dialects |
| |
| <a id="markdown-Input%20Dialects" name="Input%20Dialects"></a> |
| |
| ### Future MLIR XLA HLO Replacement |
| |
| <a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a> |
| |
| IREE's current input dialect is the XLA HLO dialect representing operations on |
| tensors. This was a pragmatic decision based on having HLO already defined and |
| proof of existing models being lowered to it from Tensorflow, allowing us to |
| focus on the IREE-specific portions of work. Unfortunately, HLO is tied to |
| Tensorflow and has many quirks that would not otherwise have been designed had |
| that not been the case. There are discussions happening about an upstream MLIR |
| [Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/) |
| dialect that HLO can be lowered into, allowing IREE (and other backends) to |
| decouple themselves from XLA and be easier to target from frontends. |
| |
| ### `linalg`: High-level Hierarchical Optimization |
| |
| <a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a> |
| |
| It's required that IREE inputs are all in tensor form (and not in-place memref |
| updates) in order to perform a large majority of the `flow` transformations. |
| Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect |
| is adding support for operating on value-semantic tensors, meaning that we can |
| first apply `xla_hlo` to `linalg` lowerings and any of the transformations |
| available in Linalg prior to performing our own `flow` lowerings. The advantage |
| is that Linalg will have much stronger and principled code motion and nested |
| loop transformation optimizations than is possible on higher-level ops. As not |
| all operations can be represented as `linalg` ops IREE will be able to ingest a |
| mix of `linalg`, `std`, and `xla_hlo` (or its replacement) ops. |
| |
| ### XLA HLO: Canonicalizations |
| |
| <a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a> |
| |
| Very little effort has been applied to `xla_hlo` optimizations and there are a |
| significant number of missing folders, canonicalizers, and simple |
| transformations. Many of these happen in legacy XLA C++ backends; however we |
| need them in MLIR so that we can make use of dynamic shapes, mixed dialect |
| inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the |
| corresponding `xla_hlo` ops) is nearing its initial milestones and afterward we |
| expect more of these missing pieces to be filled in. |
| |
| Examples of the optimizations that will greatly benefit IREE (and any other |
| backend consuming `xla_hlo`) include: |
| |
| - Eliding unneeded transpose, reshape, and broadcast operations. |
| - Inserting transpose, reshape, and broadcast operations to allow for more |
| optimal memory access patterns (such as transposing gather input to allow |
| for memcpy-like transfers instead of column-wise cache-unfriendly accesses). |
| - Moving operations above broadcasts such that the smallest amount of work is |
| performed. |
| |
| ### XLA HLO: Tensor to Primitive Conversion |
| |
| <a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a> |
| |
| HLO only operates on tensor values - even for simple scalars - and this presents |
| a problem when attempting to determine which code should be specified to run on |
| accelerators vs. what should run on the host. The canonical example is |
| `xla_hlo.while`, which as seen in the example below uses scalar tensors for its |
| loop iteration counter and comparison. |
| |
| ```mlir |
| %start = constant dense<1> : tensor<i32> |
| %bound = constant dense<3> : tensor<i32> |
| %res = "xla_hlo.while"(%start) ( { |
| ^bb0(%count: tensor<i32>): |
| %1 = "xla_hlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1> |
| "xla_hlo.return"(%1) : (tensor<i1>) -> () |
| }, { |
| ^bb0(%count: tensor<i32>): |
| %1 = xla_hlo.add %count, %count : tensor<i32> |
| "xla_hlo.return"(%1) : (tensor<i32>) -> () |
| }) : (tensor<i32>) -> tensor<i32> |
| ``` |
| |
| A naïve but correct lowering (what's currently in IREE) would perform the |
| comparison and increment on the device and insert a host readback to see if the |
| loop should continue: |
| |
| ```mlir |
| func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} { |
| %cst = constant dense<1> : tensor<i32> |
| %cst_0 = constant dense<3> : tensor<i32> |
| %cst_1 = constant dense<1> : vector<3xi32> |
| br ^bb1(%cst : tensor<i32>) |
| ^bb1(%2: tensor<i32>): // 2 preds: ^bb0, ^bb2 |
| %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> { |
| %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1> |
| flow.return %8 : tensor<i1> |
| } |
| %4 = flow.tensor.load %3 : tensor<i1> |
| cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>) |
| ^bb2(%5: tensor<i32>): // pred: ^bb1 |
| %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> { |
| %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32> |
| flow.return %8 : tensor<i32> |
| } |
| br ^bb1(%6 : tensor<i32>) |
| ^bb3(%7: tensor<i32>): // pred: ^bb1 |
| return %7 : tensor<i32> |
| } |
| ``` |
| |
| Of note is the `flow.tensor.load` op indicating a host readback. Though this |
| correctly executes the loop it is extremely inefficient. What's desired is for |
| the loop iterator and condition to all happen on the host, with the iterator |
| being passed to the loop body as an argument that can be encoded into a command |
| buffer in future lowering stages. This eliminates host readback and allows for |
| much larger `flow.stream` sequences, feeding more into the pipeline for the |
| accelerator. |
| |
| Not all source frontends have this issue (misrepresenting simple host |
| computation as non-dense tensor operations), and our goal is to add a |
| transformation that heuristically converts `xla_hlo` ops acting on small tensors |
| to `std` ops acting on primitive values (`i32`, `index`, etc). |
| |
| ### Quantization |
| |
| <a id="markdown-Quantization" name="Quantization"></a> |
| |
| It's assumed that any work related to quantization/compression has happened |
| prior to lowering into IREE dialects. Our plan is to use the proposed |
| [Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655) |
| to achieve both training and inference-time quantization of types in a way that |
| preserves maximum accuracy. IREE will support running with original unquantized |
| floats in all cases, allowing for a smooth on-ramp to quantization and the gains |
| in performance and reduction in model size that come from it. |
| |
| As future work IREE would like to move beyond these transformation-directed |
| approaches to quantization and interface directly to frontends which have a |
| defined enough type system to represent accurate quantized (and otherwise |
| compressed) computations directly, not relying exclusively on compiler-side type |
| inference transforms. |
| |
| ## `flow`: Data- and Execution-Flow Modeling |
| |
| <a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a> |
| |
| The `flow` dialect is designed to allow us to extract as much concurrency as |
| possible from a program and partition IR into the scheduling and execution |
| domains. Today we have the IR structure and transformation flow in place but |
| have not yet got to the most interesting things such an infrastructure enables. |
| A majority of the largest performance, latency, and memory usage improvements |
| IREE can offer are determined first here and all following lowerings benefit. |
| _The fastest code is the code you don't execute and the smallest allocation is |
| the allocation you don't make_ ;) |
| |
| ### Avoiding Readbacks with `flow.stream` |
| |
| <a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a> |
| |
| A majority of the readbacks we have today (manifested as `flow.tensor.load.*` |
| ops) will be removed when we have an |
| [HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion). |
| There will still be cases when readbacks are required for correctness but they |
| usually fall into a small set of usage patterns. For those that don't this is |
| one place where IREE will warn about performance issues, allowing programs that |
| perform suboptimally but encouraging authors to adjust their input model to |
| enable better behavior. The IREE VM also has specific support for hiding |
| readback latency in an efficient way via |
| [coroutines](coroutines-for-batching-and-cooperative-scheduling). |
| |
| The most common case we are currently seeing in the IR is that of dynamic copies |
| where the offsets are dependent on the result of previous computations. Source |
| models may have top-k + gather operations, for example. These appear as a |
| `flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses |
| the loaded value for a `flow.tensor.update` (or other operation): |
| |
| ```mlir |
| %index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... } |
| %index = flow.tensor.load %index_tensor : tensor<i32> |
| %result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... { |
| %0 = flow.dispatch ... |
| %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32> |
| ... |
| } |
| ``` |
| |
| Today the `flow.tensor.update` turns into HAL command buffer transfer operations |
| that must have their offsets known at recording time. This is a limitation of |
| `vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact |
| several drivers implement copies as small built-in shader programs meaning that |
| we could perform the same expansion here with the right primitives. This would |
| allow, in the above example, both the index to be computed and the tensor to be |
| updated within the same stream to entirely remove the host round-trip. |
| |
| ### Threading `flow.stream` through the CFG |
| |
| <a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a> |
| |
| The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is |
| a temporary implementation designed to get the concept of streams lowered to the |
| HAL dialect. For streams to be effective at modeling larger concurrency scopes |
| they need to be able to move across branches in the CFG. This intuitively |
| follows exactly what one would do if recording commands in C: |
| |
| ```c++ |
| vkCmdCopyBuffer(cmd, ...); |
| if (some_flag) { |
| vkCmdBindPipeline(cmd, ..., pipeline_a); |
| } else { |
| vkCmdBindPipeline(cmd, ..., pipeline_b); |
| } |
| vkCmdDispatch(cmd, ...); |
| ``` |
| |
| The corresponding `flow` IR: |
| |
| ```mlir |
| flow.stream.append[%s0](...) { |
| flow.tensor.update ... |
| } |
| %b = cmpi ne %some_flag, ... |
| cond_br %b, ^a(%s0), ^b(%s0) |
| ^a(%s1): |
| flow.stream.append[%s1](...) { |
| flow.dispatch @pipeline_a, ... |
| } |
| br ^end(%s1) |
| ^b(%s2): |
| flow.stream.append[%s2](...) { |
| flow.dispatch @pipeline_b, ... |
| } |
| br ^end(%s2) |
| ^end(%s3): |
| ... |
| ``` |
| |
| This allows the entire stream to be lowered into one command buffer without the |
| need for any host round-trips. The conversion into the `flow` dialect will walk |
| the CFG and attempt to thread the `flow.stream` values through so long as there |
| are no external dependencies. |
| |
| ### Predication of `flow.dispatch` |
| |
| <a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a> |
| |
| While the |
| [`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg) |
| can remove many of the simpler conditional dispatches there will always be some |
| that will have their execution dependent on the result of prior dispatches. For |
| these a `flow.cond_dispatch` will allow a condition to be provided that must be |
| true for the dispatch to actually be performed. |
| |
| For targets that natively support predication in their command buffers (such as |
| D3D12's |
| [ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication)) |
| this provides a host round-trip-free way of conditionally executing dispatches |
| and transfers. Unfortunately Vulkan support is still lacking, though Nvidia |
| supports the |
| [VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/) |
| extension that exposes the same behavior. |
| |
| For targets that do not support predication natively it's still possible to |
| emulate predication with |
| [indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model |
| the workgroup counts normally used to dispatch execution are sourced from |
| another device buffer at the time the dispatch is made instead of sourced from |
| the command buffer at the time the dispatch is recorded. Degenerate dispatches |
| with counts of `0, 0, 0` allow for effective neutering of the dispatch with |
| minimal overhead (vs. the significant penalty of a host round-trip!). |
| |
| By modeling such predication at the `flow` level we are able to lower into the |
| HAL with target-aware predication semantics and fuse indirect dispatch workgroup |
| count calculations into existing dispatches already being performed such that |
| overhead is reduced. |
| |
| ### Deduping `flow.executable`s |
| |
| <a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a> |
| |
| While still in the `flow` dialect, the executables are target-agnostic. This |
| makes simple IR tree diffing a potential solution to deduplication. Since most |
| of the dispatches originate from the same source-language library calls in input |
| frameworks there's a high likelihood of duplication, and depending on when |
| inlining is performed we may have stronger or weaker ability to perform the |
| deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are |
| rearranged into consistent canonical representations) the IR comparisons can be |
| done rather trivially. |
| |
| ### Rematerializing CSE'd Expressions |
| |
| <a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a> |
| |
| Common subexpression elimination is performed many times during lowering, |
| however there comes a point where the CSE can introduce false dependencies and |
| additional allocations that are otherwise avoidable. For example if a |
| broadcasting operation is CSE'd and then the result is used by two or more |
| operations that are scheduled independently what would have been a relatively |
| cheap lowering of the broadcast to a simple index remapping now becomes an |
| additional dispatch, materialization of an intermediate tensor, and a barrier: |
| |
| ```mlir |
| %bcast = "xla_hlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> |
| %mul1 = xla_hlo.multiply %arg0, %bcast : tensor<1024x10xf32> |
| // (pretend something here that prevents fusion) |
| %mul2 = xla_hlo.multiply %arg1, %bcast : tensor<1024x10xf32> |
| ``` |
| |
| ```mlir |
| %bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> { |
| %0 = "xla_hlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> |
| return %0 : tensor<1024x10xf32> |
| } |
| // a barrier will be required here |
| %mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { |
| %1 = xla_hlo.multiply %arg0, %bcast : tensor<1024x10xf32> |
| return %1 : tensor<1024x10xf32> |
| } |
| %mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { |
| %2 = xla_hlo.multiply %arg1, %bcast : tensor<1024x10xf32> |
| return %2 : tensor<1024x10xf32> |
| } |
| ``` |
| |
| Instead the broadcast should be rematerialized inside of both dispatch regions |
| as the cost of doing so is significantly less in compute resources and then the |
| intermediate tensor will not be required at all. Though at first it may seem |
| counter-intuitive to undo such a critical optimization as CSE (both to code size |
| and often to compute) but here it's something we must carefully balance while |
| looking at the whole system. It gets even more important when considering |
| multi-device execution as the cost of sharing memory and synchronizing may be |
| extremely non-trivial. |
| |
| ### Device Placement |
| |
| <a id="markdown-Device%20Placement" name="Device%20Placement"></a> |
| |
| While still within the `flow` dialect we have the ability to easily split |
| streams and safely shuffle around operations. Target execution backends can opt |
| into such behavior to ensure that device restrictions such as maximum in-flight |
| memory, maximum scheduling depth, and capabilities are observed. For |
| heterogeneous configurations the intent is that certain operations, dispatches, |
| and streams can be attributed to specify which device categories they should be |
| lowered. The constraint solving that takes place can be provided with generic |
| heuristics ("big GEMMs go on the accelerator"), profile-guided databases based |
| on benchmarks, learned traits via ML, etc. |
| |
| ## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables |
| |
| <a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a> |
| |
| As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of |
| the techniques classically used in real-time graphics apply. The benefit we have |
| by modeling our usage of such a low-level API in IR is that the normal work - |
| some of which is very non-trivial - for managing allocations, tracking resource |
| lifetime, and ensuring proper synchronization/barriers is something we can apply |
| the full force of an offline compiler against. |
| |
| ### Allow Targets to Specify `hal.interface`s |
| |
| <a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a> |
| |
| The `hal.interface` op specifies the ABI between the scheduler and the device |
| containing the buffer bindings and additional non-buffer data (parameters, |
| shapes, specialization flags, etc). Today a naïve ordering is used uniformly for |
| all targets however it is possible for target backends to opt into providing |
| their own interfaces based on target configuration. The same `hal.executable` |
| may have multiple interfaces and the same backend may use one or more. This is |
| useful for when target capabilities may vary at runtime, such as the |
| [number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android) |
| in Vulkan. By exposing a few `hal.interface` variants with different binding |
| amounts the Vulkan backend could make better use of the larger number of |
| bindings available at runtime while still providing support for smaller |
| configurations. |
| |
| Once we have multiple `hal.interface`s defined for executables the scheduler |
| needs to emit HAL ops that properly switch between them. By having a canonical |
| form for bindings we can ensure that only the differences between the interfaces |
| will need additional code. |
| |
| ### Target-specific Scheduling Specialization |
| |
| <a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a> |
| |
| Though the `flow` dialect attempts to fuse as many ops as possible into dispatch |
| regions, it's not always possible for all target backends to schedule a region |
| as a single dispatch. A classic example is algorithms like |
| [parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm) |
| commonly used on GPUs that may require many dispatches to identical executables, |
| while other algorithms may vary the executables they use based on the input |
| parameters such as shape or the target runtime device support. |
| |
| By default the `flow.dispatch` executable translation to `hal.executable`s is |
| performed 1:1 and it is assumed that a single dispatch is required. Extending |
| target backends with scheduling interfaces (enabling them to opt into different |
| scheduling behavior) will allow the backends to emit any number of |
| `hal.executable`s and any stream commands (such as additional dispatches or |
| transfers) they may need. This is effectively equivalent to what would be done |
| at runtime only because we are still operating on IR prior to buffer allocation |
| and can use the `hal` ringbuffer primitive. Through this we can elide many of |
| the allocations that would otherwise be required at runtime (and the |
| concurrency-limiting false dependencies that usually come along with scratch |
| memory). |
| |
| Since the algorithm used may vary based on the parameters of the dispatch (such |
| as the shape of the reduction which may be dynamically determined) scheduling |
| specialization may occur even when targeting a single backend. In many cases |
| folding and canonicalization can eliminate the overhead as whether one |
| dynamically computed workgroup size is used instead of another the same IR is |
| present. |
| |
| ### Buffer Usage Tracking |
| |
| <a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a> |
| |
| Many explicit hardware APIs require knowing how buffers are used alongside with |
| where they should be located. For example this additional information determines |
| caching policy on buffer accesses (write-through, write-back, etc), visibility |
| of writes across compute units, and the possible MMU properties that may need to |
| be maintained/matched for the buffer. By using the SSA-form value-semantics of |
| the MLIR `tensor` as used in the `flow` dialect we have complete information of |
| where buffers may be used or at least where they enter or leave regions where we |
| can derive such information. |
| |
| Analysis passes can run over IR to attribute tensors such that when allocation |
| is performed when lowering to the `hal` dialect we do so from an allocator |
| compatible with where the buffer will be used, with memory types chosen based on |
| the potential cost and location of operations performed (write-only on host vs. |
| read-write on host and device, etc), and with usage bits indicating what kind of |
| operations may be performed on the buffer. Many of these are local |
| transformations as most buffers are only live within very small regions such as |
| the `flow.stream` encompassing their usage. |
| |
| Traditional systems need to either use very permissive buffer properties or |
| heuristics that can introduce additional non-trivial overhead when such |
| heuristics are incorrect. For example, |
| [OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml) |
| that drivers were then able to use but almost no drivers behaved as desired in |
| all cases and it lead to additional memory ghosting, copies, readbacks, and |
| unpredictable performance. For almost all uses of the buffers within an IREE |
| invocation we instead can know precisely where and how buffers may need to be |
| moved and do it a minimum number of times if it is required. |
| |
| ### Batched Executable Caching and Precompilation |
| |
| <a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a> |
| |
| For targets that may require runtime preprocessing of their executables prior to |
| dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch |
| compilation mechanism based on Vulkan's |
| [Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html). |
| |
| Today each executable is compiled on-demand and cached only for the process |
| lifetime. Though some drivers may provide their own caching we can make better |
| use of the explicit caching and compilation behavior with the additional |
| information we have in the compiler. |
| |
| For any given entry point (or group of entry points) into an IREE module we can |
| perform reachability analysis to know which executables may be executed when |
| that entry point is invoked. In this way we can emit pre-invocation compilation |
| checks (similar to an `std::call_once` block) that provides all required |
| executables for compilation and allows more efficient compilation through |
| multithreading the compiler invocations. These same compilation caching function |
| can be exposed and invoked manually by an application to force pre-compilation |
| when it is least likely to impact the user, such as a post-install/first-run |
| step or concurrently while other application features are loading. |
| |
| We can use zero or more scoped caches for executables within a module. |
| Completely dynamic modules (such as those emitted in eager-mode usage) may avoid |
| the caching overhead entirely, while modules that have several primary usage |
| modes (such as training and inference) may choose to use independent caches for |
| each such mode. |
| |
| The caches generated can then be retrieved and saved by the hosting application. |
| Upon the next execution the application can provide the caches and if still |
| valid they will be used to avoid compilation. |
| |
| ### Target-aware Executable Compression |
| |
| <a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a> |
| |
| An advantage of representing executable binaries in IR after translation is that |
| we can apply various post-compilation compression and minification techniques |
| while still know precisely where the executable will be used. This is extremely |
| important for SPIR-V as it is not designed to be a small at-rest format. Though |
| the biggest lever we have to control generated code size is higher-level |
| deduplication and specialization there will still be a sufficiently large number |
| of executable binaries we will need to embed within the final modules and having |
| targeted approaches for reducing their size beyond just "gzip everything" is |
| very powerful. |
| |
| For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless |
| SPIR-V compression technique that, when coupled with modern dictionary-based |
| compression algorithms, can save significant binary size. As a data point, the |
| SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB |
| of compressed SMOL-V. |
| |
| Combined with |
| [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation) |
| we can easily use shared dictionaries and other cross-artifact compression in a |
| relatively plug-in way. |
| |
| ### Target-aware Constant Compression |
| |
| <a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a> |
| |
| It's still an area that needs more research but one goal of the IREE design was |
| to enable efficient target- and context-aware compression of large constants |
| (typically model weights/parameters/embeddings). This may mean reusing existing |
| hardware compression formats on GPUs, ML accelerator-specific formats, or |
| very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be |
| directly used without first decompressing. The inspiration here is formats like |
| [Crunch](https://github.com/BinomialLLC/crunch) and |
| [Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform |
| ["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be |
| able to use these directly as then we can make use of GPU hardware samplers to |
| do the 4-bit to 32-bit decompression, etc. |
| |
| ### Command Buffer Stateful Deduplication |
| |
| <a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a> |
| |
| The IREE HAL - much like Vulkan it is based on - eschews much of the state that |
| traditional APIs have in favor of (mostly) immutable state objects (pipeline |
| layouts, pipeline states, descriptor sets, etc). There are still a few stateful |
| entry points in the API, though, and deduplicating or reordering redundant calls |
| can reduce both IR, API, and execution overhead. |
| |
| The key place this will have the largest impact is around descriptor set |
| bindings and push descriptors, both of which are state and can have non-trivial |
| setup overhead. A canonicalization for such commands that inspects the target |
| `hal.command_buffer` to see if the same state was set prior and code motion to |
| move such commands out of loop bodies when possible would be helpful. |
| |
| ### Resource Timeline |
| |
| <a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a> |
| |
| A core concept of the IREE scheduler that allows for overlapping in-flight |
| invocations is that of the resource timeline. This identifies module state that |
| can be in use by multiple invocations and assigns timeline milestones denoting |
| when the resource will be in the appropriate state for the current invocation to |
| proceed. Conceptually it is like a epoch-based synchronization mechanism as |
| commonly found in garbage collectors to allow for lock-free asynchronous memory |
| reclamation. |
| |
| The advantage we have in the IR is that we know both the usage of all resources |
| thanks to [buffer usage tracking](#buffer-usage-tracking) and the |
| synchronization domains of all resources (in most cases). This allows us to |
| effectively assign one timeline semaphore per writeable resource while in |
| practice having far fewer than 1:1, as for example if two resources are only |
| ever written in the same command buffer only one semaphore is needed to signal |
| the completion of both writes. |
| |
| By transforming IR to sink all resource reads and writes closest to where the |
| value is used we can enlarge the time windows that can overlap across |
| invocations that may share those resources. This is similar to what out-of-order |
| CPUs do with register renaming/reorder buffers/etc and something we can apply |
| some traditional instruction scheduling techniques to (only here our |
| 'instructions' are entire command buffer dispatches/transfers). |
| |
| Two degenerate cases of this approach are that of resource indirection |
| (`iree.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may |
| not be possible to continue recording commands even if we are able to ensure |
| execution is appropriately synchronized. This is where indirect dispatch, |
| [predication](#predication-of-flowdispatch), |
| [indirect command buffers](#indirect-command-bufferon-accelerator-execution), |
| and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all |
| help cover for the times where we are unable to transform away the indirection |
| or emit shape logic without data dependencies. |
| |
| ### Transient Tensor Ringbuffer |
| |
| <a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a> |
| |
| (When properly implemented) almost all buffers required during execution never |
| escape the command buffers they are used in or a single VM invocation. We can |
| trivially identify this from the explicit captures of `flow.stream` and |
| `flow.dispatch` ops and the fact that all tensor types have value-semantics. |
| Only those tensor values loaded-from/stored-to module state or that cross the |
| exported module function boundary need special consideration while almost |
| everything else can live transiently only so long as it is required during |
| execution. |
| |
| Thanks to this information about buffer usage and lifetime we can use a |
| [ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the |
| transient tensor data and other required data reservations such as uniform |
| buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches. |
| This gives the compiler and the application a knob that allows them to control |
| maximum concurrency (by having a very large ringbuffer) or maximum memory usage |
| (by having a minimally small ringbuffer). |
| |
| Allocating tensors from the ringbuffer does not require sophisticated runtime |
| packing as we can emit IR to calculate required sizes for dynamically shaped |
| tensors. Whether a basic block reserves `%sz = constant 42 : index` bytes or |
| `%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how the |
| allocations are performed. Since almost all usage involves simple write head |
| bumps there is no need for ahead-of-time memory planning or large fixed |
| allocations, and since no buffer within the ringbuffer can alias we can have |
| coarse (_read: low overhead_) guarantees about the availability of certain |
| regions of the ringbuffer (_"when this event is signaled all prior ringbuffer |
| writes have completed"_). |
| |
| Usually any planning we may want to perform can be done in IR via code motion. |
| For example applying traditional algorithms used to reduce register pressure |
| will help us attain narrower live windows within the ringbuffer leading to a |
| larger number of in-flight operations for the same ringbuffer memory usage. |
| |
| We may end up using both a classical ringbuffer and a variant known as the |
| [bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist) |
| because it is better for descriptor set utilization (as we can provide many |
| dispatch parameters with a single base offset bound once at the beginning of a |
| region). |
| |
| ### Timeline Semaphores on the Module ABI |
| |
| <a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a> |
| |
| Functions calls made across modules (either from C++ into the VM, VM->VM, or |
| VM->C++) should be able to define timeline semaphores used to wait and signal on |
| the call. We can do this by making all exports automatically have the semaphores |
| and then make invocations populate them if they were not provided by the caller. |
| In this way we can allow multiple invocations of exported functions to chain |
| naturally with internal asynchronous workloads, turning most IREE invocations |
| into just recording of command buffers that can never block. |
| |
| When combined with |
| [VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we |
| even have the ability to interleave any required host execution between the wait |
| and signal semaphores provided such that the caller never knows on which device |
| execution is taking place. It's still possible to provide synchronous wrappers |
| that emulate blocking behavior but by having the core system designed around a |
| single system-supported primitive we avoid the need for additional things like |
| interrupt watchdog threads, implicit blocking, and other pitfalls. |
| |
| ### GPU-like CPU Scheduling |
| |
| <a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a> |
| |
| One approach to using multiple cores on a CPU is to perform interior |
| parallelization of operations such as OpenMP or library-call-based custom thread |
| pools (gemmlowp). This works when each individual operation is relatively costly |
| vs. potential pipeline bubbles caused by work spinning down near the end of an |
| operation and spinning up at the beginning of the next. |
| |
| IREE is designed to handle many more workloads - some of which have very narrow |
| shapes but very deep pipelines (like search algorithms) - such that the above |
| approach of multithreading within ops becomes a bottleneck. These workloads are |
| traditionally very poorly handled by frameworks and issues with |
| oversubscription, pipeline stalls, and suboptimal system schedulers (such as on |
| Android) can lead to more time being spent thrashing about than actually |
| executing real work. |
| |
| The approach we take here is to treat the cores of a CPU as if they were |
| computation units on a GPU, each able to perform some set of heterogeneous work |
| independent of others units. This means that the concurrency we are trying to |
| model at the `flow` level and communicate to the runtime via the `hal` that |
| explicitly states which dispatches can overlap and the size of the workgroups |
| can trivially be used to distribute this work over many cores exactly as a GPU |
| would do it. Integration with library calls that may require their own threading |
| (such as Ruy) requires that they be able to use the IREE thread pool instead of |
| their own. |
| |
| In this way we can avoid pipeline bubbles and other latency-inducing |
| unpredictable scheduling. This does not mean that we treat individual units of |
| work at the same scale as we would for GPUs, but instead that we tile and have |
| one or more processing units that allows us to work on those tiles. Whether the |
| tile size is defined by a library call contract, heuristics, or empirically is |
| TBD, but expect workgroup sizes in the thousands to millions of invocations vs. |
| normal GPU workgroup sizes in the dozens to hundreds of invocations. |
| |
| To achieve this style of scheduling efficiently we'll likely use |
| [marl](https://github.com/google/marl) as the scheduler. This provides |
| cross-platform low-overhead fibers and is compatible with this style of |
| scheduling as it was built for the Swiftshader software rasterizer. |
| |
| Even if IREE was only targeting CPUs the assertion is that we would still want |
| to schedule this way and it's only an incidental benefit that if building for |
| heterogeneous targets the scheduling code may be shared (just with a different |
| divisor for workgroup count calculations). |
| |
| ## `vm`: Lightweight Virtual Machine |
| |
| <a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a> |
| |
| The VM is designed as a dynamic linkage ABI, stable bytecode representation, and |
| intermediate lowering IR. Many of the optimizations we can perform on it will |
| benefit all use cases (such as when lowering to LLVM IR) by allowing |
| higher-level program transformations around synchronization that are difficult |
| to perform on arbitrary LLVM IR. |
| |
| ### Coroutines for Batching and Cooperative Scheduling |
| |
| <a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a> |
| |
| One of the largest features currently missing from the VM is coroutines (aka |
| user-mode fiber scheduling). Coroutines are what will allow us to have multiple |
| in-flight invocations into a module - some of which may be waiting on external |
| events - without the need for complex multithreading logic or state machine |
| machinations. |
| |
| In many cases |
| [once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi) |
| we will not need to yield in the VM. The user will call into the module with |
| provided semaphores, the work to perform will be recorded to one or more command |
| buffers and submitted to the device, and then control return will return to the |
| caller immediately. |
| |
| In cases requiring host readbacks that we were not able to remove, however, |
| additional VM code may need to run prior to when the final semaphore is |
| signaled. To preserve the asynchronous interface and immediate execution |
| guarantees the compiler can emit explicit yield points (`vm.yield`) that are |
| known-good locations for yielding (such as most resources not required after the |
| yield having been flushed/discarded, partial synchronization scope availability |
| if other work may be able to execute concurrently irrespective of the yielded |
| coroutine, etc). |
| |
| When the VM encounters the yield at runtime it will suspend the coroutine until |
| a defined condition is met. Many coroutines can be in various states at any |
| given time and - thanks to the resource timeline - can still be memory safe. For |
| example if two stateless invocations are made with a common wait semaphore both |
| can be recorded and submitted without waiting on each other. If there is |
| internal module state accessed the invocations are implicitly ordered by |
| invocation order (similar to what Vulkan calls |
| [API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder)) |
| based on internal resource timeline semaphores. |
| |
| Waking the coroutines can be performed by either an application-provided |
| callback in the case of the application already having a periodic event which is |
| doing bookkeeping (such as frame end callbacks when rendering or Looper idle |
| events on Android), giving direct control over the frequency and location which |
| IREE utilizes to perform additional work. A helper will be provided as well that |
| runs a dedicated IREE thread to do this, but the expectation is that |
| applications can often do a better (and importantly more predictable) job. |
| |
| By utilizing coroutines IREE will have a way to fill traditional pipeline |
| bubbles even with execution from the same module (let alone across modules) in |
| the situation where host readbacks or other logic is required. This increases |
| overall throughput and utilization while reducing host wakeups as many |
| coroutines can be processed at once to submit new work to the device queues, |
| though it does not help reduce per-invocation latency. |
| |
| External code such as the HAL implementation or user ops may provide the wait |
| handles used for continuation. For example, the HAL can expose a function that |
| yields and wakes only when one or more timeline semaphores reach their target |
| values: |
| |
| ```mlir |
| // submit work |
| hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target |
| // continue here, possibly much later in time |
| ``` |
| |
| #### Cellular Batching |
| |
| <a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a> |
| |
| Though coroutines help throughput there is a way we've found to reduce latency |
| that's been documented as |
| [cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf). |
| This same technique has been implemented in prior internal systems and is one of |
| the motivating design goals for IREE's creation. The core idea is to identify |
| small uniform work that can be partitioned and scheduled greedily such as to |
| enable batching or reduce associated invocation costs (such as refreshing |
| accelerator SRAM/caches with new parameters). This usually manifests as finding |
| large GEMM/GEMV operations using the same fixed parameters and either |
| dynamically increasing the batch size by adding the waiting work (without |
| deferring the actual execution time) or sequencing them back to back to ensure |
| better cache utilization. Which approach is taken depends on any data |
| dependencies that may be present (such as LSTM state feedback edges). |
| |
| With the foundation of coroutines in IREE it's possible to yield execution at |
| any given point - including during command buffer recording - and wake on |
| specific conditions. A majority of the logic can be built into the module itself |
| with very little need for runtime machinery, as shared VM variables can be used |
| to track pending work across invocations (even from different parts of the |
| program) and flush based on logic wholly controlled by the user or compiler |
| (such as count/max time latency/etc limits). This allows for the large variety |
| of scheduling behavior various applications may want to use, such as a |
| zero-latency batch-only-within-this-invocation to a |
| [Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque |
| time or limit based behavior or even some learned model-specific windowing. |
| |
| Design work is still required on how to represent this in IR but the current |
| thought is to model the regions in which deferred execution is possible and |
| beneficial and allow during lowering to the VM additional transformations. This |
| is similar to how the async-await behavior works in C# where the async keyword |
| is just sugar that expands to additional generated helper utilities. |
| |
| A simple strawman representation for sequential dispatch may look like: |
| |
| ```mlir |
| hal.scheduling_policy @defer_policy { |
| // max time, max count, max live memory, etc |
| } |
| ... |
| hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ... |
| // vm.yield added here during lowering |
| ``` |
| |
| There are many cases to explore and as cellular batching can have performance |
| benefits of several orders of magnitudes it'll be one of the primary areas of |
| research in the long-term. |
| |
| ### Lowering to LLVM IR |
| |
| <a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a> |
| |
| For scenarios where dynamic module loading is not required and entire modules |
| can be compiled into applications we can lower the VM IR to LLVM IR within |
| MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are |
| dispatched at runtime to things like the HAL we can instead lower to |
| `llvm::CallInst` to runtime-resolved function pointers. This still enables all |
| of the flexibility of heterogeneous/runtime-determined devices, pluggable |
| diagnostics, and backend composition without any need for flatbuffers or the VM |
| bytecode interpreter. |
| |
| The VM was designed to make such a lowering easy and the C-style struct-based |
| function pointer registration for runtime modules was designed to make emitting |
| code that used it fairly robust even when linked in dynamically such as when |
| embedded in shared objects. |
| |
| An extension of this is what we've been calling 'runtimeless mode', where the |
| IREE VM linkage code is statically linked into the binary alongside the |
| generated module LLVM IR. If only a single HAL backend is linked in then (with |
| some build-fu) we should be able to get call devirtualization to reduce code |
| size to precisely the functionality used by the module. |
| |
| ### Improved Type Support |
| |
| <a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a> |
| |
| Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an |
| intentional limitation such that we can determine what is really needed to |
| express the scheduling we perform, with the idea being that such a limited model |
| will make it easier to use techniques like |
| [indirect command buffers](#indirect-command-bufferon-accelerator-execution) to |
| compile the VM itself to an accelerator executable that dispatches work without |
| host involvement. |
| |
| As we port more models we may find a few primitives that are worth bringing into |
| the VM design such that it's worth potential complications to future porting. |
| These includes types like `f32` (for simple float calculations/comparisons), |
| `list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple |
| inline calculations that are not worth dispatch overhead/synchronization). |
| |
| ### Indirect Command Buffer/On-Accelerator Execution |
| |
| <a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a> |
| |
| Though IREE will use many different tricks such as |
| [predication](#predication-of-flowdispatch) to build deep pipelines there is |
| still the requirement that the command recording and submission happens on the |
| host CPU. Though the cost of this in terms of latency and power use can be |
| minimized by coalescing and timelines there is still the possibility of |
| non-trivial roundtrips being introduced that limit performance. For particular |
| applications like low-power always-on compute or where there is significantly |
| branchy behavior (such as search algorithms) it is important that the decision |
| making logic as to what is dispatched runs as close to real-time as possible |
| within the execution pipeline. |
| |
| The IREE VM is designed to be runnable on-device in a secure and cooperative way |
| (no pointers, indirect buffer handles to allow for memory space rearrangement |
| op-to-op, deterministic execution and explicit yield points, etc). |
| |
| The recent efforts to bring indirect command buffers to Vulkan and Metal's |
| [Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu) |
| (that both derive inspiration from |
| [NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt)) |
| are one such target for this. Either by |
| [lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special |
| conversion to target-specific forms, or by actually executing the VM bytecode |
| directly on-device (it's ~1000 LoC) we should be able to prototype what full |
| on-device usage is like. Even if only some VM functions the compiler deems |
| useful to schedule on the device are used and the rest run on the host |
| (particularly those functions calling imported functions) some of the most |
| costly logic that creates tight coupling of the host and device scheduling can |
| be limited. |