blob: a55072c6c0f311fc00dc382636b457ecfea65ba4 [file] [log] [blame] [view]
Ben Vanik8f440842020-03-11 09:41:14 -07001# IREE Design Roadmap
2
Ben Vanik3d65d882020-03-27 13:12:27 -07003<a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a>
4
5<!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING -->
Ben Vanik8f440842020-03-11 09:41:14 -07006
7A not-so-concise walkthrough of various IREE features that are in the design
8process and planned for future versions. A lot of the questions around how the
9IREE IR is designed and why certain components exist (such as the VM) hopefully
10become much clearer when seeing where we want to go with the infrastructure we
11are building (as opposed to where we currently are with our MVP slice). This
12document is not meant to encompass the entire design of any individual feature
13and if there's interest please say hi on the
14[iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing
15list.
16
17<!-- TOC -->
18
19- [IREE Design Roadmap](#iree-design-roadmap)
20 - [Input Dialects](#input-dialects)
21 - [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement)
22 - [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization)
23 - [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations)
24 - [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion)
25 - [Quantization](#quantization)
26 - [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling)
27 - [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream)
28 - [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg)
29 - [Predication of `flow.dispatch`](#predication-of-flowdispatch)
30 - [Deduping `flow.executable`s](#deduping-flowexecutables)
31 - [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions)
32 - [Device Placement](#device-placement)
33 - [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables)
34 - [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces)
35 - [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization)
36 - [Buffer Usage Tracking](#buffer-usage-tracking)
37 - [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
38 - [Target-aware Executable Compression](#target-aware-executable-compression)
39 - [Target-aware Constant Compression](#target-aware-constant-compression)
40 - [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication)
41 - [Resource Timeline](#resource-timeline)
42 - [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer)
43 - [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi)
44 - [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling)
45 - [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine)
46 - [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling)
47 - [Cellular Batching](#cellular-batching)
48 - [Lowering to LLVM IR](#lowering-to-llvm-ir)
49 - [Improved Type Support](#improved-type-support)
50 - [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution)
51
52<!-- /TOC -->
53
54## Input Dialects
55
Ben Vanik3d65d882020-03-27 13:12:27 -070056<a id="markdown-Input%20Dialects" name="Input%20Dialects"></a>
Ben Vanik8f440842020-03-11 09:41:14 -070057
58### Future MLIR XLA HLO Replacement
59
Ben Vanik3d65d882020-03-27 13:12:27 -070060<a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a>
Ben Vanik8f440842020-03-11 09:41:14 -070061
62IREE's current input dialect is the XLA HLO dialect representing operations on
63tensors. This was a pragmatic decision based on having HLO already defined and
64proof of existing models being lowered to it from Tensorflow, allowing us to
65focus on the IREE-specific portions of work. Unfortunately, HLO is tied to
66Tensorflow and has many quirks that would not otherwise have been designed had
67that not been the case. There are discussions happening about an upstream MLIR
68[Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/)
69dialect that HLO can be lowered into, allowing IREE (and other backends) to
70decouple themselves from XLA and be easier to target from frontends.
71
72### `linalg`: High-level Hierarchical Optimization
73
Ben Vanik3d65d882020-03-27 13:12:27 -070074<a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a>
Ben Vanik8f440842020-03-11 09:41:14 -070075
76It's required that IREE inputs are all in tensor form (and not in-place memref
77updates) in order to perform a large majority of the `flow` transformations.
78Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect
79is adding support for operating on value-semantic tensors, meaning that we can
Mehdi Aminiccc47e82020-07-06 21:51:24 -070080first apply `mhlo` to `linalg` lowerings and any of the transformations
Ben Vanik8f440842020-03-11 09:41:14 -070081available in Linalg prior to performing our own `flow` lowerings. The advantage
82is that Linalg will have much stronger and principled code motion and nested
83loop transformation optimizations than is possible on higher-level ops. As not
84all operations can be represented as `linalg` ops IREE will be able to ingest a
Mehdi Aminiccc47e82020-07-06 21:51:24 -070085mix of `linalg`, `std`, and `mhlo` (or its replacement) ops.
Ben Vanik8f440842020-03-11 09:41:14 -070086
87### XLA HLO: Canonicalizations
88
Ben Vanik3d65d882020-03-27 13:12:27 -070089<a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a>
Ben Vanik8f440842020-03-11 09:41:14 -070090
Mehdi Aminiccc47e82020-07-06 21:51:24 -070091Very little effort has been applied to `mhlo` optimizations and there are a
Ben Vanik8f440842020-03-11 09:41:14 -070092significant number of missing folders, canonicalizers, and simple
93transformations. Many of these happen in legacy XLA C++ backends; however we
94need them in MLIR so that we can make use of dynamic shapes, mixed dialect
95inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the
Mehdi Aminiccc47e82020-07-06 21:51:24 -070096corresponding `mhlo` ops) is nearing its initial milestones and afterward we
Ben Vanik8f440842020-03-11 09:41:14 -070097expect more of these missing pieces to be filled in.
98
99Examples of the optimizations that will greatly benefit IREE (and any other
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700100backend consuming `mhlo`) include:
Ben Vanik8f440842020-03-11 09:41:14 -0700101
102- Eliding unneeded transpose, reshape, and broadcast operations.
103- Inserting transpose, reshape, and broadcast operations to allow for more
104 optimal memory access patterns (such as transposing gather input to allow
105 for memcpy-like transfers instead of column-wise cache-unfriendly accesses).
106- Moving operations above broadcasts such that the smallest amount of work is
107 performed.
108
109### XLA HLO: Tensor to Primitive Conversion
110
Ben Vanik3d65d882020-03-27 13:12:27 -0700111<a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700112
113HLO only operates on tensor values - even for simple scalars - and this presents
114a problem when attempting to determine which code should be specified to run on
115accelerators vs. what should run on the host. The canonical example is
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700116`mhlo.while`, which as seen in the example below uses scalar tensors for its
Ben Vanik8f440842020-03-11 09:41:14 -0700117loop iteration counter and comparison.
118
119```mlir
120%start = constant dense<1> : tensor<i32>
121%bound = constant dense<3> : tensor<i32>
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700122%res = "mhlo.while"(%start) ( {
Ben Vanik8f440842020-03-11 09:41:14 -0700123^bb0(%count: tensor<i32>):
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700124 %1 = "mhlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1>
125 "mhlo.return"(%1) : (tensor<i1>) -> ()
Ben Vanik8f440842020-03-11 09:41:14 -0700126}, {
127^bb0(%count: tensor<i32>):
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700128 %1 = mhlo.add %count, %count : tensor<i32>
129 "mhlo.return"(%1) : (tensor<i32>) -> ()
Ben Vanik8f440842020-03-11 09:41:14 -0700130}) : (tensor<i32>) -> tensor<i32>
131```
132
133A naïve but correct lowering (what's currently in IREE) would perform the
134comparison and increment on the device and insert a host readback to see if the
135loop should continue:
136
137```mlir
138func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} {
139 %cst = constant dense<1> : tensor<i32>
140 %cst_0 = constant dense<3> : tensor<i32>
141 %cst_1 = constant dense<1> : vector<3xi32>
142 br ^bb1(%cst : tensor<i32>)
143^bb1(%2: tensor<i32>): // 2 preds: ^bb0, ^bb2
144 %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> {
145 %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1>
146 flow.return %8 : tensor<i1>
147 }
148 %4 = flow.tensor.load %3 : tensor<i1>
149 cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>)
150^bb2(%5: tensor<i32>): // pred: ^bb1
151 %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> {
152 %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32>
153 flow.return %8 : tensor<i32>
154 }
155 br ^bb1(%6 : tensor<i32>)
156^bb3(%7: tensor<i32>): // pred: ^bb1
157 return %7 : tensor<i32>
158}
159```
160
161Of note is the `flow.tensor.load` op indicating a host readback. Though this
162correctly executes the loop it is extremely inefficient. What's desired is for
163the loop iterator and condition to all happen on the host, with the iterator
164being passed to the loop body as an argument that can be encoded into a command
165buffer in future lowering stages. This eliminates host readback and allows for
166much larger `flow.stream` sequences, feeding more into the pipeline for the
167accelerator.
168
169Not all source frontends have this issue (misrepresenting simple host
170computation as non-dense tensor operations), and our goal is to add a
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700171transformation that heuristically converts `mhlo` ops acting on small tensors to
172`std` ops acting on primitive values (`i32`, `index`, etc).
Ben Vanik8f440842020-03-11 09:41:14 -0700173
174### Quantization
175
Ben Vanik3d65d882020-03-27 13:12:27 -0700176<a id="markdown-Quantization" name="Quantization"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700177
178It's assumed that any work related to quantization/compression has happened
179prior to lowering into IREE dialects. Our plan is to use the proposed
180[Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655)
181to achieve both training and inference-time quantization of types in a way that
182preserves maximum accuracy. IREE will support running with original unquantized
183floats in all cases, allowing for a smooth on-ramp to quantization and the gains
184in performance and reduction in model size that come from it.
185
186As future work IREE would like to move beyond these transformation-directed
187approaches to quantization and interface directly to frontends which have a
188defined enough type system to represent accurate quantized (and otherwise
189compressed) computations directly, not relying exclusively on compiler-side type
190inference transforms.
191
192## `flow`: Data- and Execution-Flow Modeling
193
Ben Vanik3d65d882020-03-27 13:12:27 -0700194<a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700195
196The `flow` dialect is designed to allow us to extract as much concurrency as
197possible from a program and partition IR into the scheduling and execution
198domains. Today we have the IR structure and transformation flow in place but
199have not yet got to the most interesting things such an infrastructure enables.
200A majority of the largest performance, latency, and memory usage improvements
201IREE can offer are determined first here and all following lowerings benefit.
202_The fastest code is the code you don't execute and the smallest allocation is
203the allocation you don't make_ ;)
204
205### Avoiding Readbacks with `flow.stream`
206
Ben Vanik3d65d882020-03-27 13:12:27 -0700207<a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700208
209A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
210ops) will be removed when we have an
211[HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion).
212There will still be cases when readbacks are required for correctness but they
213usually fall into a small set of usage patterns. For those that don't this is
214one place where IREE will warn about performance issues, allowing programs that
215perform suboptimally but encouraging authors to adjust their input model to
216enable better behavior. The IREE VM also has specific support for hiding
217readback latency in an efficient way via
218[coroutines](coroutines-for-batching-and-cooperative-scheduling).
219
220The most common case we are currently seeing in the IR is that of dynamic copies
221where the offsets are dependent on the result of previous computations. Source
222models may have top-k + gather operations, for example. These appear as a
223`flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses
224the loaded value for a `flow.tensor.update` (or other operation):
225
226```mlir
Scott Toddf179f392020-09-16 14:54:59 -0700227%index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... }
228%index = flow.tensor.load %index_tensor : tensor<i32>
229%result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... {
230 %0 = flow.dispatch ...
231 %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32>
232 ...
233}
Ben Vanik8f440842020-03-11 09:41:14 -0700234```
235
236Today the `flow.tensor.update` turns into HAL command buffer transfer operations
237that must have their offsets known at recording time. This is a limitation of
238`vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact
239several drivers implement copies as small built-in shader programs meaning that
240we could perform the same expansion here with the right primitives. This would
241allow, in the above example, both the index to be computed and the tensor to be
242updated within the same stream to entirely remove the host round-trip.
243
244### Threading `flow.stream` through the CFG
245
Ben Vanik3d65d882020-03-27 13:12:27 -0700246<a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700247
248The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
249a temporary implementation designed to get the concept of streams lowered to the
250HAL dialect. For streams to be effective at modeling larger concurrency scopes
251they need to be able to move across branches in the CFG. This intuitively
252follows exactly what one would do if recording commands in C:
253
254```c++
Scott Toddf179f392020-09-16 14:54:59 -0700255vkCmdCopyBuffer(cmd, ...);
256if (some_flag) {
257 vkCmdBindPipeline(cmd, ..., pipeline_a);
258} else {
259 vkCmdBindPipeline(cmd, ..., pipeline_b);
260}
261vkCmdDispatch(cmd, ...);
Ben Vanik8f440842020-03-11 09:41:14 -0700262```
263
264The corresponding `flow` IR:
265
266```mlir
267 flow.stream.append[%s0](...) {
268 flow.tensor.update ...
269 }
270 %b = cmpi ne %some_flag, ...
271 cond_br %b, ^a(%s0), ^b(%s0)
272^a(%s1):
273 flow.stream.append[%s1](...) {
274 flow.dispatch @pipeline_a, ...
275 }
276 br ^end(%s1)
277^b(%s2):
278 flow.stream.append[%s2](...) {
279 flow.dispatch @pipeline_b, ...
280 }
281 br ^end(%s2)
282^end(%s3):
283 ...
284```
285
286This allows the entire stream to be lowered into one command buffer without the
287need for any host round-trips. The conversion into the `flow` dialect will walk
288the CFG and attempt to thread the `flow.stream` values through so long as there
289are no external dependencies.
290
291### Predication of `flow.dispatch`
292
Ben Vanik3d65d882020-03-27 13:12:27 -0700293<a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700294
295While the
296[`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
297can remove many of the simpler conditional dispatches there will always be some
298that will have their execution dependent on the result of prior dispatches. For
299these a `flow.cond_dispatch` will allow a condition to be provided that must be
300true for the dispatch to actually be performed.
301
302For targets that natively support predication in their command buffers (such as
303D3D12's
304[ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication))
305this provides a host round-trip-free way of conditionally executing dispatches
306and transfers. Unfortunately Vulkan support is still lacking, though Nvidia
307supports the
308[VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/)
309extension that exposes the same behavior.
310
311For targets that do not support predication natively it's still possible to
312emulate predication with
313[indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model
314the workgroup counts normally used to dispatch execution are sourced from
315another device buffer at the time the dispatch is made instead of sourced from
316the command buffer at the time the dispatch is recorded. Degenerate dispatches
317with counts of `0, 0, 0` allow for effective neutering of the dispatch with
318minimal overhead (vs. the significant penalty of a host round-trip!).
319
320By modeling such predication at the `flow` level we are able to lower into the
321HAL with target-aware predication semantics and fuse indirect dispatch workgroup
322count calculations into existing dispatches already being performed such that
323overhead is reduced.
324
325### Deduping `flow.executable`s
326
Ben Vanik3d65d882020-03-27 13:12:27 -0700327<a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700328
329While still in the `flow` dialect, the executables are target-agnostic. This
330makes simple IR tree diffing a potential solution to deduplication. Since most
331of the dispatches originate from the same source-language library calls in input
332frameworks there's a high likelihood of duplication, and depending on when
333inlining is performed we may have stronger or weaker ability to perform the
334deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are
335rearranged into consistent canonical representations) the IR comparisons can be
336done rather trivially.
337
338### Rematerializing CSE'd Expressions
339
Ben Vanik3d65d882020-03-27 13:12:27 -0700340<a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700341
342Common subexpression elimination is performed many times during lowering,
343however there comes a point where the CSE can introduce false dependencies and
344additional allocations that are otherwise avoidable. For example if a
345broadcasting operation is CSE'd and then the result is used by two or more
346operations that are scheduled independently what would have been a relatively
347cheap lowering of the broadcast to a simple index remapping now becomes an
348additional dispatch, materialization of an intermediate tensor, and a barrier:
349
350```mlir
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700351%bcast = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
352%mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
Ben Vanik8f440842020-03-11 09:41:14 -0700353// (pretend something here that prevents fusion)
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700354%mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
Ben Vanik8f440842020-03-11 09:41:14 -0700355```
356
357```mlir
358%bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> {
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700359 %0 = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
Ben Vanik8f440842020-03-11 09:41:14 -0700360 return %0 : tensor<1024x10xf32>
361}
362// a barrier will be required here
363%mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700364 %1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
Ben Vanik8f440842020-03-11 09:41:14 -0700365 return %1 : tensor<1024x10xf32>
366}
367%mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
Mehdi Aminiccc47e82020-07-06 21:51:24 -0700368 %2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
Ben Vanik8f440842020-03-11 09:41:14 -0700369 return %2 : tensor<1024x10xf32>
370}
371```
372
373Instead the broadcast should be rematerialized inside of both dispatch regions
374as the cost of doing so is significantly less in compute resources and then the
375intermediate tensor will not be required at all. Though at first it may seem
376counter-intuitive to undo such a critical optimization as CSE (both to code size
377and often to compute) but here it's something we must carefully balance while
378looking at the whole system. It gets even more important when considering
379multi-device execution as the cost of sharing memory and synchronizing may be
380extremely non-trivial.
381
382### Device Placement
383
Ben Vanik3d65d882020-03-27 13:12:27 -0700384<a id="markdown-Device%20Placement" name="Device%20Placement"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700385
386While still within the `flow` dialect we have the ability to easily split
387streams and safely shuffle around operations. Target execution backends can opt
388into such behavior to ensure that device restrictions such as maximum in-flight
389memory, maximum scheduling depth, and capabilities are observed. For
390heterogeneous configurations the intent is that certain operations, dispatches,
391and streams can be attributed to specify which device categories they should be
392lowered. The constraint solving that takes place can be provided with generic
393heuristics ("big GEMMs go on the accelerator"), profile-guided databases based
394on benchmarks, learned traits via ML, etc.
395
396## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables
397
Ben Vanik3d65d882020-03-27 13:12:27 -0700398<a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700399
400As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
401the techniques classically used in real-time graphics apply. The benefit we have
402by modeling our usage of such a low-level API in IR is that the normal work -
403some of which is very non-trivial - for managing allocations, tracking resource
404lifetime, and ensuring proper synchronization/barriers is something we can apply
405the full force of an offline compiler against.
406
407### Allow Targets to Specify `hal.interface`s
408
Ben Vanik3d65d882020-03-27 13:12:27 -0700409<a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700410
411The `hal.interface` op specifies the ABI between the scheduler and the device
412containing the buffer bindings and additional non-buffer data (parameters,
413shapes, specialization flags, etc). Today a naïve ordering is used uniformly for
414all targets however it is possible for target backends to opt into providing
415their own interfaces based on target configuration. The same `hal.executable`
416may have multiple interfaces and the same backend may use one or more. This is
417useful for when target capabilities may vary at runtime, such as the
418[number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android)
419in Vulkan. By exposing a few `hal.interface` variants with different binding
420amounts the Vulkan backend could make better use of the larger number of
421bindings available at runtime while still providing support for smaller
422configurations.
423
424Once we have multiple `hal.interface`s defined for executables the scheduler
425needs to emit HAL ops that properly switch between them. By having a canonical
426form for bindings we can ensure that only the differences between the interfaces
427will need additional code.
428
429### Target-specific Scheduling Specialization
430
Ben Vanik3d65d882020-03-27 13:12:27 -0700431<a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700432
433Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
434regions, it's not always possible for all target backends to schedule a region
435as a single dispatch. A classic example is algorithms like
436[parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm)
437commonly used on GPUs that may require many dispatches to identical executables,
438while other algorithms may vary the executables they use based on the input
439parameters such as shape or the target runtime device support.
440
441By default the `flow.dispatch` executable translation to `hal.executable`s is
442performed 1:1 and it is assumed that a single dispatch is required. Extending
443target backends with scheduling interfaces (enabling them to opt into different
444scheduling behavior) will allow the backends to emit any number of
445`hal.executable`s and any stream commands (such as additional dispatches or
446transfers) they may need. This is effectively equivalent to what would be done
447at runtime only because we are still operating on IR prior to buffer allocation
448and can use the `hal` ringbuffer primitive. Through this we can elide many of
449the allocations that would otherwise be required at runtime (and the
450concurrency-limiting false dependencies that usually come along with scratch
451memory).
452
453Since the algorithm used may vary based on the parameters of the dispatch (such
454as the shape of the reduction which may be dynamically determined) scheduling
455specialization may occur even when targeting a single backend. In many cases
456folding and canonicalization can eliminate the overhead as whether one
457dynamically computed workgroup size is used instead of another the same IR is
458present.
459
460### Buffer Usage Tracking
461
Ben Vanik3d65d882020-03-27 13:12:27 -0700462<a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700463
464Many explicit hardware APIs require knowing how buffers are used alongside with
465where they should be located. For example this additional information determines
466caching policy on buffer accesses (write-through, write-back, etc), visibility
467of writes across compute units, and the possible MMU properties that may need to
468be maintained/matched for the buffer. By using the SSA-form value-semantics of
469the MLIR `tensor` as used in the `flow` dialect we have complete information of
470where buffers may be used or at least where they enter or leave regions where we
471can derive such information.
472
473Analysis passes can run over IR to attribute tensors such that when allocation
474is performed when lowering to the `hal` dialect we do so from an allocator
475compatible with where the buffer will be used, with memory types chosen based on
476the potential cost and location of operations performed (write-only on host vs.
477read-write on host and device, etc), and with usage bits indicating what kind of
478operations may be performed on the buffer. Many of these are local
479transformations as most buffers are only live within very small regions such as
480the `flow.stream` encompassing their usage.
481
482Traditional systems need to either use very permissive buffer properties or
483heuristics that can introduce additional non-trivial overhead when such
484heuristics are incorrect. For example,
485[OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml)
486that drivers were then able to use but almost no drivers behaved as desired in
487all cases and it lead to additional memory ghosting, copies, readbacks, and
488unpredictable performance. For almost all uses of the buffers within an IREE
489invocation we instead can know precisely where and how buffers may need to be
490moved and do it a minimum number of times if it is required.
491
492### Batched Executable Caching and Precompilation
493
Ben Vanik3d65d882020-03-27 13:12:27 -0700494<a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700495
496For targets that may require runtime preprocessing of their executables prior to
497dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
498compilation mechanism based on Vulkan's
499[Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html).
500
501Today each executable is compiled on-demand and cached only for the process
502lifetime. Though some drivers may provide their own caching we can make better
503use of the explicit caching and compilation behavior with the additional
504information we have in the compiler.
505
506For any given entry point (or group of entry points) into an IREE module we can
507perform reachability analysis to know which executables may be executed when
508that entry point is invoked. In this way we can emit pre-invocation compilation
509checks (similar to an `std::call_once` block) that provides all required
510executables for compilation and allows more efficient compilation through
511multithreading the compiler invocations. These same compilation caching function
512can be exposed and invoked manually by an application to force pre-compilation
513when it is least likely to impact the user, such as a post-install/first-run
514step or concurrently while other application features are loading.
515
516We can use zero or more scoped caches for executables within a module.
517Completely dynamic modules (such as those emitted in eager-mode usage) may avoid
518the caching overhead entirely, while modules that have several primary usage
519modes (such as training and inference) may choose to use independent caches for
520each such mode.
521
522The caches generated can then be retrieved and saved by the hosting application.
523Upon the next execution the application can provide the caches and if still
524valid they will be used to avoid compilation.
525
526### Target-aware Executable Compression
527
Ben Vanik3d65d882020-03-27 13:12:27 -0700528<a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700529
530An advantage of representing executable binaries in IR after translation is that
531we can apply various post-compilation compression and minification techniques
532while still know precisely where the executable will be used. This is extremely
533important for SPIR-V as it is not designed to be a small at-rest format. Though
534the biggest lever we have to control generated code size is higher-level
535deduplication and specialization there will still be a sufficiently large number
536of executable binaries we will need to embed within the final modules and having
537targeted approaches for reducing their size beyond just "gzip everything" is
538very powerful.
539
540For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless
541SPIR-V compression technique that, when coupled with modern dictionary-based
542compression algorithms, can save significant binary size. As a data point, the
543SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB
544of compressed SMOL-V.
545
546Combined with
547[Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
548we can easily use shared dictionaries and other cross-artifact compression in a
549relatively plug-in way.
550
551### Target-aware Constant Compression
552
Ben Vanik3d65d882020-03-27 13:12:27 -0700553<a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700554
555It's still an area that needs more research but one goal of the IREE design was
556to enable efficient target- and context-aware compression of large constants
557(typically model weights/parameters/embeddings). This may mean reusing existing
558hardware compression formats on GPUs, ML accelerator-specific formats, or
559very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be
560directly used without first decompressing. The inspiration here is formats like
561[Crunch](https://github.com/BinomialLLC/crunch) and
562[Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform
563["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be
564able to use these directly as then we can make use of GPU hardware samplers to
565do the 4-bit to 32-bit decompression, etc.
566
567### Command Buffer Stateful Deduplication
568
Ben Vanik3d65d882020-03-27 13:12:27 -0700569<a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700570
571The IREE HAL - much like Vulkan it is based on - eschews much of the state that
572traditional APIs have in favor of (mostly) immutable state objects (pipeline
573layouts, pipeline states, descriptor sets, etc). There are still a few stateful
574entry points in the API, though, and deduplicating or reordering redundant calls
575can reduce both IR, API, and execution overhead.
576
577The key place this will have the largest impact is around descriptor set
578bindings and push descriptors, both of which are state and can have non-trivial
579setup overhead. A canonicalization for such commands that inspects the target
580`hal.command_buffer` to see if the same state was set prior and code motion to
581move such commands out of loop bodies when possible would be helpful.
582
583### Resource Timeline
584
Ben Vanik3d65d882020-03-27 13:12:27 -0700585<a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700586
587A core concept of the IREE scheduler that allows for overlapping in-flight
588invocations is that of the resource timeline. This identifies module state that
589can be in use by multiple invocations and assigns timeline milestones denoting
590when the resource will be in the appropriate state for the current invocation to
591proceed. Conceptually it is like a epoch-based synchronization mechanism as
592commonly found in garbage collectors to allow for lock-free asynchronous memory
593reclamation.
594
595The advantage we have in the IR is that we know both the usage of all resources
596thanks to [buffer usage tracking](#buffer-usage-tracking) and the
597synchronization domains of all resources (in most cases). This allows us to
598effectively assign one timeline semaphore per writeable resource while in
599practice having far fewer than 1:1, as for example if two resources are only
600ever written in the same command buffer only one semaphore is needed to signal
601the completion of both writes.
602
603By transforming IR to sink all resource reads and writes closest to where the
604value is used we can enlarge the time windows that can overlap across
605invocations that may share those resources. This is similar to what out-of-order
606CPUs do with register renaming/reorder buffers/etc and something we can apply
607some traditional instruction scheduling techniques to (only here our
608'instructions' are entire command buffer dispatches/transfers).
609
610Two degenerate cases of this approach are that of resource indirection
611(`iree.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may
612not be possible to continue recording commands even if we are able to ensure
613execution is appropriately synchronized. This is where indirect dispatch,
614[predication](#predication-of-flowdispatch),
615[indirect command buffers](#indirect-command-bufferon-accelerator-execution),
616and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all
617help cover for the times where we are unable to transform away the indirection
618or emit shape logic without data dependencies.
619
620### Transient Tensor Ringbuffer
621
Ben Vanik3d65d882020-03-27 13:12:27 -0700622<a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700623
624(When properly implemented) almost all buffers required during execution never
625escape the command buffers they are used in or a single VM invocation. We can
626trivially identify this from the explicit captures of `flow.stream` and
627`flow.dispatch` ops and the fact that all tensor types have value-semantics.
628Only those tensor values loaded-from/stored-to module state or that cross the
629exported module function boundary need special consideration while almost
630everything else can live transiently only so long as it is required during
631execution.
632
633Thanks to this information about buffer usage and lifetime we can use a
634[ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the
635transient tensor data and other required data reservations such as uniform
636buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches.
637This gives the compiler and the application a knob that allows them to control
638maximum concurrency (by having a very large ringbuffer) or maximum memory usage
639(by having a minimally small ringbuffer).
640
641Allocating tensors from the ringbuffer does not require sophisticated runtime
642packing as we can emit IR to calculate required sizes for dynamically shaped
643tensors. Whether a basic block reserves `%sz = constant 42 : index` bytes or
644`%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how the
645allocations are performed. Since almost all usage involves simple write head
646bumps there is no need for ahead-of-time memory planning or large fixed
647allocations, and since no buffer within the ringbuffer can alias we can have
648coarse (_read: low overhead_) guarantees about the availability of certain
649regions of the ringbuffer (_"when this event is signaled all prior ringbuffer
650writes have completed"_).
651
652Usually any planning we may want to perform can be done in IR via code motion.
653For example applying traditional algorithms used to reduce register pressure
654will help us attain narrower live windows within the ringbuffer leading to a
655larger number of in-flight operations for the same ringbuffer memory usage.
656
657We may end up using both a classical ringbuffer and a variant known as the
658[bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist)
659because it is better for descriptor set utilization (as we can provide many
660dispatch parameters with a single base offset bound once at the beginning of a
661region).
662
663### Timeline Semaphores on the Module ABI
664
Ben Vanik3d65d882020-03-27 13:12:27 -0700665<a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700666
667Functions calls made across modules (either from C++ into the VM, VM->VM, or
668VM->C++) should be able to define timeline semaphores used to wait and signal on
669the call. We can do this by making all exports automatically have the semaphores
670and then make invocations populate them if they were not provided by the caller.
671In this way we can allow multiple invocations of exported functions to chain
672naturally with internal asynchronous workloads, turning most IREE invocations
673into just recording of command buffers that can never block.
674
675When combined with
676[VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we
677even have the ability to interleave any required host execution between the wait
678and signal semaphores provided such that the caller never knows on which device
679execution is taking place. It's still possible to provide synchronous wrappers
680that emulate blocking behavior but by having the core system designed around a
681single system-supported primitive we avoid the need for additional things like
682interrupt watchdog threads, implicit blocking, and other pitfalls.
683
684### GPU-like CPU Scheduling
685
Ben Vanik3d65d882020-03-27 13:12:27 -0700686<a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700687
688One approach to using multiple cores on a CPU is to perform interior
689parallelization of operations such as OpenMP or library-call-based custom thread
690pools (gemmlowp). This works when each individual operation is relatively costly
691vs. potential pipeline bubbles caused by work spinning down near the end of an
692operation and spinning up at the beginning of the next.
693
694IREE is designed to handle many more workloads - some of which have very narrow
695shapes but very deep pipelines (like search algorithms) - such that the above
696approach of multithreading within ops becomes a bottleneck. These workloads are
697traditionally very poorly handled by frameworks and issues with
698oversubscription, pipeline stalls, and suboptimal system schedulers (such as on
699Android) can lead to more time being spent thrashing about than actually
700executing real work.
701
702The approach we take here is to treat the cores of a CPU as if they were
703computation units on a GPU, each able to perform some set of heterogeneous work
704independent of others units. This means that the concurrency we are trying to
705model at the `flow` level and communicate to the runtime via the `hal` that
706explicitly states which dispatches can overlap and the size of the workgroups
707can trivially be used to distribute this work over many cores exactly as a GPU
708would do it. Integration with library calls that may require their own threading
709(such as Ruy) requires that they be able to use the IREE thread pool instead of
710their own.
711
712In this way we can avoid pipeline bubbles and other latency-inducing
713unpredictable scheduling. This does not mean that we treat individual units of
714work at the same scale as we would for GPUs, but instead that we tile and have
715one or more processing units that allows us to work on those tiles. Whether the
716tile size is defined by a library call contract, heuristics, or empirically is
717TBD, but expect workgroup sizes in the thousands to millions of invocations vs.
718normal GPU workgroup sizes in the dozens to hundreds of invocations.
719
720To achieve this style of scheduling efficiently we'll likely use
721[marl](https://github.com/google/marl) as the scheduler. This provides
722cross-platform low-overhead fibers and is compatible with this style of
723scheduling as it was built for the Swiftshader software rasterizer.
724
725Even if IREE was only targeting CPUs the assertion is that we would still want
726to schedule this way and it's only an incidental benefit that if building for
727heterogeneous targets the scheduling code may be shared (just with a different
728divisor for workgroup count calculations).
729
730## `vm`: Lightweight Virtual Machine
731
Ben Vanik3d65d882020-03-27 13:12:27 -0700732<a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700733
734The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
735intermediate lowering IR. Many of the optimizations we can perform on it will
736benefit all use cases (such as when lowering to LLVM IR) by allowing
737higher-level program transformations around synchronization that are difficult
738to perform on arbitrary LLVM IR.
739
740### Coroutines for Batching and Cooperative Scheduling
741
Ben Vanik3d65d882020-03-27 13:12:27 -0700742<a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700743
744One of the largest features currently missing from the VM is coroutines (aka
745user-mode fiber scheduling). Coroutines are what will allow us to have multiple
746in-flight invocations into a module - some of which may be waiting on external
747events - without the need for complex multithreading logic or state machine
748machinations.
749
750In many cases
751[once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi)
752we will not need to yield in the VM. The user will call into the module with
753provided semaphores, the work to perform will be recorded to one or more command
754buffers and submitted to the device, and then control return will return to the
755caller immediately.
756
757In cases requiring host readbacks that we were not able to remove, however,
758additional VM code may need to run prior to when the final semaphore is
759signaled. To preserve the asynchronous interface and immediate execution
760guarantees the compiler can emit explicit yield points (`vm.yield`) that are
761known-good locations for yielding (such as most resources not required after the
762yield having been flushed/discarded, partial synchronization scope availability
763if other work may be able to execute concurrently irrespective of the yielded
764coroutine, etc).
765
766When the VM encounters the yield at runtime it will suspend the coroutine until
767a defined condition is met. Many coroutines can be in various states at any
768given time and - thanks to the resource timeline - can still be memory safe. For
769example if two stateless invocations are made with a common wait semaphore both
770can be recorded and submitted without waiting on each other. If there is
771internal module state accessed the invocations are implicitly ordered by
772invocation order (similar to what Vulkan calls
773[API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder))
774based on internal resource timeline semaphores.
775
776Waking the coroutines can be performed by either an application-provided
777callback in the case of the application already having a periodic event which is
778doing bookkeeping (such as frame end callbacks when rendering or Looper idle
779events on Android), giving direct control over the frequency and location which
780IREE utilizes to perform additional work. A helper will be provided as well that
781runs a dedicated IREE thread to do this, but the expectation is that
782applications can often do a better (and importantly more predictable) job.
783
784By utilizing coroutines IREE will have a way to fill traditional pipeline
785bubbles even with execution from the same module (let alone across modules) in
786the situation where host readbacks or other logic is required. This increases
787overall throughput and utilization while reducing host wakeups as many
788coroutines can be processed at once to submit new work to the device queues,
789though it does not help reduce per-invocation latency.
790
791External code such as the HAL implementation or user ops may provide the wait
792handles used for continuation. For example, the HAL can expose a function that
793yields and wakes only when one or more timeline semaphores reach their target
794values:
795
796```mlir
797// submit work
798hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target
799// continue here, possibly much later in time
800```
801
802#### Cellular Batching
803
Ben Vanik3d65d882020-03-27 13:12:27 -0700804<a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700805
806Though coroutines help throughput there is a way we've found to reduce latency
807that's been documented as
808[cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf).
809This same technique has been implemented in prior internal systems and is one of
810the motivating design goals for IREE's creation. The core idea is to identify
811small uniform work that can be partitioned and scheduled greedily such as to
812enable batching or reduce associated invocation costs (such as refreshing
813accelerator SRAM/caches with new parameters). This usually manifests as finding
814large GEMM/GEMV operations using the same fixed parameters and either
815dynamically increasing the batch size by adding the waiting work (without
816deferring the actual execution time) or sequencing them back to back to ensure
817better cache utilization. Which approach is taken depends on any data
818dependencies that may be present (such as LSTM state feedback edges).
819
820With the foundation of coroutines in IREE it's possible to yield execution at
821any given point - including during command buffer recording - and wake on
822specific conditions. A majority of the logic can be built into the module itself
823with very little need for runtime machinery, as shared VM variables can be used
824to track pending work across invocations (even from different parts of the
825program) and flush based on logic wholly controlled by the user or compiler
826(such as count/max time latency/etc limits). This allows for the large variety
827of scheduling behavior various applications may want to use, such as a
828zero-latency batch-only-within-this-invocation to a
829[Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque
830time or limit based behavior or even some learned model-specific windowing.
831
832Design work is still required on how to represent this in IR but the current
833thought is to model the regions in which deferred execution is possible and
834beneficial and allow during lowering to the VM additional transformations. This
835is similar to how the async-await behavior works in C# where the async keyword
836is just sugar that expands to additional generated helper utilities.
837
838A simple strawman representation for sequential dispatch may look like:
839
840```mlir
841hal.scheduling_policy @defer_policy {
842 // max time, max count, max live memory, etc
843}
844...
845hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ...
846// vm.yield added here during lowering
847```
848
849There are many cases to explore and as cellular batching can have performance
850benefits of several orders of magnitudes it'll be one of the primary areas of
851research in the long-term.
852
853### Lowering to LLVM IR
854
Ben Vanik3d65d882020-03-27 13:12:27 -0700855<a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700856
857For scenarios where dynamic module loading is not required and entire modules
858can be compiled into applications we can lower the VM IR to LLVM IR within
859MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are
860dispatched at runtime to things like the HAL we can instead lower to
861`llvm::CallInst` to runtime-resolved function pointers. This still enables all
862of the flexibility of heterogeneous/runtime-determined devices, pluggable
863diagnostics, and backend composition without any need for flatbuffers or the VM
864bytecode interpreter.
865
866The VM was designed to make such a lowering easy and the C-style struct-based
867function pointer registration for runtime modules was designed to make emitting
868code that used it fairly robust even when linked in dynamically such as when
869embedded in shared objects.
870
871An extension of this is what we've been calling 'runtimeless mode', where the
872IREE VM linkage code is statically linked into the binary alongside the
873generated module LLVM IR. If only a single HAL backend is linked in then (with
874some build-fu) we should be able to get call devirtualization to reduce code
875size to precisely the functionality used by the module.
876
877### Improved Type Support
878
Ben Vanik3d65d882020-03-27 13:12:27 -0700879<a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700880
881Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
882intentional limitation such that we can determine what is really needed to
883express the scheduling we perform, with the idea being that such a limited model
884will make it easier to use techniques like
885[indirect command buffers](#indirect-command-bufferon-accelerator-execution) to
886compile the VM itself to an accelerator executable that dispatches work without
887host involvement.
888
889As we port more models we may find a few primitives that are worth bringing into
890the VM design such that it's worth potential complications to future porting.
891These includes types like `f32` (for simple float calculations/comparisons),
892`list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple
893inline calculations that are not worth dispatch overhead/synchronization).
894
895### Indirect Command Buffer/On-Accelerator Execution
896
Ben Vanik3d65d882020-03-27 13:12:27 -0700897<a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a>
Ben Vanik8f440842020-03-11 09:41:14 -0700898
899Though IREE will use many different tricks such as
900[predication](#predication-of-flowdispatch) to build deep pipelines there is
901still the requirement that the command recording and submission happens on the
902host CPU. Though the cost of this in terms of latency and power use can be
903minimized by coalescing and timelines there is still the possibility of
904non-trivial roundtrips being introduced that limit performance. For particular
905applications like low-power always-on compute or where there is significantly
906branchy behavior (such as search algorithms) it is important that the decision
907making logic as to what is dispatched runs as close to real-time as possible
908within the execution pipeline.
909
910The IREE VM is designed to be runnable on-device in a secure and cooperative way
911(no pointers, indirect buffer handles to allow for memory space rearrangement
912op-to-op, deterministic execution and explicit yield points, etc).
913
914The recent efforts to bring indirect command buffers to Vulkan and Metal's
915[Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu)
916(that both derive inspiration from
917[NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt))
918are one such target for this. Either by
919[lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special
920conversion to target-specific forms, or by actually executing the VM bytecode
921directly on-device (it's ~1000 LoC) we should be able to prototype what full
922on-device usage is like. Even if only some VM functions the compiler deems
923useful to schedule on the device are used and the rest run on the host
924(particularly those functions calling imported functions) some of the most
925costly logic that creates tight coupling of the host and device scheduling can
926be limited.