Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 1 | # IREE Design Roadmap |
| 2 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 3 | <a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a> |
| 4 | |
| 5 | <!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING --> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 6 | |
| 7 | A not-so-concise walkthrough of various IREE features that are in the design |
| 8 | process and planned for future versions. A lot of the questions around how the |
| 9 | IREE IR is designed and why certain components exist (such as the VM) hopefully |
| 10 | become much clearer when seeing where we want to go with the infrastructure we |
| 11 | are building (as opposed to where we currently are with our MVP slice). This |
| 12 | document is not meant to encompass the entire design of any individual feature |
| 13 | and if there's interest please say hi on the |
| 14 | [iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing |
| 15 | list. |
| 16 | |
| 17 | <!-- TOC --> |
| 18 | |
| 19 | - [IREE Design Roadmap](#iree-design-roadmap) |
| 20 | - [Input Dialects](#input-dialects) |
| 21 | - [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement) |
| 22 | - [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization) |
| 23 | - [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations) |
| 24 | - [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion) |
| 25 | - [Quantization](#quantization) |
| 26 | - [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling) |
| 27 | - [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream) |
| 28 | - [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg) |
| 29 | - [Predication of `flow.dispatch`](#predication-of-flowdispatch) |
| 30 | - [Deduping `flow.executable`s](#deduping-flowexecutables) |
| 31 | - [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions) |
| 32 | - [Device Placement](#device-placement) |
| 33 | - [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables) |
| 34 | - [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces) |
| 35 | - [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization) |
| 36 | - [Buffer Usage Tracking](#buffer-usage-tracking) |
| 37 | - [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation) |
| 38 | - [Target-aware Executable Compression](#target-aware-executable-compression) |
| 39 | - [Target-aware Constant Compression](#target-aware-constant-compression) |
| 40 | - [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication) |
| 41 | - [Resource Timeline](#resource-timeline) |
| 42 | - [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer) |
| 43 | - [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi) |
| 44 | - [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling) |
| 45 | - [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine) |
| 46 | - [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling) |
| 47 | - [Cellular Batching](#cellular-batching) |
| 48 | - [Lowering to LLVM IR](#lowering-to-llvm-ir) |
| 49 | - [Improved Type Support](#improved-type-support) |
| 50 | - [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution) |
| 51 | |
| 52 | <!-- /TOC --> |
| 53 | |
| 54 | ## Input Dialects |
| 55 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 56 | <a id="markdown-Input%20Dialects" name="Input%20Dialects"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 57 | |
| 58 | ### Future MLIR XLA HLO Replacement |
| 59 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 60 | <a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 61 | |
| 62 | IREE's current input dialect is the XLA HLO dialect representing operations on |
| 63 | tensors. This was a pragmatic decision based on having HLO already defined and |
| 64 | proof of existing models being lowered to it from Tensorflow, allowing us to |
| 65 | focus on the IREE-specific portions of work. Unfortunately, HLO is tied to |
| 66 | Tensorflow and has many quirks that would not otherwise have been designed had |
| 67 | that not been the case. There are discussions happening about an upstream MLIR |
| 68 | [Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/) |
| 69 | dialect that HLO can be lowered into, allowing IREE (and other backends) to |
| 70 | decouple themselves from XLA and be easier to target from frontends. |
| 71 | |
| 72 | ### `linalg`: High-level Hierarchical Optimization |
| 73 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 74 | <a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 75 | |
| 76 | It's required that IREE inputs are all in tensor form (and not in-place memref |
| 77 | updates) in order to perform a large majority of the `flow` transformations. |
| 78 | Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect |
| 79 | is adding support for operating on value-semantic tensors, meaning that we can |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 80 | first apply `mhlo` to `linalg` lowerings and any of the transformations |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 81 | available in Linalg prior to performing our own `flow` lowerings. The advantage |
| 82 | is that Linalg will have much stronger and principled code motion and nested |
| 83 | loop transformation optimizations than is possible on higher-level ops. As not |
| 84 | all operations can be represented as `linalg` ops IREE will be able to ingest a |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 85 | mix of `linalg`, `std`, and `mhlo` (or its replacement) ops. |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 86 | |
| 87 | ### XLA HLO: Canonicalizations |
| 88 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 89 | <a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 90 | |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 91 | Very little effort has been applied to `mhlo` optimizations and there are a |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 92 | significant number of missing folders, canonicalizers, and simple |
| 93 | transformations. Many of these happen in legacy XLA C++ backends; however we |
| 94 | need them in MLIR so that we can make use of dynamic shapes, mixed dialect |
| 95 | inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 96 | corresponding `mhlo` ops) is nearing its initial milestones and afterward we |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 97 | expect more of these missing pieces to be filled in. |
| 98 | |
| 99 | Examples of the optimizations that will greatly benefit IREE (and any other |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 100 | backend consuming `mhlo`) include: |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 101 | |
| 102 | - Eliding unneeded transpose, reshape, and broadcast operations. |
| 103 | - Inserting transpose, reshape, and broadcast operations to allow for more |
| 104 | optimal memory access patterns (such as transposing gather input to allow |
| 105 | for memcpy-like transfers instead of column-wise cache-unfriendly accesses). |
| 106 | - Moving operations above broadcasts such that the smallest amount of work is |
| 107 | performed. |
| 108 | |
| 109 | ### XLA HLO: Tensor to Primitive Conversion |
| 110 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 111 | <a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 112 | |
| 113 | HLO only operates on tensor values - even for simple scalars - and this presents |
| 114 | a problem when attempting to determine which code should be specified to run on |
| 115 | accelerators vs. what should run on the host. The canonical example is |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 116 | `mhlo.while`, which as seen in the example below uses scalar tensors for its |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 117 | loop iteration counter and comparison. |
| 118 | |
| 119 | ```mlir |
| 120 | %start = constant dense<1> : tensor<i32> |
| 121 | %bound = constant dense<3> : tensor<i32> |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 122 | %res = "mhlo.while"(%start) ( { |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 123 | ^bb0(%count: tensor<i32>): |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 124 | %1 = "mhlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1> |
| 125 | "mhlo.return"(%1) : (tensor<i1>) -> () |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 126 | }, { |
| 127 | ^bb0(%count: tensor<i32>): |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 128 | %1 = mhlo.add %count, %count : tensor<i32> |
| 129 | "mhlo.return"(%1) : (tensor<i32>) -> () |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 130 | }) : (tensor<i32>) -> tensor<i32> |
| 131 | ``` |
| 132 | |
| 133 | A naïve but correct lowering (what's currently in IREE) would perform the |
| 134 | comparison and increment on the device and insert a host readback to see if the |
| 135 | loop should continue: |
| 136 | |
| 137 | ```mlir |
| 138 | func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} { |
| 139 | %cst = constant dense<1> : tensor<i32> |
| 140 | %cst_0 = constant dense<3> : tensor<i32> |
| 141 | %cst_1 = constant dense<1> : vector<3xi32> |
| 142 | br ^bb1(%cst : tensor<i32>) |
| 143 | ^bb1(%2: tensor<i32>): // 2 preds: ^bb0, ^bb2 |
| 144 | %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> { |
| 145 | %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1> |
| 146 | flow.return %8 : tensor<i1> |
| 147 | } |
| 148 | %4 = flow.tensor.load %3 : tensor<i1> |
| 149 | cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>) |
| 150 | ^bb2(%5: tensor<i32>): // pred: ^bb1 |
| 151 | %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> { |
| 152 | %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32> |
| 153 | flow.return %8 : tensor<i32> |
| 154 | } |
| 155 | br ^bb1(%6 : tensor<i32>) |
| 156 | ^bb3(%7: tensor<i32>): // pred: ^bb1 |
| 157 | return %7 : tensor<i32> |
| 158 | } |
| 159 | ``` |
| 160 | |
| 161 | Of note is the `flow.tensor.load` op indicating a host readback. Though this |
| 162 | correctly executes the loop it is extremely inefficient. What's desired is for |
| 163 | the loop iterator and condition to all happen on the host, with the iterator |
| 164 | being passed to the loop body as an argument that can be encoded into a command |
| 165 | buffer in future lowering stages. This eliminates host readback and allows for |
| 166 | much larger `flow.stream` sequences, feeding more into the pipeline for the |
| 167 | accelerator. |
| 168 | |
| 169 | Not all source frontends have this issue (misrepresenting simple host |
| 170 | computation as non-dense tensor operations), and our goal is to add a |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 171 | transformation that heuristically converts `mhlo` ops acting on small tensors to |
| 172 | `std` ops acting on primitive values (`i32`, `index`, etc). |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 173 | |
| 174 | ### Quantization |
| 175 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 176 | <a id="markdown-Quantization" name="Quantization"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 177 | |
| 178 | It's assumed that any work related to quantization/compression has happened |
| 179 | prior to lowering into IREE dialects. Our plan is to use the proposed |
| 180 | [Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655) |
| 181 | to achieve both training and inference-time quantization of types in a way that |
| 182 | preserves maximum accuracy. IREE will support running with original unquantized |
| 183 | floats in all cases, allowing for a smooth on-ramp to quantization and the gains |
| 184 | in performance and reduction in model size that come from it. |
| 185 | |
| 186 | As future work IREE would like to move beyond these transformation-directed |
| 187 | approaches to quantization and interface directly to frontends which have a |
| 188 | defined enough type system to represent accurate quantized (and otherwise |
| 189 | compressed) computations directly, not relying exclusively on compiler-side type |
| 190 | inference transforms. |
| 191 | |
| 192 | ## `flow`: Data- and Execution-Flow Modeling |
| 193 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 194 | <a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 195 | |
| 196 | The `flow` dialect is designed to allow us to extract as much concurrency as |
| 197 | possible from a program and partition IR into the scheduling and execution |
| 198 | domains. Today we have the IR structure and transformation flow in place but |
| 199 | have not yet got to the most interesting things such an infrastructure enables. |
| 200 | A majority of the largest performance, latency, and memory usage improvements |
| 201 | IREE can offer are determined first here and all following lowerings benefit. |
| 202 | _The fastest code is the code you don't execute and the smallest allocation is |
| 203 | the allocation you don't make_ ;) |
| 204 | |
| 205 | ### Avoiding Readbacks with `flow.stream` |
| 206 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 207 | <a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 208 | |
| 209 | A majority of the readbacks we have today (manifested as `flow.tensor.load.*` |
| 210 | ops) will be removed when we have an |
| 211 | [HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion). |
| 212 | There will still be cases when readbacks are required for correctness but they |
| 213 | usually fall into a small set of usage patterns. For those that don't this is |
| 214 | one place where IREE will warn about performance issues, allowing programs that |
| 215 | perform suboptimally but encouraging authors to adjust their input model to |
| 216 | enable better behavior. The IREE VM also has specific support for hiding |
| 217 | readback latency in an efficient way via |
| 218 | [coroutines](coroutines-for-batching-and-cooperative-scheduling). |
| 219 | |
| 220 | The most common case we are currently seeing in the IR is that of dynamic copies |
| 221 | where the offsets are dependent on the result of previous computations. Source |
| 222 | models may have top-k + gather operations, for example. These appear as a |
| 223 | `flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses |
| 224 | the loaded value for a `flow.tensor.update` (or other operation): |
| 225 | |
| 226 | ```mlir |
Scott Todd | f179f39 | 2020-09-16 14:54:59 -0700 | [diff] [blame] | 227 | %index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... } |
| 228 | %index = flow.tensor.load %index_tensor : tensor<i32> |
| 229 | %result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... { |
| 230 | %0 = flow.dispatch ... |
| 231 | %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32> |
| 232 | ... |
| 233 | } |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 234 | ``` |
| 235 | |
| 236 | Today the `flow.tensor.update` turns into HAL command buffer transfer operations |
| 237 | that must have their offsets known at recording time. This is a limitation of |
| 238 | `vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact |
| 239 | several drivers implement copies as small built-in shader programs meaning that |
| 240 | we could perform the same expansion here with the right primitives. This would |
| 241 | allow, in the above example, both the index to be computed and the tensor to be |
| 242 | updated within the same stream to entirely remove the host round-trip. |
| 243 | |
| 244 | ### Threading `flow.stream` through the CFG |
| 245 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 246 | <a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 247 | |
| 248 | The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is |
| 249 | a temporary implementation designed to get the concept of streams lowered to the |
| 250 | HAL dialect. For streams to be effective at modeling larger concurrency scopes |
| 251 | they need to be able to move across branches in the CFG. This intuitively |
| 252 | follows exactly what one would do if recording commands in C: |
| 253 | |
| 254 | ```c++ |
Scott Todd | f179f39 | 2020-09-16 14:54:59 -0700 | [diff] [blame] | 255 | vkCmdCopyBuffer(cmd, ...); |
| 256 | if (some_flag) { |
| 257 | vkCmdBindPipeline(cmd, ..., pipeline_a); |
| 258 | } else { |
| 259 | vkCmdBindPipeline(cmd, ..., pipeline_b); |
| 260 | } |
| 261 | vkCmdDispatch(cmd, ...); |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 262 | ``` |
| 263 | |
| 264 | The corresponding `flow` IR: |
| 265 | |
| 266 | ```mlir |
| 267 | flow.stream.append[%s0](...) { |
| 268 | flow.tensor.update ... |
| 269 | } |
| 270 | %b = cmpi ne %some_flag, ... |
| 271 | cond_br %b, ^a(%s0), ^b(%s0) |
| 272 | ^a(%s1): |
| 273 | flow.stream.append[%s1](...) { |
| 274 | flow.dispatch @pipeline_a, ... |
| 275 | } |
| 276 | br ^end(%s1) |
| 277 | ^b(%s2): |
| 278 | flow.stream.append[%s2](...) { |
| 279 | flow.dispatch @pipeline_b, ... |
| 280 | } |
| 281 | br ^end(%s2) |
| 282 | ^end(%s3): |
| 283 | ... |
| 284 | ``` |
| 285 | |
| 286 | This allows the entire stream to be lowered into one command buffer without the |
| 287 | need for any host round-trips. The conversion into the `flow` dialect will walk |
| 288 | the CFG and attempt to thread the `flow.stream` values through so long as there |
| 289 | are no external dependencies. |
| 290 | |
| 291 | ### Predication of `flow.dispatch` |
| 292 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 293 | <a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 294 | |
| 295 | While the |
| 296 | [`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg) |
| 297 | can remove many of the simpler conditional dispatches there will always be some |
| 298 | that will have their execution dependent on the result of prior dispatches. For |
| 299 | these a `flow.cond_dispatch` will allow a condition to be provided that must be |
| 300 | true for the dispatch to actually be performed. |
| 301 | |
| 302 | For targets that natively support predication in their command buffers (such as |
| 303 | D3D12's |
| 304 | [ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication)) |
| 305 | this provides a host round-trip-free way of conditionally executing dispatches |
| 306 | and transfers. Unfortunately Vulkan support is still lacking, though Nvidia |
| 307 | supports the |
| 308 | [VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/) |
| 309 | extension that exposes the same behavior. |
| 310 | |
| 311 | For targets that do not support predication natively it's still possible to |
| 312 | emulate predication with |
| 313 | [indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model |
| 314 | the workgroup counts normally used to dispatch execution are sourced from |
| 315 | another device buffer at the time the dispatch is made instead of sourced from |
| 316 | the command buffer at the time the dispatch is recorded. Degenerate dispatches |
| 317 | with counts of `0, 0, 0` allow for effective neutering of the dispatch with |
| 318 | minimal overhead (vs. the significant penalty of a host round-trip!). |
| 319 | |
| 320 | By modeling such predication at the `flow` level we are able to lower into the |
| 321 | HAL with target-aware predication semantics and fuse indirect dispatch workgroup |
| 322 | count calculations into existing dispatches already being performed such that |
| 323 | overhead is reduced. |
| 324 | |
| 325 | ### Deduping `flow.executable`s |
| 326 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 327 | <a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 328 | |
| 329 | While still in the `flow` dialect, the executables are target-agnostic. This |
| 330 | makes simple IR tree diffing a potential solution to deduplication. Since most |
| 331 | of the dispatches originate from the same source-language library calls in input |
| 332 | frameworks there's a high likelihood of duplication, and depending on when |
| 333 | inlining is performed we may have stronger or weaker ability to perform the |
| 334 | deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are |
| 335 | rearranged into consistent canonical representations) the IR comparisons can be |
| 336 | done rather trivially. |
| 337 | |
| 338 | ### Rematerializing CSE'd Expressions |
| 339 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 340 | <a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 341 | |
| 342 | Common subexpression elimination is performed many times during lowering, |
| 343 | however there comes a point where the CSE can introduce false dependencies and |
| 344 | additional allocations that are otherwise avoidable. For example if a |
| 345 | broadcasting operation is CSE'd and then the result is used by two or more |
| 346 | operations that are scheduled independently what would have been a relatively |
| 347 | cheap lowering of the broadcast to a simple index remapping now becomes an |
| 348 | additional dispatch, materialization of an intermediate tensor, and a barrier: |
| 349 | |
| 350 | ```mlir |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 351 | %bcast = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> |
| 352 | %mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 353 | // (pretend something here that prevents fusion) |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 354 | %mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 355 | ``` |
| 356 | |
| 357 | ```mlir |
| 358 | %bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> { |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 359 | %0 = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 360 | return %0 : tensor<1024x10xf32> |
| 361 | } |
| 362 | // a barrier will be required here |
| 363 | %mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 364 | %1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 365 | return %1 : tensor<1024x10xf32> |
| 366 | } |
| 367 | %mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> { |
Mehdi Amini | ccc47e8 | 2020-07-06 21:51:24 -0700 | [diff] [blame] | 368 | %2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 369 | return %2 : tensor<1024x10xf32> |
| 370 | } |
| 371 | ``` |
| 372 | |
| 373 | Instead the broadcast should be rematerialized inside of both dispatch regions |
| 374 | as the cost of doing so is significantly less in compute resources and then the |
| 375 | intermediate tensor will not be required at all. Though at first it may seem |
| 376 | counter-intuitive to undo such a critical optimization as CSE (both to code size |
| 377 | and often to compute) but here it's something we must carefully balance while |
| 378 | looking at the whole system. It gets even more important when considering |
| 379 | multi-device execution as the cost of sharing memory and synchronizing may be |
| 380 | extremely non-trivial. |
| 381 | |
| 382 | ### Device Placement |
| 383 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 384 | <a id="markdown-Device%20Placement" name="Device%20Placement"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 385 | |
| 386 | While still within the `flow` dialect we have the ability to easily split |
| 387 | streams and safely shuffle around operations. Target execution backends can opt |
| 388 | into such behavior to ensure that device restrictions such as maximum in-flight |
| 389 | memory, maximum scheduling depth, and capabilities are observed. For |
| 390 | heterogeneous configurations the intent is that certain operations, dispatches, |
| 391 | and streams can be attributed to specify which device categories they should be |
| 392 | lowered. The constraint solving that takes place can be provided with generic |
| 393 | heuristics ("big GEMMs go on the accelerator"), profile-guided databases based |
| 394 | on benchmarks, learned traits via ML, etc. |
| 395 | |
| 396 | ## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables |
| 397 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 398 | <a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 399 | |
| 400 | As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of |
| 401 | the techniques classically used in real-time graphics apply. The benefit we have |
| 402 | by modeling our usage of such a low-level API in IR is that the normal work - |
| 403 | some of which is very non-trivial - for managing allocations, tracking resource |
| 404 | lifetime, and ensuring proper synchronization/barriers is something we can apply |
| 405 | the full force of an offline compiler against. |
| 406 | |
| 407 | ### Allow Targets to Specify `hal.interface`s |
| 408 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 409 | <a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 410 | |
| 411 | The `hal.interface` op specifies the ABI between the scheduler and the device |
| 412 | containing the buffer bindings and additional non-buffer data (parameters, |
| 413 | shapes, specialization flags, etc). Today a naïve ordering is used uniformly for |
| 414 | all targets however it is possible for target backends to opt into providing |
| 415 | their own interfaces based on target configuration. The same `hal.executable` |
| 416 | may have multiple interfaces and the same backend may use one or more. This is |
| 417 | useful for when target capabilities may vary at runtime, such as the |
| 418 | [number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android) |
| 419 | in Vulkan. By exposing a few `hal.interface` variants with different binding |
| 420 | amounts the Vulkan backend could make better use of the larger number of |
| 421 | bindings available at runtime while still providing support for smaller |
| 422 | configurations. |
| 423 | |
| 424 | Once we have multiple `hal.interface`s defined for executables the scheduler |
| 425 | needs to emit HAL ops that properly switch between them. By having a canonical |
| 426 | form for bindings we can ensure that only the differences between the interfaces |
| 427 | will need additional code. |
| 428 | |
| 429 | ### Target-specific Scheduling Specialization |
| 430 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 431 | <a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 432 | |
| 433 | Though the `flow` dialect attempts to fuse as many ops as possible into dispatch |
| 434 | regions, it's not always possible for all target backends to schedule a region |
| 435 | as a single dispatch. A classic example is algorithms like |
| 436 | [parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm) |
| 437 | commonly used on GPUs that may require many dispatches to identical executables, |
| 438 | while other algorithms may vary the executables they use based on the input |
| 439 | parameters such as shape or the target runtime device support. |
| 440 | |
| 441 | By default the `flow.dispatch` executable translation to `hal.executable`s is |
| 442 | performed 1:1 and it is assumed that a single dispatch is required. Extending |
| 443 | target backends with scheduling interfaces (enabling them to opt into different |
| 444 | scheduling behavior) will allow the backends to emit any number of |
| 445 | `hal.executable`s and any stream commands (such as additional dispatches or |
| 446 | transfers) they may need. This is effectively equivalent to what would be done |
| 447 | at runtime only because we are still operating on IR prior to buffer allocation |
| 448 | and can use the `hal` ringbuffer primitive. Through this we can elide many of |
| 449 | the allocations that would otherwise be required at runtime (and the |
| 450 | concurrency-limiting false dependencies that usually come along with scratch |
| 451 | memory). |
| 452 | |
| 453 | Since the algorithm used may vary based on the parameters of the dispatch (such |
| 454 | as the shape of the reduction which may be dynamically determined) scheduling |
| 455 | specialization may occur even when targeting a single backend. In many cases |
| 456 | folding and canonicalization can eliminate the overhead as whether one |
| 457 | dynamically computed workgroup size is used instead of another the same IR is |
| 458 | present. |
| 459 | |
| 460 | ### Buffer Usage Tracking |
| 461 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 462 | <a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 463 | |
| 464 | Many explicit hardware APIs require knowing how buffers are used alongside with |
| 465 | where they should be located. For example this additional information determines |
| 466 | caching policy on buffer accesses (write-through, write-back, etc), visibility |
| 467 | of writes across compute units, and the possible MMU properties that may need to |
| 468 | be maintained/matched for the buffer. By using the SSA-form value-semantics of |
| 469 | the MLIR `tensor` as used in the `flow` dialect we have complete information of |
| 470 | where buffers may be used or at least where they enter or leave regions where we |
| 471 | can derive such information. |
| 472 | |
| 473 | Analysis passes can run over IR to attribute tensors such that when allocation |
| 474 | is performed when lowering to the `hal` dialect we do so from an allocator |
| 475 | compatible with where the buffer will be used, with memory types chosen based on |
| 476 | the potential cost and location of operations performed (write-only on host vs. |
| 477 | read-write on host and device, etc), and with usage bits indicating what kind of |
| 478 | operations may be performed on the buffer. Many of these are local |
| 479 | transformations as most buffers are only live within very small regions such as |
| 480 | the `flow.stream` encompassing their usage. |
| 481 | |
| 482 | Traditional systems need to either use very permissive buffer properties or |
| 483 | heuristics that can introduce additional non-trivial overhead when such |
| 484 | heuristics are incorrect. For example, |
| 485 | [OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml) |
| 486 | that drivers were then able to use but almost no drivers behaved as desired in |
| 487 | all cases and it lead to additional memory ghosting, copies, readbacks, and |
| 488 | unpredictable performance. For almost all uses of the buffers within an IREE |
| 489 | invocation we instead can know precisely where and how buffers may need to be |
| 490 | moved and do it a minimum number of times if it is required. |
| 491 | |
| 492 | ### Batched Executable Caching and Precompilation |
| 493 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 494 | <a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 495 | |
| 496 | For targets that may require runtime preprocessing of their executables prior to |
| 497 | dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch |
| 498 | compilation mechanism based on Vulkan's |
| 499 | [Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html). |
| 500 | |
| 501 | Today each executable is compiled on-demand and cached only for the process |
| 502 | lifetime. Though some drivers may provide their own caching we can make better |
| 503 | use of the explicit caching and compilation behavior with the additional |
| 504 | information we have in the compiler. |
| 505 | |
| 506 | For any given entry point (or group of entry points) into an IREE module we can |
| 507 | perform reachability analysis to know which executables may be executed when |
| 508 | that entry point is invoked. In this way we can emit pre-invocation compilation |
| 509 | checks (similar to an `std::call_once` block) that provides all required |
| 510 | executables for compilation and allows more efficient compilation through |
| 511 | multithreading the compiler invocations. These same compilation caching function |
| 512 | can be exposed and invoked manually by an application to force pre-compilation |
| 513 | when it is least likely to impact the user, such as a post-install/first-run |
| 514 | step or concurrently while other application features are loading. |
| 515 | |
| 516 | We can use zero or more scoped caches for executables within a module. |
| 517 | Completely dynamic modules (such as those emitted in eager-mode usage) may avoid |
| 518 | the caching overhead entirely, while modules that have several primary usage |
| 519 | modes (such as training and inference) may choose to use independent caches for |
| 520 | each such mode. |
| 521 | |
| 522 | The caches generated can then be retrieved and saved by the hosting application. |
| 523 | Upon the next execution the application can provide the caches and if still |
| 524 | valid they will be used to avoid compilation. |
| 525 | |
| 526 | ### Target-aware Executable Compression |
| 527 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 528 | <a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 529 | |
| 530 | An advantage of representing executable binaries in IR after translation is that |
| 531 | we can apply various post-compilation compression and minification techniques |
| 532 | while still know precisely where the executable will be used. This is extremely |
| 533 | important for SPIR-V as it is not designed to be a small at-rest format. Though |
| 534 | the biggest lever we have to control generated code size is higher-level |
| 535 | deduplication and specialization there will still be a sufficiently large number |
| 536 | of executable binaries we will need to embed within the final modules and having |
| 537 | targeted approaches for reducing their size beyond just "gzip everything" is |
| 538 | very powerful. |
| 539 | |
| 540 | For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless |
| 541 | SPIR-V compression technique that, when coupled with modern dictionary-based |
| 542 | compression algorithms, can save significant binary size. As a data point, the |
| 543 | SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB |
| 544 | of compressed SMOL-V. |
| 545 | |
| 546 | Combined with |
| 547 | [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation) |
| 548 | we can easily use shared dictionaries and other cross-artifact compression in a |
| 549 | relatively plug-in way. |
| 550 | |
| 551 | ### Target-aware Constant Compression |
| 552 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 553 | <a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 554 | |
| 555 | It's still an area that needs more research but one goal of the IREE design was |
| 556 | to enable efficient target- and context-aware compression of large constants |
| 557 | (typically model weights/parameters/embeddings). This may mean reusing existing |
| 558 | hardware compression formats on GPUs, ML accelerator-specific formats, or |
| 559 | very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be |
| 560 | directly used without first decompressing. The inspiration here is formats like |
| 561 | [Crunch](https://github.com/BinomialLLC/crunch) and |
| 562 | [Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform |
| 563 | ["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be |
| 564 | able to use these directly as then we can make use of GPU hardware samplers to |
| 565 | do the 4-bit to 32-bit decompression, etc. |
| 566 | |
| 567 | ### Command Buffer Stateful Deduplication |
| 568 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 569 | <a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 570 | |
| 571 | The IREE HAL - much like Vulkan it is based on - eschews much of the state that |
| 572 | traditional APIs have in favor of (mostly) immutable state objects (pipeline |
| 573 | layouts, pipeline states, descriptor sets, etc). There are still a few stateful |
| 574 | entry points in the API, though, and deduplicating or reordering redundant calls |
| 575 | can reduce both IR, API, and execution overhead. |
| 576 | |
| 577 | The key place this will have the largest impact is around descriptor set |
| 578 | bindings and push descriptors, both of which are state and can have non-trivial |
| 579 | setup overhead. A canonicalization for such commands that inspects the target |
| 580 | `hal.command_buffer` to see if the same state was set prior and code motion to |
| 581 | move such commands out of loop bodies when possible would be helpful. |
| 582 | |
| 583 | ### Resource Timeline |
| 584 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 585 | <a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 586 | |
| 587 | A core concept of the IREE scheduler that allows for overlapping in-flight |
| 588 | invocations is that of the resource timeline. This identifies module state that |
| 589 | can be in use by multiple invocations and assigns timeline milestones denoting |
| 590 | when the resource will be in the appropriate state for the current invocation to |
| 591 | proceed. Conceptually it is like a epoch-based synchronization mechanism as |
| 592 | commonly found in garbage collectors to allow for lock-free asynchronous memory |
| 593 | reclamation. |
| 594 | |
| 595 | The advantage we have in the IR is that we know both the usage of all resources |
| 596 | thanks to [buffer usage tracking](#buffer-usage-tracking) and the |
| 597 | synchronization domains of all resources (in most cases). This allows us to |
| 598 | effectively assign one timeline semaphore per writeable resource while in |
| 599 | practice having far fewer than 1:1, as for example if two resources are only |
| 600 | ever written in the same command buffer only one semaphore is needed to signal |
| 601 | the completion of both writes. |
| 602 | |
| 603 | By transforming IR to sink all resource reads and writes closest to where the |
| 604 | value is used we can enlarge the time windows that can overlap across |
| 605 | invocations that may share those resources. This is similar to what out-of-order |
| 606 | CPUs do with register renaming/reorder buffers/etc and something we can apply |
| 607 | some traditional instruction scheduling techniques to (only here our |
| 608 | 'instructions' are entire command buffer dispatches/transfers). |
| 609 | |
| 610 | Two degenerate cases of this approach are that of resource indirection |
| 611 | (`iree.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may |
| 612 | not be possible to continue recording commands even if we are able to ensure |
| 613 | execution is appropriately synchronized. This is where indirect dispatch, |
| 614 | [predication](#predication-of-flowdispatch), |
| 615 | [indirect command buffers](#indirect-command-bufferon-accelerator-execution), |
| 616 | and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all |
| 617 | help cover for the times where we are unable to transform away the indirection |
| 618 | or emit shape logic without data dependencies. |
| 619 | |
| 620 | ### Transient Tensor Ringbuffer |
| 621 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 622 | <a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 623 | |
| 624 | (When properly implemented) almost all buffers required during execution never |
| 625 | escape the command buffers they are used in or a single VM invocation. We can |
| 626 | trivially identify this from the explicit captures of `flow.stream` and |
| 627 | `flow.dispatch` ops and the fact that all tensor types have value-semantics. |
| 628 | Only those tensor values loaded-from/stored-to module state or that cross the |
| 629 | exported module function boundary need special consideration while almost |
| 630 | everything else can live transiently only so long as it is required during |
| 631 | execution. |
| 632 | |
| 633 | Thanks to this information about buffer usage and lifetime we can use a |
| 634 | [ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the |
| 635 | transient tensor data and other required data reservations such as uniform |
| 636 | buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches. |
| 637 | This gives the compiler and the application a knob that allows them to control |
| 638 | maximum concurrency (by having a very large ringbuffer) or maximum memory usage |
| 639 | (by having a minimally small ringbuffer). |
| 640 | |
| 641 | Allocating tensors from the ringbuffer does not require sophisticated runtime |
| 642 | packing as we can emit IR to calculate required sizes for dynamically shaped |
| 643 | tensors. Whether a basic block reserves `%sz = constant 42 : index` bytes or |
| 644 | `%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how the |
| 645 | allocations are performed. Since almost all usage involves simple write head |
| 646 | bumps there is no need for ahead-of-time memory planning or large fixed |
| 647 | allocations, and since no buffer within the ringbuffer can alias we can have |
| 648 | coarse (_read: low overhead_) guarantees about the availability of certain |
| 649 | regions of the ringbuffer (_"when this event is signaled all prior ringbuffer |
| 650 | writes have completed"_). |
| 651 | |
| 652 | Usually any planning we may want to perform can be done in IR via code motion. |
| 653 | For example applying traditional algorithms used to reduce register pressure |
| 654 | will help us attain narrower live windows within the ringbuffer leading to a |
| 655 | larger number of in-flight operations for the same ringbuffer memory usage. |
| 656 | |
| 657 | We may end up using both a classical ringbuffer and a variant known as the |
| 658 | [bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist) |
| 659 | because it is better for descriptor set utilization (as we can provide many |
| 660 | dispatch parameters with a single base offset bound once at the beginning of a |
| 661 | region). |
| 662 | |
| 663 | ### Timeline Semaphores on the Module ABI |
| 664 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 665 | <a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 666 | |
| 667 | Functions calls made across modules (either from C++ into the VM, VM->VM, or |
| 668 | VM->C++) should be able to define timeline semaphores used to wait and signal on |
| 669 | the call. We can do this by making all exports automatically have the semaphores |
| 670 | and then make invocations populate them if they were not provided by the caller. |
| 671 | In this way we can allow multiple invocations of exported functions to chain |
| 672 | naturally with internal asynchronous workloads, turning most IREE invocations |
| 673 | into just recording of command buffers that can never block. |
| 674 | |
| 675 | When combined with |
| 676 | [VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we |
| 677 | even have the ability to interleave any required host execution between the wait |
| 678 | and signal semaphores provided such that the caller never knows on which device |
| 679 | execution is taking place. It's still possible to provide synchronous wrappers |
| 680 | that emulate blocking behavior but by having the core system designed around a |
| 681 | single system-supported primitive we avoid the need for additional things like |
| 682 | interrupt watchdog threads, implicit blocking, and other pitfalls. |
| 683 | |
| 684 | ### GPU-like CPU Scheduling |
| 685 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 686 | <a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 687 | |
| 688 | One approach to using multiple cores on a CPU is to perform interior |
| 689 | parallelization of operations such as OpenMP or library-call-based custom thread |
| 690 | pools (gemmlowp). This works when each individual operation is relatively costly |
| 691 | vs. potential pipeline bubbles caused by work spinning down near the end of an |
| 692 | operation and spinning up at the beginning of the next. |
| 693 | |
| 694 | IREE is designed to handle many more workloads - some of which have very narrow |
| 695 | shapes but very deep pipelines (like search algorithms) - such that the above |
| 696 | approach of multithreading within ops becomes a bottleneck. These workloads are |
| 697 | traditionally very poorly handled by frameworks and issues with |
| 698 | oversubscription, pipeline stalls, and suboptimal system schedulers (such as on |
| 699 | Android) can lead to more time being spent thrashing about than actually |
| 700 | executing real work. |
| 701 | |
| 702 | The approach we take here is to treat the cores of a CPU as if they were |
| 703 | computation units on a GPU, each able to perform some set of heterogeneous work |
| 704 | independent of others units. This means that the concurrency we are trying to |
| 705 | model at the `flow` level and communicate to the runtime via the `hal` that |
| 706 | explicitly states which dispatches can overlap and the size of the workgroups |
| 707 | can trivially be used to distribute this work over many cores exactly as a GPU |
| 708 | would do it. Integration with library calls that may require their own threading |
| 709 | (such as Ruy) requires that they be able to use the IREE thread pool instead of |
| 710 | their own. |
| 711 | |
| 712 | In this way we can avoid pipeline bubbles and other latency-inducing |
| 713 | unpredictable scheduling. This does not mean that we treat individual units of |
| 714 | work at the same scale as we would for GPUs, but instead that we tile and have |
| 715 | one or more processing units that allows us to work on those tiles. Whether the |
| 716 | tile size is defined by a library call contract, heuristics, or empirically is |
| 717 | TBD, but expect workgroup sizes in the thousands to millions of invocations vs. |
| 718 | normal GPU workgroup sizes in the dozens to hundreds of invocations. |
| 719 | |
| 720 | To achieve this style of scheduling efficiently we'll likely use |
| 721 | [marl](https://github.com/google/marl) as the scheduler. This provides |
| 722 | cross-platform low-overhead fibers and is compatible with this style of |
| 723 | scheduling as it was built for the Swiftshader software rasterizer. |
| 724 | |
| 725 | Even if IREE was only targeting CPUs the assertion is that we would still want |
| 726 | to schedule this way and it's only an incidental benefit that if building for |
| 727 | heterogeneous targets the scheduling code may be shared (just with a different |
| 728 | divisor for workgroup count calculations). |
| 729 | |
| 730 | ## `vm`: Lightweight Virtual Machine |
| 731 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 732 | <a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 733 | |
| 734 | The VM is designed as a dynamic linkage ABI, stable bytecode representation, and |
| 735 | intermediate lowering IR. Many of the optimizations we can perform on it will |
| 736 | benefit all use cases (such as when lowering to LLVM IR) by allowing |
| 737 | higher-level program transformations around synchronization that are difficult |
| 738 | to perform on arbitrary LLVM IR. |
| 739 | |
| 740 | ### Coroutines for Batching and Cooperative Scheduling |
| 741 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 742 | <a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 743 | |
| 744 | One of the largest features currently missing from the VM is coroutines (aka |
| 745 | user-mode fiber scheduling). Coroutines are what will allow us to have multiple |
| 746 | in-flight invocations into a module - some of which may be waiting on external |
| 747 | events - without the need for complex multithreading logic or state machine |
| 748 | machinations. |
| 749 | |
| 750 | In many cases |
| 751 | [once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi) |
| 752 | we will not need to yield in the VM. The user will call into the module with |
| 753 | provided semaphores, the work to perform will be recorded to one or more command |
| 754 | buffers and submitted to the device, and then control return will return to the |
| 755 | caller immediately. |
| 756 | |
| 757 | In cases requiring host readbacks that we were not able to remove, however, |
| 758 | additional VM code may need to run prior to when the final semaphore is |
| 759 | signaled. To preserve the asynchronous interface and immediate execution |
| 760 | guarantees the compiler can emit explicit yield points (`vm.yield`) that are |
| 761 | known-good locations for yielding (such as most resources not required after the |
| 762 | yield having been flushed/discarded, partial synchronization scope availability |
| 763 | if other work may be able to execute concurrently irrespective of the yielded |
| 764 | coroutine, etc). |
| 765 | |
| 766 | When the VM encounters the yield at runtime it will suspend the coroutine until |
| 767 | a defined condition is met. Many coroutines can be in various states at any |
| 768 | given time and - thanks to the resource timeline - can still be memory safe. For |
| 769 | example if two stateless invocations are made with a common wait semaphore both |
| 770 | can be recorded and submitted without waiting on each other. If there is |
| 771 | internal module state accessed the invocations are implicitly ordered by |
| 772 | invocation order (similar to what Vulkan calls |
| 773 | [API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder)) |
| 774 | based on internal resource timeline semaphores. |
| 775 | |
| 776 | Waking the coroutines can be performed by either an application-provided |
| 777 | callback in the case of the application already having a periodic event which is |
| 778 | doing bookkeeping (such as frame end callbacks when rendering or Looper idle |
| 779 | events on Android), giving direct control over the frequency and location which |
| 780 | IREE utilizes to perform additional work. A helper will be provided as well that |
| 781 | runs a dedicated IREE thread to do this, but the expectation is that |
| 782 | applications can often do a better (and importantly more predictable) job. |
| 783 | |
| 784 | By utilizing coroutines IREE will have a way to fill traditional pipeline |
| 785 | bubbles even with execution from the same module (let alone across modules) in |
| 786 | the situation where host readbacks or other logic is required. This increases |
| 787 | overall throughput and utilization while reducing host wakeups as many |
| 788 | coroutines can be processed at once to submit new work to the device queues, |
| 789 | though it does not help reduce per-invocation latency. |
| 790 | |
| 791 | External code such as the HAL implementation or user ops may provide the wait |
| 792 | handles used for continuation. For example, the HAL can expose a function that |
| 793 | yields and wakes only when one or more timeline semaphores reach their target |
| 794 | values: |
| 795 | |
| 796 | ```mlir |
| 797 | // submit work |
| 798 | hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target |
| 799 | // continue here, possibly much later in time |
| 800 | ``` |
| 801 | |
| 802 | #### Cellular Batching |
| 803 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 804 | <a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 805 | |
| 806 | Though coroutines help throughput there is a way we've found to reduce latency |
| 807 | that's been documented as |
| 808 | [cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf). |
| 809 | This same technique has been implemented in prior internal systems and is one of |
| 810 | the motivating design goals for IREE's creation. The core idea is to identify |
| 811 | small uniform work that can be partitioned and scheduled greedily such as to |
| 812 | enable batching or reduce associated invocation costs (such as refreshing |
| 813 | accelerator SRAM/caches with new parameters). This usually manifests as finding |
| 814 | large GEMM/GEMV operations using the same fixed parameters and either |
| 815 | dynamically increasing the batch size by adding the waiting work (without |
| 816 | deferring the actual execution time) or sequencing them back to back to ensure |
| 817 | better cache utilization. Which approach is taken depends on any data |
| 818 | dependencies that may be present (such as LSTM state feedback edges). |
| 819 | |
| 820 | With the foundation of coroutines in IREE it's possible to yield execution at |
| 821 | any given point - including during command buffer recording - and wake on |
| 822 | specific conditions. A majority of the logic can be built into the module itself |
| 823 | with very little need for runtime machinery, as shared VM variables can be used |
| 824 | to track pending work across invocations (even from different parts of the |
| 825 | program) and flush based on logic wholly controlled by the user or compiler |
| 826 | (such as count/max time latency/etc limits). This allows for the large variety |
| 827 | of scheduling behavior various applications may want to use, such as a |
| 828 | zero-latency batch-only-within-this-invocation to a |
| 829 | [Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque |
| 830 | time or limit based behavior or even some learned model-specific windowing. |
| 831 | |
| 832 | Design work is still required on how to represent this in IR but the current |
| 833 | thought is to model the regions in which deferred execution is possible and |
| 834 | beneficial and allow during lowering to the VM additional transformations. This |
| 835 | is similar to how the async-await behavior works in C# where the async keyword |
| 836 | is just sugar that expands to additional generated helper utilities. |
| 837 | |
| 838 | A simple strawman representation for sequential dispatch may look like: |
| 839 | |
| 840 | ```mlir |
| 841 | hal.scheduling_policy @defer_policy { |
| 842 | // max time, max count, max live memory, etc |
| 843 | } |
| 844 | ... |
| 845 | hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ... |
| 846 | // vm.yield added here during lowering |
| 847 | ``` |
| 848 | |
| 849 | There are many cases to explore and as cellular batching can have performance |
| 850 | benefits of several orders of magnitudes it'll be one of the primary areas of |
| 851 | research in the long-term. |
| 852 | |
| 853 | ### Lowering to LLVM IR |
| 854 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 855 | <a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 856 | |
| 857 | For scenarios where dynamic module loading is not required and entire modules |
| 858 | can be compiled into applications we can lower the VM IR to LLVM IR within |
| 859 | MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are |
| 860 | dispatched at runtime to things like the HAL we can instead lower to |
| 861 | `llvm::CallInst` to runtime-resolved function pointers. This still enables all |
| 862 | of the flexibility of heterogeneous/runtime-determined devices, pluggable |
| 863 | diagnostics, and backend composition without any need for flatbuffers or the VM |
| 864 | bytecode interpreter. |
| 865 | |
| 866 | The VM was designed to make such a lowering easy and the C-style struct-based |
| 867 | function pointer registration for runtime modules was designed to make emitting |
| 868 | code that used it fairly robust even when linked in dynamically such as when |
| 869 | embedded in shared objects. |
| 870 | |
| 871 | An extension of this is what we've been calling 'runtimeless mode', where the |
| 872 | IREE VM linkage code is statically linked into the binary alongside the |
| 873 | generated module LLVM IR. If only a single HAL backend is linked in then (with |
| 874 | some build-fu) we should be able to get call devirtualization to reduce code |
| 875 | size to precisely the functionality used by the module. |
| 876 | |
| 877 | ### Improved Type Support |
| 878 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 879 | <a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 880 | |
| 881 | Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an |
| 882 | intentional limitation such that we can determine what is really needed to |
| 883 | express the scheduling we perform, with the idea being that such a limited model |
| 884 | will make it easier to use techniques like |
| 885 | [indirect command buffers](#indirect-command-bufferon-accelerator-execution) to |
| 886 | compile the VM itself to an accelerator executable that dispatches work without |
| 887 | host involvement. |
| 888 | |
| 889 | As we port more models we may find a few primitives that are worth bringing into |
| 890 | the VM design such that it's worth potential complications to future porting. |
| 891 | These includes types like `f32` (for simple float calculations/comparisons), |
| 892 | `list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple |
| 893 | inline calculations that are not worth dispatch overhead/synchronization). |
| 894 | |
| 895 | ### Indirect Command Buffer/On-Accelerator Execution |
| 896 | |
Ben Vanik | 3d65d88 | 2020-03-27 13:12:27 -0700 | [diff] [blame] | 897 | <a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a> |
Ben Vanik | 8f44084 | 2020-03-11 09:41:14 -0700 | [diff] [blame] | 898 | |
| 899 | Though IREE will use many different tricks such as |
| 900 | [predication](#predication-of-flowdispatch) to build deep pipelines there is |
| 901 | still the requirement that the command recording and submission happens on the |
| 902 | host CPU. Though the cost of this in terms of latency and power use can be |
| 903 | minimized by coalescing and timelines there is still the possibility of |
| 904 | non-trivial roundtrips being introduced that limit performance. For particular |
| 905 | applications like low-power always-on compute or where there is significantly |
| 906 | branchy behavior (such as search algorithms) it is important that the decision |
| 907 | making logic as to what is dispatched runs as close to real-time as possible |
| 908 | within the execution pipeline. |
| 909 | |
| 910 | The IREE VM is designed to be runnable on-device in a secure and cooperative way |
| 911 | (no pointers, indirect buffer handles to allow for memory space rearrangement |
| 912 | op-to-op, deterministic execution and explicit yield points, etc). |
| 913 | |
| 914 | The recent efforts to bring indirect command buffers to Vulkan and Metal's |
| 915 | [Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu) |
| 916 | (that both derive inspiration from |
| 917 | [NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt)) |
| 918 | are one such target for this. Either by |
| 919 | [lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special |
| 920 | conversion to target-specific forms, or by actually executing the VM bytecode |
| 921 | directly on-device (it's ~1000 LoC) we should be able to prototype what full |
| 922 | on-device usage is like. Even if only some VM functions the compiler deems |
| 923 | useful to schedule on the device are used and the rest run on the host |
| 924 | (particularly those functions calling imported functions) some of the most |
| 925 | costly logic that creates tight coupling of the host and device scheduling can |
| 926 | be limited. |