Blame - docs/design_roadmap.md - 3p/openxla/iree

blob: a55072c6c0f311fc00dc382636b457ecfea65ba4 [file] [log] [blame] [view]

Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	1	# IREE Design Roadmap
				2
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	3	<a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a>
				4
				5	<!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING -->
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	6
				7	A not-so-concise walkthrough of various IREE features that are in the design
				8	process and planned for future versions. A lot of the questions around how the
				9	IREE IR is designed and why certain components exist (such as the VM) hopefully
				10	become much clearer when seeing where we want to go with the infrastructure we
				11	are building (as opposed to where we currently are with our MVP slice). This
				12	document is not meant to encompass the entire design of any individual feature
				13	and if there's interest please say hi on the
				14	[iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing
				15	list.
				16
				17	<!-- TOC -->
				18
				19	- [IREE Design Roadmap](#iree-design-roadmap)
				20	- [Input Dialects](#input-dialects)
				21	- [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement)
				22	- [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization)
				23	- [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations)
				24	- [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion)
				25	- [Quantization](#quantization)
				26	- [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling)
				27	- [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream)
				28	- [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg)
				29	- [Predication of `flow.dispatch`](#predication-of-flowdispatch)
				30	- [Deduping `flow.executable`s](#deduping-flowexecutables)
				31	- [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions)
				32	- [Device Placement](#device-placement)
				33	- [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables)
				34	- [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces)
				35	- [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization)
				36	- [Buffer Usage Tracking](#buffer-usage-tracking)
				37	- [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
				38	- [Target-aware Executable Compression](#target-aware-executable-compression)
				39	- [Target-aware Constant Compression](#target-aware-constant-compression)
				40	- [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication)
				41	- [Resource Timeline](#resource-timeline)
				42	- [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer)
				43	- [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi)
				44	- [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling)
				45	- [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine)
				46	- [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling)
				47	- [Cellular Batching](#cellular-batching)
				48	- [Lowering to LLVM IR](#lowering-to-llvm-ir)
				49	- [Improved Type Support](#improved-type-support)
				50	- [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution)
				51
				52	<!-- /TOC -->
				53
				54	## Input Dialects
				55
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	56	<a id="markdown-Input%20Dialects" name="Input%20Dialects"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	57
				58	### Future MLIR XLA HLO Replacement
				59
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	60	<a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	61
				62	IREE's current input dialect is the XLA HLO dialect representing operations on
				63	tensors. This was a pragmatic decision based on having HLO already defined and
				64	proof of existing models being lowered to it from Tensorflow, allowing us to
				65	focus on the IREE-specific portions of work. Unfortunately, HLO is tied to
				66	Tensorflow and has many quirks that would not otherwise have been designed had
				67	that not been the case. There are discussions happening about an upstream MLIR
				68	[Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/)
				69	dialect that HLO can be lowered into, allowing IREE (and other backends) to
				70	decouple themselves from XLA and be easier to target from frontends.
				71
				72	### `linalg`: High-level Hierarchical Optimization
				73
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	74	<a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	75
				76	It's required that IREE inputs are all in tensor form (and not in-place memref
				77	updates) in order to perform a large majority of the `flow` transformations.
				78	Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect
				79	is adding support for operating on value-semantic tensors, meaning that we can
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	80	first apply `mhlo` to `linalg` lowerings and any of the transformations
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	81	available in Linalg prior to performing our own `flow` lowerings. The advantage
				82	is that Linalg will have much stronger and principled code motion and nested
				83	loop transformation optimizations than is possible on higher-level ops. As not
				84	all operations can be represented as `linalg` ops IREE will be able to ingest a
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	85	mix of `linalg`, `std`, and `mhlo` (or its replacement) ops.
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	86
				87	### XLA HLO: Canonicalizations
				88
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	89	<a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	90
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	91	Very little effort has been applied to `mhlo` optimizations and there are a
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	92	significant number of missing folders, canonicalizers, and simple
				93	transformations. Many of these happen in legacy XLA C++ backends; however we
				94	need them in MLIR so that we can make use of dynamic shapes, mixed dialect
				95	inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	96	corresponding `mhlo` ops) is nearing its initial milestones and afterward we
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	97	expect more of these missing pieces to be filled in.
				98
				99	Examples of the optimizations that will greatly benefit IREE (and any other
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	100	backend consuming `mhlo`) include:
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	101
				102	- Eliding unneeded transpose, reshape, and broadcast operations.
				103	- Inserting transpose, reshape, and broadcast operations to allow for more
				104	optimal memory access patterns (such as transposing gather input to allow
				105	for memcpy-like transfers instead of column-wise cache-unfriendly accesses).
				106	- Moving operations above broadcasts such that the smallest amount of work is
				107	performed.
				108
				109	### XLA HLO: Tensor to Primitive Conversion
				110
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	111	<a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	112
				113	HLO only operates on tensor values - even for simple scalars - and this presents
				114	a problem when attempting to determine which code should be specified to run on
				115	accelerators vs. what should run on the host. The canonical example is
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	116	`mhlo.while`, which as seen in the example below uses scalar tensors for its
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	117	loop iteration counter and comparison.
				118
				119	```mlir
				120	%start = constant dense<1> : tensor<i32>
				121	%bound = constant dense<3> : tensor<i32>
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	122	%res = "mhlo.while"(%start) ( {
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	123	^bb0(%count: tensor<i32>):
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	124	%1 = "mhlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1>
				125	"mhlo.return"(%1) : (tensor<i1>) -> ()
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	126	}, {
				127	^bb0(%count: tensor<i32>):
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	128	%1 = mhlo.add %count, %count : tensor<i32>
				129	"mhlo.return"(%1) : (tensor<i32>) -> ()
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	130	}) : (tensor<i32>) -> tensor<i32>
				131	```
				132
				133	A naïve but correct lowering (what's currently in IREE) would perform the
				134	comparison and increment on the device and insert a host readback to see if the
				135	loop should continue:
				136
				137	```mlir
				138	func @main() -> tensor<i32> attributes {iree.module.export, iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} {
				139	%cst = constant dense<1> : tensor<i32>
				140	%cst_0 = constant dense<3> : tensor<i32>
				141	%cst_1 = constant dense<1> : vector<3xi32>
				142	br ^bb1(%cst : tensor<i32>)
				143	^bb1(%2: tensor<i32>): // 2 preds: ^bb0, ^bb2
				144	%3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> {
				145	%8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1>
				146	flow.return %8 : tensor<i1>
				147	}
				148	%4 = flow.tensor.load %3 : tensor<i1>
				149	cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>)
				150	^bb2(%5: tensor<i32>): // pred: ^bb1
				151	%6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> {
				152	%8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32>
				153	flow.return %8 : tensor<i32>
				154	}
				155	br ^bb1(%6 : tensor<i32>)
				156	^bb3(%7: tensor<i32>): // pred: ^bb1
				157	return %7 : tensor<i32>
				158	}
				159	```
				160
				161	Of note is the `flow.tensor.load` op indicating a host readback. Though this
				162	correctly executes the loop it is extremely inefficient. What's desired is for
				163	the loop iterator and condition to all happen on the host, with the iterator
				164	being passed to the loop body as an argument that can be encoded into a command
				165	buffer in future lowering stages. This eliminates host readback and allows for
				166	much larger `flow.stream` sequences, feeding more into the pipeline for the
				167	accelerator.
				168
				169	Not all source frontends have this issue (misrepresenting simple host
				170	computation as non-dense tensor operations), and our goal is to add a
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	171	transformation that heuristically converts `mhlo` ops acting on small tensors to
				172	`std` ops acting on primitive values (`i32`, `index`, etc).
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	173
				174	### Quantization
				175
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	176	<a id="markdown-Quantization" name="Quantization"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	177
				178	It's assumed that any work related to quantization/compression has happened
				179	prior to lowering into IREE dialects. Our plan is to use the proposed
				180	[Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655)
				181	to achieve both training and inference-time quantization of types in a way that
				182	preserves maximum accuracy. IREE will support running with original unquantized
				183	floats in all cases, allowing for a smooth on-ramp to quantization and the gains
				184	in performance and reduction in model size that come from it.
				185
				186	As future work IREE would like to move beyond these transformation-directed
				187	approaches to quantization and interface directly to frontends which have a
				188	defined enough type system to represent accurate quantized (and otherwise
				189	compressed) computations directly, not relying exclusively on compiler-side type
				190	inference transforms.
				191
				192	## `flow`: Data- and Execution-Flow Modeling
				193
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	194	<a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	195
				196	The `flow` dialect is designed to allow us to extract as much concurrency as
				197	possible from a program and partition IR into the scheduling and execution
				198	domains. Today we have the IR structure and transformation flow in place but
				199	have not yet got to the most interesting things such an infrastructure enables.
				200	A majority of the largest performance, latency, and memory usage improvements
				201	IREE can offer are determined first here and all following lowerings benefit.
				202	_The fastest code is the code you don't execute and the smallest allocation is
				203	the allocation you don't make_ ;)
				204
				205	### Avoiding Readbacks with `flow.stream`
				206
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	207	<a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	208
				209	A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
				210	ops) will be removed when we have an
				211	[HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion).
				212	There will still be cases when readbacks are required for correctness but they
				213	usually fall into a small set of usage patterns. For those that don't this is
				214	one place where IREE will warn about performance issues, allowing programs that
				215	perform suboptimally but encouraging authors to adjust their input model to
				216	enable better behavior. The IREE VM also has specific support for hiding
				217	readback latency in an efficient way via
				218	[coroutines](coroutines-for-batching-and-cooperative-scheduling).
				219
				220	The most common case we are currently seeing in the IR is that of dynamic copies
				221	where the offsets are dependent on the result of previous computations. Source
				222	models may have top-k + gather operations, for example. These appear as a
				223	`flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses
				224	the loaded value for a `flow.tensor.update` (or other operation):
				225
				226	```mlir
Scott Todd	f179f39	2020-09-16 14:54:59 -0700	[diff] [blame]	227	%index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... }
				228	%index = flow.tensor.load %index_tensor : tensor<i32>
				229	%result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... {
				230	%0 = flow.dispatch ...
				231	%1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32>
				232	...
				233	}
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	234	```
				235
				236	Today the `flow.tensor.update` turns into HAL command buffer transfer operations
				237	that must have their offsets known at recording time. This is a limitation of
				238	`vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact
				239	several drivers implement copies as small built-in shader programs meaning that
				240	we could perform the same expansion here with the right primitives. This would
				241	allow, in the above example, both the index to be computed and the tensor to be
				242	updated within the same stream to entirely remove the host round-trip.
				243
				244	### Threading `flow.stream` through the CFG
				245
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	246	<a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	247
				248	The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
				249	a temporary implementation designed to get the concept of streams lowered to the
				250	HAL dialect. For streams to be effective at modeling larger concurrency scopes
				251	they need to be able to move across branches in the CFG. This intuitively
				252	follows exactly what one would do if recording commands in C:
				253
				254	```c++
Scott Todd	f179f39	2020-09-16 14:54:59 -0700	[diff] [blame]	255	vkCmdCopyBuffer(cmd, ...);
				256	if (some_flag) {
				257	vkCmdBindPipeline(cmd, ..., pipeline_a);
				258	} else {
				259	vkCmdBindPipeline(cmd, ..., pipeline_b);
				260	}
				261	vkCmdDispatch(cmd, ...);
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	262	```
				263
				264	The corresponding `flow` IR:
				265
				266	```mlir
				267	flow.stream.append[%s0](...) {
				268	flow.tensor.update ...
				269	}
				270	%b = cmpi ne %some_flag, ...
				271	cond_br %b, ^a(%s0), ^b(%s0)
				272	^a(%s1):
				273	flow.stream.append[%s1](...) {
				274	flow.dispatch @pipeline_a, ...
				275	}
				276	br ^end(%s1)
				277	^b(%s2):
				278	flow.stream.append[%s2](...) {
				279	flow.dispatch @pipeline_b, ...
				280	}
				281	br ^end(%s2)
				282	^end(%s3):
				283	...
				284	```
				285
				286	This allows the entire stream to be lowered into one command buffer without the
				287	need for any host round-trips. The conversion into the `flow` dialect will walk
				288	the CFG and attempt to thread the `flow.stream` values through so long as there
				289	are no external dependencies.
				290
				291	### Predication of `flow.dispatch`
				292
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	293	<a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	294
				295	While the
				296	[`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
				297	can remove many of the simpler conditional dispatches there will always be some
				298	that will have their execution dependent on the result of prior dispatches. For
				299	these a `flow.cond_dispatch` will allow a condition to be provided that must be
				300	true for the dispatch to actually be performed.
				301
				302	For targets that natively support predication in their command buffers (such as
				303	D3D12's
				304	[ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication))
				305	this provides a host round-trip-free way of conditionally executing dispatches
				306	and transfers. Unfortunately Vulkan support is still lacking, though Nvidia
				307	supports the
				308	[VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/)
				309	extension that exposes the same behavior.
				310
				311	For targets that do not support predication natively it's still possible to
				312	emulate predication with
				313	[indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model
				314	the workgroup counts normally used to dispatch execution are sourced from
				315	another device buffer at the time the dispatch is made instead of sourced from
				316	the command buffer at the time the dispatch is recorded. Degenerate dispatches
				317	with counts of `0, 0, 0` allow for effective neutering of the dispatch with
				318	minimal overhead (vs. the significant penalty of a host round-trip!).
				319
				320	By modeling such predication at the `flow` level we are able to lower into the
				321	HAL with target-aware predication semantics and fuse indirect dispatch workgroup
				322	count calculations into existing dispatches already being performed such that
				323	overhead is reduced.
				324
				325	### Deduping `flow.executable`s
				326
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	327	<a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	328
				329	While still in the `flow` dialect, the executables are target-agnostic. This
				330	makes simple IR tree diffing a potential solution to deduplication. Since most
				331	of the dispatches originate from the same source-language library calls in input
				332	frameworks there's a high likelihood of duplication, and depending on when
				333	inlining is performed we may have stronger or weaker ability to perform the
				334	deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are
				335	rearranged into consistent canonical representations) the IR comparisons can be
				336	done rather trivially.
				337
				338	### Rematerializing CSE'd Expressions
				339
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	340	<a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	341
				342	Common subexpression elimination is performed many times during lowering,
				343	however there comes a point where the CSE can introduce false dependencies and
				344	additional allocations that are otherwise avoidable. For example if a
				345	broadcasting operation is CSE'd and then the result is used by two or more
				346	operations that are scheduled independently what would have been a relatively
				347	cheap lowering of the broadcast to a simple index remapping now becomes an
				348	additional dispatch, materialization of an intermediate tensor, and a barrier:
				349
				350	```mlir
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	351	%bcast = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
				352	%mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	353	// (pretend something here that prevents fusion)
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	354	%mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	355	```
				356
				357	```mlir
				358	%bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> {
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	359	%0 = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	360	return %0 : tensor<1024x10xf32>
				361	}
				362	// a barrier will be required here
				363	%mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	364	%1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	365	return %1 : tensor<1024x10xf32>
				366	}
				367	%mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
Mehdi Amini	ccc47e8	2020-07-06 21:51:24 -0700	[diff] [blame]	368	%2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	369	return %2 : tensor<1024x10xf32>
				370	}
				371	```
				372
				373	Instead the broadcast should be rematerialized inside of both dispatch regions
				374	as the cost of doing so is significantly less in compute resources and then the
				375	intermediate tensor will not be required at all. Though at first it may seem
				376	counter-intuitive to undo such a critical optimization as CSE (both to code size
				377	and often to compute) but here it's something we must carefully balance while
				378	looking at the whole system. It gets even more important when considering
				379	multi-device execution as the cost of sharing memory and synchronizing may be
				380	extremely non-trivial.
				381
				382	### Device Placement
				383
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	384	<a id="markdown-Device%20Placement" name="Device%20Placement"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	385
				386	While still within the `flow` dialect we have the ability to easily split
				387	streams and safely shuffle around operations. Target execution backends can opt
				388	into such behavior to ensure that device restrictions such as maximum in-flight
				389	memory, maximum scheduling depth, and capabilities are observed. For
				390	heterogeneous configurations the intent is that certain operations, dispatches,
				391	and streams can be attributed to specify which device categories they should be
				392	lowered. The constraint solving that takes place can be provided with generic
				393	heuristics ("big GEMMs go on the accelerator"), profile-guided databases based
				394	on benchmarks, learned traits via ML, etc.
				395
				396	## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables
				397
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	398	<a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	399
				400	As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
				401	the techniques classically used in real-time graphics apply. The benefit we have
				402	by modeling our usage of such a low-level API in IR is that the normal work -
				403	some of which is very non-trivial - for managing allocations, tracking resource
				404	lifetime, and ensuring proper synchronization/barriers is something we can apply
				405	the full force of an offline compiler against.
				406
				407	### Allow Targets to Specify `hal.interface`s
				408
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	409	<a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	410
				411	The `hal.interface` op specifies the ABI between the scheduler and the device
				412	containing the buffer bindings and additional non-buffer data (parameters,
				413	shapes, specialization flags, etc). Today a naïve ordering is used uniformly for
				414	all targets however it is possible for target backends to opt into providing
				415	their own interfaces based on target configuration. The same `hal.executable`
				416	may have multiple interfaces and the same backend may use one or more. This is
				417	useful for when target capabilities may vary at runtime, such as the
				418	[number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android)
				419	in Vulkan. By exposing a few `hal.interface` variants with different binding
				420	amounts the Vulkan backend could make better use of the larger number of
				421	bindings available at runtime while still providing support for smaller
				422	configurations.
				423
				424	Once we have multiple `hal.interface`s defined for executables the scheduler
				425	needs to emit HAL ops that properly switch between them. By having a canonical
				426	form for bindings we can ensure that only the differences between the interfaces
				427	will need additional code.
				428
				429	### Target-specific Scheduling Specialization
				430
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	431	<a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	432
				433	Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
				434	regions, it's not always possible for all target backends to schedule a region
				435	as a single dispatch. A classic example is algorithms like
				436	[parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm)
				437	commonly used on GPUs that may require many dispatches to identical executables,
				438	while other algorithms may vary the executables they use based on the input
				439	parameters such as shape or the target runtime device support.
				440
				441	By default the `flow.dispatch` executable translation to `hal.executable`s is
				442	performed 1:1 and it is assumed that a single dispatch is required. Extending
				443	target backends with scheduling interfaces (enabling them to opt into different
				444	scheduling behavior) will allow the backends to emit any number of
				445	`hal.executable`s and any stream commands (such as additional dispatches or
				446	transfers) they may need. This is effectively equivalent to what would be done
				447	at runtime only because we are still operating on IR prior to buffer allocation
				448	and can use the `hal` ringbuffer primitive. Through this we can elide many of
				449	the allocations that would otherwise be required at runtime (and the
				450	concurrency-limiting false dependencies that usually come along with scratch
				451	memory).
				452
				453	Since the algorithm used may vary based on the parameters of the dispatch (such
				454	as the shape of the reduction which may be dynamically determined) scheduling
				455	specialization may occur even when targeting a single backend. In many cases
				456	folding and canonicalization can eliminate the overhead as whether one
				457	dynamically computed workgroup size is used instead of another the same IR is
				458	present.
				459
				460	### Buffer Usage Tracking
				461
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	462	<a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	463
				464	Many explicit hardware APIs require knowing how buffers are used alongside with
				465	where they should be located. For example this additional information determines
				466	caching policy on buffer accesses (write-through, write-back, etc), visibility
				467	of writes across compute units, and the possible MMU properties that may need to
				468	be maintained/matched for the buffer. By using the SSA-form value-semantics of
				469	the MLIR `tensor` as used in the `flow` dialect we have complete information of
				470	where buffers may be used or at least where they enter or leave regions where we
				471	can derive such information.
				472
				473	Analysis passes can run over IR to attribute tensors such that when allocation
				474	is performed when lowering to the `hal` dialect we do so from an allocator
				475	compatible with where the buffer will be used, with memory types chosen based on
				476	the potential cost and location of operations performed (write-only on host vs.
				477	read-write on host and device, etc), and with usage bits indicating what kind of
				478	operations may be performed on the buffer. Many of these are local
				479	transformations as most buffers are only live within very small regions such as
				480	the `flow.stream` encompassing their usage.
				481
				482	Traditional systems need to either use very permissive buffer properties or
				483	heuristics that can introduce additional non-trivial overhead when such
				484	heuristics are incorrect. For example,
				485	[OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml)
				486	that drivers were then able to use but almost no drivers behaved as desired in
				487	all cases and it lead to additional memory ghosting, copies, readbacks, and
				488	unpredictable performance. For almost all uses of the buffers within an IREE
				489	invocation we instead can know precisely where and how buffers may need to be
				490	moved and do it a minimum number of times if it is required.
				491
				492	### Batched Executable Caching and Precompilation
				493
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	494	<a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	495
				496	For targets that may require runtime preprocessing of their executables prior to
				497	dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
				498	compilation mechanism based on Vulkan's
				499	[Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html).
				500
				501	Today each executable is compiled on-demand and cached only for the process
				502	lifetime. Though some drivers may provide their own caching we can make better
				503	use of the explicit caching and compilation behavior with the additional
				504	information we have in the compiler.
				505
				506	For any given entry point (or group of entry points) into an IREE module we can
				507	perform reachability analysis to know which executables may be executed when
				508	that entry point is invoked. In this way we can emit pre-invocation compilation
				509	checks (similar to an `std::call_once` block) that provides all required
				510	executables for compilation and allows more efficient compilation through
				511	multithreading the compiler invocations. These same compilation caching function
				512	can be exposed and invoked manually by an application to force pre-compilation
				513	when it is least likely to impact the user, such as a post-install/first-run
				514	step or concurrently while other application features are loading.
				515
				516	We can use zero or more scoped caches for executables within a module.
				517	Completely dynamic modules (such as those emitted in eager-mode usage) may avoid
				518	the caching overhead entirely, while modules that have several primary usage
				519	modes (such as training and inference) may choose to use independent caches for
				520	each such mode.
				521
				522	The caches generated can then be retrieved and saved by the hosting application.
				523	Upon the next execution the application can provide the caches and if still
				524	valid they will be used to avoid compilation.
				525
				526	### Target-aware Executable Compression
				527
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	528	<a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	529
				530	An advantage of representing executable binaries in IR after translation is that
				531	we can apply various post-compilation compression and minification techniques
				532	while still know precisely where the executable will be used. This is extremely
				533	important for SPIR-V as it is not designed to be a small at-rest format. Though
				534	the biggest lever we have to control generated code size is higher-level
				535	deduplication and specialization there will still be a sufficiently large number
				536	of executable binaries we will need to embed within the final modules and having
				537	targeted approaches for reducing their size beyond just "gzip everything" is
				538	very powerful.
				539
				540	For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless
				541	SPIR-V compression technique that, when coupled with modern dictionary-based
				542	compression algorithms, can save significant binary size. As a data point, the
				543	SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB
				544	of compressed SMOL-V.
				545
				546	Combined with
				547	[Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
				548	we can easily use shared dictionaries and other cross-artifact compression in a
				549	relatively plug-in way.
				550
				551	### Target-aware Constant Compression
				552
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	553	<a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	554
				555	It's still an area that needs more research but one goal of the IREE design was
				556	to enable efficient target- and context-aware compression of large constants
				557	(typically model weights/parameters/embeddings). This may mean reusing existing
				558	hardware compression formats on GPUs, ML accelerator-specific formats, or
				559	very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be
				560	directly used without first decompressing. The inspiration here is formats like
				561	[Crunch](https://github.com/BinomialLLC/crunch) and
				562	[Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform
				563	["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be
				564	able to use these directly as then we can make use of GPU hardware samplers to
				565	do the 4-bit to 32-bit decompression, etc.
				566
				567	### Command Buffer Stateful Deduplication
				568
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	569	<a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	570
				571	The IREE HAL - much like Vulkan it is based on - eschews much of the state that
				572	traditional APIs have in favor of (mostly) immutable state objects (pipeline
				573	layouts, pipeline states, descriptor sets, etc). There are still a few stateful
				574	entry points in the API, though, and deduplicating or reordering redundant calls
				575	can reduce both IR, API, and execution overhead.
				576
				577	The key place this will have the largest impact is around descriptor set
				578	bindings and push descriptors, both of which are state and can have non-trivial
				579	setup overhead. A canonicalization for such commands that inspects the target
				580	`hal.command_buffer` to see if the same state was set prior and code motion to
				581	move such commands out of loop bodies when possible would be helpful.
				582
				583	### Resource Timeline
				584
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	585	<a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	586
				587	A core concept of the IREE scheduler that allows for overlapping in-flight
				588	invocations is that of the resource timeline. This identifies module state that
				589	can be in use by multiple invocations and assigns timeline milestones denoting
				590	when the resource will be in the appropriate state for the current invocation to
				591	proceed. Conceptually it is like a epoch-based synchronization mechanism as
				592	commonly found in garbage collectors to allow for lock-free asynchronous memory
				593	reclamation.
				594
				595	The advantage we have in the IR is that we know both the usage of all resources
				596	thanks to [buffer usage tracking](#buffer-usage-tracking) and the
				597	synchronization domains of all resources (in most cases). This allows us to
				598	effectively assign one timeline semaphore per writeable resource while in
				599	practice having far fewer than 1:1, as for example if two resources are only
				600	ever written in the same command buffer only one semaphore is needed to signal
				601	the completion of both writes.
				602
				603	By transforming IR to sink all resource reads and writes closest to where the
				604	value is used we can enlarge the time windows that can overlap across
				605	invocations that may share those resources. This is similar to what out-of-order
				606	CPUs do with register renaming/reorder buffers/etc and something we can apply
				607	some traditional instruction scheduling techniques to (only here our
				608	'instructions' are entire command buffer dispatches/transfers).
				609
				610	Two degenerate cases of this approach are that of resource indirection
				611	(`iree.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may
				612	not be possible to continue recording commands even if we are able to ensure
				613	execution is appropriately synchronized. This is where indirect dispatch,
				614	[predication](#predication-of-flowdispatch),
				615	[indirect command buffers](#indirect-command-bufferon-accelerator-execution),
				616	and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all
				617	help cover for the times where we are unable to transform away the indirection
				618	or emit shape logic without data dependencies.
				619
				620	### Transient Tensor Ringbuffer
				621
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	622	<a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	623
				624	(When properly implemented) almost all buffers required during execution never
				625	escape the command buffers they are used in or a single VM invocation. We can
				626	trivially identify this from the explicit captures of `flow.stream` and
				627	`flow.dispatch` ops and the fact that all tensor types have value-semantics.
				628	Only those tensor values loaded-from/stored-to module state or that cross the
				629	exported module function boundary need special consideration while almost
				630	everything else can live transiently only so long as it is required during
				631	execution.
				632
				633	Thanks to this information about buffer usage and lifetime we can use a
				634	[ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the
				635	transient tensor data and other required data reservations such as uniform
				636	buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches.
				637	This gives the compiler and the application a knob that allows them to control
				638	maximum concurrency (by having a very large ringbuffer) or maximum memory usage
				639	(by having a minimally small ringbuffer).
				640
				641	Allocating tensors from the ringbuffer does not require sophisticated runtime
				642	packing as we can emit IR to calculate required sizes for dynamically shaped
				643	tensors. Whether a basic block reserves `%sz = constant 42 : index` bytes or
				644	`%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how the
				645	allocations are performed. Since almost all usage involves simple write head
				646	bumps there is no need for ahead-of-time memory planning or large fixed
				647	allocations, and since no buffer within the ringbuffer can alias we can have
				648	coarse (_read: low overhead_) guarantees about the availability of certain
				649	regions of the ringbuffer (_"when this event is signaled all prior ringbuffer
				650	writes have completed"_).
				651
				652	Usually any planning we may want to perform can be done in IR via code motion.
				653	For example applying traditional algorithms used to reduce register pressure
				654	will help us attain narrower live windows within the ringbuffer leading to a
				655	larger number of in-flight operations for the same ringbuffer memory usage.
				656
				657	We may end up using both a classical ringbuffer and a variant known as the
				658	[bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist)
				659	because it is better for descriptor set utilization (as we can provide many
				660	dispatch parameters with a single base offset bound once at the beginning of a
				661	region).
				662
				663	### Timeline Semaphores on the Module ABI
				664
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	665	<a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	666
				667	Functions calls made across modules (either from C++ into the VM, VM->VM, or
				668	VM->C++) should be able to define timeline semaphores used to wait and signal on
				669	the call. We can do this by making all exports automatically have the semaphores
				670	and then make invocations populate them if they were not provided by the caller.
				671	In this way we can allow multiple invocations of exported functions to chain
				672	naturally with internal asynchronous workloads, turning most IREE invocations
				673	into just recording of command buffers that can never block.
				674
				675	When combined with
				676	[VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we
				677	even have the ability to interleave any required host execution between the wait
				678	and signal semaphores provided such that the caller never knows on which device
				679	execution is taking place. It's still possible to provide synchronous wrappers
				680	that emulate blocking behavior but by having the core system designed around a
				681	single system-supported primitive we avoid the need for additional things like
				682	interrupt watchdog threads, implicit blocking, and other pitfalls.
				683
				684	### GPU-like CPU Scheduling
				685
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	686	<a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	687
				688	One approach to using multiple cores on a CPU is to perform interior
				689	parallelization of operations such as OpenMP or library-call-based custom thread
				690	pools (gemmlowp). This works when each individual operation is relatively costly
				691	vs. potential pipeline bubbles caused by work spinning down near the end of an
				692	operation and spinning up at the beginning of the next.
				693
				694	IREE is designed to handle many more workloads - some of which have very narrow
				695	shapes but very deep pipelines (like search algorithms) - such that the above
				696	approach of multithreading within ops becomes a bottleneck. These workloads are
				697	traditionally very poorly handled by frameworks and issues with
				698	oversubscription, pipeline stalls, and suboptimal system schedulers (such as on
				699	Android) can lead to more time being spent thrashing about than actually
				700	executing real work.
				701
				702	The approach we take here is to treat the cores of a CPU as if they were
				703	computation units on a GPU, each able to perform some set of heterogeneous work
				704	independent of others units. This means that the concurrency we are trying to
				705	model at the `flow` level and communicate to the runtime via the `hal` that
				706	explicitly states which dispatches can overlap and the size of the workgroups
				707	can trivially be used to distribute this work over many cores exactly as a GPU
				708	would do it. Integration with library calls that may require their own threading
				709	(such as Ruy) requires that they be able to use the IREE thread pool instead of
				710	their own.
				711
				712	In this way we can avoid pipeline bubbles and other latency-inducing
				713	unpredictable scheduling. This does not mean that we treat individual units of
				714	work at the same scale as we would for GPUs, but instead that we tile and have
				715	one or more processing units that allows us to work on those tiles. Whether the
				716	tile size is defined by a library call contract, heuristics, or empirically is
				717	TBD, but expect workgroup sizes in the thousands to millions of invocations vs.
				718	normal GPU workgroup sizes in the dozens to hundreds of invocations.
				719
				720	To achieve this style of scheduling efficiently we'll likely use
				721	[marl](https://github.com/google/marl) as the scheduler. This provides
				722	cross-platform low-overhead fibers and is compatible with this style of
				723	scheduling as it was built for the Swiftshader software rasterizer.
				724
				725	Even if IREE was only targeting CPUs the assertion is that we would still want
				726	to schedule this way and it's only an incidental benefit that if building for
				727	heterogeneous targets the scheduling code may be shared (just with a different
				728	divisor for workgroup count calculations).
				729
				730	## `vm`: Lightweight Virtual Machine
				731
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	732	<a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	733
				734	The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
				735	intermediate lowering IR. Many of the optimizations we can perform on it will
				736	benefit all use cases (such as when lowering to LLVM IR) by allowing
				737	higher-level program transformations around synchronization that are difficult
				738	to perform on arbitrary LLVM IR.
				739
				740	### Coroutines for Batching and Cooperative Scheduling
				741
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	742	<a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	743
				744	One of the largest features currently missing from the VM is coroutines (aka
				745	user-mode fiber scheduling). Coroutines are what will allow us to have multiple
				746	in-flight invocations into a module - some of which may be waiting on external
				747	events - without the need for complex multithreading logic or state machine
				748	machinations.
				749
				750	In many cases
				751	[once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi)
				752	we will not need to yield in the VM. The user will call into the module with
				753	provided semaphores, the work to perform will be recorded to one or more command
				754	buffers and submitted to the device, and then control return will return to the
				755	caller immediately.
				756
				757	In cases requiring host readbacks that we were not able to remove, however,
				758	additional VM code may need to run prior to when the final semaphore is
				759	signaled. To preserve the asynchronous interface and immediate execution
				760	guarantees the compiler can emit explicit yield points (`vm.yield`) that are
				761	known-good locations for yielding (such as most resources not required after the
				762	yield having been flushed/discarded, partial synchronization scope availability
				763	if other work may be able to execute concurrently irrespective of the yielded
				764	coroutine, etc).
				765
				766	When the VM encounters the yield at runtime it will suspend the coroutine until
				767	a defined condition is met. Many coroutines can be in various states at any
				768	given time and - thanks to the resource timeline - can still be memory safe. For
				769	example if two stateless invocations are made with a common wait semaphore both
				770	can be recorded and submitted without waiting on each other. If there is
				771	internal module state accessed the invocations are implicitly ordered by
				772	invocation order (similar to what Vulkan calls
				773	[API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder))
				774	based on internal resource timeline semaphores.
				775
				776	Waking the coroutines can be performed by either an application-provided
				777	callback in the case of the application already having a periodic event which is
				778	doing bookkeeping (such as frame end callbacks when rendering or Looper idle
				779	events on Android), giving direct control over the frequency and location which
				780	IREE utilizes to perform additional work. A helper will be provided as well that
				781	runs a dedicated IREE thread to do this, but the expectation is that
				782	applications can often do a better (and importantly more predictable) job.
				783
				784	By utilizing coroutines IREE will have a way to fill traditional pipeline
				785	bubbles even with execution from the same module (let alone across modules) in
				786	the situation where host readbacks or other logic is required. This increases
				787	overall throughput and utilization while reducing host wakeups as many
				788	coroutines can be processed at once to submit new work to the device queues,
				789	though it does not help reduce per-invocation latency.
				790
				791	External code such as the HAL implementation or user ops may provide the wait
				792	handles used for continuation. For example, the HAL can expose a function that
				793	yields and wakes only when one or more timeline semaphores reach their target
				794	values:
				795
				796	```mlir
				797	// submit work
				798	hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target
				799	// continue here, possibly much later in time
				800	```
				801
				802	#### Cellular Batching
				803
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	804	<a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	805
				806	Though coroutines help throughput there is a way we've found to reduce latency
				807	that's been documented as
				808	[cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf).
				809	This same technique has been implemented in prior internal systems and is one of
				810	the motivating design goals for IREE's creation. The core idea is to identify
				811	small uniform work that can be partitioned and scheduled greedily such as to
				812	enable batching or reduce associated invocation costs (such as refreshing
				813	accelerator SRAM/caches with new parameters). This usually manifests as finding
				814	large GEMM/GEMV operations using the same fixed parameters and either
				815	dynamically increasing the batch size by adding the waiting work (without
				816	deferring the actual execution time) or sequencing them back to back to ensure
				817	better cache utilization. Which approach is taken depends on any data
				818	dependencies that may be present (such as LSTM state feedback edges).
				819
				820	With the foundation of coroutines in IREE it's possible to yield execution at
				821	any given point - including during command buffer recording - and wake on
				822	specific conditions. A majority of the logic can be built into the module itself
				823	with very little need for runtime machinery, as shared VM variables can be used
				824	to track pending work across invocations (even from different parts of the
				825	program) and flush based on logic wholly controlled by the user or compiler
				826	(such as count/max time latency/etc limits). This allows for the large variety
				827	of scheduling behavior various applications may want to use, such as a
				828	zero-latency batch-only-within-this-invocation to a
				829	[Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque
				830	time or limit based behavior or even some learned model-specific windowing.
				831
				832	Design work is still required on how to represent this in IR but the current
				833	thought is to model the regions in which deferred execution is possible and
				834	beneficial and allow during lowering to the VM additional transformations. This
				835	is similar to how the async-await behavior works in C# where the async keyword
				836	is just sugar that expands to additional generated helper utilities.
				837
				838	A simple strawman representation for sequential dispatch may look like:
				839
				840	```mlir
				841	hal.scheduling_policy @defer_policy {
				842	// max time, max count, max live memory, etc
				843	}
				844	...
				845	hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ...
				846	// vm.yield added here during lowering
				847	```
				848
				849	There are many cases to explore and as cellular batching can have performance
				850	benefits of several orders of magnitudes it'll be one of the primary areas of
				851	research in the long-term.
				852
				853	### Lowering to LLVM IR
				854
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	855	<a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	856
				857	For scenarios where dynamic module loading is not required and entire modules
				858	can be compiled into applications we can lower the VM IR to LLVM IR within
				859	MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are
				860	dispatched at runtime to things like the HAL we can instead lower to
				861	`llvm::CallInst` to runtime-resolved function pointers. This still enables all
				862	of the flexibility of heterogeneous/runtime-determined devices, pluggable
				863	diagnostics, and backend composition without any need for flatbuffers or the VM
				864	bytecode interpreter.
				865
				866	The VM was designed to make such a lowering easy and the C-style struct-based
				867	function pointer registration for runtime modules was designed to make emitting
				868	code that used it fairly robust even when linked in dynamically such as when
				869	embedded in shared objects.
				870
				871	An extension of this is what we've been calling 'runtimeless mode', where the
				872	IREE VM linkage code is statically linked into the binary alongside the
				873	generated module LLVM IR. If only a single HAL backend is linked in then (with
				874	some build-fu) we should be able to get call devirtualization to reduce code
				875	size to precisely the functionality used by the module.
				876
				877	### Improved Type Support
				878
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	879	<a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	880
				881	Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
				882	intentional limitation such that we can determine what is really needed to
				883	express the scheduling we perform, with the idea being that such a limited model
				884	will make it easier to use techniques like
				885	[indirect command buffers](#indirect-command-bufferon-accelerator-execution) to
				886	compile the VM itself to an accelerator executable that dispatches work without
				887	host involvement.
				888
				889	As we port more models we may find a few primitives that are worth bringing into
				890	the VM design such that it's worth potential complications to future porting.
				891	These includes types like `f32` (for simple float calculations/comparisons),
				892	`list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple
				893	inline calculations that are not worth dispatch overhead/synchronization).
				894
				895	### Indirect Command Buffer/On-Accelerator Execution
				896
Ben Vanik	3d65d88	2020-03-27 13:12:27 -0700	[diff] [blame]	897	<a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a>
Ben Vanik	8f44084	2020-03-11 09:41:14 -0700	[diff] [blame]	898
				899	Though IREE will use many different tricks such as
				900	[predication](#predication-of-flowdispatch) to build deep pipelines there is
				901	still the requirement that the command recording and submission happens on the
				902	host CPU. Though the cost of this in terms of latency and power use can be
				903	minimized by coalescing and timelines there is still the possibility of
				904	non-trivial roundtrips being introduced that limit performance. For particular
				905	applications like low-power always-on compute or where there is significantly
				906	branchy behavior (such as search algorithms) it is important that the decision
				907	making logic as to what is dispatched runs as close to real-time as possible
				908	within the execution pipeline.
				909
				910	The IREE VM is designed to be runnable on-device in a secure and cooperative way
				911	(no pointers, indirect buffer handles to allow for memory space rearrangement
				912	op-to-op, deterministic execution and explicit yield points, etc).
				913
				914	The recent efforts to bring indirect command buffers to Vulkan and Metal's
				915	[Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu)
				916	(that both derive inspiration from
				917	[NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt))
				918	are one such target for this. Either by
				919	[lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special
				920	conversion to target-specific forms, or by actually executing the VM bytecode
				921	directly on-device (it's ~1000 LoC) we should be able to prototype what full
				922	on-device usage is like. Even if only some VM functions the compiler deems
				923	useful to schedule on the device are used and the rest run on the host
				924	(particularly those functions calling imported functions) some of the most
				925	costly logic that creates tight coupling of the host and device scheduling can
				926	be limited.