Adding design doc for the IREE invocation execution model. (#9630) Not all of the implementation has landed so the examples are abstract but this is what we're building towards.

diff --git a/docs/developers/design_docs/execution_model.md b/docs/developers/design_docs/execution_model.md
new file mode 100644
index 0000000..aa67e33
--- /dev/null
+++ b/docs/developers/design_docs/execution_model.md

@@ -0,0 +1,478 @@
+# IREE Invocation Execution Model
+
+This documents the behavior of the user-visible invocation mechanism IREE uses
+to schedule program execution. Internally IREE uses a very similar modeling for
+tracking its internal workloads and in kind carries that down to target APIs
+and devices that themselves use a very similar model. The intent is to expose
+the device model in an abstracted way that allows for the full capture and
+communication of the execution intent to be propagated to the hardware that
+executes it. Though here we focus on the user-visible portion of execution
+there is really only one "IREE execution model" and the entire stack follows
+the same design. At its core this design is just an instantiation of an
+[out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution) algorithm such as those [originating from the 1960's](https://en.wikipedia.org/wiki/Tomasulo_algorithm).
+
+## Glossary
+
+```mermaid
+stateDiagram
+    state UserApplication {
+      direction BT
+      state Context0 {
+        ModuleA-->ModuleAState0
+        ModuleB-->ModuleBState0
+      }
+      state Context1 {
+        ModuleA-->ModuleAState1
+        ModuleB-->ModuleBState1
+        ModuleC-->ModuleCState1
+      }
+      state ModuleA {
+        @func1
+        @func2
+      }
+      state ModuleB {
+        @func3
+        @func4
+      }
+      state ModuleC {
+        @func5
+      }
+    }
+```
+
+### Program
+
+An IREE _program_ is a collection of _modules_ instantiated in a _context_ from
+which _invocations_ can be made. Invocations are ordered on a user-controlled
+_timeline_ that uses _fences_ to define the execution order requirements to
+enable out-of-order execution. A hosting user application may have multiple
+programs or multiple instances of the same program available and running
+invocations at a time across multiple timelines.
+
+### Module
+
+_Modules_ define executable code and data that can be loaded, linked, and run à
+la [ELF shared libraries](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format).
+Modules may be implemented as C/C++, generated bytecode or C sources from the
+IREE compiler, or any other mechanism that can run code and implement the [`iree_vm_module_t` interface](https://github.com/google/iree/blob/0e8d8370699912c6b51889e8f7e967690102402c/runtime/src/iree/vm/module.h#L335-L437).
+Modules on their own are read-only and can be reused across many contexts.
+
+Traditional ML runtimes would use a model (graph, etc) as their module
+representation. In IREE everything is a module including runtime subsystems like
+the HAL and user-provided custom code. This ensures that anything IREE can do
+can be externalized and replaced by users without needing to modify the core
+IREE code.
+
+### Context
+
+A collection of _modules_ are linked and instantiated in a _context_. Each
+context operates independently and carries its own copies of mutable module
+state. _Invocations_ execute within a context scope and hosting applications
+coordinate across contexts as required. Contexts are cheap to create
+(microseconds) and retain (~100B + program state) such that users can decide
+how to manage them based on their scenario.
+
+Traditional ML runtimes would call these "sessions" but in IREE everything is a
+_program_. Whether the program is stateful or stateless and how the program is
+invoked is up to the program author.
+
+### Invocation
+
+An _invocation_ represents a single call into a module exported function using
+the program state stored in a context. Users can decide whether to perform
+synchronous blocking invocations or asynchronous non-blocking invocations
+per-call; the behavior of the invocation is independent from the target function
+and a user program may contain a mix of both.
+
+As an example a user program may synchronously invoke a `@query_output_shapes`
+function to preallocate storage for an asynchronous `@execute_in_place`
+function to write into.
+
+### Timeline
+
+A _timeline_ represents the observable order of execution. Users define their
+own timelines and communicate them to IREE via _fences_. Timelines do not match
+up with the order of invocations unless the user dictates they must by way of
+fences. In the absence of fences all invocations execute in an arbitrary order
+and they may execute concurrently just as threads in C with no barriers.
+
+Each timeline can be thought of as an independent clock domain that may operate
+asynchronously at its own frequency with only fences acting to tie separate
+timelines together. This directly mirrors real hardware constraints like
+[clock domain crossing](https://en.wikipedia.org/wiki/Globally_asynchronous_locally_synchronous)
+as each execution scope (thread on core, driver calls to queues, kernel queues
+to device queues, device queues to compute unit queues, etc) is naturally
+operating at different rates and well-designed systems must tolerate that
+variability.
+
+### Fence
+
+A _fence_ is a specific point of progress in one or more _timelines_ acting as
+a barrier, fork, or join point. Fences only guard execution ordering and not any
+particular resources though users can use them to guard resources by defining
+when in time the resources are available for use.
+
+Waits on fences are wait-until operations specifying that the timeline must
+reach  _at least_ a specific point. This allows for flexible reordering and
+deferral of execution as executors can pull forward scheduled work based on
+policy (run similar work together, etc).
+
+### Hardware Abstraction Layer (HAL)
+
+The HAL is an optional feature of IREE that is used to provide a consistent
+interface across execution resources. It is used internally by IREE programs to
+define and submit work to devices and signal across them but may also be used by
+users to directly interface with hardware in a compatible way. Exposing the
+HAL API allows for users to efficiently manage their data and custom
+execution without expensive marshaling. Most users will only interact with HAL
+buffers as they work with their data but more advanced integrations can directly
+insert IREE into existing device contexts to transparently share scheduling and
+resources or insert their own code into IREE to pipeline custom execution.
+
+## Execution by Timelines
+
+**NOTE**: this defines _an_ execution scheme that IREE supports but a user may
+use one or more such schemes in a single program - just as a C application may
+mix single- and multi-threaded code within itself for different components.
+
+The combination of _invocations_, _timelines_, and _fences_ allows users
+to provide future knowledge to lower layers of the system by declaring their
+availability requirements and the lower layers are then able to execute the work
+out-of-order so long as the specified requirements are met. The primary goal
+when designing for such a system is to specify as few requirements as possible
+in order to provide the maximum amount of scheduling freedom to the
+implementation.
+
+This makes timelines one of the most critical components of the interface.
+The purpose of invocations is to schedule work against one or more timelines and
+what happens within the invocations is an implementation detail of the program.
+
+### Sequential Execution
+
+Here we say _"a user invokes a function to schedule execution on a timeline"_
+vs. a more traditional _"a user invokes a function to execute work"_ and this
+manifests in the IREE ABI as invocations taking fences defining specific points
+on timelines of which the user may observe:
+
+```python
+# Fences are effectively just timeline + integer tuples and are cheap to hold.
+wait_fence = my_timeline.at(t)
+signal_fence = my_timeline.at(t+1)
+# Schedule work against the timeline.
+# All work prior to t must complete before execution can occur and after
+# execution the timeline will advance to t+1.
+async_invoke(@some_fn, wait_fence, signal_fence)
+# The invocation may have returned immediately after the work was scheduled;
+# until the fence is reached no actual execution may have occurred. To
+# synchronize the user code with the timeline the user can block until the fence
+# is reached.
+signal_fence.wait()
+```
+
+To the user this would appear as:
+```mermaid
+sequenceDiagram
+    User->>@some_func: invoke
+    activate @some_func
+    @some_func->>User: ;
+    @some_func-->>@some_func: wait t
+    @some_func-->>User: signal t+1
+    deactivate @some_func
+```
+
+This means from the user's perspective the _actual_ operations performed by the
+invocation are not important: the only thing the user can observe in this
+situation is when the timeline reaches `t+1` as they specified. Whether
+internally the invocation needs many steps to complete as there are timelines
+internal to the program is an implementation detail. Actual execution may look
+like this:
+
+```mermaid
+sequenceDiagram
+    User->>@some_func: invoke
+    activate @some_func
+    @some_func->>User:  ;
+    @some_func->>@some_func: ;
+    @some_func-->>Device A: ;
+    Device A-->>Device A: wait t
+    activate Device A
+    @some_func->>@some_func: ;
+    @some_func-->>Device B: ;
+    activate Device B
+    @some_func->>@some_func: ;
+    Device A-->>@some_func: ;
+    deactivate Device A
+    @some_func->>@some_func: ;
+    @some_func-->>Device B: ;
+    activate Device B
+    deactivate @some_func
+    Device B-->>User: signal t+1
+    deactivate Device B
+    deactivate Device B
+```
+
+Even in this simple user-synchronous example the system is able to internally
+run several concurrent timelines with a minimal number of synchronization points
+and the lowest possible latency as the user is immediately notified without
+any intermediate layers needing to be woken, scheduled, executed, and passed on.
+
+### Pipelined Execution
+
+The true power of timelines comes from the ability to pipeline execution. Users
+define DAGs with fences and can construct arbitrarily complex execution
+topologies whether from the same program or across multiple programs:
+
+```mermaid
+stateDiagram
+    direction LR
+    state fence0 <<fork>>
+    [*] --> fence0
+    fence0 --> @fn0
+    state fence1 <<fork>>
+    @fn0 --> fence1
+    fence1 --> @fn1
+    fence1 --> @fn2
+    state fence2 <<join>>
+    @fn1 --> fence2
+    @fn2 --> fence2
+    @fn3 --> fence2
+    fence0 --> @fn4
+    @fn4 --> fence2
+    fence2 --> [*]
+```
+
+This is a simple extension to the synchronous example using the same primitives:
+
+```python
+# Timeline is defined by the user.
+fence_a = my_timeline.at(t)
+fence_b = my_timeline.at(t+1)
+fence_c = my_timeline.at(t+2)
+# Invocations are launched using the fences and may not complete immediately.
+async_invoke(@fn0, fence_a, fence_b)
+async_invoke(@fn1, fence_b, fence_c)
+async_invoke(@fn2, fence_b, fence_c)
+async_invoke(@fn3, None, fence_c)
+async_invoke(@fn4, fence_a, fence_c)
+# Blocking here but no need to; could pass fence_c on to other invocations.
+fence_c.wait()
+```
+
+The critical point of this being that the user never had to wait for any
+particular invocation to complete before being able to schedule more work
+against the timeline, even if those invocations could themselves not complete
+synchronously. The lower layers of the system are able to fully model the
+execution as early as possible without needing to communicate (and importantly
+synchronize) with the user.
+
+### I/O
+
+Users define the semantics of their programs themselves. For example if the user
+knows the precise shape of an output buffer they can preallocate the buffer and
+pass it in. If they don't know they can decide to factor out the shape
+calculation and invoke that synchronously in order to compute the shape,
+allocate the appropriately sized buffer, and pass that in. Or they could decide
+to only deal with synchronous invocations and return a program-allocated buffer
+view with the appropriate shape in their callback. IREE does not dictate the
+design of user programs and as such enables mixed stateful/stateless,
+asynchronous/synchronous, and arbitrary scheduling models (enqueue/drain,
+windowing, etc).
+
+Inputs and outputs to invocations are provided by the user as primitive values
+(integers, floats, etc), supported builtin types (lists, byte buffers/strings),
+custom user types, and HAL types like buffers or buffer views (buffers + shape
+and type metadata). One or more wait fences can be used to order invocation
+access to one or more inputs by indicating that the resource is not available
+until a certain fence is reached. Similarly one or more signal fences can be
+used to order subsequent access to the resources by indicating the advancement
+of the timeline when they are available.
+
+```python
+# wait_fence_a must be reached before buffer_a and buffer_b can be read.
+# wait_fence_b must be reached before buffer_c can be read.
+# buffer_a will be ready to read when signal_fence_a has been reached.
+async_invoke(@fn,
+             (wait_fence_a, buffer_a, buffer_b),
+             42,  # no ordering required on value types
+             (wait_fence_b, buffer_c),
+             (signal_fence_a, buffer_a))
+```
+
+The above example demonstrates an in-place operation on `buffer_a`. It's also
+possible for invocations to return values:
+
+```python
+result = invoke(@sum, 1, 2)  # = 3
+```
+
+When executed asynchronously a callback or any construct that can be built upon
+them (like promises/futures) can receive the results:
+
+```python
+def my_callback(result):
+  print(result)  # 3
+async_invoke(@sum, 1, 2, my_callback)
+```
+
+### Stream-ordered Allocations
+
+Invocations generally have only a few KB of overhead and pipelined command
+buffers take only a small amount more. Storage buffers, however, can easily
+take hundreds of MB per invocation for I/O and transient state. This compounds
+as program usage becomes more complex or multiple programs are involved. IREE
+supports traditional host-ordered allocations (à la malloc/free) for persistent
+buffers like large constants/read-only data or user-managed ringbuffers.
+Stream-ordered allocations are also supported to allow for pooled buffer
+reservations that can be allocated in a scheduled order alongside program
+execution. For more detailed examples see the CUDA blog posts describing their
+implementation: [part 1](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/), [part 2](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/).
+
+With stream-ordered allocations each allocation and deallocation operation is
+scheduled with wait and signal fences just as with invocations. This allows
+these allocation operations to execute remotely on device without host
+program involvement. For example, scheduling `alloca0`/`dealloca0` and
+`alloca1`/`dealloca1` interleaved with the function execution allows for the
+transient memory required for executing `@fn0` to remain uncommitted until
+immediately before it is executed, committed during execution, and then
+decommitted immediately after execution. The memory required for passing
+data from `@fn0` to the subsequent `@fn1` and `@fn2` survives until after they
+have completed executing before being decommitted. By using the same scheduling
+primitives as execution the allocation topology can be as arbitrarily complex as
+the invocation topology:
+
+```mermaid
+stateDiagram
+    direction LR
+    state fence0a <<fork>>
+    [*] --> fence0a
+    state fence0b <<fork>>
+    fence0a --> alloca0
+    fence0a --> alloca1
+    alloca0 --> fence0b
+    alloca1 --> fence0b
+    fence0b --> @fn0
+    state fence1a <<fork>>
+    @fn0 --> fence1a
+    state fence1b <<fork>>
+    fence1a --> dealloc0
+    dealloc0 --> fence1b
+    fence1b --> @fn1
+    fence1b --> @fn2
+    state fence2a <<join>>
+    @fn1 --> fence2a
+    @fn2 --> fence2a
+    state fence2b
+    fence2a --> dealloc1
+    state fence2b <<join>>
+    dealloc1 --> fence2b
+    fence2b --> [*]
+```
+
+When operating in this way allocations from the host-perspective are just
+reservations for a slice of pooled storage that will be committed at some point
+in the future. Likewise deallocations from the host-perspective release the
+prior reservation and schedule the paired decommit at some point in the future.
+Scheduling N sequential invocations thus requires only enough committed storage
+for a single invocation in addition to the I/O (unless that too is
+stream-ordered).
+
+This scheduling behavior allows for both minimal peak memory consumption
+regardless of the number of programs or invocation pipeline depth and sharing
+of committed storage across programs: the memory consumption of a program at
+rest is near zero when stateless and the sum of all state when stateful. Target
+devices that natively support stream-ordered allocations (like CUDA) can even
+share pools across processes.
+
+The other provided feature in combination with the fence guaranteed forward
+progress is that so long as the memory pool can service a single request
+execution can still continue even when constrained. A device can serialize two
+independent invocations requiring 400MB of transient memory when the system only
+has 512MB available with no user-visible impact besides increased latency. This
+does require the user to ensure they schedule work that is possible to run or
+rely on the target system having paging in order to lighten the strictness of
+the pool quotas.
+
+Stream-ordered allocations performed by the user for invocation inputs can be
+declared as transferred to the program. This allows the program to eagerly
+deallocate or reuse the input storage while still preserving the internal
+scheduling requirements of the program.
+
+### Internal State
+
+A stateful program may contain internal timelines that it uses to order its own
+execution. Take for example [this simple stateful program](https://github.com/google/iree-jax/blob/main/tests/program/dynamic_state.py):
+
+```python
+class TrivialKernel(Program):
+  _x0 = Program.export_global(x_type)
+  def get(self):
+    return self._x0
+  def set(self, x=x_type):
+    self._x0 = x
+  def matmul(self, x=y_type):
+    self._x0 = self._matmul(x, self._x0)
+  @Program.kernel
+  def _matmul(x, x0):
+    return jnp.matmul(x, x0)
+```
+
+Each invocation of `matmul` needs to be executed in-order with prior invocations
+as there is a data dependency established on `self._x0`. Attempts to `get` or
+`set` must also be sequenced correctly with the `matmul` invocations. A basic
+usage like this:
+
+```python
+m = TrivialKernel()
+m.set(input)
+m.matmul(a)
+m.matmul(b)
+m.matmul(c)
+output = m.get()
+print(output)  # implicit wait
+```
+
+Would be executed as:
+
+```mermaid
+sequenceDiagram
+    activate User
+    User->>TrivialKernel: @set(input)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    activate Device
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(a)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(b)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(c)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @get()
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    Device-->>Device: ;
+    deactivate User
+    User->>User: (wait)
+    Device-->>User: (signal)
+    deactivate Device
+    activate User
+    User->>User: print(output)
+    deactivate User
+```
+
+Note that although the user provided no timeline of their own execution is still
+ordered correctly due to the internal timeline constructed by the program. If
+the user wanted to also pipeline execution with another program they could do
+so by providing their own fences.

diff --git a/docs/developers/design_docs/simple_ir_walkthrough.md b/docs/developers/design_docs/simple_ir_walkthrough.md
deleted file mode 100644
index f1b3bc2..0000000
--- a/docs/developers/design_docs/simple_ir_walkthrough.md
+++ /dev/null

@@ -1,511 +0,0 @@
-# Simple IR Walkthrough
-
-Note that this doc is quite outdated. We expect to update it soon.
-
-## Overview
-
-This walks through the process of lowering TensorFlow python to an IREE module,
-demonstrating the MLIR that exists at each stage. Many individual intermediate
-transforms are skipped for clarity but the major dialect milestones during
-lowering are present.
-
-**NOTE**: this represents the IR as it exists at the time of writing, which is
-planned to undergo significant changes soon. Take this more as a conceptual
-walkthrough than a reference for what the IR looks like.
-
-## TensorFlow to XLA HLO
-
-The "frontend" in this example is TensorFlow and we import that into MLIR in the
-TensorFlow dialect and lower it to the mid-level IR of XLA HLO. Many backends
-can consume the XLA HLO (such as TPU and CUDA), not just IREE, meaning that the
-work required to convert the TensorFlow ops to the much more restricted set of
-XLA HLO is shared amongst many projects.
-
-### TensorFlow Python
-
-This is using the TensorFlow 1.0 syntax producing a GraphDef. IREE is designed
-to work best with the TensorFlow 2.0 SavedModel representation.
-
-```python
-import tensorflow as tf
-with tf.Session() as session:
-  arg0 = tf.placeholder(tf.float32, shape=[4])
-  arg1 = tf.placeholder(tf.float32, shape=[4])
-  result = tf.multiply(arg0, arg1)
-  print(session.graph_def)
-```
-
-### TensorFlow GraphDef
-
-GraphDefs do not contain information about the feeds and fetches nor do they
-identify exported functions. TensorFlow 2.0 makes this significantly easier but
-since most are familiar with TF1.0 the GraphDef is displayed here.
-
-```protobuf
-node {
-  name: "Placeholder"
-  op: "Placeholder"
-  attr {
-    key: "dtype"
-    value { type: DT_FLOAT }
-  }
-  attr {
-    key: "shape"
-    value { shape { dim { size: 4 } } }
-  }
-}
-node {
-  name: "Placeholder_1"
-  op: "Placeholder"
-  attr {
-    key: "dtype"
-    value { type: DT_FLOAT }
-  }
-  attr {
-    key: "shape"
-    value { shape { dim { size: 4 } } }
-  }
-}
-node {
-  name: "Mul"
-  op: "Mul"
-  input: "Placeholder"
-  input: "Placeholder_1"
-  attr {
-    key: "T"
-    value { type: DT_FLOAT }
-  }
-}
-```
-
-### XLA HLO
-
-XLA HLO is the dialect we try to lower to as instead of 1400+ ops in TensorFlow
-we end up with ~30 ops that better represent the actual math being performed.
-The
-[XLA Operation Semantics](https://www.tensorflow.org/xla/operation_semantics)
-are well(ish) documented and a great starting point for lowering into other
-dialects. The existing
-[tf2xla](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/tf2xla)
-bridge can be used to convert the ops from GraphDef to XLA HLO, while a
-[new implementation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/mlir/xla/transforms)
-based in MLIR is currently being written.
-
-```mlir
-func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
-  %0 = mhlo.multiply(%arg0, %arg1) : tensor<4xf32>
-  return %0 : tensor<4xf32>
-}
-```
-
-What was a graph of nodes now looks much more like a traditional program:
-there's a function with a well-defined signature, an operation that performs
-some math on the operands, and the result of the math is returned.
-
-In the XLA HLO dialect it's possible to express control flow (calls, loops,
-conditionals), complex multi-operation regions like reductions, etc. All
-TensorFlow graph semantics (control edges, switch/merge, etc) are lowered to
-this form, and all data edges are converted to SSA values.
-
-## IREE Module IR
-
-Once lowered to XLA HLO the IREE transformations work to legalize and lower to a
-high-level sequencer dialect (iree_hl_seq). At this point we are still operating
-on tensors with value-semantics allowing us to use the SSA representation in
-MLIR to do some relatively complex (yet easy to express) transforms.
-
-### Dispatch Region Identification
-
-The final IREE module is designed to have as few sequencer operations as
-possible. This is achieved by clustering operations into regions such that data
-dependencies and execution order are correctly observed and that the dispatch
-workload (roughly the shape of the output) is compatible. Jumping ahead a bit,
-the dispatch regions correspond to dispatches against the target API (such as
-Vulkan vkCmdDispatch) modulo threadgroup sizes. When still operating with value
-semantics it's easy to use SSA use-def chains to ensure we are preserving the
-expected behavior of the program.
-
-```mlir
-func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
-  %cst = arith.constant dense<[4, 1, 1]> : tensor<3xi32>
-  %0 = iree.dispatch_region[%cst : tensor<3xi32>](%arg2 = %arg0 : tensor<4xf32>, %arg3 = %arg1 : tensor<4xf32>) : tensor<4xf32> {
-    %1 = arith.mulf %arg2, %arg3 : tensor<4xf32>
-    iree.return %1 : tensor<4xf32>
-  }
-  return %0 : tensor<4xf32>
-}
-```
-
-In the above example, the workload is defined by `%cst` as 4x1x1. If there were
-other ops that were also of a `dot(4,1,1)` workload we could cluster those here.
-
-Other dispatch-like operations, such as reductions, are also identified and
-clustered appropriately at this stage. What we end up with is a top-level IR
-performing dispatches with nested regions containing the work to perform. When
-all identification has completed the goal is to have no math outside of the
-dispatch regions (though copies are permitted).
-
-Additional passes may run that combine, split, or otherwise transform the
-dispatch regions based on a set of cost functions or target capability metrics.
-For example, to ensure predictable maximum latency larger dispatch regions may
-be split based on how much memory bandwidth they are likely to consume.
-
-### Executable Outlining and High-level Sequencer IR
-
-The first step in lowering to the IREE sequencer IR describing the runtime
-sequence of operations to perform is to isolate the work being performed from
-how it is to be dispatched. We outline dispatch regions into `iree.executable`s
-and replace the original `iree.dispatch_region` with `iree_hl_seq.dispatch` ops
-referencing those executables. At this point we still have not specified what
-our exact lowering targets are, however we know enough to establish the basic
-ABI used to pass parameters.
-
-```mlir
-module {
-  iree.multi_arch_executable @simple_mul_ex_dispatch_0() {
-    iree.executable(Unspecified) {
-      module {
-        func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>)
-  attributes  {iree.executable.export} {
-          %0 = iree.load_input(%arg0 : memref<4xf32>) : tensor<4xf32>
-          %1 = iree.load_input(%arg1 : memref<4xf32>) : tensor<4xf32>
-          %2 = arith.mulf %0, %1 : tensor<4xf32>
-          iree.store_output(%2 : tensor<4xf32>, %arg2 : memref<4xf32>)
-          iree.return
-        }
-      }
-    }
-  }
-  func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32> {
-    %0 = iree_interp.constant dense<[4, 1, 1]> : tensor<3xi32>
-    %1 = "iree_hl_seq.alloc_heap"() : () -> memref<4xf32>
-    iree_hl_seq.dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[%0 : memref<3xi32>](%arg0, %arg1, %1) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> ()
-    iree_hl_seq.return %1 : memref<4xf32>
-  }
-}
-```
-
-Here we've allocated the output argument for the dispatch region via
-`iree_hl_seq.alloc_heap` and passed it as an argument into the dispatch. The
-executable entry point function gains a matching output argument where the final
-result is stored. The `iree.load_input` and `iree.store_output` pseudo-commands
-are used by backends in following lowering steps to determine how to load and
-store their arguments.
-
-### Low-level Sequencer IR
-
-Once we've established the signatures between the sequencer and the executable
-we can lower the sequencer IR to an explicitly-allocated dialect and perform
-memory allocation. Here we attempt to alias/reuse buffers, determine buffers
-that can be entirely elided, and reorder dispatches so that they can more easily
-be grouped based on required barriers. Thanks to MLIR's built-in folding logic
-we can also do some IR optimizations such as converting the generic dispatch to
-a `iree_ll_seq.static_dispatch`, as we know the workload size at compile-time.
-
-As part of this we also propagate any static information we can determine, such
-as the workload, into the executables. This is to help aid backends in lowering
-more efficiently when possible.
-
-```mlir
-module {
-  iree.multi_arch_executable @simple_mul_ex_dispatch_0[0]() {
-    iree.executable(Unspecified) {
-      module {
-        func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>)
-        attributes  {iree.executable.export, iree.executable.workload = dense<[4, 1, 1]> : tensor<3xi32>, iree.ordinal = 0 : i32} {
-          %0 = iree.load_input(%arg0 : memref<4xf32>) : tensor<4xf32>
-          %1 = iree.load_input(%arg1 : memref<4xf32>) : tensor<4xf32>
-          %2 = arith.mulf %0, %1 : tensor<4xf32>
-          iree.store_output(%2 : tensor<4xf32>, %arg2 : memref<4xf32>)
-          iree.return
-        }
-      }
-    }
-  }
-  func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32>
-  attributes {iree.ordinal = 0 : i32} {
-    %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32>
-    iree_ll_seq.static_dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[dense<[4, 1, 1]> : tensor<3xi32>](%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> ()
-    iree_ll_seq.return %0 : memref<4xf32>
-  }
-}
-```
-
-### Executable Lowering to SPIR-V
-
-For each executable and target combination we invoke an MLIR translation to some
-target dialect. Here, we are lowering to the SPIR-V dialect, and use the current
-IREE-specific XLA HLO-to-SPIR-V lowering passes. Other lowerings, as they become
-available, can be swapped in. Below is the `simple_mul_ex_dispatch_0` executable
-fully lowered to SPIR-V in the canonical MLIR SPIR-V dialect, which can be
-trivially serialized to SPIR-V words. Note how the `iree.load_input` and
-`iree.load_output` ops are lowered to storage buffer loads and stores.
-
-```mlir
-module {
-  spv.module "Logical" "GLSL450" {
-    spv.globalVariable @globalInvocationID built_in("GlobalInvocationId") : !spv.ptr<vector<3xi32>, Input>
-    spv.globalVariable @simple_mul_rgn_dispatch_0_arg_0 bind(0, 0) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-    spv.globalVariable @simple_mul_rgn_dispatch_0_arg_1 bind(0, 1) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-    spv.globalVariable @simple_mul_rgn_dispatch_0_arg_2 bind(0, 2) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-    func @simple_mul_rgn_dispatch_0() {
-      %0 = spv._address_of @globalInvocationID : !spv.ptr<vector<3xi32>, Input>
-      %1 = spv.Load "Input" %0 : vector<3xi32>
-      %2 = spv.CompositeExtract %1[0 : i32] : vector<3xi32>
-      %3 = spv.CompositeExtract %1[1 : i32] : vector<3xi32>
-      %4 = spv.CompositeExtract %1[2 : i32] : vector<3xi32>
-      %5 = spv._address_of @simple_mul_rgn_dispatch_0_arg_0 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      %6 = spv.constant 0 : i32
-      %7 = spv.AccessChain %5[%6, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      %8 = spv.Load "StorageBuffer" %7 : f32
-      %9 = spv._address_of @simple_mul_rgn_dispatch_0_arg_1 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      %10 = spv.constant 0 : i32
-      %11 = spv.AccessChain %9[%10, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      %12 = spv.Load "StorageBuffer" %11 : f32
-      %13 = spv.FMul %8, %12 : f32
-      %14 = spv._address_of @simple_mul_rgn_dispatch_0_arg_2 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      %15 = spv.constant 0 : i32
-      %16 = spv.AccessChain %14[%15, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-      spv.Store "StorageBuffer" %16, %13 : f32
-      spv.Return
-    }
-    spv.EntryPoint "GLCompute" @simple_mul_rgn_dispatch_0, @globalInvocationID
-    spv.ExecutionMode @simple_mul_rgn_dispatch_0 "LocalSize", 1, 1, 1
-  } attributes {capabilities = ["Shader"], extensions = ["SPV_KHR_storage_buffer_storage_class"]}
-}
-```
-
-### Final Module
-
-Below is the final module containing executables for both the IREE reference
-interpreter backend and the Vulkan/SPIR-V backend, as well as the sequencer IR
-function detailing how to dispatch the workload.
-
-```mlir
-module {
-  iree.multi_arch_executable @simple_mul_ex_dispatch_0[0]() {
-    iree.executable(IreeBytecode) {
-      module {
-        func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>)
-        attributes  {iree.executable.export, iree.executable.workload = dense<[4, 1, 1]> : tensor<3xi32>, iree.ordinal = 0 : i32} {
-          %0 = "iree_ll_interp.alloc_heap"() : () -> memref<4xf32>
-          "iree_ll_interp.mul_f"(%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> ()
-          %1 = "iree_ll_interp.constant"() {value = dense<0> : tensor<1xi32>} : () -> memref<1xi32>
-          %2 = "iree_ll_interp.constant"() {value = dense<4> : tensor<1xi32>} : () -> memref<1xi32>
-          "iree_ll_interp.dynamic_copy"(%0, %1, %arg2, %1, %2) : (memref<4xf32>, memref<1xi32>, memref<4xf32>, memref<1xi32>, memref<1xi32>) -> ()
-          iree.return
-        }
-      }
-    }
-    iree.executable(SPIRV) {
-      spv.module "Logical" "GLSL450" {
-        spv.globalVariable @globalInvocationID built_in("GlobalInvocationId") : !spv.ptr<vector<3xi32>, Input>
-        spv.globalVariable @simple_mul_rgn_dispatch_0_arg_0 bind(0, 0) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-        spv.globalVariable @simple_mul_rgn_dispatch_0_arg_1 bind(0, 1) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-        spv.globalVariable @simple_mul_rgn_dispatch_0_arg_2 bind(0, 2) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-        func @simple_mul_rgn_dispatch_0() {
-          %0 = spv._address_of @globalInvocationID : !spv.ptr<vector<3xi32>, Input>
-          %1 = spv.Load "Input" %0 : vector<3xi32>
-          %2 = spv.CompositeExtract %1[0 : i32] : vector<3xi32>
-          %3 = spv.CompositeExtract %1[1 : i32] : vector<3xi32>
-          %4 = spv.CompositeExtract %1[2 : i32] : vector<3xi32>
-          %5 = spv._address_of @simple_mul_rgn_dispatch_0_arg_0 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          %6 = spv.constant 0 : i32
-          %7 = spv.AccessChain %5[%6, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          %8 = spv.Load "StorageBuffer" %7 : f32
-          %9 = spv._address_of @simple_mul_rgn_dispatch_0_arg_1 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          %10 = spv.constant 0 : i32
-          %11 = spv.AccessChain %9[%10, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          %12 = spv.Load "StorageBuffer" %11 : f32
-          %13 = spv.FMul %8, %12 : f32
-          %14 = spv._address_of @simple_mul_rgn_dispatch_0_arg_2 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          %15 = spv.constant 0 : i32
-          %16 = spv.AccessChain %14[%15, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer>
-          spv.Store "StorageBuffer" %16, %13 : f32
-          spv.Return
-        }
-        spv.EntryPoint "GLCompute" @simple_mul_rgn_dispatch_0, @globalInvocationID
-        spv.ExecutionMode @simple_mul_rgn_dispatch_0 "LocalSize", 1, 1, 1
-      } attributes {capabilities = ["Shader"], extensions = ["SPV_KHR_storage_buffer_storage_class"]}
-    }
-  }
-  func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32>
-  attributes {iree.ordinal = 0 : i32} {
-    %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32>
-    iree_ll_seq.static_dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[dense<[4, 1, 1]> : tensor<3xi32>](%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> ()
-    iree_ll_seq.return %0 : memref<4xf32>
-  }
-}
-```
-
-## Runtime
-
-### IREE VM
-
-The above IREE module (containing the sequencer function IR and the SPIR-V
-executable) can be serialized to a FlatBuffer. This FlatBuffer is optimized for
-minimal runtime overhead and there's zero load-time work required. This is
-useful in scenarios where either ease of debugging or dynamic deployment is
-required (such as when downloading models to run in a store-signed app on
-Android or iOS). Since the majority of the compute-intensive work is happening
-on the GPU (or CPU) via the generated SPIR-V the overhead for processing the
-sequencer IR is minimal, often an order of magnitude less than traditional ML
-runtimes.
-
-The VM targets the IREE HAL API, meaning that you get access to Vulkan, CPU, and
-other backends that are available in IREE. The HAL API is just an interface,
-though, and is easy to map to existing application abstractions that may exist.
-This means that implementing a HAL that maps to app primitives gives you access
-to the VM without needing to modify the IREE compiler or VM code.
-
-The VM is really simple and effectively does the same as demonstrated below in
-the HAL codegen example, just with a bytecode instead of C++ code. This layering
-allows us to optimize for the fast case (codegen) while still being able to
-reuse almost the entire infrastructure for the dynamic case.
-
-### IREE HAL Codegen
-
-For models that are static at application compile-time and app binaries can be
-redeployed if the model changes it's possible to generate C++ code that uses the
-IREE HAL API. This avoids the need for a VM at the cost of recompilations when
-the model changes and less debugger support. Since the HAL API is still used the
-heterogeneous device support IREE provides is still available.
-
-As with the VM the HAL API is just an interface; implementing a custom mapping
-from that interface to an existing API is easy and gives the ability to switch
-between VM or codegen approaches with no code beyond the interface
-implementation required.
-
-**NOTE**: this is not yet fully implemented/open sourced, but is coming soon.
-Here's a pseudo-codeish example of what a module would look like:
-
-```c++
-class SimpleMulModule : public iree::vm::Module {
- public:
-  // Creates the module and prepares it for execution in the given context.
-  // This may assign device handles, cache executables, etc.
-  static iree::StatusOr<std::unique_ptr<SimpleMulModule>> Create(
-      iree::vm::Context* context) {
-    // <prepare executable, allocate transient buffers, etc>
-  }
-
-  // Synchronous call to @simple_mul. Simplest form of the API and may perform
-  // internal pipelining but will appear synchronous to callers.
-  //
-  // Note that this assumes that the inputs are available and visible to the
-  // target devices. If you are exclusively using the synchronous API that will
-  // be the case.
-  //
-  // Matches IR:
-  // func @simple_mul(%arg0: memref<4xf32>,
-  //                  %arg1: memref<4xf32>) -> memref<4xf32>
-  iree::StatusOr<iree::hal::BufferView> simple_mul(
-      iree::hal::BufferView arg0,
-      iree::hal::BufferView arg1) {
-    iree::hal::Device* device = select_device(0);
-
-    // Buffers are allocated conservatively as we don't know what the caller
-    // will do with it. Buffers used internally or across async calls can be
-    // placed in device memory.
-    //
-    // Matches IR:
-    // %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32>
-    ASSIGN_OR_RETURN(auto result, device->allocator()->Allocate(
-        IREE_HAL_MEMORY_TYPE_HOST_LOCAL | IREE_HAL_MEMORY_TYPE_DEVICE_VISIBLE,
-        IREE_HAL_BUFFER_USAGE_DISPATCH_STORAGE |
-        IREE_HAL_BUFFER_USAGE_MAPPING));
-    auto result_view = iree::hal::BufferView(
-        std::move(result), {4}, sizeof(float));
-
-    // To show that this is just a wrapper around the real execution we just
-    // call into the async version of the function.
-    ASSIGN_OR_RETURN(auto fence, device->CreateFence(0u));
-    auto completed_fence_value = iree::hal::FenceValue{add_ref(fence), 1u};
-    RETURN_IF_ERROR(simple_mul(
-        device,
-        /*wait_semaphore=*/{},
-        arg0, arg1, result_view,
-        /*signal_semaphore=*/{},
-        completed_fence_value));
-
-    // Wait until results are ready.
-    RETURN_IF_ERROR(device->WaitAllFences(
-        {completed_fence_value}, absl::InfiniteDuration()));
-
-    // The allocated buffer escapes this function.
-    // Callers can provide already-allocated buffers with the async API.
-    //
-    // Matches IR:
-    // iree_ll_seq.return %0 : memref<4xf32>
-    return result_view;
-  }
-
-  // Asynchronous variant of the function that can (optionally) wait on existing
-  // semaphores that indicate that arguments are ready for use and
-  // (optionally) signal both semaphores and fences when the results are ready.
-  //
-  // Multiple variants of this API can be exposed such as ones returning a
-  // iree::hal::SubmissionBatch that can be submitted by the caller, however
-  // this is usually fine for most uses as any additional required submissions
-  // are handled internally as needed.
-  iree::Status simple_mul(
-      iree::hal::Device* device,
-      iree::hal::SemaphoreValue wait_semaphore,
-      iree::hal::BufferView arg0,
-      iree::hal::BufferView arg1,
-      iree::hal::BufferView out0,
-      iree::hal::SemaphoreValue signal_semaphore,
-      iree::hal::FenceValue signal_fence) {
-    // Record the command buffer with any commands we can.
-    // In more complex examples this would include barriers, events, transfers,
-    // and multiple dispatches. In many cases only one command buffer is
-    // required however more complex flow control may require multiple.
-    //
-    // Matches IR:
-    // iree_ll_seq.static_dispatch ...
-    ASSIGN_OR_RETURN(auto cmd, device->CreateCommandBuffer(
-        IREE_HAL_COMMAND_BUFFER_MODE_ONE_SHOT,
-        IREE_HAL_COMMAND_CATEGORY_DISPATCH));
-    RETURN_IF_ERROR(cmd->Begin());
-    iree::hal::DispatchRequest dispatch_request;
-    dispatch_request.executable = device_executable(device, 0);
-    dispatch_request.workload = {4, 1, 1};
-    dispatch_request.bindings = {
-      {arg0.buffer, arg0.shape, arg0.element_size},
-      {arg1.buffer, arg1.shape, arg1.element_size},
-      {out0.buffer, out0.shape, out0.element_size},
-    };
-    RETURN_IF_ERROR(cmd->Dispatch(dispatch_request));
-    RETURN_IF_ERROR(cmd->End());
-
-    // TBD: show resource tracking.
-
-    // Submit for execution using the semaphores we were told to wait on.
-    // In more complex examples where we may have to submit multiple command
-    // buffers we'll use the wait/signal semaphores as the boundary
-    // synchronization primitives.
-    auto* command_queue = device->command_queues()[0];
-    return command_queue->Submit({
-      iree::hal::SubmissionBatch{
-        {wait_semaphore},
-        {cmd},
-        {signal_semaphore},
-      },
-    }, signal_fence);
-  }
-};
-```
-
-### Custom Codegen
-
-Using the final IREE module (containing executables and sequencer IR) it's
-possible to generate code for any target. For example, instead of using the IREE
-HAL and C++ one could generate straight C directly against their target API or
-hardware (such as directly calling Vulkan or launching DSP executables). We
-refer to this form as "runtimeless," as beyond the code required to run the
-program there's no more than what one would write by hand if they were very
-carefully hand-translating the model.
-
-Because we are still changing the IR we have not yet written a backend that does
-this, however we plan to demonstrate this for targeting small embedded systems
-and DSPs in the future.