blob: aa67e3357c7d2a7299dbae4ccd0f2b5e34635449 [file] [log] [blame] [view]
# IREE Invocation Execution Model
This documents the behavior of the user-visible invocation mechanism IREE uses
to schedule program execution. Internally IREE uses a very similar modeling for
tracking its internal workloads and in kind carries that down to target APIs
and devices that themselves use a very similar model. The intent is to expose
the device model in an abstracted way that allows for the full capture and
communication of the execution intent to be propagated to the hardware that
executes it. Though here we focus on the user-visible portion of execution
there is really only one "IREE execution model" and the entire stack follows
the same design. At its core this design is just an instantiation of an
[out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution) algorithm such as those [originating from the 1960's](https://en.wikipedia.org/wiki/Tomasulo_algorithm).
## Glossary
```mermaid
stateDiagram
state UserApplication {
direction BT
state Context0 {
ModuleA-->ModuleAState0
ModuleB-->ModuleBState0
}
state Context1 {
ModuleA-->ModuleAState1
ModuleB-->ModuleBState1
ModuleC-->ModuleCState1
}
state ModuleA {
@func1
@func2
}
state ModuleB {
@func3
@func4
}
state ModuleC {
@func5
}
}
```
### Program
An IREE _program_ is a collection of _modules_ instantiated in a _context_ from
which _invocations_ can be made. Invocations are ordered on a user-controlled
_timeline_ that uses _fences_ to define the execution order requirements to
enable out-of-order execution. A hosting user application may have multiple
programs or multiple instances of the same program available and running
invocations at a time across multiple timelines.
### Module
_Modules_ define executable code and data that can be loaded, linked, and run à
la [ELF shared libraries](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format).
Modules may be implemented as C/C++, generated bytecode or C sources from the
IREE compiler, or any other mechanism that can run code and implement the [`iree_vm_module_t` interface](https://github.com/google/iree/blob/0e8d8370699912c6b51889e8f7e967690102402c/runtime/src/iree/vm/module.h#L335-L437).
Modules on their own are read-only and can be reused across many contexts.
Traditional ML runtimes would use a model (graph, etc) as their module
representation. In IREE everything is a module including runtime subsystems like
the HAL and user-provided custom code. This ensures that anything IREE can do
can be externalized and replaced by users without needing to modify the core
IREE code.
### Context
A collection of _modules_ are linked and instantiated in a _context_. Each
context operates independently and carries its own copies of mutable module
state. _Invocations_ execute within a context scope and hosting applications
coordinate across contexts as required. Contexts are cheap to create
(microseconds) and retain (~100B + program state) such that users can decide
how to manage them based on their scenario.
Traditional ML runtimes would call these "sessions" but in IREE everything is a
_program_. Whether the program is stateful or stateless and how the program is
invoked is up to the program author.
### Invocation
An _invocation_ represents a single call into a module exported function using
the program state stored in a context. Users can decide whether to perform
synchronous blocking invocations or asynchronous non-blocking invocations
per-call; the behavior of the invocation is independent from the target function
and a user program may contain a mix of both.
As an example a user program may synchronously invoke a `@query_output_shapes`
function to preallocate storage for an asynchronous `@execute_in_place`
function to write into.
### Timeline
A _timeline_ represents the observable order of execution. Users define their
own timelines and communicate them to IREE via _fences_. Timelines do not match
up with the order of invocations unless the user dictates they must by way of
fences. In the absence of fences all invocations execute in an arbitrary order
and they may execute concurrently just as threads in C with no barriers.
Each timeline can be thought of as an independent clock domain that may operate
asynchronously at its own frequency with only fences acting to tie separate
timelines together. This directly mirrors real hardware constraints like
[clock domain crossing](https://en.wikipedia.org/wiki/Globally_asynchronous_locally_synchronous)
as each execution scope (thread on core, driver calls to queues, kernel queues
to device queues, device queues to compute unit queues, etc) is naturally
operating at different rates and well-designed systems must tolerate that
variability.
### Fence
A _fence_ is a specific point of progress in one or more _timelines_ acting as
a barrier, fork, or join point. Fences only guard execution ordering and not any
particular resources though users can use them to guard resources by defining
when in time the resources are available for use.
Waits on fences are wait-until operations specifying that the timeline must
reach _at least_ a specific point. This allows for flexible reordering and
deferral of execution as executors can pull forward scheduled work based on
policy (run similar work together, etc).
### Hardware Abstraction Layer (HAL)
The HAL is an optional feature of IREE that is used to provide a consistent
interface across execution resources. It is used internally by IREE programs to
define and submit work to devices and signal across them but may also be used by
users to directly interface with hardware in a compatible way. Exposing the
HAL API allows for users to efficiently manage their data and custom
execution without expensive marshaling. Most users will only interact with HAL
buffers as they work with their data but more advanced integrations can directly
insert IREE into existing device contexts to transparently share scheduling and
resources or insert their own code into IREE to pipeline custom execution.
## Execution by Timelines
**NOTE**: this defines _an_ execution scheme that IREE supports but a user may
use one or more such schemes in a single program - just as a C application may
mix single- and multi-threaded code within itself for different components.
The combination of _invocations_, _timelines_, and _fences_ allows users
to provide future knowledge to lower layers of the system by declaring their
availability requirements and the lower layers are then able to execute the work
out-of-order so long as the specified requirements are met. The primary goal
when designing for such a system is to specify as few requirements as possible
in order to provide the maximum amount of scheduling freedom to the
implementation.
This makes timelines one of the most critical components of the interface.
The purpose of invocations is to schedule work against one or more timelines and
what happens within the invocations is an implementation detail of the program.
### Sequential Execution
Here we say _"a user invokes a function to schedule execution on a timeline"_
vs. a more traditional _"a user invokes a function to execute work"_ and this
manifests in the IREE ABI as invocations taking fences defining specific points
on timelines of which the user may observe:
```python
# Fences are effectively just timeline + integer tuples and are cheap to hold.
wait_fence = my_timeline.at(t)
signal_fence = my_timeline.at(t+1)
# Schedule work against the timeline.
# All work prior to t must complete before execution can occur and after
# execution the timeline will advance to t+1.
async_invoke(@some_fn, wait_fence, signal_fence)
# The invocation may have returned immediately after the work was scheduled;
# until the fence is reached no actual execution may have occurred. To
# synchronize the user code with the timeline the user can block until the fence
# is reached.
signal_fence.wait()
```
To the user this would appear as:
```mermaid
sequenceDiagram
User->>@some_func: invoke
activate @some_func
@some_func->>User: ;
@some_func-->>@some_func: wait t
@some_func-->>User: signal t+1
deactivate @some_func
```
This means from the user's perspective the _actual_ operations performed by the
invocation are not important: the only thing the user can observe in this
situation is when the timeline reaches `t+1` as they specified. Whether
internally the invocation needs many steps to complete as there are timelines
internal to the program is an implementation detail. Actual execution may look
like this:
```mermaid
sequenceDiagram
User->>@some_func: invoke
activate @some_func
@some_func->>User: ;
@some_func->>@some_func: ;
@some_func-->>Device A: ;
Device A-->>Device A: wait t
activate Device A
@some_func->>@some_func: ;
@some_func-->>Device B: ;
activate Device B
@some_func->>@some_func: ;
Device A-->>@some_func: ;
deactivate Device A
@some_func->>@some_func: ;
@some_func-->>Device B: ;
activate Device B
deactivate @some_func
Device B-->>User: signal t+1
deactivate Device B
deactivate Device B
```
Even in this simple user-synchronous example the system is able to internally
run several concurrent timelines with a minimal number of synchronization points
and the lowest possible latency as the user is immediately notified without
any intermediate layers needing to be woken, scheduled, executed, and passed on.
### Pipelined Execution
The true power of timelines comes from the ability to pipeline execution. Users
define DAGs with fences and can construct arbitrarily complex execution
topologies whether from the same program or across multiple programs:
```mermaid
stateDiagram
direction LR
state fence0 <<fork>>
[*] --> fence0
fence0 --> @fn0
state fence1 <<fork>>
@fn0 --> fence1
fence1 --> @fn1
fence1 --> @fn2
state fence2 <<join>>
@fn1 --> fence2
@fn2 --> fence2
@fn3 --> fence2
fence0 --> @fn4
@fn4 --> fence2
fence2 --> [*]
```
This is a simple extension to the synchronous example using the same primitives:
```python
# Timeline is defined by the user.
fence_a = my_timeline.at(t)
fence_b = my_timeline.at(t+1)
fence_c = my_timeline.at(t+2)
# Invocations are launched using the fences and may not complete immediately.
async_invoke(@fn0, fence_a, fence_b)
async_invoke(@fn1, fence_b, fence_c)
async_invoke(@fn2, fence_b, fence_c)
async_invoke(@fn3, None, fence_c)
async_invoke(@fn4, fence_a, fence_c)
# Blocking here but no need to; could pass fence_c on to other invocations.
fence_c.wait()
```
The critical point of this being that the user never had to wait for any
particular invocation to complete before being able to schedule more work
against the timeline, even if those invocations could themselves not complete
synchronously. The lower layers of the system are able to fully model the
execution as early as possible without needing to communicate (and importantly
synchronize) with the user.
### I/O
Users define the semantics of their programs themselves. For example if the user
knows the precise shape of an output buffer they can preallocate the buffer and
pass it in. If they don't know they can decide to factor out the shape
calculation and invoke that synchronously in order to compute the shape,
allocate the appropriately sized buffer, and pass that in. Or they could decide
to only deal with synchronous invocations and return a program-allocated buffer
view with the appropriate shape in their callback. IREE does not dictate the
design of user programs and as such enables mixed stateful/stateless,
asynchronous/synchronous, and arbitrary scheduling models (enqueue/drain,
windowing, etc).
Inputs and outputs to invocations are provided by the user as primitive values
(integers, floats, etc), supported builtin types (lists, byte buffers/strings),
custom user types, and HAL types like buffers or buffer views (buffers + shape
and type metadata). One or more wait fences can be used to order invocation
access to one or more inputs by indicating that the resource is not available
until a certain fence is reached. Similarly one or more signal fences can be
used to order subsequent access to the resources by indicating the advancement
of the timeline when they are available.
```python
# wait_fence_a must be reached before buffer_a and buffer_b can be read.
# wait_fence_b must be reached before buffer_c can be read.
# buffer_a will be ready to read when signal_fence_a has been reached.
async_invoke(@fn,
(wait_fence_a, buffer_a, buffer_b),
42, # no ordering required on value types
(wait_fence_b, buffer_c),
(signal_fence_a, buffer_a))
```
The above example demonstrates an in-place operation on `buffer_a`. It's also
possible for invocations to return values:
```python
result = invoke(@sum, 1, 2) # = 3
```
When executed asynchronously a callback or any construct that can be built upon
them (like promises/futures) can receive the results:
```python
def my_callback(result):
print(result) # 3
async_invoke(@sum, 1, 2, my_callback)
```
### Stream-ordered Allocations
Invocations generally have only a few KB of overhead and pipelined command
buffers take only a small amount more. Storage buffers, however, can easily
take hundreds of MB per invocation for I/O and transient state. This compounds
as program usage becomes more complex or multiple programs are involved. IREE
supports traditional host-ordered allocations (à la malloc/free) for persistent
buffers like large constants/read-only data or user-managed ringbuffers.
Stream-ordered allocations are also supported to allow for pooled buffer
reservations that can be allocated in a scheduled order alongside program
execution. For more detailed examples see the CUDA blog posts describing their
implementation: [part 1](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/), [part 2](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/).
With stream-ordered allocations each allocation and deallocation operation is
scheduled with wait and signal fences just as with invocations. This allows
these allocation operations to execute remotely on device without host
program involvement. For example, scheduling `alloca0`/`dealloca0` and
`alloca1`/`dealloca1` interleaved with the function execution allows for the
transient memory required for executing `@fn0` to remain uncommitted until
immediately before it is executed, committed during execution, and then
decommitted immediately after execution. The memory required for passing
data from `@fn0` to the subsequent `@fn1` and `@fn2` survives until after they
have completed executing before being decommitted. By using the same scheduling
primitives as execution the allocation topology can be as arbitrarily complex as
the invocation topology:
```mermaid
stateDiagram
direction LR
state fence0a <<fork>>
[*] --> fence0a
state fence0b <<fork>>
fence0a --> alloca0
fence0a --> alloca1
alloca0 --> fence0b
alloca1 --> fence0b
fence0b --> @fn0
state fence1a <<fork>>
@fn0 --> fence1a
state fence1b <<fork>>
fence1a --> dealloc0
dealloc0 --> fence1b
fence1b --> @fn1
fence1b --> @fn2
state fence2a <<join>>
@fn1 --> fence2a
@fn2 --> fence2a
state fence2b
fence2a --> dealloc1
state fence2b <<join>>
dealloc1 --> fence2b
fence2b --> [*]
```
When operating in this way allocations from the host-perspective are just
reservations for a slice of pooled storage that will be committed at some point
in the future. Likewise deallocations from the host-perspective release the
prior reservation and schedule the paired decommit at some point in the future.
Scheduling N sequential invocations thus requires only enough committed storage
for a single invocation in addition to the I/O (unless that too is
stream-ordered).
This scheduling behavior allows for both minimal peak memory consumption
regardless of the number of programs or invocation pipeline depth and sharing
of committed storage across programs: the memory consumption of a program at
rest is near zero when stateless and the sum of all state when stateful. Target
devices that natively support stream-ordered allocations (like CUDA) can even
share pools across processes.
The other provided feature in combination with the fence guaranteed forward
progress is that so long as the memory pool can service a single request
execution can still continue even when constrained. A device can serialize two
independent invocations requiring 400MB of transient memory when the system only
has 512MB available with no user-visible impact besides increased latency. This
does require the user to ensure they schedule work that is possible to run or
rely on the target system having paging in order to lighten the strictness of
the pool quotas.
Stream-ordered allocations performed by the user for invocation inputs can be
declared as transferred to the program. This allows the program to eagerly
deallocate or reuse the input storage while still preserving the internal
scheduling requirements of the program.
### Internal State
A stateful program may contain internal timelines that it uses to order its own
execution. Take for example [this simple stateful program](https://github.com/google/iree-jax/blob/main/tests/program/dynamic_state.py):
```python
class TrivialKernel(Program):
_x0 = Program.export_global(x_type)
def get(self):
return self._x0
def set(self, x=x_type):
self._x0 = x
def matmul(self, x=y_type):
self._x0 = self._matmul(x, self._x0)
@Program.kernel
def _matmul(x, x0):
return jnp.matmul(x, x0)
```
Each invocation of `matmul` needs to be executed in-order with prior invocations
as there is a data dependency established on `self._x0`. Attempts to `get` or
`set` must also be sequenced correctly with the `matmul` invocations. A basic
usage like this:
```python
m = TrivialKernel()
m.set(input)
m.matmul(a)
m.matmul(b)
m.matmul(c)
output = m.get()
print(output) # implicit wait
```
Would be executed as:
```mermaid
sequenceDiagram
activate User
User->>TrivialKernel: @set(input)
activate TrivialKernel
TrivialKernel-->>Device: ;
deactivate TrivialKernel
activate Device
TrivialKernel->>User: ;
User->>TrivialKernel: @matmul(a)
activate TrivialKernel
TrivialKernel-->>Device: ;
deactivate TrivialKernel
TrivialKernel->>User: ;
User->>TrivialKernel: @matmul(b)
activate TrivialKernel
TrivialKernel-->>Device: ;
deactivate TrivialKernel
TrivialKernel->>User: ;
User->>TrivialKernel: @matmul(c)
activate TrivialKernel
TrivialKernel-->>Device: ;
deactivate TrivialKernel
TrivialKernel->>User: ;
User->>TrivialKernel: @get()
activate TrivialKernel
TrivialKernel-->>Device: ;
deactivate TrivialKernel
TrivialKernel->>User: ;
Device-->>Device: ;
deactivate User
User->>User: (wait)
Device-->>User: (signal)
deactivate Device
activate User
User->>User: print(output)
deactivate User
```
Note that although the user provided no timeline of their own execution is still
ordered correctly due to the internal timeline constructed by the program. If
the user wanted to also pipeline execution with another program they could do
so by providing their own fences.