commit	11ced0ce7a3a1e24b498f40ac976c0ccdee4c68e	[log] [tgz]
author	Ben Vanik <ben.vanik@gmail.com>	Fri Nov 03 14:24:08 2023 -0700
committer	GitHub <noreply@github.com>	Fri Nov 03 14:24:08 2023 -0700
tree	d43457654bcdb2509a0d788054439baa231c73ee
parent	988f7c560f484640e910af6f284fa08565bf98dc [diff]

Adding parameters as a concept to stream/hal/tooling. (#15104)

Parameters are externalized storage for resources that are
asynchronously accessible and device-aware. Parameters can be read or
written on the same device timelines as the operations that consume or
produce them and with locality pinning to ensure memory doesn't need to
move. Parameters are referenced by an optional scope (a file name, a
model name, whatever) and a unique key within that scope, with the scope
being strongly recommended as a way to distinguish sets of parameters
that may exist when multiple model parts are compiled together and would
otherwise collide.

Parameters are provided to programs by a virtual interface and can
support shared parameters (same storage used in multiple contexts, or
outliving a single instantiation in a context), in-memory caches,
memory-mapped files (including directly using the mapped memory for
execution when devices support it), iree_hal_file_t usage for
device-supported I/O, and parameter subsetting for things like runtime
sharding. A basic file cache is implemented to allow for programs to
decide when and where they want to use the parameters without needing to
have bound them to devices at startup time.

Alongside read(+load) and write operations gather and scatter allow for
batching of large numbers of reads and writes into/from single buffers.
For parameter providers that can batch operations this allows for a
handful (~1-4) of calls out to perform many more operations
(~thousands). Modeling the gather/scatter also gives us a point where we
could extract the mapping and use it to repack files/defrag memory in
the future.

Parameters are currently defined by the `#stream.parameter.named`
attribute which specifies an optional parameter scope and a scope-unique
key for the parameter along with its logical type. Today these are
intended to be used as default values on global ops (mutable or
immutable) and are hackily processed as if they were constants. Future
changes will allow parameter mutation and storage but what's present
should be enough for inference and training parameter initialization.

Example parameter (here a tensor but parameters can be other types in
the future to act as bags of bits):
```mlir
util.global private @"model.layer-1.kernel" = #stream.parameter.named<"mnist"::"model.layer-1.kernel"> : tensor<784x128xf32>
```

Parameters can optionally have a subrange specified indicating that the
logical tensor is a block of some larger storage. When sharding this can
be used to have an individual shard load a subset of the parameter data:
```mlir
util.global private @"model.layer-1.kernel-shard-0" = #stream.parameter.named<"mnist"::"model.layer-1.kernel", {offset = 0}> : tensor<392x128xf32>
util.global private @"model.layer-1.kernel-shard-1" = #stream.parameter.named<"mnist"::"model.layer-1.kernel", {offset = 200704}> : tensor<392x128xf32>
```

In this initial implementation we err on the side of optimizing for
discrete memory devices (GPUs/etc) by emitting gathers of all
parameters. On unified memory systems where we can zero-copy import
parameters into device memory this is wasteful but it ensures proper
alignment/packing/limited runtime overheads. Setting the resource memory
model to `unified` via the `#stream.resource_config` attribute (helper
flag `--iree-stream-resource-memory-model=unified`) will change instead
to aliasing parameter memory where possible at the cost of increased
runtime overhead. Future changes will connect the resource memory model
to those of the devices under compilation and allow for heterogenous
deployments to treat parameters used exclusively on different devices in
whatever way is best for that device.

Basic tooling support for read-only parameters has been added for
testing by allowing parameter files to be specified on the command line:
```
$ iree-run-module \
    --parameter_mode=mmap \
    --parameters=some_scope=some/file0.safetensors \
    --parameters=other_scope=some/file1.gguf \
    --module=...
```

Currently parameters are only usable from the full HAL implementation
and not the inline HAL - the parameter file format and index code was
kept portable such that it could be reused for a lighter-weight feature
set if we wanted to support parameters in the inline HAL but given that
cases where the inline HAL is interesting are usually small models on
tiny systems where optimization of parameters is critical to
memory/performance I haven't bothered here.

Since all parameter file formats are terrible a new parameter file
format that is less terrible for our uses will be introduced in future
changes. It's still experimental and not fully wired up but will be
something we can convert other formats into for optimize use as both
immutable constant and mutable variable storage in our tools when direct
compatibility with existing frameworks is not required without
conversion steps.

The `iree-dump-parameters` tool can be used to inspect any of the
parameter file formats the tooling can load and extract individual
parameters. It indexes parameters using the same flags as the rest of
the tooling so it can also be useful to see what parameters are actually
available for use without trial and error in other tools. Example
output:
```
$ ../iree-build/tools/iree-dump-parameters.exe --parameters=a=tools/test/parameters_a.safetensors --parameters=runtime/src/iree/io/formats/gguf/testdata/multiple.gguf --extract=a::a0=a0.bin --extract=tensor0=tensor0.bin
//===--------------------------------------------------------------------------------------------------------------===//
// Parameter scope `a` (2 entries, 64 total bytes)
//===------------+------------------+------------------+-----------------------------------------------------------===//
//         Start |              End |           Length | Key
//---------------+------------------+------------------+--------------------------------------------------------------//
             120 |              152 |               32 | `a0`
             152 |              184 |               32 | `a1`

//===--------------------------------------------------------------------------------------------------------------===//
// Parameter scope `` (3 entries, 72 total bytes)
//===------------+------------------+------------------+-----------------------------------------------------------===//
//         Start |              End |           Length | Key
//---------------+------------------+------------------+--------------------------------------------------------------//
             448 |              464 |               16 | `tensor0`
             512 |              520 |                8 | `tensor1`
             576 |              624 |               48 | `tensor2`

Extracting parameter `a::a0` (32b) to `a0.bin`...
Extracting parameter `tensor0` (16b) to `tensor0.bin`...
```

Progress on #14987.

---------

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>

134 files changed

tree: d43457654bcdb2509a0d788054439baa231c73ee

README.md

IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

See our website for project details, user guides, and instructions on building from source.

Project Status

IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!

Communication Channels

GitHub issues: Feature requests, bugs, and other work tracking
IREE Discord server: Daily development discussions with the core team and collaborators
iree-discuss email list: Announcements, general and low-priority discussion

Related Project Channels

MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.

Architecture Overview

IREE Architecture

See our website for more information.

Presentations and Talks

Community meeting recordings: IREE YouTube channel
2021-06-09: IREE Runtime Design Tech Talk (recording and slides)
2020-08-20: IREE CodeGen: MLIR Open Design Meeting Presentation (recording and slides)
2020-03-18: Interactive HAL IR Walkthrough (recording)
2020-01-31: End-to-end MLIR Workflow in IREE: MLIR Open Design Meeting Presentation (recording and slides)

License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.