README.md - 3p/openxla/iree - Git at Google

 # IREE: An Experimental MLIR Execution Environment

 **DISCLAIMER**: This is not an officially supported Google product. It's an
 experimental playground for low-level/tightly integrated machine learning
 libraries that make use of modern hardware acceleration APIs and techniques (see
 [non goals](#non-goals)).

 ## Table of Contents

 -   [Quickstart](#quickstart)
 -   [Project Goals](#project-goals)
 -   [Milestones](#milestones)
 -   [Status](#status)
 -   [Dependencies](#dependencies)
 -   [License](#license)

 <a name="quickstart"></a>

 ## Quickstart

 More Coming soon! Performing full model translation may require a few steps
 (such as ensuring you have a working TensorFlow build), however we'll have
 pre-translated example models that allow independent testing of the runtime
 portions.

 *   [Getting Started on Windows](docs/getting_started_on_windows.md)
 *   [Getting Started on Linux](docs/getting_started_on_linux.md)
 *   [Getting Started](docs/getting_started.md)

 See also:

 *   [Using Colab](docs/using_colab.md)
 *   [Vulkan and SPIR-V](docs/vulkan_and_spirv.md)

 <a name="project-goals"></a>

 ## Project Goals

 IREE (**I**ntermediate **R**epresentation **E**xecution **E**nvironment,
 pronounced as "eerie") is an experimental compiler backend for
 [MLIR](https://github.com/tensorflow/mlir) that lowers ML models to an IR that
 is optimized for real-time mobile/edge inference against heterogeneous hardware
 accelerators.

 The IR produced contains the sequencing information required to communicate
 pipelined data dependencies and parallelism to low-level hardware APIs like
 Vulkan and embed hardware/API-specific binaries such as SPIR-V or compiled ARM
 code. As the IR is specified against an abstract execution environment there are
 many potential ways to run a compiled model, and one such way is included as an
 example and testbed for runtime optimization experiments.

 The included layered runtime scales from generated code for a particular API
 (such as emitting C code calling external DSP kernels), to a HAL (**H**ardware
 **A**bstraction **L**ayer) that allows the same generated code to target
 multiple APIs (like Vulkan and Direct3D 12), to a full VM allowing runtime model
 loading for flexible deployment options and heterogeneous execution. Consider
 both the compiler and the included runtime a toolbox for making it easier - via
 the versatility of MLIR - to take ML models from their source to some varying
 degree of integration with your application.

 ### Demonstrate MLIR

 IREE has been developed alongside MLIR and is used as an example of how
 non-traditional ML compiler backends and runtimes can be built: it focuses more
 on the math being performed and how that math is scheduled rather than graphs of
 "ops" and in some cases allows doing away with a runtime entirely. It seeks to
 show how more holistic approaches that exploit the MLIR framework and its
 various dialects can be both easy to understand and powerful in the
 optimizations to code size, runtime complexity, and performance they enable.

 ### Demonstrate Advanced Models

 By using models with much greater complexity than the usual references (such as
 MobileNet) we want to show how weird things can get when model authors are
 allowed to get creative: dynamic shapes, dynamic flow control, dynamic
 multi-model dispatch (including models that conditionally dispatch other
 models), streaming models, tree-based search algorithms, etc. We are trying to
 build IREE from the ground-up to enable these models and run them efficiently on
 modern hardware. Many of our example models are sequence-to-sequence language
 models from the [Lingvo](https://github.com/tensorflow/lingvo) project
 representing cutting edge speech recognition and translation work.

 ### Demonstrate ML-as-a-Game-Engine

 An observation that has driven the development of IREE is one of ML workloads
 not being much different than traditional game rendering workloads: math is
 performed on buffers with varying levels of concurrency and ordering in a
 pipelined fashion against accelerators designed to make such operations fast. In
 fact, most ML is performed on the same hardware that was designed for games! Our
 approach is to use the compiler to transform ML workloads to ones that look
 eerily _(pun intended)_ similar to what a game performs in per-frame render
 workloads, optimize for low-latency and predictable execution, and integrate
 well into existing systems both for batched and interactive usage. The IREE
 runtime is designed to feel more like game engine middleware than a standalone
 ML inference system, though we still have much work towards that goal. This
 should make it easy to use existing tools for high-performance/low-power
 optimization of GPU workloads, identify driver or system issues introducing
 latency, and help to improve the ecosystem overall.

 ### Demonstrate Standards-based ML via Vulkan and SPIR-V

 With the above observation that ML can look like games from the systems
 perspective it follows that APIs and technologies good for games should probably
 also be good for ML. In many cases we've identified only a few key differences
 that exist and just as extensions have been added and API improvements have been
 made to graphics/compute standards for decades we hope to demonstrate and
 evaluate small, tactical changes that can have large impacts on ML performance
 through these standard APIs. We would love to allow hardware vendors to be able
 to make ML efficient on their hardware without the need for bespoke runtimes and
 special access such that any ML workload produced by any tool runs well. We'd
 consider the IREE experiment a success if what resulted was some worked examples
 that help advance the entire ecosystem!

 <a name="non-goals"></a>

 ## Non-Goals

 *   Replace parts of the supported TensorFlow ecosystem of tools: The authors
     within Google work closely with TensorFlow and contribute to it regularly.
     However, IREE is exploring some different angles of the problem and is
     experimental. We will seek to leverage anything of value that we learn in an
     appropriate way to make TensorFlow better over time, but the two should not
     be conflated.
 *   Providing an [SLA](https://en.wikipedia.org/wiki/Service-level_agreement) of
     any kind: IREE is infrastructure research, not a supported product. If it
     gains mind-share or traction, we would revisit that in conjunction with
     finding a more permanent way to align it with the broader constellation of
     ML tooling.

 <a name="milestones"></a>

 ## Milestones

 We are currently just at the starting line, with basic
 [MNIST MLP](https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py)
 running end-to-end on both a CPU interpreter and Vulkan. As we scale out the
 compiler we will be increasing the complexity of the models that can run and
 demonstrating more of the optimizations we've found useful in previous efforts
 to run them efficiently on edge devices.

 A short-term
 [Roadmap](https://github.com/google/iree/blob/master/docs/roadmap.md) is
 available talking about the major areas where are focusing on in addition to the
 more infrastructure-focused work listed below.

 We'll be setting up GitHub milestones with issues tracking major feature work we
 are planning. For now, our areas of work are:

 *   Allocation tracking and aliasing in the compiler
 *   Pipelined scheduler in the VM for issuing proper command buffers
 *   New CPU interpreter that enables lightweight execution on ARM and x86
 *   C code generator and API to demonstrate "runtimeless" mode
 *   Quantization using the MLIR quantization framework
 *   High-level integration and examples when working with TensorFlow 2.0
 *   Switching from IREE's XLA-to-SPIR-V backend to the general MLIR SPIR-V
     backend

 Things we are interested in but don't yet have in-progress:

 *   Ahead-of-time compiled ARM NEON backend (perhaps via
     [SPIRV-LLVM](https://github.com/KhronosGroup/SPIRV-LLVM-Translator/),
     [SPIRV-to-ISPC](https://github.com/GameTechDev/SPIRV-Cross), or some other
     technique)
 *   HAL backends for Metal 2 and Direct3D 12
 *   Profile-guided optimization support for scheduling feedback

 <a name="status"></a>

 ## Current Status

 ### Documentation

 Coming soon :)

 ### Build System and CI

 *   We support Bazel for builds of all parts of the project.
 *   We also maintain a CMake build for a subset of runtime components designed
     to be used in other systems.

 ### Code and Style

 The project is currently _very_ early and a mix of code written prior to a lot
 of the more recent ergonomics improvements in MLIR and its tablegen. Future
 changes will replace the legacy code style with prettier forms and simplify the
 project structure to make it easier to separate the different components. Some
 entire portions of the code (such as the CPU interpreter) will likely be dropped
 or rewritten. For now, assume churn!

 The compiler portions of the code (almost exclusively under `iree/compiler/`)
 follows the LLVM style guide and has the same system requirements as MLIR
 itself. It general requires a more modern C++ compiler.

 The runtime portions vary but most are designed to work with C++11 and use
 [Abseil](https://github.com/abseil/abseil-cpp) to bring in future C++14 and
 C++17 features.

 ### Hardware Support

 We are mostly targeting Vulkan and Metal on recent mobile devices and as such
 have limited our usage of hardware features and vendor extensions to those we
 have broad access to there. This is mainly just to keep our focus tight and does
 not preclude usage of features outside the standard sets or for other hardware
 types (in fact, we have a lot of fun ideas for
 `VK_NVX_device_generated_commands` and Metal 2.1's Indirect Command Buffers!).

 <a name="dependencies"></a>

 ## Dependencies

 NOTE: during the initial open source release we are still cleaning things up. If
 there's weird dependencies/layering that makes life difficult for your
 particular use case please file an issue so we can make sure to fix it.

 ### Compiler

 The compiler has several layers that allow scaling the dependencies required
 based on the source and target formats. In all cases
 [MLIR](https://github.com/tensorflow/mlir) is required and for models not
 originating from TensorFlow (or already in XLA HLO format) it is the only
 dependency. When targeting the IREE Runtime VM and HAL
 [FlatBuffers](https://google.github.io/flatbuffers/) is required for
 serialization. Converting from TensorFlow models requires a dependency on
 TensorFlow (however only those parts required for conversion).

 ### Runtime VM

 The VM providing dynamic model deployment and advanced scheduling behavior
 requires [Abseil](https://github.com/abseil/abseil-cpp) for its common types,
 however contributions are welcome to make it possible to replace Abseil with
 other libraries via shims/forwarding. The core types used by the runtime
 (excluding command line flags and such in tools) are limited to types coming in
 future C++ versions (variant, optional, string_view, etc), cheap types
 (absl::Span), or simple standard containers (absl::InlinedVector).
 [FlatBuffers](https://google.github.io/flatbuffers/) is used to load compiled
 modules.

 ### Runtime HAL

 The HAL and the provided implementations (Vulkan, etc) also use
 [Abseil](https://github.com/abseil/abseil-cpp). Contributions are welcome to
 allow other types to be swapped in. A C99 HAL API is planned for code generation
 targets that will use no dependencies.

 ### Testing and Tooling

 [Swiftshader](https://github.com/google/swiftshader) is used to provide fast
 hardware-independent testing of the Vulkan and SPIR-V portions of the toolchain.

 <a name="license"></a>

 ## License

 IREE is licensed under the terms of the Apache license. See [LICENSE](LICENSE)
 for more information.
	# IREE: An Experimental MLIR Execution Environment

	DISCLAIMER: This is not an officially supported Google product. It's an
	experimental playground for low-level/tightly integrated machine learning
	libraries that make use of modern hardware acceleration APIs and techniques (see
	[non goals](#non-goals)).

	## Table of Contents

	- [Quickstart](#quickstart)
	- [Project Goals](#project-goals)
	- [Milestones](#milestones)
	- [Status](#status)
	- [Dependencies](#dependencies)
	- [License](#license)

	<a name="quickstart"></a>

	## Quickstart

	More Coming soon! Performing full model translation may require a few steps
	(such as ensuring you have a working TensorFlow build), however we'll have
	pre-translated example models that allow independent testing of the runtime
	portions.

	* [Getting Started on Windows](docs/getting_started_on_windows.md)
	* [Getting Started on Linux](docs/getting_started_on_linux.md)
	* [Getting Started](docs/getting_started.md)

	See also:

	* [Using Colab](docs/using_colab.md)
	* [Vulkan and SPIR-V](docs/vulkan_and_spirv.md)

	<a name="project-goals"></a>

	## Project Goals

	IREE (Intermediate Representation Execution Environment,
	pronounced as "eerie") is an experimental compiler backend for
	[MLIR](https://github.com/tensorflow/mlir) that lowers ML models to an IR that
	is optimized for real-time mobile/edge inference against heterogeneous hardware
	accelerators.

	The IR produced contains the sequencing information required to communicate
	pipelined data dependencies and parallelism to low-level hardware APIs like
	Vulkan and embed hardware/API-specific binaries such as SPIR-V or compiled ARM
	code. As the IR is specified against an abstract execution environment there are
	many potential ways to run a compiled model, and one such way is included as an
	example and testbed for runtime optimization experiments.

	The included layered runtime scales from generated code for a particular API
	(such as emitting C code calling external DSP kernels), to a HAL (Hardware
	Abstraction Layer) that allows the same generated code to target
	multiple APIs (like Vulkan and Direct3D 12), to a full VM allowing runtime model
	loading for flexible deployment options and heterogeneous execution. Consider
	both the compiler and the included runtime a toolbox for making it easier - via
	the versatility of MLIR - to take ML models from their source to some varying
	degree of integration with your application.

	### Demonstrate MLIR

	IREE has been developed alongside MLIR and is used as an example of how
	non-traditional ML compiler backends and runtimes can be built: it focuses more
	on the math being performed and how that math is scheduled rather than graphs of
	"ops" and in some cases allows doing away with a runtime entirely. It seeks to
	show how more holistic approaches that exploit the MLIR framework and its
	various dialects can be both easy to understand and powerful in the
	optimizations to code size, runtime complexity, and performance they enable.

	### Demonstrate Advanced Models

	By using models with much greater complexity than the usual references (such as
	MobileNet) we want to show how weird things can get when model authors are
	allowed to get creative: dynamic shapes, dynamic flow control, dynamic
	multi-model dispatch (including models that conditionally dispatch other
	models), streaming models, tree-based search algorithms, etc. We are trying to
	build IREE from the ground-up to enable these models and run them efficiently on
	modern hardware. Many of our example models are sequence-to-sequence language
	models from the [Lingvo](https://github.com/tensorflow/lingvo) project
	representing cutting edge speech recognition and translation work.

	### Demonstrate ML-as-a-Game-Engine

	An observation that has driven the development of IREE is one of ML workloads
	not being much different than traditional game rendering workloads: math is
	performed on buffers with varying levels of concurrency and ordering in a
	pipelined fashion against accelerators designed to make such operations fast. In
	fact, most ML is performed on the same hardware that was designed for games! Our
	approach is to use the compiler to transform ML workloads to ones that look
	eerily _(pun intended)_ similar to what a game performs in per-frame render
	workloads, optimize for low-latency and predictable execution, and integrate
	well into existing systems both for batched and interactive usage. The IREE
	runtime is designed to feel more like game engine middleware than a standalone
	ML inference system, though we still have much work towards that goal. This
	should make it easy to use existing tools for high-performance/low-power
	optimization of GPU workloads, identify driver or system issues introducing
	latency, and help to improve the ecosystem overall.

	### Demonstrate Standards-based ML via Vulkan and SPIR-V

	With the above observation that ML can look like games from the systems
	perspective it follows that APIs and technologies good for games should probably
	also be good for ML. In many cases we've identified only a few key differences
	that exist and just as extensions have been added and API improvements have been
	made to graphics/compute standards for decades we hope to demonstrate and
	evaluate small, tactical changes that can have large impacts on ML performance
	through these standard APIs. We would love to allow hardware vendors to be able
	to make ML efficient on their hardware without the need for bespoke runtimes and
	special access such that any ML workload produced by any tool runs well. We'd
	consider the IREE experiment a success if what resulted was some worked examples
	that help advance the entire ecosystem!

	<a name="non-goals"></a>

	## Non-Goals

	* Replace parts of the supported TensorFlow ecosystem of tools: The authors
	within Google work closely with TensorFlow and contribute to it regularly.
	However, IREE is exploring some different angles of the problem and is
	experimental. We will seek to leverage anything of value that we learn in an
	appropriate way to make TensorFlow better over time, but the two should not
	be conflated.
	* Providing an [SLA](https://en.wikipedia.org/wiki/Service-level_agreement) of
	any kind: IREE is infrastructure research, not a supported product. If it
	gains mind-share or traction, we would revisit that in conjunction with
	finding a more permanent way to align it with the broader constellation of
	ML tooling.

	<a name="milestones"></a>

	## Milestones

	We are currently just at the starting line, with basic
	[MNIST MLP](https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py)
	running end-to-end on both a CPU interpreter and Vulkan. As we scale out the
	compiler we will be increasing the complexity of the models that can run and
	demonstrating more of the optimizations we've found useful in previous efforts
	to run them efficiently on edge devices.

	A short-term
	[Roadmap](https://github.com/google/iree/blob/master/docs/roadmap.md) is
	available talking about the major areas where are focusing on in addition to the
	more infrastructure-focused work listed below.

	We'll be setting up GitHub milestones with issues tracking major feature work we
	are planning. For now, our areas of work are:

	* Allocation tracking and aliasing in the compiler
	* Pipelined scheduler in the VM for issuing proper command buffers
	* New CPU interpreter that enables lightweight execution on ARM and x86
	* C code generator and API to demonstrate "runtimeless" mode
	* Quantization using the MLIR quantization framework
	* High-level integration and examples when working with TensorFlow 2.0
	* Switching from IREE's XLA-to-SPIR-V backend to the general MLIR SPIR-V
	backend

	Things we are interested in but don't yet have in-progress:

	* Ahead-of-time compiled ARM NEON backend (perhaps via
	[SPIRV-LLVM](https://github.com/KhronosGroup/SPIRV-LLVM-Translator/),
	[SPIRV-to-ISPC](https://github.com/GameTechDev/SPIRV-Cross), or some other
	technique)
	* HAL backends for Metal 2 and Direct3D 12
	* Profile-guided optimization support for scheduling feedback

	<a name="status"></a>

	## Current Status

	### Documentation

	Coming soon :)

	### Build System and CI

	* We support Bazel for builds of all parts of the project.
	* We also maintain a CMake build for a subset of runtime components designed
	to be used in other systems.

	### Code and Style

	The project is currently _very_ early and a mix of code written prior to a lot
	of the more recent ergonomics improvements in MLIR and its tablegen. Future
	changes will replace the legacy code style with prettier forms and simplify the
	project structure to make it easier to separate the different components. Some
	entire portions of the code (such as the CPU interpreter) will likely be dropped
	or rewritten. For now, assume churn!

	The compiler portions of the code (almost exclusively under `iree/compiler/`)
	follows the LLVM style guide and has the same system requirements as MLIR
	itself. It general requires a more modern C++ compiler.

	The runtime portions vary but most are designed to work with C++11 and use
	[Abseil](https://github.com/abseil/abseil-cpp) to bring in future C++14 and
	C++17 features.

	### Hardware Support

	We are mostly targeting Vulkan and Metal on recent mobile devices and as such
	have limited our usage of hardware features and vendor extensions to those we
	have broad access to there. This is mainly just to keep our focus tight and does
	not preclude usage of features outside the standard sets or for other hardware
	types (in fact, we have a lot of fun ideas for
	`VK_NVX_device_generated_commands` and Metal 2.1's Indirect Command Buffers!).

	<a name="dependencies"></a>

	## Dependencies

	NOTE: during the initial open source release we are still cleaning things up. If
	there's weird dependencies/layering that makes life difficult for your
	particular use case please file an issue so we can make sure to fix it.

	### Compiler

	The compiler has several layers that allow scaling the dependencies required
	based on the source and target formats. In all cases
	[MLIR](https://github.com/tensorflow/mlir) is required and for models not
	originating from TensorFlow (or already in XLA HLO format) it is the only
	dependency. When targeting the IREE Runtime VM and HAL
	[FlatBuffers](https://google.github.io/flatbuffers/) is required for
	serialization. Converting from TensorFlow models requires a dependency on
	TensorFlow (however only those parts required for conversion).

	### Runtime VM

	The VM providing dynamic model deployment and advanced scheduling behavior
	requires [Abseil](https://github.com/abseil/abseil-cpp) for its common types,
	however contributions are welcome to make it possible to replace Abseil with
	other libraries via shims/forwarding. The core types used by the runtime
	(excluding command line flags and such in tools) are limited to types coming in
	future C++ versions (variant, optional, string_view, etc), cheap types
	(absl::Span), or simple standard containers (absl::InlinedVector).
	[FlatBuffers](https://google.github.io/flatbuffers/) is used to load compiled
	modules.

	### Runtime HAL

	The HAL and the provided implementations (Vulkan, etc) also use
	[Abseil](https://github.com/abseil/abseil-cpp). Contributions are welcome to
	allow other types to be swapped in. A C99 HAL API is planned for code generation
	targets that will use no dependencies.

	### Testing and Tooling

	[Swiftshader](https://github.com/google/swiftshader) is used to provide fast
	hardware-independent testing of the Vulkan and SPIR-V portions of the toolchain.

	<a name="license"></a>

	## License

	IREE is licensed under the terms of the Apache license. See [LICENSE](LICENSE)
	for more information.