Merge docs/developers into docs/website/. (#15396)

Fixes https://github.com/openxla/iree/issues/15116.


![image](https://github.com/openxla/iree/assets/4010439/db4e0fe5-36cb-4cae-8597-6607493f774f)

## Notes for review

| Location description | Preview URL |
| --------- | -------------- |
| source folder (GitHub markdown) | [`docs/website/docs/developers/` on
my
fork](https://github.com/ScottTodd/iree/tree/developer-docs-folder/docs/website/docs/developers)
|
| website (mkdocs site generator) |
https://scotttodd.github.io/iree/developers |

* I tried to split this change into multiple smaller PRs / reviewable
commits, but the nature of moving so many files around and getting them
integrated into the website made that tricky
* I may have missed some URL renamings. mkdocs warns about broken links
on the website itself, at least

## Overview

This takes the existing
[`/docs/developers/`](https://github.com/openxla/iree/tree/main/docs/developers)
folder and merges it into
[`docs/website/`](https://github.com/openxla/iree/tree/main/docs/website)
for publishing on https://iree.dev/.

The website has historically been primarily serving an ambiguous group
of "users" (hoping to take a model authored in an ML framework and
deploy it to some target). This broadens the scope of the website to
also include developers (from established maintainers to new
contributors).

This change does a few things:

* Moves existing pages across folders
* Deletes some stale pages
* Updates page style to match what the "mkdocs" site generator and
"mkdocs material" framework use
* Updates links across the project to use website URLs, relative links,
or GitHub links as appropriate

## Detailed list of changes

* Added "edit this page" buttons to all pages
![image](https://github.com/openxla/iree/assets/4010439/04660ff5-3612-4062-820b-ef82686aad80)
* Merged `building_with_bazel_[linux, macos, windows]` into a single
"Building with Bazel" page that uses content tabs

![image](https://github.com/openxla/iree/assets/4010439/2beedc02-79d5-4039-a847-d86c5d471ee0)
* Renamed files so alphabetical sorting highlights the category that
each file belongs in

![image](https://github.com/openxla/iree/assets/4010439/c5a732ad-82f3-468b-b942-26d7e9d3c4ae)
* Renamed files from using underscores to using dashes (more natural for
URLs)
* Merged some "debugging integration test" pages and deleted outdated
information (pointing at old TensorFlow code that no longer exists)
* Moved "developer tips" from the top level "Guides" category into the
"General development topics" subsection under this new top level
"Developers" category
* Applied lint and style fixes to all files (e.g. max line length,
`Subsection capitalization` instead of `Subsection Capitalization`)
* Merged "contributor tips" into "contributing"
* Redirected or removed references to docs/developers/ (e.g. website 404
page pointed there as another place to look for docs)
* Deleted "codegen passes", "hal driver features", and "dynamic shapes"
design docs (all were stale)
* Removed references to old processes (quirks based on supporting
Google's downstream monorepo)

## Future work

This PR is focused primarily on moving pages over and making minor
changes where appropriate. More work is needed to refresh the content on
several pages. The "developer docs" should be seen as a wiki of sorts,
so the support expectations are lower, but outdated or missing
documentation can be worse than no documentation in some respects.

Known issues to follow up on:

* The "Contributing" page should be updated, perhaps with a separate
page for "Infrastructure" forked out
* We have many "benchmarking" and "profiling" pages. That's great, but
people shouldn't need to read all of the pages to be productive
* The design docs are _very_ outdated. I removed a few of them, but we
should figure out if the remaining ones are worth keeping around. New
pages would be nice too
* These pages could have icons and other style tweaks, e.g. the sidebar
shows icons but it looks better if all pages list them:

![image](https://github.com/openxla/iree/assets/4010439/7efb0bbd-23aa-4b91-8ee3-4787d800a3fb)
* mkdocs [material] supports showing revision dates on files. That would
be useful for showing how fresh a file is, but files can be touched by
refactorings and generated files don't have git information... need to
think through that a bit
diff --git a/docs/website/README.md b/docs/website/README.md
index 6f777b2..84cd5e8 100644
--- a/docs/website/README.md
+++ b/docs/website/README.md
@@ -1,10 +1,10 @@
 # IREE User-Facing Documentation Website
 
-This directory contains the source and assets for IREE's website, hosted on
-[GitHub Pages](https://pages.github.com/).
+This directory contains the source and assets for <https://iree.dev/>.
 
 The website is generated using [MkDocs](https://www.mkdocs.org/), with the
-[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme.
+[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme and
+is served using [GitHub Pages](https://pages.github.com/).
 
 ## How to edit this documentation
 
@@ -38,3 +38,40 @@
 ```shell
 mkdocs gh-deploy --remote-name <your remote>
 ```
+
+## Website sections and authoring tips
+
+For more details on how this is set up, see
+[IREE Website Overview - July 10, 2023](https://docs.google.com/presentation/d/116TyW_aCsPXmmjRYI2tRqpOwDaGNoV8LDC_j9hsMrDk/edit?usp=sharing)
+(though note that the website organization has changed since then).
+
+For documentation language and style, the guide at
+<https://developers.google.com/style> offers good advice.
+
+### Building from source
+
+Instructions on how to build the project from source on supported platforms.
+
+* Focus on instructions that apply to all users, independent of specific
+  package managers and development styles
+* Set developers up for success with good default options
+* Explain how interact with the build system and its outputs
+
+### Guides
+
+Workflow-oriented guides showing users how to accomplish tasks
+
+### Reference
+
+Unopinionated descriptions of system components
+
+### Developers
+
+Less structured pages for project development topics
+
+* Pages may be "promoted" from this category to another category if they are
+  generally useful to a wide enough range of developers
+
+### Community (Blog)
+
+A place to showcase work across the community
diff --git a/docs/website/docs/building-from-source/getting-started.md b/docs/website/docs/building-from-source/getting-started.md
index 1203e43..8ed175d 100644
--- a/docs/website/docs/building-from-source/getting-started.md
+++ b/docs/website/docs/building-from-source/getting-started.md
@@ -153,7 +153,7 @@
     -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
     ```
 
-    See also our [developer documentation for ccache](https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/ccache.md).
+    See also our [developer documentation for ccache](../developers/building/cmake-with-ccache.md).
 
 ### :octicons-gear-16: Optional components
 
diff --git a/docs/website/docs/community/blog/posts/cuda-backend.md b/docs/website/docs/community/blog/posts/cuda-backend.md
index 9c3c56c..340c1d7 100644
--- a/docs/website/docs/community/blog/posts/cuda-backend.md
+++ b/docs/website/docs/community/blog/posts/cuda-backend.md
@@ -28,7 +28,8 @@
 
 ### HAL support
 
-IREE has a [HAL API](https://github.com/openxla/iree/blob/main/docs/developers/design_roadmap.md#hal-hardware-abstraction-layer-and-multi-architecture-executables)
+IREE has a
+[HAL API](https://iree.dev/developers/design-docs/design-roadmap/#hal-hardware-abstraction-layer-and-multi-architecture-executables)
 that abstract all the targets behind a common interface. The first step to
 supporting a CUDA target was to map the HAL API onto CUDA. We use the CUDA
 driver API to reduce dependencies and be closer to the hardware. The HAL API is
@@ -96,7 +97,7 @@
 ![Compilation diagram](./cuda-bring_up.png)
 
 The steps to reproduce running a simple op end to end through CUDA backend are
-described [here](https://github.com/openxla/iree/blob/main/docs/developers/design_docs/cuda_backend.md#example).
+described [here](../../../developers/design-docs/cuda-backend.md/#example).
 
 ## Performance
 
diff --git a/docs/website/docs/developers/building/bazel.md b/docs/website/docs/developers/building/bazel.md
new file mode 100644
index 0000000..790345f
--- /dev/null
+++ b/docs/website/docs/developers/building/bazel.md
@@ -0,0 +1,220 @@
+# Building with Bazel
+
+This page walks through building IREE from source using the
+[Bazel build system](https://bazel.build/).
+
+!!! warning
+
+    Bazel build support is primarily for internal project infrastructure. We
+    strongly recommend [using CMake](../../building-from-source/index.md)
+    instead.
+
+    Our Bazel configuration is also _only_ tested on Linux. Windows and macOS
+    may be unstable.
+
+## :octicons-download-16: Prerequisites
+
+=== ":fontawesome-brands-linux: Linux"
+
+    1. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the
+        [official docs](https://bazel.build/install).
+
+    2. Install a compiler such as Clang (GCC is not fully supported).
+
+        ```shell
+        sudo apt install clang
+        ```
+
+        Set environment variables for Bazel:
+
+        ```shell
+        export CC=clang
+        export CXX=clang++
+        ```
+
+    3. Install Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+=== ":fontawesome-brands-apple: macOS"
+
+    1. Install [Homebrew](https://brew.sh/):
+
+        ```shell
+        /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
+        ```
+
+    2. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the [official docs](https://bazel.build/install/os-x) or
+        via Homebrew:
+
+        ```shell
+        brew install bazel
+        ```
+
+    3. Install Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+=== ":fontawesome-brands-windows: Windows"
+
+    !!! tip
+
+        You can simplify installation by using a package manager like
+        [Scoop](https://scoop.sh/) or [Chocolatey](https://chocolatey.org/).
+
+    1. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the [official docs](https://bazel.build/install/windows).
+
+        Also install [MSYS2](https://www.msys2.org/) by following Bazel's documentation.
+
+    2. Install Python3 ([docs here](https://www.python.org/downloads/windows/))
+        and Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+    3. Install the full Visual Studio or "Build Tools For Visual Studio" from the
+        [downloads page](https://visualstudio.microsoft.com/downloads/) then
+        set the `BAZEL_VS` environment variable:
+
+        ```powershell
+        > $env:BAZEL_VS = "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools"
+        ```
+
+## :octicons-rocket-16: Quickstart: clone and build
+
+### Clone
+
+Use [Git](https://git-scm.com/) to clone the IREE repository and initialize its
+submodules:
+
+```shell
+git clone https://github.com/openxla/iree.git
+cd iree
+git submodule update --init
+```
+
+Configure Bazel:
+
+```shell
+# This generates a `configured.bazelrc` file by analyzing your environment.
+# Skipping this step will make it difficult to select your platform/compiler.
+python3 configure_bazel.py
+```
+
+=== ":fontawesome-brands-linux: Linux"
+
+    (No Linux-specific tips for configuring)
+
+=== ":fontawesome-brands-apple: macOS"
+
+    (No macOS-specific tips for configuring)
+
+=== ":fontawesome-brands-windows: Windows"
+
+    !!! tip
+
+        Clone to a short path like `C:\projects\` to avoid issues with Windows
+        maximum path lengths (260 characters).
+
+    !!! tip
+
+        `configure_bazel.py` only detects that you have Windows and will output
+        the default `--config=windows` to `configured.bazelrc`, which assumes
+        the latest version of MSVC. To avoid some warnings, you may want to
+        replace it with (for example) `--config=msvc2022`.
+
+### Build
+
+Run all core tests:
+
+```shell
+bazel test -k //...
+```
+
+!!! tip
+
+    You can add flags like `--test_env=IREE_VULKAN_DISABLE=1` to your test
+    command to change how/which tests run.
+
+In general, build artifacts will be under the `bazel-bin` directory at the top
+level.
+
+## :octicons-gear-16: Recommended `user.bazelrc`
+
+You can put a user.bazelrc at the root of the repository and it will be ignored
+by git.
+
+=== ":fontawesome-brands-linux: Linux"
+
+    ```shell
+    build --disk_cache=/tmp/bazel-cache
+
+    # Use --config=debug to compile IREE and LLVM without optimizations
+    # and with assertions enabled.
+    build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
+
+    # Use --config=asserts to enable assertions. This has to be done globally:
+    # Code compiled with and without assertions can't be linked together (ODR violation).
+    build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
+    ```
+
+=== ":fontawesome-brands-apple: macOS"
+
+    ```shell
+    build --disk_cache=/tmp/bazel-cache
+
+    # Use --config=debug to compile IREE and LLVM without optimizations
+    # and with assertions enabled.
+    build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
+
+    # Use --config=asserts to enable assertions. This has to be done globally:
+    # Code compiled with and without assertions can't be linked together (ODR violation).
+    build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
+    ```
+
+=== ":fontawesome-brands-windows: Windows"
+
+    ```shell
+    build --disk_cache=c:/bazelcache
+    build:debug --compilation_mode=dbg --copt=/O2 --per_file_copt=iree@/Od --strip=never
+    ```
+
+## What's next?
+
+### Take a Look Around
+
+Build all of IREE's 'tools' directory:
+
+```shell
+bazel build tools/...
+```
+
+Check out what was built:
+
+```shell
+ls bazel-bin/tools/
+./bazel-bin/tools/iree-compile --help
+```
+
+Translate a
+[MLIR file](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir)
+and execute a function in the compiled module:
+
+```shell
+# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
+$ ./bazel-bin/tools/iree-run-mlir \
+  --iree-hal-target-backends=vmvx --print-mlir \
+  ./samples/models/simple_abs.mlir \
+  --input=f32=-2
+```
diff --git a/docs/website/docs/developers/building/cmake-options-and-variables.md b/docs/website/docs/developers/building/cmake-options-and-variables.md
new file mode 100644
index 0000000..9c4b137
--- /dev/null
+++ b/docs/website/docs/developers/building/cmake-options-and-variables.md
@@ -0,0 +1,172 @@
+# CMake options and variables
+
+## Frequently-used CMake variables
+
+### `CMAKE_BUILD_TYPE`
+
+* type: STRING
+
+Sets the build type. Possible values are `Release`, `Debug`,
+`RelWithDebInfo` and `MinSizeRel`. If unset, build type is set to `Release`.
+
+### `CMAKE_<LANG>_COMPILER`
+
+* type: STRING
+
+This is the command that will be used as the `<LANG>` compiler, which are `C`
+and `CXX` in IREE. These variables are set to compile IREE with `clang` or
+rather `clang++`. Once set, these variables can not be changed.
+
+## IREE-specific CMake options and variables
+
+This gives a brief explanation of IREE specific CMake options and variables.
+
+### `IREE_ENABLE_RUNTIME_TRACING`
+
+* type: BOOL
+
+Enables instrumented runtime tracing. Defaults to `OFF`.
+
+### `IREE_ENABLE_COMPILER_TRACING`
+
+* type: BOOL
+
+Enables instrumented compiler tracing. This requires that
+`IREE_ENABLE_RUNTIME_TRACING` also be set. Defaults to `OFF`.
+
+### `IREE_BUILD_COMPILER`
+
+* type: BOOL
+
+Builds the IREE compiler. Defaults to `ON`.
+
+### `IREE_BUILD_TESTS`
+
+* type: BOOL
+
+Builds IREE unit tests. Defaults to `ON`.
+
+### `IREE_BUILD_DOCS`
+
+* type: BOOL
+
+Builds IREE documentation files. Defaults to `OFF`.
+
+### `IREE_BUILD_SAMPLES`
+
+* type: BOOL
+
+Builds IREE sample projects. Defaults to `ON`.
+
+### `IREE_BUILD_PYTHON_BINDINGS`
+
+* type: BOOL
+
+Builds the IREE python bindings. Defaults to `OFF`.
+
+### `IREE_BUILD_BINDINGS_TFLITE`
+
+* type: BOOL
+
+Builds the IREE TFLite C API compatibility shim. Defaults to `ON`.
+
+### `IREE_BUILD_BINDINGS_TFLITE_JAVA`
+
+* type: BOOL
+
+Builds the IREE TFLite Java bindings with the C API compatibility shim.
+Defaults to `ON`.
+
+### `IREE_BUILD_EXPERIMENTAL_REMOTING`
+
+* type: BOOL
+
+Builds experimental remoting component. Defaults to `OFF`.
+
+### `IREE_HAL_DRIVER_DEFAULTS`
+
+* type: BOOL
+
+Default setting for each `IREE_HAL_DRIVER_*` option.
+
+### `IREE_HAL_DRIVER_*`
+
+* type: BOOL
+
+Individual options enabling the build for each runtime HAL driver.
+
+### `IREE_TARGET_BACKEND_DEFAULTS`
+
+* type: BOOL
+
+Default setting for each `IREE_TARGET_BACKEND_*` option.
+
+### `IREE_TARGET_BACKEND_*`
+
+* type: BOOL
+
+Individual options enabling the build for each compiler target backend.
+
+### `IREE_INPUT_*`
+
+* type: BOOL
+
+Individual options enabling each set of input dialects.
+
+### `IREE_OUTPUT_FORMAT_C`
+
+* type: BOOL
+
+Enables the vm-c compiler output format, using MLIR EmitC. Defaults to `ON`.
+
+### `IREE_DEV_MODE`
+
+* type: BOOL
+
+Configure settings to optimize for IREE development (as opposed to CI or
+release). Defaults to `OFF`. For example, this will downgrade some compiler
+diagnostics from errors to warnings.
+
+### `IREE_ENABLE_LLD`
+
+* type: BOOL
+
+Use lld when linking. Defaults to `OFF`. This option is equivalent to
+`-DIREE_USE_LINKER=lld`. The option `IREE_ENABLE_LLD` and `IREE_USE_LINKER` can
+not be set at the same time.
+
+### `IREE_ENABLE_ASAN`
+
+* type: BOOL
+
+Enable [address sanitizer](https://clang.llvm.org/docs/AddressSanitizer.html) if
+the current build type is Debug and the compiler supports it.
+
+### `IREE_ENABLE_MSAN`
+
+* type: BOOL
+
+Enable [memory sanitizer](https://clang.llvm.org/docs/MemorySanitizer.html) if
+the current build type is Debug and the compiler supports it.
+
+### `IREE_ENABLE_TSAN`
+
+* type: BOOL
+
+Enable [thread sanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html) if
+the current build type is Debug and the compiler supports it.
+
+### `IREE_ENABLE_UBSAN`
+
+* type: BOOL
+
+Enable [undefiend behavior sanitizer](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)
+if the current build type is Debug and the compiler supports it.
+
+## Cross-compilation
+
+When cross compiling (using a toolchain file like
+[`android.toolchain.cmake`](https://android.googlesource.com/platform/ndk/+/master/build/cmake/android.toolchain.cmake)),
+first build and install IREE's tools for your host configuration, then use the
+`IREE_HOST_BIN_DIR` CMake option to point the cross compiled build at the
+host tools.
diff --git a/docs/website/docs/developers/building/cmake-with-ccache.md b/docs/website/docs/developers/building/cmake-with-ccache.md
new file mode 100644
index 0000000..759b98c
--- /dev/null
+++ b/docs/website/docs/developers/building/cmake-with-ccache.md
@@ -0,0 +1,75 @@
+# CMake with `ccache`
+
+[`ccache`](https://ccache.dev/) is a compilation cache. In principle, just
+prepending compiler invocations with `ccache` is all one needs to enable it,
+e.g.
+
+```shell
+ccache clang foo.c -c -o foo.o
+```
+
+takes care of executing `clang` with these arguments and caches the output file
+`foo.o`. The next invocation then skips executing `clang` altogether.
+
+When the cache is hit, the speedup is such that the "compilation" becomes
+essentially free. However, `ccache` only caches compilation,
+[not linking](https://stackoverflow.com/a/29828811).
+
+Here a few scenarios where `ccache` helps:
+
+* Incremental rebuilds. While `cmake` always tries to avoid unnecessary work in
+  incremental rebuilds, it can only make simple decisions based on file
+  timestamps. `ccache` sees deeper: if the raw source code isn't readily
+  a cache hit, it will then try again after preprocessing and discarding
+  comments.
+* One pain point with `cmake` is having to start over from a clean build
+  directory from time to time, which by default means paying again the full cost
+  of a cold build. Thankfully `ccache` keeps its cache outside of any `cmake`
+  build directory, so the first build in the new clean build directory may be
+  very fast.
+
+## Installing and setting up `ccache`
+
+`ccache` is available on most platforms. On Debian-based Linux distributions,
+do:
+
+```shell
+sudo apt install ccache
+```
+
+The one `ccache` setting that you probably need to configure is the maximum
+cache size. The default `5G` is too small for our purposes. To set the cache max
+size, do this once:
+
+```shell
+ccache --max-size=20G
+```
+
+**Tip:** At the moment (late 2020), most of the code we're building is
+`third_party/llvm-project` so the fundamental limiting factor to how far we can
+cache away rebuilds is how often that dependency gets updated. Given how
+frequently it currently is updated, I'm finding that `20G` is enough to make the
+`ccache` size not be the limiting factor.
+
+## Telling CMake to use `ccache`
+
+Use the CMake
+[COMPILER_LAUNCHER functionality](https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_LAUNCHER.html)
+by setting `CMAKE_C_COMPILER_LAUNCHER=ccache` and
+`CMAKE_CXX_COMPILER_LAUNCHER=ccache` in your
+
+Notes:
+
+* This approach only works with the `Ninja` and `Makefile` generators
+  (`cmake -G` flag). When using other generators, another approach is needed,
+  based on wrapping the compiler in a script that prepends `ccache`. See this
+  [article](https://crascit.com/2016/04/09/using-ccache-with-cmake/).
+
+## Ensuring that `ccache` is used and monitoring cache hits
+
+The `ccache -s` command dumps statistics, including a cache hit count and ratio.
+It's convenient to run periodically with `watch` in a separate terminal:
+
+```shell
+watch -n 0.1 ccache -s  # update the stats readout every 0.1 seconds
+```
diff --git a/docs/website/docs/developers/building/emscripten.md b/docs/website/docs/developers/building/emscripten.md
new file mode 100644
index 0000000..2323859
--- /dev/null
+++ b/docs/website/docs/developers/building/emscripten.md
@@ -0,0 +1,66 @@
+# Building with Emscripten
+
+[Emscripten](https://emscripten.org/index.html) is a complete compiler
+toolchain to WebAssembly, using LLVM, with a special focus on speed, size, and
+the Web platform. Emscripten can be used to compile parts of IREE to
+[WebAssembly](https://webassembly.org/) for execution within web browsers or
+other Wasm runtimes.
+
+## Status
+
+IREE's _runtime_ can be compiled through Emscripten in some limited
+configurations. More of the runtime will be supported over time.
+
+IREE's _compiler_ can be compiled through Emscripten with local changes. More
+work is needed for this to be generally supported.
+
+## Prerequisites
+
+Read <https://emscripten.org/docs/getting_started/downloads.html> and run
+
+```shell
+./emsdk install latest
+./emsdk activate latest
+source ./emsdk_env.sh
+```
+
+## Building IREE's runtime with Emscripten
+
+### Host configuration
+
+Build and install at least the compiler tools on your host machine, or install
+them from a binary distribution:
+
+```shell
+$ cmake -G Ninja -B ../iree-build-host/ \
+    -DCMAKE_C_COMPILER=clang \
+    -DCMAKE_CXX_COMPILER=clang++ \
+    -DCMAKE_INSTALL_PREFIX=../iree-build-host/install \
+    .
+$ cmake --build ../iree-build-host/ --target install
+```
+
+### Target configuration
+
+```shell
+$ emcmake cmake -G Ninja -B ../iree-build-emscripten/ \
+  -DCMake_BUILD_TYPE=Release \
+  -DIREE_HOST_BIN_DIR=$(realpath ../iree-build-host/install/bin) \
+  -DIREE_BUILD_TESTS=OFF \
+  -DIREE_BUILD_COMPILER=OFF \
+  .
+```
+
+Build:
+
+```shell
+cmake --build ../iree-build-emscripten/ \
+  --target iree_samples_simple_embedding_simple_embedding_vmvx_sync
+```
+
+### Load into a WebAssembly environment
+
+Copy the outputs from the build process (e.g. `simple_embedding_vmvx_sync.js`
+and `simple_embedding_vmvx_sync.wasm`) into your application and follow
+instructions at either <https://webassembly.org/getting-started/developers-guide/>
+or <https://developer.mozilla.org/en-US/docs/WebAssembly/Loading_and_running>.
diff --git a/docs/website/docs/developers/debugging/android-with-lldb.md b/docs/website/docs/developers/debugging/android-with-lldb.md
new file mode 100644
index 0000000..7774b7c
--- /dev/null
+++ b/docs/website/docs/developers/debugging/android-with-lldb.md
@@ -0,0 +1,61 @@
+# Android debugging with LLDB
+
+This doc shows how to use LLDB to debug native binaries on Android. For a more
+complete explanation, see the
+[official LLDB documentation on remote debugging](https://lldb.llvm.org/use/remote.html).
+
+## Prerequisites
+
+We assume the following setup:
+
+1. [Android NDK is installed](https://developer.android.com/ndk/downloads) and
+   the `ANDROID_NDK` environment variable is set to the installation path.
+2. Your Android device connected and configured for
+   [`adb`](https://developer.android.com/studio/command-line/adb).
+3. The Android binary of interest is already compiled and the command to run it
+   (in `adb shell`) is `<your-binary> [program args...]`. This does *not* have
+   to be a proper Android app with a manifest, etc.
+
+## Running Manually
+
+1. Push the toolchain files, including `lldb-server`, to your device:
+
+    ```shell
+    adb shell "mkdir -p /data/local/tmp/tools"
+    adb push "$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/aarch64/* /data/local/tmp/tools
+    ```
+
+    You may need to adjust the clang toolchain version to match the one in your
+    NDK. You can find it with
+    `find "$ANDROID_NDK/toolchains/llvm/prebuilt" -name lldb-server`.
+
+2. Set up port forwarding. We are going to use port 5039 but you are free to
+   pick a different one:
+
+    ```shell
+    adb forward tcp:5039 tcp:5039
+    ```
+
+3. Start an `lldb-server` in a new interactive adb shell:
+
+    ```shell
+    adb shell
+    /data/local/tmp/tools/lldb-server platform --listen '*:5039' --server
+    ```
+
+4. Launch `lldb`, connect to the server and run the binary:
+
+    ```shell
+    lldb -o 'platform select remote-android' \
+        -o 'platform connect connect://:5039' \
+        -o 'platform shell cd /data/local/tmp'
+    target create <your-binary>
+    run [program args...]
+    ```
+
+    You can either use the system `lldb` or a prebuilt under `"$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/<your-host-arch>`.
+
+    Explanation: each `-o` (short for `--one-shot`) tells lldb to execute a
+    command on startup. You can run those manually in the lldb shell, if you
+    prefer. Then, we tell lldb which working directory to use, where to find the
+    executable, and what command line arguments to use.
diff --git a/docs/website/docs/developers/debugging/compile-time-regressions.md b/docs/website/docs/developers/debugging/compile-time-regressions.md
new file mode 100644
index 0000000..a6ea76c
--- /dev/null
+++ b/docs/website/docs/developers/debugging/compile-time-regressions.md
@@ -0,0 +1,226 @@
+# Compile time regression debugging
+
+So the IREE compiler used to compile a program quickly, but it is now slower.
+What do you do?
+
+## Initial information gathering
+
+Try to answer as many of these questions as you can:
+
+> **When did compilation get slower?**
+
+  A specific git commit is ideal, but "sometime in the last week" is a good
+  starting point. You'll ultimately want to find a culprit release or git
+  commit that changed the compiler code.
+
+> **How much slower did compilation get?**
+
+  Be specific - did it jump from 1 minute to 2 minutes, or 1 minute to 1 hour?
+  Identifying the scale of the regression can help set the priority to
+  investigate it.
+
+> **What is the full compile command?**
+
+  Try to extract the input program and full list of flags passed to the
+  compiler binary so that others can reproduce what you're seeing. Try to
+  distill this as much as possible to using just native tools (no Python or
+  other framework layers).
+
+> **What environment is the compiler running in?**
+
+  Are you using a `Debug` build, or a release build? What operating system and
+  size machine is running the compiler (e.g. Linux developer machine, or a
+  smaller system)?
+
+## Culprit finding and bisecting
+
+If you only have a rough idea of when something changed and want to narrow that
+down to a specific code change, bisecting can help.
+
+### Running `git bisect`
+
+Building the compiler from source and using
+[`git bisect`](https://git-scm.com/docs/git-bisect) will let you pinpoint
+specific commits in IREE, though it typically won't let you step through changes
+in submodules (e.g. MLIR updates in `third_party/llvm-project/`).
+
+**Tip**: [Configure ccache](../building/cmake-with-ccache.md) if you'll be
+rebuilding the compiler while bisecting
+
+A manual workflow with `git bisect` looks like this:
+
+```bash
+git bisect start --first-parent
+git bisect good [<rev>]
+git bisect bad [<rev>]
+
+# Read the prompts from the command as it runs
+# At each step, test the compiler:
+#   git submodule update
+#   cmake --build build/ --target iree-compile
+#   ./build/tools/iree-compile <args>
+#       attach Tracy, observe timing, print IR, etc. to determine if fast or slow
+#       if fast, `git bisect good`
+#       if slow, `git bisect bad`
+#   repeat
+```
+
+An automated workflow can use `git bisect run` and a script:
+
+```shell
+# run_bisect.sh
+git submodule update
+cmake --build build/ --target iree-compile
+# Other logic here
+```
+
+```bash
+git bisect start --first-parent
+git bisect good [<rev>]
+git bisect bad [<rev>]
+git bisect run run_bisect.sh
+```
+
+#### Sample: compile executable sources individually with a timeout
+
+```bash
+#!/bin/bash
+
+set -xeuo pipefail
+
+# --------------------------------------------------------------------------- #
+# Settings                                                                    #
+# --------------------------------------------------------------------------- #
+
+INPUT_FILE_PATH="/path/to/program.mlirbc"
+TMP_DIR="../iree-tmp"
+
+declare -a COMPILER_FLAGS=(
+  "--iree-input-type=stablehlo"
+  "--iree-hal-target-backends=cuda"
+  "--iree-hal-cuda-llvm-target-arch=sm_80"
+)
+
+TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE=10
+
+# --------------------------------------------------------------------------- #
+# Utility functions                                                           #
+# --------------------------------------------------------------------------- #
+
+# Call to have `git bisect` skip this commit (don't mark as good _or_ bad)
+# https://git-scm.com/docs/git-bisect#_bisect_run
+skip_on_error() {
+  >&2 echo "** Skipping due to error: $1 **"
+  exit 125  # Special exit code for `git bisect skip`
+}
+
+# --------------------------------------------------------------------------- #
+# Main script                                                                 #
+# --------------------------------------------------------------------------- #
+
+# Store git version hash, so we can dump artifacts to unique directories later.
+GIT_SHA="$(git rev-parse --short HEAD)"
+
+echo "** Building iree-compile at ${GIT_SHA} **"
+
+# The `git bisect` command only checks out a commit, so update submodules.
+git submodule update
+
+# Build the compiler. You'll want ccache configured to make this fast!
+cmake --build ../iree-build/ --target iree-compile || skip_on_error "CMake build failed"
+
+# Run the compiler, dumping executable sources and stopping.
+SOURCES_DIR="${TMP_DIR}/sources-${GIT_SHA}"
+echo "** Running iree-compile at ${GIT_SHA}, dumping sources to ${SOURCES_DIR} **"
+../iree-build/tools/iree-compile \
+    ${INPUT_FILE_PATH} \
+    ${COMPILER_FLAGS[@]} \
+    --iree-hal-dump-executable-sources-to=${SOURCES_DIR} \
+    --compile-to=executable-sources \
+    -o /dev/null
+
+# Run the compiler again on each executable individually.
+echo "** Running iree-compile at ${GIT_SHA} for each executable source **"
+SOURCES=($(ls -1 ${SOURCES_DIR}))
+for SOURCE in "${SOURCES[@]}"; do
+  echo "  * Compiling: ${SOURCE} *"
+  timeout --verbose ${TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE} \
+   ../iree-build/tools/iree-compile ${SOURCES_DIR}/${SOURCE} \
+    ${COMPILER_FLAGS[@]} \
+    --compile-mode=hal-executable \
+    -o /dev/null
+done
+```
+
+## Profiling and tracing
+
+If you want to understand _why_ the compiler is fast or slow, or if you want to
+compare performance in detail between two versions, consider these profiling
+options.
+
+### MLIR pass timing
+
+The `-mlir-timing` flag enables
+[Pass Timing](https://mlir.llvm.org/docs/PassManagement/#pass-timing)
+instrumentation. Once the compiler finishes running, this prints a report like
+
+```shell
+===-------------------------------------------------------------------------===
+                      ... Pass execution timing report ...
+===-------------------------------------------------------------------------===
+  Total Execution Time: 0.0203 seconds
+
+   ---Wall Time---  --- Name ---
+   0.0047 ( 55.9%)  Canonicalizer
+   0.0019 ( 22.2%)  VerifierPass
+   0.0016 ( 18.5%)  LLVMLoweringPass
+   0.0003 (  3.4%)  CSE
+   0.0002 (  1.9%)  (A) DominanceInfo
+   0.0084 (100.0%)  Total
+```
+
+This is easy data to collect, especially remotely over SSH, but it might not
+paint a complete picture and requires waiting for compilation to finish.
+
+### Using Tracy
+
+<!-- TODO(scotttodd): update link -->
+
+See our documentation on
+[profiling with Tracy](../performance/profiling-with-tracy.md). For compile
+time regressions, pay particular attention to the different compilation phases
+(Flow/Stream/HAL), how many times `TranslateExecutablesPass` runs, and if there
+are outlier passes that take significantly longer to run than others.
+
+Here are some previous analyses for inspiration:
+
+* <https://github.com/openxla/iree/issues/12033>
+* <https://github.com/openxla/iree/issues/12035>
+* <https://github.com/openxla/iree/issues/12183>
+* <https://github.com/openxla/iree/issues/13189>
+
+Example slow trace:
+
+![slow trace](https://user-images.githubusercontent.com/4010439/233436147-2fa0fbb3-80cd-474c-bfff-3441c2d8f8fc.png)
+
+Example fast trace:
+
+![fast trace](https://user-images.githubusercontent.com/4010439/233455673-7469066b-2b0d-4462-b6a5-3af4a502e591.png)
+
+Example sampling statistics showing 10s of minutes in LLVM codegen:
+
+![slow LLVM codegen](https://user-images.githubusercontent.com/4010439/233441298-3c4f5afa-d1cc-43b3-8900-58652f295fe2.png)
+
+## Stepping through compiler IR
+
+Debugging an MLIR-based compiler like IREE usually involves reading IR at some
+point. For compile time regressions, it helps to snapshot the IR at a few key
+phases and look for differences between fast compilation and slow compilation.
+
+Here is one useful flag combination:
+
+```shell
+--mlir-disable-threading \
+--mlir-elide-elementsattrs-if-larger=8 \
+--mlir-print-ir-after=iree-hal-materialize-interfaces
+```
diff --git a/docs/website/docs/developers/debugging/integration-tests.md b/docs/website/docs/developers/debugging/integration-tests.md
new file mode 100644
index 0000000..5761c48
--- /dev/null
+++ b/docs/website/docs/developers/debugging/integration-tests.md
@@ -0,0 +1,121 @@
+# Integration test debugging
+
+This document includes tips for triaging integration test correctness issues.
+Feel free to reach out to @hanhanW or ask questions on Discord for more help.
+
+## General tips
+
+### Narrow down reproducers
+
+* Models themselves can be large, and IREE breaks models into dispatches/kernels
+and then launches those individually. Program outputs could diverge starting
+from any individual launch. To get a smaller reproducer, you can use
+[--iree-flow-trace-dispatch-tensors](../general/developer-overview.md#-iree-flow-trace-dispatch-tensors).
+* You can compare the logs between builds/backends to get an idea about which
+dispatch results in wrong outputs. The dumped inputs can be reused in a
+flagfile.
+
+Once a suspicious dispatch is identified, we can create a test case based on
+the dispatch function. The dispatch function can be derived after the
+`OutlineDispatchRegions` pass. The function signatures have to be modified
+manually. You'll have to put `flow.dispatch.tensor.load` variables to function
+arguments, and replace `flow.dispatch.tensor.store` with `return` op.
+
+Note: This only works when dispatch formation logics are identical between runs.
+
+## iree-samples repository tests
+
+Follow [README](https://github.com/iree-org/iree-samples#readme) to run the model.
+The MLIR files will be generated. You'll find the saved file from log. E.g.,
+
+``` shell
+[ RUN      ] MobilenetV2Int8Test.test_compile_tflite
+I0401 17:27:04.084272 140182373025024 test_util.py:119] Setting up for IREE
+I0401 17:27:04.085064 140182373025024 binaries.py:218] Invoke IREE Pipeline:
+  /tmp/iree-samples/iree-samples.venv/lib/python3.9/site-packages/iree/tools/tflite/iree-import-tflite
+    /tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/model.tflite
+    --mlir-print-debuginfo
+    --save-temp-tfl-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tflite.mlir
+    --save-temp-iree-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tosa.mlir
+```
+
+Unfortunately, the artifacts are not dumped in the runs. There is an
+[issue](https://github.com/openxla/iree/issues/8756) for tracking this. A
+workaround can be found in the issue.
+
+## TensorFlow integration tests
+
+These are steps to reproduce/address failures in TF/TFLite integration tests.
+These instructions are most stable on Linux, though they may work with a few
+tweaks on Windows and macOS.
+
+All steps here assume starting from the IREE root directory.
+
+1. First create a Python virtual environment to install packages into:
+
+    ```bash
+    python -m venv iree-tf.venv
+    source iree-tf.venv/bin/activate
+
+    # Install test requirements
+    python -m pip install -r ./integrations/tensorflow/test/requirements.txt
+    ```
+
+2. Install IREE's tools and Python bindings or build them from source
+
+    Install distributed packages
+
+    ```bash
+    # Install packages from nightly releases
+    # This should work for most cases, as the importers change infrequently
+    python -m pip install \
+      iree-compiler iree-runtime iree-tools-tf iree-tools-tflite \
+      --find-links https://iree.dev/pip-release-links.html
+    ```
+
+    _OR_ build from source
+
+    ```bash
+    # Build Python bindings from source
+    cmake -G Ninja -B ../iree-build/ -DIREE_BUILD_PYTHON_BINDINGS=ON .
+    cmake --build ../iree-build/
+
+    # Add IREE built-from-source Python packages to PYTHONPATH
+    source .env
+
+    # Install IREE TF/TFLite Python packages
+    python -m pip install integrations/tensorflow/python_projects/iree_tf
+    python -m pip install integrations/tensorflow/python_projects/iree_tflite
+    ```
+
+3. Run the python test command line
+
+    The command can be obtained from the run file. For example, if
+    `iree_tfl_tests/llvmcpu_posenet_i8.run` failed,
+
+    ```bash
+    cd integrations/tensorflow/test/
+    cat iree_tfl_tests/llvmcpu_posenet_i8.run
+
+    # REQUIRES: llvmcpu
+    # RUN: %PYTHON -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=%t
+
+    cd python/
+    python -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=/tmp/posenet_i8_failure
+    ```
+
+    Note that the command can only be run under
+    `integrations/tensorflow/test/python` directory.
+
+4. Extract intermediate files and use with native tools
+
+    The test will create an `iree_input.mlir` in the temp directory specified.
+    Those can then be fed into `iree-compile` (built locally to reproduce the
+    error)
+
+    ```bash
+    iree-compile \
+      --iree-hal-target-backends=llvm-cpu \
+      --iree-input-type=stablehlo \
+      iree_input.mlir
+    ```
diff --git a/docs/website/docs/developers/debugging/releases.md b/docs/website/docs/developers/debugging/releases.md
new file mode 100644
index 0000000..82554d8
--- /dev/null
+++ b/docs/website/docs/developers/debugging/releases.md
@@ -0,0 +1,140 @@
+# Release debugging playbook
+
+## Tools and Locations
+
+* `.github/workflows/build_package.yml`: Release packaging jobs
+* `build_tools/github_actions/build_dist.py`: Main script to build various
+  release packages (for all platforms). We usually use this when reproing to
+  approximate exactly what the CI does. Assumes a subdirectory of `c`
+  and writes builds to `iree-build` and `iree-install` as a peer of it. To use
+  locally, just symlink your source dir as `c` in an empty
+  directory (versus checking out).
+
+## Mapping releases back to git commits
+
+The source IREE commit SHA is embeded into pip releases in a few places.
+Starting in a python venv, you can find the IREE commit from both the shell:
+
+```shell
+"$(find . -name 'iree-compile' -executable)" --version
+IREE (https://iree.dev):
+  IREE compiler version 20231016.553 @ f1cb2692a086738d7f16274b9b3af6d2c15ef133
+  LLVM version 18.0.0git
+  Optimized build
+```
+
+and the Python API:
+
+```shell
+python -c "import iree.compiler.version as v; print(v.REVISIONS['IREE'])"
+f1cb2692a086738d7f16274b9b3af6d2c15ef133
+```
+
+## Manylinux releases
+
+The Linux releases are done in a manylinux2014 docker container. At the time of
+this writing, it has gcc 9.3.1 and Python versions 3.5 - 3.9 under `/opt/python`.
+Note that this docker image approximates a 2014 era RHEL distro, patched with
+backported (newer) dev packages. It builds with gcc and BFD linker unless if
+you arrange otherwise. `yum` can be used to get some packages.
+
+Get a docker shell (see exact docker image in build_package.yml workflow):
+
+```shell
+docker run --rm -it -v $(pwd):/work/c stellaraccident/manylinux2014_x86_64-bazel-4.2.2:latest /bin/bash
+```
+
+Remember that docker runs as root unless if you take steps otherwise. Don't
+touch write files in the `/work/c` directory to avoid scattering
+root owned files on your workstation.
+
+The default system Python is 2.x, so you must select one of the more modern
+ones:
+
+```shell
+export PATH=/opt/python/cp39-cp39/bin:$PATH
+```
+
+Build core installation:
+
+```shell
+# (from within docker)
+cd /work
+python ./c/build_tools/github_actions/build_dist.py main-dist
+
+# Also supports:
+#   main-dist
+#   py-runtime-pkg
+#   py-xla-compiler-tools-pkg
+#   py-tflite-compiler-tools-pkg
+#   py-tf-compiler-tools-pkg
+```
+
+You can `git bisect` on the host and keep running the above in the docker
+container. Note that every time you run `build_dist.py`, it deletes the cmake
+cache but otherwise leaves the build directory (so it pays the configure cost
+but is otherwise incremental). You can just `cd iree-build` and run `ninja`
+for faster iteration (after the first build or if changing cmake flags).
+Example:
+
+Extended debugging in the manylinux container:
+
+```shell
+cd /work/iree-build
+# If doing extended debugging in the container, these may make you happier.
+yum install ccache devtoolset-9-libasan-devel gdb
+
+# Get an LLVM symbolizer.
+yum install llvm9.0
+ln -s /usr/bin/llvm-symbolizer-9.0 /usr/bin/llvm-symbolizer
+
+# You can manipulate cmake flags. These may get you a better debug experience.
+cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASAN=ON -DCMAKE_EXE_LINKER_FLAGS=-fuse-ld=gold -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache .
+
+ninja
+
+# Or you may need this if buggy LLVM tools (like mlir-tblgen) are leaking :(
+ASAN_OPTIONS="detect_leaks=0" ninja
+```
+
+Other tips:
+
+* If debugging the runtime, you may have a better time just building the
+  Release mode `main-dist` package above once, which will drop binaries in the
+  `iree-install` directory. Then build the `py-runtime-pkg` or equiv and
+  iterate further in the build directory. Ditto for TF/XLA/etc.
+
+## Testing releases on your fork
+
+To avoid interrupting the regular releases published on the IREE github, you
+can test any changes to the release process on your own fork.  Some setup is
+required before these github actions will work on your fork and development
+branch.
+
+You can run
+[`schedule_candidate_release.yml`](https://github.com/openxla/iree/blob/main/.github/workflows/schedule_candidate_release.yml)
+with a workflow dispatch from the actions tab. If you want to test using a
+commit other than the latest green on your `main` branch, modify the section
+that
+[identifies the latest green commit](https://github.com/openxla/iree/blob/c7b29123f8bd80c1346d2a9e6c5227b372b75616/.github/workflows/schedule_candidate_release.yml#L25)
+to search from another commit or just hardcode one.
+
+To speed up
+[`build_package.yml`](https://github.com/openxla/iree/blob/main/.github/workflows/build_package.yml),
+you may want to comment out some of the builds
+[here](https://github.com/openxla/iree/blob/392449e986493bf710e3da637ebf807715da9ffe/.github/workflows/build_package.yml#L34-L87).
+The
+[`py-pure-pkgs`](https://github.com/openxla/iree/blob/392449e986493bf710e3da637ebf807715da9ffe/.github/workflows/build_package.yml#L52)
+build takes only ~2 minutes and the
+[`py-runtime-pkg`](https://github.com/openxla/iree/blob/392449e986493bf710e3da637ebf807715da9ffe/.github/workflows/build_package.yml#L39)
+build takes ~5, while the others can take several hours.
+
+From your development branch, you can manually run the
+[Schedule Snapshot Release](https://github.com/openxla/iree/actions/workflows/schedule_snapshot_release.yml)
+action, which invokes the
+[Build Release Packages](https://github.com/openxla/iree/actions/workflows/build_package.yml)
+action, which finally invokes the
+[Validate and Publish Release](https://github.com/openxla/iree/actions/workflows/validate_and_publish_release.yml)
+action.  If you already have a draft release and know the release id, package
+version, and run ID from a previous Build Release Packages run, you can
+also manually run just the Validate and Publish Release action.
diff --git a/docs/website/docs/developers/debugging/sanitizers.md b/docs/website/docs/developers/debugging/sanitizers.md
new file mode 100644
index 0000000..1c9f342
--- /dev/null
+++ b/docs/website/docs/developers/debugging/sanitizers.md
@@ -0,0 +1,136 @@
+# Sanitizers (ASan/MSan/TSan)
+
+[AddressSanitizer](https://clang.llvm.org/docs/AddressSanitizer.html),
+[MemorySanitizer](https://clang.llvm.org/docs/MemorySanitizer.html) and
+[ThreadSanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html) are tools
+provided by `clang` to detect certain classes of errors in C/C++ programs. They
+consist of compiler instrumentation (so your program's executable code is
+modified) and runtime libraries (so e.g. the `malloc` function may get
+replaced).
+
+They are abbreviated as "ASan", "MSan" and "TSan" respectively.
+
+They all incur large overhead, so only enable them while debugging.
+
+Tool   | Detects | Helps debug what? | Slowdown | Memory overhead | Android support
+------ | ------- | ----------------- | -------- | --------------- | ---------------
+ASan   | Out-of-bounds accesses, use-after-free, use-after-return, memory leaks | Crashes, non-deterministic results, memory leaks | 2x | 3x | Yes
+MSan   | Uninitialized memory reads | Non-deterministic results | 3x | ? | Yes
+TSan   | Data races | Many bugs in multi-thread code | 5x-15x | 5x-10x | [No](https://github.com/android/ndk/issues/1171)
+
+!!! note
+
+    See
+    [this documentation](https://clang.llvm.org/docs/AddressSanitizer.html#memory-leak-detection)
+    on leak detection. It is only enabled by default on some platforms.
+
+## Support status and how to enable each sanitizer
+
+### ASan (AddressSanitizer)
+
+Enabling ASan in the IREE build is a simple matter of setting the
+`IREE_ENABLE_ASAN` CMake option:
+
+```shell
+cmake -DIREE_ENABLE_ASAN=ON ...
+```
+
+### TSan (ThreadSanitizer)
+
+To enable TSan, at the moment, the following 3 CMake options must be set:
+
+```shell
+cmake \
+  -DIREE_ENABLE_TSAN=ON \
+  -DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON \
+  -DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON \
+  -DIREE_BUILD_SAMPLES=OFF \
+  ...
+```
+
+In practice, `IREE_ENABLE_TSAN` alone would be enough for many targets, but not
+all. The problem is that a IREE runtime built with `IREE_ENABLE_TSAN` cannot
+load a IREE compiled LLVM/CPU module unless the following flags were passed to
+the IREE compiler: `--iree-llvmcpu-sanitize=thread` and
+`--iree-llvmcpu-link-embedded=false`.
+
+The CMake options `IREE_BYTECODE_MODULE_ENABLE_TSAN` and
+`IREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER` ensure that the above flags are
+passed to the IREE compiler when building modules used in tests, benchmarks,
+etc. (anything that internally uses the CMake `iree_bytecode_module` macro).
+
+The CMake option `IREE_BUILD_SAMPLES=OFF` is needed because samples [currently
+assume](https://github.com/openxla/iree/pull/8893) that the embedded linker is
+used, so they are incompatible with
+`IREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON`.
+
+At the moment, CMake logic heavy-handedly enforces that whenever
+`IREE_ENABLE_TSAN` is set, these other two CMake variables are also set.
+That ensures that all tests succeed: no test is expected to fail with TSan.
+
+If you know what you're doing (i.e. if you are not building targets that
+internally involve a LLVM/CPU `iree_bytecode_module`), feel free to locally
+comment out the CMake error and only set `IREE_ENABLE_TSAN`. Also see a
+[past attempt](https://github.com/openxla/iree/pull/8966) to relax that CMake
+validation.
+
+### MSan (MemorySanitizer)
+
+In theory that should be a simple matter of
+
+```shell
+-DIREE_ENABLE_MSAN=ON
+```
+
+However, that requires making and using a custom
+build of libc++ with MSan as explained in
+[this documentation](https://github.com/google/sanitizers/wiki/MemorySanitizerLibcxxHowTo).
+
+As of April 2022, all of IREE's tests succeeded with MSan on Linux/x86-64,
+provided that the `vulkan` driver was disabled (due to lack of MSan
+instrumentation in the NVIDIA Vulkan driver).
+
+### UBSan (UndefinedBehaviorSanitizer)
+
+Enabling UBSan in the IREE build is a simple matter of setting the
+`IREE_ENABLE_UBSAN` CMake option:
+
+```shell
+cmake -DIREE_ENABLE_UBSAN=ON ...
+```
+
+Note that both ASan and UBSan can be enabled in the same build.
+
+## Symbolizing the reports
+
+### Desktop platforms
+
+On desktop platforms, getting nicely symbolized reports is covered in [this
+documentation](https://clang.llvm.org/docs/AddressSanitizer.html#symbolizing-the-reports).
+The gist of it is make sure that `llvm-symbolizer` is in your `PATH`, or make
+the `ASAN_SYMBOLIZER_PATH` environment variable point to it.
+
+### Android
+
+On Android it's more complicated due to
+[this](https://github.com/android/ndk/issues/753) Android NDK issue.
+Fortunately, we have a script to perform the symbolization. Copy the raw output
+from the sanitizer and feed it into the `stdin` of the
+`build_tools/scripts/android_symbolize.sh` script, with the `ANDROID_NDK`
+environment variable pointing to the NDK root directory, like this:
+
+```shell
+ANDROID_NDK=~/android-ndk-r21d ./build_tools/scripts/android_symbolize.sh < /tmp/asan.txt
+```
+
+Where `/tmp/asan.txt` is where you've pasted the raw sanitizer report.
+
+!!! tip
+
+    This script will happily just echo any line that isn't a stack frame.
+    That means you can feed it the whole `ASan` report at once, and it will
+    output a symbolized version of it. DO NOT run it on a single stack at a
+    time! That is unlike the symbolizer tool that's being added in NDK r22, and
+    one of the reasons why we prefer to keep our own script. For more details
+    see
+    [this comment](https://github.com/android/ndk/issues/753#issuecomment-719719789).
diff --git a/docs/website/docs/developers/design-docs/cuda-backend.md b/docs/website/docs/developers/design-docs/cuda-backend.md
new file mode 100644
index 0000000..74b2f69
--- /dev/null
+++ b/docs/website/docs/developers/design-docs/cuda-backend.md
@@ -0,0 +1,154 @@
+# CUDA backend
+
+!!! note - "Authored March, 2021"
+
+This document is intended to provide an overview of the design choices made to
+support CUDA within IREE. It describes both the HAL runtime and the NVVM
+codegen side.
+
+## CUDA HAL Driver
+
+The CUDA HAL driver is in [`iree/hal/drivers/cuda/`][iree-cuda] directory. It is
+written in C following the standards of the rest of the HAL module.
+
+### CUDA library dependency
+
+IREE calls directly into [`CUDA driver API`][cuda-driver]. CUDA library is
+loaded dynamically and cuda.h header from CUDA SDK is part of IREE third_party
+project. Therefore IREE doesn't require CUDA SDK to be installed when building
+iree tools.
+
+At runtime HAL CUDA driver will load libcuda.so/nvcuda.dll library and load a
+subset of the cuda driver API used in HAL. The list of functions being used are
+in the file [`iree/hal/drivers/cuda/dynamic_symbols_tables.h`][cuda-symbols]
+
+### Driver
+
+There is no direct equivalent in CUDA to the HAL driver abstraction. We use it
+to hold the symbols loaded for all the devices.
+
+### Device
+
+The equivalent to HAL device in CUDA is the `CUcontext`, it holds all the state
+related to memory allocations.
+
+### Command buffer
+
+We implement command buffers using [`CUDA Graph API`][cuda-graph]. Using the
+Graph API allows to easily encode fine grain dependencies between dispatch
+without having to create multiple streams.
+
+Note that Graph API is meant to be used for command buffers that can be
+recorded once and used several times and there may be a performance penalty to
+using Graph API for direct command buffer. It is likely that we will also have
+a pure stream implementation in the future if we see performance problems with
+direct command buffer usages.
+
+### Event and Barrier
+
+In HAL Event and Barrier are used for GPU<->GPU synchronization either within a
+command buffer (Event and Barrier) or between command buffers.
+
+The current implementation ignores events and barriers and serializes all the
+nodes of the graph in order to have a conservative but correct solution.
+
+The design we plan for the future is to map dependencies within a command
+buffer to graph dependencies in the CUDA Graph API. When an event is signaled
+all the leaf nodes of the graph will be saved in HAL data structure and when
+the same command buffer waits on the signal we will add all the nodes as
+dependency to the future nodes added to the graph.
+
+For simplicity we always serialize command buffers sent to the same command
+queue.
+
+### Allocator
+
+The allocator will forward allocation requests to `cuMemHostAlloc` for host
+accessible memory and `cuMemAlloc` for device only memory.
+
+### Buffer
+
+CUDA buffers are represented either as a host pointer or a device pointer of
+type `CUdeviceptr`.
+
+### Executable
+
+HAL executable maps naturally to a PTX module. The compiler will generate a
+flat buffer containing a PTX text module as well as a list of entry point
+function names and the workgroup size associated with those entry points.
+
+### Semaphore
+
+Timeline semaphore is used in IREE to handle coarse grain synchronization for
+CPU<->GPU, GPU<->GPU and CPU<->CPU. The interface follows closely
+[`Vulkan timeline semaphore spec`][vulkan-semaphore].
+
+There is currently no simple way to implement this on CUDA. There are several
+solutions discussed on this [`IREE issue`][semaphore-issue] but no obvious
+solution. For now we force CPU and GPU to be synchronized after every submit to
+ensure correctness and ignore the semaphore.
+
+## NVVM Codegen
+
+### NVVM and PTX
+
+NVVM is a CUDA specific IR composed of LLVM IR and NVVM specific intrinsics. It
+can be compiled to PTX text using LLVM PTX backend. NVVM has an associated
+dialect in MLIR that translates 1:1 to NVVM intrinsics. This is what we are
+using to generate the PTX kernel code.
+
+### IREE flow
+
+IREE's target independent codegen converts the compiler input to Linalg on
+Tensors. Afterward IREE will call the LinalgToLLVMGPU codegen passes.
+
+Once we get into LinalgToLLVMGPU passes we first do bufferize to generate
+Linalg on Buffers. Then we apply MLIR generic passes to  convert linalg to SCF
+dialect and then SCF to Standard dialect. After that we convert Standard
+dialect to LLVM+NVVM dialect.
+
+## Example
+
+Save the following mlir in /tmp/add.mlir
+
+```mlir
+func.func @add(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
+  %0 = tensor.empty() : tensor<4xf32>
+  %1 = linalg.generic {
+    indexing_maps = [
+      affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]}
+      ins(%arg0, %arg1 : tensor<4xf32>, tensor<4xf32>)
+      outs(%0 : tensor<4xf32>) {
+  ^bb0(%in: f32, %in_0: f32, %out: f32):
+    %2 = arith.addf %in, %in_0 : f32
+    linalg.yield %2 : f32
+  } -> tensor<4xf32>
+  return %1 : tensor<4xf32>
+}
+```
+
+```shell
+# First compile into a VM bytecode module.
+$ ../iree-build/tools/iree-compile \
+  --iree-hal-target-backends=cuda \
+  /tmp/add.mlir \
+  -o /tmp/add.vmfb
+
+# Run the module through CUDA HAL backend.
+$ ../iree-build/tools/iree-run-module \
+  --device=cuda \
+  --module=/tmp/add.vmfb \
+  --function=add \
+  --input="4xf32=[1 2 3 4]" \
+  --input="4xf32=[2 2 2 2]"
+
+EXEC @add
+4xf32=3 4 5 6
+```
+
+[iree-cuda]: https://github.com/openxla/iree/tree/main/iree/hal/drivers/cuda/
+[cuda-symbols]: https://github.com/openxla/iree/blob/main/iree/hal/drivers/cuda/dynamic_symbols_tables.h
+[cuda-driver]: https://docs.nvidia.com/cuda/cuda-driver-api/index.html
+[cuda-graph]: https://developer.nvidia.com/blog/cuda-graphs/
+[vulkan-semaphore]: https://www.khronos.org/blog/vulkan-timeline-semaphores
+[semaphore-issue]: https://github.com/openxla/iree/issues/4727
diff --git a/docs/website/docs/developers/design-docs/design-roadmap.md b/docs/website/docs/developers/design-docs/design-roadmap.md
new file mode 100644
index 0000000..7b37354
--- /dev/null
+++ b/docs/website/docs/developers/design-docs/design-roadmap.md
@@ -0,0 +1,717 @@
+# Design roadmap
+
+A not-so-concise walkthrough of various IREE features that are in the design
+process and planned for future versions. A lot of the questions around how the
+IREE IR is designed and why certain components exist (such as the VM) hopefully
+become much clearer when seeing where we want to go with the infrastructure we
+are building (as opposed to where we currently are with our MVP slice). This
+document is not meant to encompass the entire design of any individual feature
+and if there's interest please say hi on the
+[iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing
+list.
+
+[TOC]
+
+## Input Dialects
+
+### Quantization
+
+It's assumed that any work related to quantization/compression has happened
+prior to lowering into IREE dialects. Our plan is to use the proposed
+[Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655)
+to achieve both training and inference-time quantization of types in a way that
+preserves maximum accuracy. IREE will support running with original unquantized
+floats in all cases, allowing for a smooth on-ramp to quantization and the gains
+in performance and reduction in model size that come from it.
+
+As future work IREE would like to move beyond these transformation-directed
+approaches to quantization and interface directly to frontends which have a
+defined enough type system to represent accurate quantized (and otherwise
+compressed) computations directly, not relying exclusively on compiler-side type
+inference transforms.
+
+## `flow`: Data- and Execution-Flow Modeling
+
+The `flow` dialect is designed to allow us to extract as much concurrency as
+possible from a program and partition IR into the scheduling and execution
+domains. Today we have the IR structure and transformation flow in place but
+have not yet got to the most interesting things such an infrastructure enables.
+A majority of the largest performance, latency, and memory usage improvements
+IREE can offer are determined first here and all following lowerings benefit.
+_The fastest code is the code you don't execute and the smallest allocation is
+the allocation you don't make_ ;)
+
+### Avoiding Readbacks with `flow.stream`
+
+A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
+ops) will be removed when we have an
+[HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion).
+There will still be cases when readbacks are required for correctness but they
+usually fall into a small set of usage patterns. For those that don't this is
+one place where IREE will warn about performance issues, allowing programs that
+perform suboptimally but encouraging authors to adjust their input model to
+enable better behavior. The IREE VM also has specific support for hiding
+readback latency in an efficient way via
+[coroutines](#coroutines-for-batching-and-cooperative-scheduling).
+
+The most common case we are currently seeing in the IR is that of dynamic copies
+where the offsets are dependent on the result of previous computations. Source
+models may have top-k + gather operations, for example. These appear as a
+`flow.stream`, a `flow.tensor.load`, and then another `flow.stream` that uses
+the loaded value for a `flow.tensor.update` (or other operation):
+
+```mlir
+%index_tensor = flow.ex.stream.fragment(...) -> tensor<i32> { ... }
+%index = flow.tensor.load %index_tensor : tensor<i32>
+%result = flow.ex.stream.fragment(%arg0 = %index : i32, ...) -> ... {
+  %0 = flow.dispatch ...
+  %1 = flow.tensor.update %0, %arg2[%index] : tensor<10xf32> -> tensor<1x10xf32>
+  ...
+}
+```
+
+Today the `flow.tensor.update` turns into HAL command buffer transfer operations
+that must have their offsets known at recording time. This is a limitation of
+`vkCmdCopyBuffer` but not a fundamental limitation of any hardware. In fact
+several drivers implement copies as small built-in shader programs meaning that
+we could perform the same expansion here with the right primitives. This would
+allow, in the above example, both the index to be computed and the tensor to be
+updated within the same stream to entirely remove the host round-trip.
+
+### Threading `flow.stream` through the CFG
+
+The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
+a temporary implementation designed to get the concept of streams lowered to the
+HAL dialect. For streams to be effective at modeling larger concurrency scopes
+they need to be able to move across branches in the CFG. This intuitively
+follows exactly what one would do if recording commands in C:
+
+```c++
+vkCmdCopyBuffer(cmd, ...);
+if (some_flag) {
+  vkCmdBindPipeline(cmd, ..., pipeline_a);
+} else {
+  vkCmdBindPipeline(cmd, ..., pipeline_b);
+}
+vkCmdDispatch(cmd, ...);
+```
+
+The corresponding `flow` IR:
+
+```mlir
+  flow.stream.append[%s0](...) {
+    flow.tensor.update ...
+  }
+  %b = arith.cmpi ne %some_flag, ...
+  cond_br %b, ^a(%s0), ^b(%s0)
+^a(%s1):
+  flow.stream.append[%s1](...) {
+    flow.dispatch @pipeline_a, ...
+  }
+  br ^end(%s1)
+^b(%s2):
+  flow.stream.append[%s2](...) {
+    flow.dispatch @pipeline_b, ...
+  }
+  br ^end(%s2)
+^end(%s3):
+  ...
+```
+
+This allows the entire stream to be lowered into one command buffer without the
+need for any host round-trips. The conversion into the `flow` dialect will walk
+the CFG and attempt to thread the `flow.stream` values through so long as there
+are no external dependencies.
+
+### Predication of `flow.dispatch`
+
+While the
+[`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
+can remove many of the simpler conditional dispatches there will always be some
+that will have their execution dependent on the result of prior dispatches. For
+these a `flow.cond_dispatch` will allow a condition to be provided that must be
+true for the dispatch to actually be performed.
+
+For targets that natively support predication in their command buffers (such as
+D3D12's
+[ID3D12GraphicsCommandList::SetPredication](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12graphicscommandlist-setpredication))
+this provides a host round-trip-free way of conditionally executing dispatches
+and transfers. Unfortunately Vulkan support is still lacking, though Nvidia
+supports the
+[VK_EXT_conditional_rendering](https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/)
+extension that exposes the same behavior.
+
+For targets that do not support predication natively it's still possible to
+emulate predication with
+[indirect dispatches](https://github.com/gpuweb/gpuweb/issues/31). In this model
+the workgroup counts normally used to dispatch execution are sourced from
+another device buffer at the time the dispatch is made instead of sourced from
+the command buffer at the time the dispatch is recorded. Degenerate dispatches
+with counts of `0, 0, 0` allow for effective neutering of the dispatch with
+minimal overhead (vs. the significant penalty of a host round-trip!).
+
+By modeling such predication at the `flow` level we are able to lower into the
+HAL with target-aware predication semantics and fuse indirect dispatch workgroup
+count calculations into existing dispatches already being performed such that
+overhead is reduced.
+
+### Deduping `flow.executable`s
+
+While still in the `flow` dialect, the executables are target-agnostic. This
+makes simple IR tree diffing a potential solution to deduplication. Since most
+of the dispatches originate from the same source-language library calls in input
+frameworks there's a high likelihood of duplication, and depending on when
+inlining is performed we may have stronger or weaker ability to perform the
+deduplication. Thanks to the MLIR canonicalization pass (that ensures ops are
+rearranged into consistent canonical representations) the IR comparisons can be
+done rather trivially.
+
+### Rematerializing CSE'd Expressions
+
+Common subexpression elimination is performed many times during lowering,
+however there comes a point where the CSE can introduce false dependencies and
+additional allocations that are otherwise avoidable. For example if a
+broadcasting operation is CSE'd and then the result is used by two or more
+operations that are scheduled independently what would have been a relatively
+cheap lowering of the broadcast to a simple index remapping now becomes an
+additional dispatch, materialization of an intermediate tensor, and a barrier:
+
+```mlir
+%bcast = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
+%mul1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
+// (pretend something here that prevents fusion)
+%mul2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
+```
+
+```mlir
+%bcast = flow.dispatch.region(%cst : tensor<f32>) -> tensor<1024x10xf32> {
+  %0 = "mhlo.broadcast_in_dim"(%cst) : (tensor<f32>) -> tensor<1024x10xf32>
+  return %0 : tensor<1024x10xf32>
+}
+// a barrier will be required here
+%mul1 = flow.dispatch.region(%arg0 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
+  %1 = mhlo.multiply %arg0, %bcast : tensor<1024x10xf32>
+  return %1 : tensor<1024x10xf32>
+}
+%mul2 = flow.dispatch.region(%arg1 : tensor<1024x10xf32>, %bcast : tensor<1024x10xf32>) -> tensor<1024x10xf32> {
+  %2 = mhlo.multiply %arg1, %bcast : tensor<1024x10xf32>
+  return %2 : tensor<1024x10xf32>
+}
+```
+
+Instead the broadcast should be rematerialized inside of both dispatch regions
+as the cost of doing so is significantly less in compute resources and then the
+intermediate tensor will not be required at all. Though at first it may seem
+counter-intuitive to undo such a critical optimization as CSE (both to code size
+and often to compute) but here it's something we must carefully balance while
+looking at the whole system. It gets even more important when considering
+multi-device execution as the cost of sharing memory and synchronizing may be
+extremely non-trivial.
+
+### Device Placement
+
+While still within the `flow` dialect we have the ability to easily split
+streams and safely shuffle around operations. Target execution backends can opt
+into such behavior to ensure that device restrictions such as maximum in-flight
+memory, maximum scheduling depth, and capabilities are observed. For
+heterogeneous configurations the intent is that certain operations, dispatches,
+and streams can be attributed to specify which device categories they should be
+lowered. The constraint solving that takes place can be provided with generic
+heuristics ("big GEMMs go on the accelerator"), profile-guided databases based
+on benchmarks, learned traits via ML, etc.
+
+## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables
+
+As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
+the techniques classically used in real-time graphics apply. The benefit we have
+by modeling our usage of such a low-level API in IR is that the normal work -
+some of which is very non-trivial - for managing allocations, tracking resource
+lifetime, and ensuring proper synchronization/barriers is something we can apply
+the full force of an offline compiler against.
+
+### Allow Targets to Specify `hal.interface`s
+
+The `hal.interface` op specifies the ABI between the scheduler and the device
+containing the buffer bindings and additional non-buffer data (parameters,
+shapes, specialization flags, etc). Today a naïve ordering is used uniformly for
+all targets however it is possible for target backends to opt into providing
+their own interfaces based on target configuration. The same `hal.executable`
+may have multiple interfaces and the same backend may use one or more. This is
+useful for when target capabilities may vary at runtime, such as the
+[number of available storage buffer bindings](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxPerStageDescriptorStorageBuffers&platform=android)
+in Vulkan. By exposing a few `hal.interface` variants with different binding
+amounts the Vulkan backend could make better use of the larger number of
+bindings available at runtime while still providing support for smaller
+configurations.
+
+Once we have multiple `hal.interface`s defined for executables the scheduler
+needs to emit HAL ops that properly switch between them. By having a canonical
+form for bindings we can ensure that only the differences between the interfaces
+will need additional code.
+
+### Target-specific Scheduling Specialization
+
+Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
+regions, it's not always possible for all target backends to schedule a region
+as a single dispatch. A classic example is algorithms like
+[parallel reduction](https://en.wikipedia.org/wiki/Reduction_Operator#PRAM-algorithm)
+commonly used on GPUs that may require many dispatches to identical executables,
+while other algorithms may vary the executables they use based on the input
+parameters such as shape or the target runtime device support.
+
+By default the `flow.dispatch` executable translation to `hal.executable`s is
+performed 1:1 and it is assumed that a single dispatch is required. Extending
+target backends with scheduling interfaces (enabling them to opt into different
+scheduling behavior) will allow the backends to emit any number of
+`hal.executable`s and any stream commands (such as additional dispatches or
+transfers) they may need. This is effectively equivalent to what would be done
+at runtime only because we are still operating on IR prior to buffer allocation
+and can use the `hal` ringbuffer primitive. Through this we can elide many of
+the allocations that would otherwise be required at runtime (and the
+concurrency-limiting false dependencies that usually come along with scratch
+memory).
+
+Since the algorithm used may vary based on the parameters of the dispatch (such
+as the shape of the reduction which may be dynamically determined) scheduling
+specialization may occur even when targeting a single backend. In many cases
+folding and canonicalization can eliminate the overhead as whether one
+dynamically computed workgroup size is used instead of another the same IR is
+present.
+
+### Buffer Usage Tracking
+
+Many explicit hardware APIs require knowing how buffers are used alongside with
+where they should be located. For example this additional information determines
+caching policy on buffer accesses (write-through, write-back, etc), visibility
+of writes across compute units, and the possible MMU properties that may need to
+be maintained/matched for the buffer. By using the SSA-form value-semantics of
+the MLIR `tensor` as used in the `flow` dialect we have complete information of
+where buffers may be used or at least where they enter or leave regions where we
+can derive such information.
+
+Analysis passes can run over IR to attribute tensors such that when allocation
+is performed when lowering to the `hal` dialect we do so from an allocator
+compatible with where the buffer will be used, with memory types chosen based on
+the potential cost and location of operations performed (write-only on host vs.
+read-write on host and device, etc), and with usage bits indicating what kind of
+operations may be performed on the buffer. Many of these are local
+transformations as most buffers are only live within very small regions such as
+the `flow.stream` encompassing their usage.
+
+Traditional systems need to either use very permissive buffer properties or
+heuristics that can introduce additional non-trivial overhead when such
+heuristics are incorrect. For example,
+[OpenGL had several such usage hints](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferData.xhtml)
+that drivers were then able to use but almost no drivers behaved as desired in
+all cases and it lead to additional memory ghosting, copies, readbacks, and
+unpredictable performance. For almost all uses of the buffers within an IREE
+invocation we instead can know precisely where and how buffers may need to be
+moved and do it a minimum number of times if it is required.
+
+### Batched Executable Caching and Precompilation
+
+For targets that may require runtime preprocessing of their executables prior to
+dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
+compilation mechanism based on Vulkan's
+[Pipeline Cache](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html).
+
+Today each executable is compiled on-demand and cached only for the process
+lifetime. Though some drivers may provide their own caching we can make better
+use of the explicit caching and compilation behavior with the additional
+information we have in the compiler.
+
+For any given entry point (or group of entry points) into an IREE module we can
+perform reachability analysis to know which executables may be executed when
+that entry point is invoked. In this way we can emit pre-invocation compilation
+checks (similar to an `std::call_once` block) that provides all required
+executables for compilation and allows more efficient compilation through
+multithreading the compiler invocations. These same compilation caching function
+can be exposed and invoked manually by an application to force pre-compilation
+when it is least likely to impact the user, such as a post-install/first-run
+step or concurrently while other application features are loading.
+
+We can use zero or more scoped caches for executables within a module.
+Completely dynamic modules (such as those emitted in eager-mode usage) may avoid
+the caching overhead entirely, while modules that have several primary usage
+modes (such as training and inference) may choose to use independent caches for
+each such mode.
+
+The caches generated can then be retrieved and saved by the hosting application.
+Upon the next execution the application can provide the caches and if still
+valid they will be used to avoid compilation.
+
+### Target-aware Executable Compression
+
+An advantage of representing executable binaries in IR after translation is that
+we can apply various post-compilation compression and minification techniques
+while still know precisely where the executable will be used. This is extremely
+important for SPIR-V as it is not designed to be a small at-rest format. Though
+the biggest lever we have to control generated code size is higher-level
+deduplication and specialization there will still be a sufficiently large number
+of executable binaries we will need to embed within the final modules and having
+targeted approaches for reducing their size beyond just "gzip everything" is
+very powerful.
+
+For example, [SMOL-V](https://github.com/aras-p/smol-v) is a fantastic lossless
+SPIR-V compression technique that, when coupled with modern dictionary-based
+compression algorithms, can save significant binary size. As a data point, the
+SPIR-V corpus SMOL-V uses for testing goes from 4.8MiB of raw SPIR-V to 348KiB
+of compressed SMOL-V.
+
+Combined with
+[Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
+we can easily use shared dictionaries and other cross-artifact compression in a
+relatively plug-in way.
+
+### Target-aware Constant Compression
+
+It's still an area that needs more research but one goal of the IREE design was
+to enable efficient target- and context-aware compression of large constants
+(typically model weights/parameters/embeddings). This may mean reusing existing
+hardware compression formats on GPUs, ML accelerator-specific formats, or
+very-low-bit-depth (1-4 bit per value) quantization techniques that cannot be
+directly used without first decompressing. The inspiration here is formats like
+[Crunch](https://github.com/BinomialLLC/crunch) and
+[Basis Universal](https://github.com/BinomialLLC/basis_universal) that perform
+["supercompression"](http://gamma.cs.unc.edu/GST/gst.pdf), and we may even be
+able to use these directly as then we can make use of GPU hardware samplers to
+do the 4-bit to 32-bit decompression, etc.
+
+### Command Buffer Stateful Deduplication
+
+The IREE HAL - much like Vulkan it is based on - eschews much of the state that
+traditional APIs have in favor of (mostly) immutable state objects (pipeline
+layouts, pipeline states, descriptor sets, etc). There are still a few stateful
+entry points in the API, though, and deduplicating or reordering redundant calls
+can reduce both IR, API, and execution overhead.
+
+The key place this will have the largest impact is around descriptor set
+bindings and push descriptors, both of which are state and can have non-trivial
+setup overhead. A canonicalization for such commands that inspects the target
+`hal.command_buffer` to see if the same state was set prior and code motion to
+move such commands out of loop bodies when possible would be helpful.
+
+### Resource Timeline
+
+A core concept of the IREE scheduler that allows for overlapping in-flight
+invocations is that of the resource timeline. This identifies module state that
+can be in use by multiple invocations and assigns timeline milestones denoting
+when the resource will be in the appropriate state for the current invocation to
+proceed. Conceptually it is like a epoch-based synchronization mechanism as
+commonly found in garbage collectors to allow for lock-free asynchronous memory
+reclamation.
+
+The advantage we have in the IR is that we know both the usage of all resources
+thanks to [buffer usage tracking](#buffer-usage-tracking) and the
+synchronization domains of all resources (in most cases). This allows us to
+effectively assign one timeline semaphore per writeable resource while in
+practice having far fewer than 1:1, as for example if two resources are only
+ever written in the same command buffer only one semaphore is needed to signal
+the completion of both writes.
+
+By transforming IR to sink all resource reads and writes closest to where the
+value is used we can enlarge the time windows that can overlap across
+invocations that may share those resources. This is similar to what out-of-order
+CPUs do with register renaming/reorder buffers/etc and something we can apply
+some traditional instruction scheduling techniques to (only here our
+'instructions' are entire command buffer dispatches/transfers).
+
+Two degenerate cases of this approach are that of resource indirection
+(`util.ptr<tensor<T>>`) and dynamic resource shapes. In these two cases it may
+not be possible to continue recording commands even if we are able to ensure
+execution is appropriately synchronized. This is where indirect dispatch,
+[predication](#predication-of-flowdispatch),
+[indirect command buffers](#indirect-command-bufferon-accelerator-execution),
+and [VM coroutines](#coroutines-for-batching-and-cooperative-scheduling) can all
+help cover for the times where we are unable to transform away the indirection
+or emit shape logic without data dependencies.
+
+### Transient Tensor Ringbuffer
+
+(When properly implemented) almost all buffers required during execution never
+escape the command buffers they are used in or a single VM invocation. We can
+trivially identify this from the explicit captures of `flow.stream` and
+`flow.dispatch` ops and the fact that all tensor types have value-semantics.
+Only those tensor values loaded-from/stored-to module state or that cross the
+exported module function boundary need special consideration while almost
+everything else can live transiently only so long as it is required during
+execution.
+
+Thanks to this information about buffer usage and lifetime we can use a
+[ringbuffer](https://en.wikipedia.org/wiki/Circular_buffer) to store the
+transient tensor data and other required data reservations such as uniform
+buffers used to pass dynamic parameters (shapes, flags, etc) into dispatches.
+This gives the compiler and the application a knob that allows them to control
+maximum concurrency (by having a very large ringbuffer) or maximum memory usage
+(by having a minimally small ringbuffer).
+
+Allocating tensors from the ringbuffer does not require sophisticated runtime
+packing as we can emit IR to calculate required sizes for dynamically shaped
+tensors. Whether a basic block reserves `%sz = arith.constant 42 : index` bytes
+or `%sz = std.muli %cst, %dyn_dim : index` bytes doesn't materially change how
+the allocations are performed. Since almost all usage involves simple write head
+bumps there is no need for ahead-of-time memory planning or large fixed
+allocations, and since no buffer within the ringbuffer can alias we can have
+coarse (_read: low overhead_) guarantees about the availability of certain
+regions of the ringbuffer (_"when this event is signaled all prior ringbuffer
+writes have completed"_).
+
+Usually any planning we may want to perform can be done in IR via code motion.
+For example applying traditional algorithms used to reduce register pressure
+will help us attain narrower live windows within the ringbuffer leading to a
+larger number of in-flight operations for the same ringbuffer memory usage.
+
+We may end up using both a classical ringbuffer and a variant known as the
+[bip buffer](https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The-Circular-Buffer-with-a-Twist)
+because it is better for descriptor set utilization (as we can provide many
+dispatch parameters with a single base offset bound once at the beginning of a
+region).
+
+### Timeline Semaphores on the Module ABI
+
+Functions calls made across modules (either from C++ into the VM, VM->VM, or
+VM->C++) should be able to define timeline semaphores used to wait and signal on
+the call. We can do this by making all exports automatically have the semaphores
+and then make invocations populate them if they were not provided by the caller.
+In this way we can allow multiple invocations of exported functions to chain
+naturally with internal asynchronous workloads, turning most IREE invocations
+into just recording of command buffers that can never block.
+
+When combined with
+[VM coroutine support](#coroutines-for-batching-and-cooperative-scheduling) we
+even have the ability to interleave any required host execution between the wait
+and signal semaphores provided such that the caller never knows on which device
+execution is taking place. It's still possible to provide synchronous wrappers
+that emulate blocking behavior but by having the core system designed around a
+single system-supported primitive we avoid the need for additional things like
+interrupt watchdog threads, implicit blocking, and other pitfalls.
+
+### GPU-like CPU Scheduling
+
+One approach to using multiple cores on a CPU is to perform interior
+parallelization of operations such as OpenMP or library-call-based custom thread
+pools (gemmlowp). This works when each individual operation is relatively costly
+vs. potential pipeline bubbles caused by work spinning down near the end of an
+operation and spinning up at the beginning of the next.
+
+IREE is designed to handle many more workloads - some of which have very narrow
+shapes but very deep pipelines (like search algorithms) - such that the above
+approach of multithreading within ops becomes a bottleneck. These workloads are
+traditionally very poorly handled by frameworks and issues with
+oversubscription, pipeline stalls, and suboptimal system schedulers (such as on
+Android) can lead to more time being spent thrashing about than actually
+executing real work.
+
+The approach we take here is to treat the cores of a CPU as if they were
+computation units on a GPU, each able to perform some set of heterogeneous work
+independent of others units. This means that the concurrency we are trying to
+model at the `flow` level and communicate to the runtime via the `hal` that
+explicitly states which dispatches can overlap and the size of the workgroups
+can trivially be used to distribute this work over many cores exactly as a GPU
+would do it. Integration with library calls that may require their own threading
+(such as Ruy) requires that they be able to use the IREE thread pool instead of
+their own.
+
+In this way we can avoid pipeline bubbles and other latency-inducing
+unpredictable scheduling. This does not mean that we treat individual units of
+work at the same scale as we would for GPUs, but instead that we tile and have
+one or more processing units that allows us to work on those tiles. Whether the
+tile size is defined by a library call contract, heuristics, or empirically is
+TBD, but expect workgroup sizes in the thousands to millions of invocations vs.
+normal GPU workgroup sizes in the dozens to hundreds of invocations.
+
+To achieve this style of scheduling efficiently we'll likely use something like
+[marl](https://github.com/google/marl) as the scheduler. Marl provides
+cross-platform low-overhead fibers and is compatible with this style of
+scheduling as it was built for the Swiftshader software rasterizer.
+
+Even if IREE was only targeting CPUs the assertion is that we would still want
+to schedule this way and it's only an incidental benefit that if building for
+heterogeneous targets the scheduling code may be shared (just with a different
+divisor for workgroup count calculations).
+
+## `vm`: Lightweight Virtual Machine
+
+The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
+intermediate lowering IR. Many of the optimizations we can perform on it will
+benefit all use cases (such as when lowering to LLVM IR) by allowing
+higher-level program transformations around synchronization that are difficult
+to perform on arbitrary LLVM IR.
+
+### Coroutines for Batching and Cooperative Scheduling
+
+One of the largest features currently missing from the VM is coroutines (aka
+user-mode fiber scheduling). Coroutines are what will allow us to have multiple
+in-flight invocations into a module - some of which may be waiting on external
+events - without the need for complex multithreading logic or state machine
+machinations.
+
+In many cases
+[once semaphores are exposed to callers](#timeline-semaphores-on-the-module-abi)
+we will not need to yield in the VM. The user will call into the module with
+provided semaphores, the work to perform will be recorded to one or more command
+buffers and submitted to the device, and then control return will return to the
+caller immediately.
+
+In cases requiring host readbacks that we were not able to remove, however,
+additional VM code may need to run prior to when the final semaphore is
+signaled. To preserve the asynchronous interface and immediate execution
+guarantees the compiler can emit explicit yield points (`vm.yield`) that are
+known-good locations for yielding (such as most resources not required after the
+yield having been flushed/discarded, partial synchronization scope availability
+if other work may be able to execute concurrently irrespective of the yielded
+coroutine, etc).
+
+When the VM encounters the yield at runtime it will suspend the coroutine until
+a defined condition is met. Many coroutines can be in various states at any
+given time and - thanks to the resource timeline - can still be memory safe. For
+example if two stateless invocations are made with a common wait semaphore both
+can be recorded and submitted without waiting on each other. If there is
+internal module state accessed the invocations are implicitly ordered by
+invocation order (similar to what Vulkan calls
+[API order](https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html#fundamentals-queueoperation-apiorder))
+based on internal resource timeline semaphores.
+
+Waking the coroutines can be performed by either an application-provided
+callback in the case of the application already having a periodic event which is
+doing bookkeeping (such as frame end callbacks when rendering or Looper idle
+events on Android), giving direct control over the frequency and location which
+IREE utilizes to perform additional work. A helper will be provided as well that
+runs a dedicated IREE thread to do this, but the expectation is that
+applications can often do a better (and importantly more predictable) job.
+
+By utilizing coroutines IREE will have a way to fill traditional pipeline
+bubbles even with execution from the same module (let alone across modules) in
+the situation where host readbacks or other logic is required. This increases
+overall throughput and utilization while reducing host wakeups as many
+coroutines can be processed at once to submit new work to the device queues,
+though it does not help reduce per-invocation latency.
+
+External code such as the HAL implementation or user ops may provide the wait
+handles used for continuation. For example, the HAL can expose a function that
+yields and wakes only when one or more timeline semaphores reach their target
+values:
+
+```mlir
+// submit work
+hal.device.yield %semaphore4 >= %sem4_target, %semaphore5 >= %sem5_target
+// continue here, possibly much later in time
+```
+
+#### Cellular Batching
+
+Though coroutines help throughput there is a way we've found to reduce latency
+that's been documented as
+[cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf).
+This same technique has been implemented in prior internal systems and is one of
+the motivating design goals for IREE's creation. The core idea is to identify
+small uniform work that can be partitioned and scheduled greedily such as to
+enable batching or reduce associated invocation costs (such as refreshing
+accelerator SRAM/caches with new parameters). This usually manifests as finding
+large GEMM/GEMV operations using the same fixed parameters and either
+dynamically increasing the batch size by adding the waiting work (without
+deferring the actual execution time) or sequencing them back to back to ensure
+better cache utilization. Which approach is taken depends on any data
+dependencies that may be present (such as LSTM state feedback edges).
+
+With the foundation of coroutines in IREE it's possible to yield execution at
+any given point - including during command buffer recording - and wake on
+specific conditions. A majority of the logic can be built into the module itself
+with very little need for runtime machinery, as shared VM variables can be used
+to track pending work across invocations (even from different parts of the
+program) and flush based on logic wholly controlled by the user or compiler
+(such as count/max time latency/etc limits). This allows for the large variety
+of scheduling behavior various applications may want to use, such as a
+zero-latency batch-only-within-this-invocation to a
+[Nagle's Algorithm](https://en.wikipedia.org/wiki/Nagle%27s_algorithm)-esque
+time or limit based behavior or even some learned model-specific windowing.
+
+Design work is still required on how to represent this in IR but the current
+thought is to model the regions in which deferred execution is possible and
+beneficial and allow during lowering to the VM additional transformations. This
+is similar to how the async-await behavior works in C# where the async keyword
+is just sugar that expands to additional generated helper utilities.
+
+A simple strawman representation for sequential dispatch may look like:
+
+```mlir
+hal.scheduling_policy @defer_policy {
+  // max time, max count, max live memory, etc
+}
+...
+hal.command_buffer.dispatch.deferred @defer_policy, @dispatch, ...
+// vm.yield added here during lowering
+```
+
+There are many cases to explore and as cellular batching can have performance
+benefits of several orders of magnitudes it'll be one of the primary areas of
+research in the long-term.
+
+### Lowering to LLVM IR
+
+For scenarios where dynamic module loading is not required and entire modules
+can be compiled into applications we can lower the VM IR to LLVM IR within
+MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are
+dispatched at runtime to things like the HAL we can instead lower to
+`llvm::CallInst` to runtime-resolved function pointers. This still enables all
+of the flexibility of heterogeneous/runtime-determined devices, pluggable
+diagnostics, and backend composition without any need for FlatBuffers or the VM
+bytecode interpreter.
+
+The VM was designed to make such a lowering easy and the C-style struct-based
+function pointer registration for runtime modules was designed to make emitting
+code that used it fairly robust even when linked in dynamically such as when
+embedded in shared objects.
+
+An extension of this is what we've been calling 'runtimeless mode', where the
+IREE VM linkage code is statically linked into the binary alongside the
+generated module LLVM IR. If only a single HAL backend is linked in then (with
+some build-fu) we should be able to get call devirtualization to reduce code
+size to precisely the functionality used by the module.
+
+### Improved Type Support
+
+Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
+intentional limitation such that we can determine what is really needed to
+express the scheduling we perform, with the idea being that such a limited model
+will make it easier to use techniques like
+[indirect command buffers](#indirect-command-bufferon-accelerator-execution) to
+compile the VM itself to an accelerator executable that dispatches work without
+host involvement.
+
+As we port more models we may find a few primitives that are worth bringing into
+the VM design such that it's worth potential complications to future porting.
+These includes types like `f32` (for simple float calculations/comparisons),
+`list`/`dict` (easier python compatibility), and `vector<4xf32>` (for simple
+inline calculations that are not worth dispatch overhead/synchronization).
+
+### Indirect Command Buffer/On-Accelerator Execution
+
+Though IREE will use many different tricks such as
+[predication](#predication-of-flowdispatch) to build deep pipelines there is
+still the requirement that the command recording and submission happens on the
+host CPU. Though the cost of this in terms of latency and power use can be
+minimized by coalescing and timelines there is still the possibility of
+non-trivial roundtrips being introduced that limit performance. For particular
+applications like low-power always-on compute or where there is significantly
+branchy behavior (such as search algorithms) it is important that the decision
+making logic as to what is dispatched runs as close to real-time as possible
+within the execution pipeline.
+
+The IREE VM is designed to be runnable on-device in a secure and cooperative way
+(no pointers, indirect buffer handles to allow for memory space rearrangement
+op-to-op, deterministic execution and explicit yield points, etc).
+
+The recent efforts to bring indirect command buffers to Vulkan and Metal's
+[Indirect Command Buffers](https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu)
+(that both derive inspiration from
+[NV_command_list](https://www.khronos.org/registry/OpenGL/extensions/NV/NV_command_list.txt))
+are one such target for this. Either by
+[lowering the VM IR to LLVM IR](#lowering-to-llvm-ir) or SPIR-V, by a special
+conversion to target-specific forms, or by actually executing the VM bytecode
+directly on-device (it's ~1000 LoC) we should be able to prototype what full
+on-device usage is like. Even if only some VM functions the compiler deems
+useful to schedule on the device are used and the rest run on the host
+(particularly those functions calling imported functions) some of the most
+costly logic that creates tight coupling of the host and device scheduling can
+be limited.
diff --git a/docs/website/docs/developers/design-docs/function-abi.md b/docs/website/docs/developers/design-docs/function-abi.md
new file mode 100644
index 0000000..1148dfa
--- /dev/null
+++ b/docs/website/docs/developers/design-docs/function-abi.md
@@ -0,0 +1,193 @@
+# Function ABI
+
+!!! note
+
+    Authored December, 2019
+
+    Updated August, 2021
+
+A key job of the IREE compiler and runtime is capturing function call semantics
+from the originating system and providing mechanisms so that invocations can be
+performed in as similar way as possible in various target languages. In general,
+this requires additional metadata on top of the raw characteristics of a
+function. Where possible, this is done by attaching attributes to a function.
+
+- `iree.abi` : JSON encoded description of the function's calling convention.
+
+## V1 ABI
+
+This is the default ABI supported by the IREE VM invocations. It attempts to
+provide a default calling convention that can be used without further reflection
+metadata but which may be enhanced with it.
+
+It natively allows monomorphic functions to be exported where arguments and
+results are composed of the following types:
+
+### Value Types:
+
+- Byte aligned integer type (i8, i16, i32, i64)
+- Floating point value (f16, f32, f64)
+
+### Reference Types:
+
+- ND-Array buffers of Value Types:
+
+    - Simple: Packed, C-layout
+    - Strided: Arbitrary layout with strides (future)
+
+- String (byte arrays)
+
+- Opaque reference object
+
+### Sequence Types:
+
+- Tuples: fixed length lists where each position has its own type bound
+- Homogenous list: lists of arbitrary size where a single type bound applies
+    to all elements
+
+The intent with these low level types is that calling conventions can be
+synthesized to bind arbitrary high level, domain/language specific signatures to
+these types, possibly by way of additional reflection metadata.
+
+### Representations:
+
+The above are all representable with native constructs in the VM:
+
+- ValueType:
+
+    - Runtime:
+        [`iree_vm_value`](https://github.com/openxla/iree/blob/main/iree/vm/value.h)
+    - Compile Time: primitive MLIR integer/floating point types
+
+- Simple ND-Array Buffer:
+
+    - Runtime:
+        [`iree_hal_buffer_view`](https://github.com/openxla/iree/blob/main/iree/hal/buffer_view.h)
+    - Compile Time: `tensor<>`
+
+- String:
+
+    - Runtime:
+        [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
+        containing `i8`
+    - Compile Time: `!util.list<i8>`
+
+- Tuple:
+
+    - Runtime:
+        [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
+        of variant
+    - Compile Time: `!util.list<?>`
+    - Note that these are statically type erased at the boundary.
+
+- TypedList (homogenous):
+
+    - Runtime:
+        [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
+        of `T`
+    - Compile Time: `!util.list<T>`
+
+### Extended Type Calling Conventions
+
+While the above features of the native ABI may be sufficient for direct use by
+various programs, many programs and callers will need to represent various
+higher level types, consistently mapping them to the above facilities. This
+section describes calling conventions for various higher level types which do
+not map 1:1 to the above. Not all source language types are representable, and
+extending these calling conventions (and the fundamental types above) is demand
+driven.
+
+All of these calling conventions presume that the arity of the arguments/results
+of the raw function matches the user-level function, meaning that the calling
+convention is specified per argument/result. Higher-level whole function
+transformations may also exist for some domains but are outside of the scope of
+this specification.
+
+#### Structure
+
+A `Structure` is a common enough entity to have a dedicated calling convention.
+In C-like languages, this may just be a `struct`. In Python, it is typically a
+`dict` with an associated schema providing a name and type bound for each of its
+slots. In both, its slots are of fixed arity.
+
+In this convention, such a structure is represented as a `Tuple` in the native
+calling convention (i.e. `!util.list` of variant type). The order of the
+elements of the tuple are the natural order of the structure, where that is
+either:
+
+- For a C-like system where order is determinate, it is the order of
+    declaration.
+- For a name-based system (i.e. bind to `dict`) where no order is defined, the
+    natural order will be the lexically sorted order of the keys.
+
+#### String
+
+Most languages interop between byte arrays (i.e. the native ABI `String` type)
+by way of applying an encoding. Such strings are just a sequence of bytes (i.e.
+`!util.list<i8>`).
+
+#### Typed List
+
+High level lists which all share the same type bound are represented as a
+`TypedList` in the native ABI.
+
+#### NDArray of Reference Types
+
+NDArrays of reference types are considered separately from those of value types.
+Internally, the code generated for them is completely different from what gets
+generated for numeric based arrays (i.e. has ref-counting, ownership semantics,
+non-POD, etc). These types are permitted for completeness, not necessarily
+performance: by nature they are already indirected and have overheads.
+
+In the native ABI, these are represented as a composite tuple type (i.e. today a
+list since sugar for tuple is not yet defined): `!iree.tuple<!util.list<T>,
+!util.list<index>>`. The first element of the tuple is the list of values,
+packed with a C-Layout and the second element is the list of dimension sizes.
+
+#### Reflection
+
+Additional reflection metadata may be encoded in a custom JSON form, providing
+additional typing hints for arguments and results. If present, this will be a
+reflection attribute with key `d`, containing a serialized JSON object.
+
+The JSON object contains:
+
+- `a` (array): List of type records for each argument.
+- `r` (array): List of type records for each argument.
+
+Type records are one of:
+
+- A string naming a primitive type:
+
+    - `i[0-9]+`: Integer type with given bit width
+    - `f[0-9]+`: IEEE floating point type with given bit width
+    - `bf16`: BFloat16
+
+- JSON `null`: A null reference value
+
+- `"unknown"`: An unknown/unmapped type
+
+- An array, interpreted as a tuple describing a compound type.
+
+##### Compound type tuples
+
+A compound type tuple has a type identifier as its first element, followed with
+type specific fields:
+
+- `["named", "key", {slot_type}]`: Associates a name with a slot. This is
+    used with the root argument list to denote named arguments that can be
+    passed positionally or by keyword.
+- `["ndarray", {element_type}, {rank}, {dim...}]`: For unknown rank, the
+    `rank` will be `null` and there will be no dims. Any unknown dim will be
+    `null`.
+- `["slist", {slot_type...}]`: An anonymous structured list of fixed arity and
+    slot specific types. If there are gaps in the list, empty slots will have a
+    `null` type.
+- `["stuple", {slot_type...}]`: Same as `slist` but some languages
+    differentiate between sequences represented as lists and those represented
+    as tuples (read-only lists).
+- `["sdict", ["key", {slot_type}]...]`: An anonymous structure with named
+    slots. Note that when passing these types, the keys are not passed to the
+    function (only the slot values).
+- `["py_homogeneous_list", {element_type}]`: A Python list of unknown size
+    with elements sharing a common type bound given by `element_type`.
diff --git a/docs/website/docs/developers/design-docs/invocation-execution-model.md b/docs/website/docs/developers/design-docs/invocation-execution-model.md
new file mode 100644
index 0000000..c5d31ef
--- /dev/null
+++ b/docs/website/docs/developers/design-docs/invocation-execution-model.md
@@ -0,0 +1,489 @@
+# Invocation execution model
+
+!!! note - "Authored June, 2022"
+
+This documents the behavior of the user-visible invocation mechanism IREE uses
+to schedule program execution. Internally IREE uses a very similar modeling for
+tracking its internal workloads and in kind carries that down to target APIs
+and devices that themselves use a very similar model. The intent is to expose
+the device model in an abstracted way that allows for the full capture and
+communication of the execution intent to be propagated to the hardware that
+executes it. Though here we focus on the user-visible portion of execution
+there is really only one "IREE execution model" and the entire stack follows
+the same design. At its core this design is just an instantiation of an
+[out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution)
+algorithm such as those
+[originating from the 1960's](https://en.wikipedia.org/wiki/Tomasulo_algorithm).
+
+## Glossary
+
+```mermaid
+stateDiagram
+    state UserApplication {
+      direction BT
+      state Context0 {
+        ModuleA-->ModuleAState0
+        ModuleB-->ModuleBState0
+      }
+      state Context1 {
+        ModuleA-->ModuleAState1
+        ModuleB-->ModuleBState1
+        ModuleC-->ModuleCState1
+      }
+      state ModuleA {
+        @func1
+        @func2
+      }
+      state ModuleB {
+        @func3
+        @func4
+      }
+      state ModuleC {
+        @func5
+      }
+    }
+```
+
+### Program
+
+An IREE _program_ is a collection of _modules_ instantiated in a _context_ from
+which _invocations_ can be made. Invocations are ordered on a user-controlled
+_timeline_ that uses _fences_ to define the execution order requirements to
+enable out-of-order execution. A hosting user application may have multiple
+programs or multiple instances of the same program available and running
+invocations at a time across multiple timelines.
+
+### Module
+
+_Modules_ define executable code and data that can be loaded, linked, and run à
+la [ELF shared libraries](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format).
+Modules may be implemented as C/C++, generated bytecode or C sources from the
+IREE compiler, or any other mechanism that can run code and implement the
+[`iree_vm_module_t` interface](https://github.com/google/iree/blob/0e8d8370699912c6b51889e8f7e967690102402c/runtime/src/iree/vm/module.h#L335-L437).
+Modules on their own are read-only and can be reused across many contexts.
+
+Traditional ML runtimes would use a model (graph, etc) as their module
+representation. In IREE everything is a module including runtime subsystems like
+the HAL and user-provided custom code. This ensures that anything IREE can do
+can be externalized and replaced by users without needing to modify the core
+IREE code.
+
+### Context
+
+A collection of _modules_ are linked and instantiated in a _context_. Each
+context operates independently and carries its own copies of mutable module
+state. _Invocations_ execute within a context scope and hosting applications
+coordinate across contexts as required. Contexts are cheap to create
+(microseconds) and retain (~100B + program state) such that users can decide
+how to manage them based on their scenario.
+
+Traditional ML runtimes would call these "sessions" but in IREE everything is a
+_program_. Whether the program is stateful or stateless and how the program is
+invoked is up to the program author.
+
+### Invocation
+
+An _invocation_ represents a single call into a module exported function using
+the program state stored in a context. Users can decide whether to perform
+synchronous blocking invocations or asynchronous non-blocking invocations
+per-call; the behavior of the invocation is independent from the target function
+and a user program may contain a mix of both.
+
+As an example a user program may synchronously invoke a `@query_output_shapes`
+function to preallocate storage for an asynchronous `@execute_in_place`
+function to write into.
+
+### Timeline
+
+A _timeline_ represents the observable order of execution. Users define their
+own timelines and communicate them to IREE via _fences_. Timelines do not match
+up with the order of invocations unless the user dictates they must by way of
+fences. In the absence of fences all invocations execute in an arbitrary order
+and they may execute concurrently just as threads in C with no barriers.
+
+Each timeline can be thought of as an independent clock domain that may operate
+asynchronously at its own frequency with only fences acting to tie separate
+timelines together. This directly mirrors real hardware constraints like
+[clock domain crossing](https://en.wikipedia.org/wiki/Globally_asynchronous_locally_synchronous)
+as each execution scope (thread on core, driver calls to queues, kernel queues
+to device queues, device queues to compute unit queues, etc) is naturally
+operating at different rates and well-designed systems must tolerate that
+variability.
+
+### Fence
+
+A _fence_ is a specific point of progress in one or more _timelines_ acting as
+a barrier, fork, or join point. Fences only guard execution ordering and not any
+particular resources though users can use them to guard resources by defining
+when in time the resources are available for use.
+
+Waits on fences are wait-until operations specifying that the timeline must
+reach  _at least_ a specific point. This allows for flexible reordering and
+deferral of execution as executors can pull forward scheduled work based on
+policy (run similar work together, etc).
+
+### Hardware Abstraction Layer (HAL)
+
+The HAL is an optional feature of IREE that is used to provide a consistent
+interface across execution resources. It is used internally by IREE programs to
+define and submit work to devices and signal across them but may also be used by
+users to directly interface with hardware in a compatible way. Exposing the
+HAL API allows for users to efficiently manage their data and custom
+execution without expensive marshaling. Most users will only interact with HAL
+buffers as they work with their data but more advanced integrations can directly
+insert IREE into existing device contexts to transparently share scheduling and
+resources or insert their own code into IREE to pipeline custom execution.
+
+## Execution by Timelines
+
+**NOTE**: this defines _an_ execution scheme that IREE supports but a user may
+use one or more such schemes in a single program - just as a C application may
+mix single- and multi-threaded code within itself for different components.
+
+The combination of _invocations_, _timelines_, and _fences_ allows users
+to provide future knowledge to lower layers of the system by declaring their
+availability requirements and the lower layers are then able to execute the work
+out-of-order so long as the specified requirements are met. The primary goal
+when designing for such a system is to specify as few requirements as possible
+in order to provide the maximum amount of scheduling freedom to the
+implementation.
+
+This makes timelines one of the most critical components of the interface.
+The purpose of invocations is to schedule work against one or more timelines and
+what happens within the invocations is an implementation detail of the program.
+
+### Sequential Execution
+
+Here we say _"a user invokes a function to schedule execution on a timeline"_
+vs. a more traditional _"a user invokes a function to execute work"_ and this
+manifests in the IREE ABI as invocations taking fences defining specific points
+on timelines of which the user may observe:
+
+```python
+# Fences are effectively just timeline + integer tuples and are cheap to hold.
+wait_fence = my_timeline.at(t)
+signal_fence = my_timeline.at(t+1)
+# Schedule work against the timeline.
+# All work prior to t must complete before execution can occur and after
+# execution the timeline will advance to t+1.
+async_invoke(@some_fn, wait_fence, signal_fence)
+# The invocation may have returned immediately after the work was scheduled;
+# until the fence is reached no actual execution may have occurred. To
+# synchronize the user code with the timeline the user can block until the fence
+# is reached.
+signal_fence.wait()
+```
+
+To the user this would appear as:
+
+```mermaid
+sequenceDiagram
+    User->>@some_func: invoke
+    activate @some_func
+    @some_func->>User: ;
+    @some_func-->>@some_func: wait t
+    @some_func-->>User: signal t+1
+    deactivate @some_func
+```
+
+This means from the user's perspective the _actual_ operations performed by the
+invocation are not important: the only thing the user can observe in this
+situation is when the timeline reaches `t+1` as they specified. Whether
+internally the invocation needs many steps to complete as there are timelines
+internal to the program is an implementation detail. Actual execution may look
+like this:
+
+```mermaid
+sequenceDiagram
+    User->>@some_func: invoke
+    activate @some_func
+    @some_func->>User:  ;
+    @some_func->>@some_func: ;
+    @some_func-->>Device A: ;
+    Device A-->>Device A: wait t
+    activate Device A
+    @some_func->>@some_func: ;
+    @some_func-->>Device B: ;
+    activate Device B
+    @some_func->>@some_func: ;
+    Device A-->>@some_func: ;
+    deactivate Device A
+    @some_func->>@some_func: ;
+    @some_func-->>Device B: ;
+    activate Device B
+    deactivate @some_func
+    Device B-->>User: signal t+1
+    deactivate Device B
+    deactivate Device B
+```
+
+Even in this simple user-synchronous example the system is able to internally
+run several concurrent timelines with a minimal number of synchronization points
+and the lowest possible latency as the user is immediately notified without
+any intermediate layers needing to be woken, scheduled, executed, and passed on.
+
+### Pipelined Execution
+
+The true power of timelines comes from the ability to pipeline execution. Users
+define DAGs with fences and can construct arbitrarily complex execution
+topologies whether from the same program or across multiple programs:
+
+```mermaid
+stateDiagram
+    direction LR
+    state fence0 <<fork>>
+    [*] --> fence0
+    fence0 --> @fn0
+    state fence1 <<fork>>
+    @fn0 --> fence1
+    fence1 --> @fn1
+    fence1 --> @fn2
+    state fence2 <<join>>
+    @fn1 --> fence2
+    @fn2 --> fence2
+    @fn3 --> fence2
+    fence0 --> @fn4
+    @fn4 --> fence2
+    fence2 --> [*]
+```
+
+This is a simple extension to the synchronous example using the same primitives:
+
+```python
+# Timeline is defined by the user.
+fence_a = my_timeline.at(t)
+fence_b = my_timeline.at(t+1)
+fence_c = my_timeline.at(t+2)
+# Invocations are launched using the fences and may not complete immediately.
+async_invoke(@fn0, fence_a, fence_b)
+async_invoke(@fn1, fence_b, fence_c)
+async_invoke(@fn2, fence_b, fence_c)
+async_invoke(@fn3, None, fence_c)
+async_invoke(@fn4, fence_a, fence_c)
+# Blocking here but no need to; could pass fence_c on to other invocations.
+fence_c.wait()
+```
+
+The critical point of this being that the user never had to wait for any
+particular invocation to complete before being able to schedule more work
+against the timeline, even if those invocations could themselves not complete
+synchronously. The lower layers of the system are able to fully model the
+execution as early as possible without needing to communicate (and importantly
+synchronize) with the user.
+
+### I/O
+
+Users define the semantics of their programs themselves. For example if the user
+knows the precise shape of an output buffer they can preallocate the buffer and
+pass it in. If they don't know they can decide to factor out the shape
+calculation and invoke that synchronously in order to compute the shape,
+allocate the appropriately sized buffer, and pass that in. Or they could decide
+to only deal with synchronous invocations and return a program-allocated buffer
+view with the appropriate shape in their callback. IREE does not dictate the
+design of user programs and as such enables mixed stateful/stateless,
+asynchronous/synchronous, and arbitrary scheduling models (enqueue/drain,
+windowing, etc).
+
+Inputs and outputs to invocations are provided by the user as primitive values
+(integers, floats, etc), supported builtin types (lists, byte buffers/strings),
+custom user types, and HAL types like buffers or buffer views (buffers + shape
+and type metadata). One or more wait fences can be used to order invocation
+access to one or more inputs by indicating that the resource is not available
+until a certain fence is reached. Similarly one or more signal fences can be
+used to order subsequent access to the resources by indicating the advancement
+of the timeline when they are available.
+
+```python
+# wait_fence_a must be reached before buffer_a and buffer_b can be read.
+# wait_fence_b must be reached before buffer_c can be read.
+# buffer_a will be ready to read when signal_fence_a has been reached.
+async_invoke(@fn,
+             (wait_fence_a, buffer_a, buffer_b),
+             42,  # no ordering required on value types
+             (wait_fence_b, buffer_c),
+             (signal_fence_a, buffer_a))
+```
+
+The above example demonstrates an in-place operation on `buffer_a`. It's also
+possible for invocations to return values:
+
+```python
+result = invoke(@sum, 1, 2)  # = 3
+```
+
+When executed asynchronously a callback or any construct that can be built upon
+them (like promises/futures) can receive the results:
+
+```python
+def my_callback(result):
+  print(result)  # 3
+async_invoke(@sum, 1, 2, my_callback)
+```
+
+### Stream-ordered Allocations
+
+Invocations generally have only a few KB of overhead and pipelined command
+buffers take only a small amount more. Storage buffers, however, can easily
+take hundreds of MB per invocation for I/O and transient state. This compounds
+as program usage becomes more complex or multiple programs are involved. IREE
+supports traditional host-ordered allocations (à la malloc/free) for persistent
+buffers like large constants/read-only data or user-managed ringbuffers.
+Stream-ordered allocations are also supported to allow for pooled buffer
+reservations that can be allocated in a scheduled order alongside program
+execution.
+
+For more detailed examples see the CUDA blog posts describing their
+implementation:
+[part 1](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/),
+[part 2](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/).
+
+With stream-ordered allocations each allocation and deallocation operation is
+scheduled with wait and signal fences just as with invocations. This allows
+these allocation operations to execute remotely on device without host
+program involvement. For example, scheduling `alloca0`/`dealloca0` and
+`alloca1`/`dealloca1` interleaved with the function execution allows for the
+transient memory required for executing `@fn0` to remain uncommitted until
+immediately before it is executed, committed during execution, and then
+decommitted immediately after execution. The memory required for passing
+data from `@fn0` to the subsequent `@fn1` and `@fn2` survives until after they
+have completed executing before being decommitted. By using the same scheduling
+primitives as execution the allocation topology can be as arbitrarily complex as
+the invocation topology:
+
+```mermaid
+stateDiagram
+    direction LR
+    state fence0a <<fork>>
+    [*] --> fence0a
+    state fence0b <<fork>>
+    fence0a --> alloca0
+    fence0a --> alloca1
+    alloca0 --> fence0b
+    alloca1 --> fence0b
+    fence0b --> @fn0
+    state fence1a <<fork>>
+    @fn0 --> fence1a
+    state fence1b <<fork>>
+    fence1a --> dealloc0
+    dealloc0 --> fence1b
+    fence1b --> @fn1
+    fence1b --> @fn2
+    state fence2a <<join>>
+    @fn1 --> fence2a
+    @fn2 --> fence2a
+    state fence2b
+    fence2a --> dealloc1
+    state fence2b <<join>>
+    dealloc1 --> fence2b
+    fence2b --> [*]
+```
+
+When operating in this way allocations from the host-perspective are just
+reservations for a slice of pooled storage that will be committed at some point
+in the future. Likewise deallocations from the host-perspective release the
+prior reservation and schedule the paired decommit at some point in the future.
+Scheduling N sequential invocations thus requires only enough committed storage
+for a single invocation in addition to the I/O (unless that too is
+stream-ordered).
+
+This scheduling behavior allows for both minimal peak memory consumption
+regardless of the number of programs or invocation pipeline depth and sharing
+of committed storage across programs: the memory consumption of a program at
+rest is near zero when stateless and the sum of all state when stateful. Target
+devices that natively support stream-ordered allocations (like CUDA) can even
+share pools across processes.
+
+The other provided feature in combination with the fence guaranteed forward
+progress is that so long as the memory pool can service a single request
+execution can still continue even when constrained. A device can serialize two
+independent invocations requiring 400MB of transient memory when the system only
+has 512MB available with no user-visible impact besides increased latency. This
+does require the user to ensure they schedule work that is possible to run or
+rely on the target system having paging in order to lighten the strictness of
+the pool quotas.
+
+Stream-ordered allocations performed by the user for invocation inputs can be
+declared as transferred to the program. This allows the program to eagerly
+deallocate or reuse the input storage while still preserving the internal
+scheduling requirements of the program.
+
+### Internal State
+
+A stateful program may contain internal timelines that it uses to order its own
+execution. Take for example
+[this simple stateful program](https://github.com/google/iree-jax/blob/main/tests/program/dynamic_state.py):
+
+```python
+class TrivialKernel(Program):
+  _x0 = Program.export_global(x_type)
+  def get(self):
+    return self._x0
+  def set(self, x=x_type):
+    self._x0 = x
+  def matmul(self, x=y_type):
+    self._x0 = self._matmul(x, self._x0)
+  @Program.kernel
+  def _matmul(x, x0):
+    return jnp.matmul(x, x0)
+```
+
+Each invocation of `matmul` needs to be executed in-order with prior invocations
+as there is a data dependency established on `self._x0`. Attempts to `get` or
+`set` must also be sequenced correctly with the `matmul` invocations. A basic
+usage like this:
+
+```python
+m = TrivialKernel()
+m.set(input)
+m.matmul(a)
+m.matmul(b)
+m.matmul(c)
+output = m.get()
+print(output)  # implicit wait
+```
+
+Would be executed as:
+
+```mermaid
+sequenceDiagram
+    activate User
+    User->>TrivialKernel: @set(input)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    activate Device
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(a)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(b)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @matmul(c)
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    User->>TrivialKernel: @get()
+    activate TrivialKernel
+    TrivialKernel-->>Device: ;
+    deactivate TrivialKernel
+    TrivialKernel->>User: ;
+    Device-->>Device: ;
+    deactivate User
+    User->>User: (wait)
+    Device-->>User: (signal)
+    deactivate Device
+    activate User
+    User->>User: print(output)
+    deactivate User
+```
+
+Note that although the user provided no timeline of their own execution is still
+ordered correctly due to the internal timeline constructed by the program. If
+the user wanted to also pipeline execution with another program they could do
+so by providing their own fences.
diff --git a/docs/website/docs/developers/general/contributing-ci-enabled-jobs.png b/docs/website/docs/developers/general/contributing-ci-enabled-jobs.png
new file mode 100644
index 0000000..f92be6f
--- /dev/null
+++ b/docs/website/docs/developers/general/contributing-ci-enabled-jobs.png
Binary files differ
diff --git a/docs/website/docs/developers/general/contributing-ci-extra.png b/docs/website/docs/developers/general/contributing-ci-extra.png
new file mode 100644
index 0000000..f6c3573
--- /dev/null
+++ b/docs/website/docs/developers/general/contributing-ci-extra.png
Binary files differ
diff --git a/docs/website/docs/developers/general/contributing.md b/docs/website/docs/developers/general/contributing.md
new file mode 100644
index 0000000..5041630
--- /dev/null
+++ b/docs/website/docs/developers/general/contributing.md
@@ -0,0 +1,262 @@
+# Contributing
+
+This is a more detailed version of the top-level
+[CONTRIBUTING.md](https://github.com/openxla/iree/blob/main/CONTRIBUTING.md)
+file. We keep it separate to avoid everyone getting a pop-up when creating a PR
+after each time it changes.
+
+<!-- TODO(scotttodd): Update this document
+    * pull more text into this, update that to point to the website
+    * document access controls (join organization then team)
+    * document revert policy
+    * document where new community members should start
+-->
+
+## Build systems
+
+IREE supports building from source with both Bazel and CMake.
+
+* CMake is the preferred build system and offers the most flexible
+  configuration options
+* Bazel is a stricter build system and helps with usage in Google's downstream
+  source repository
+* Certain dependencies (think large/complex projects like CUDA, TensorFlow,
+  PyTorch, etc.) may be difficult to support with one build system or the
+  other, so the project may configure these as optional
+
+## Continuous integration (CI)
+
+IREE uses [GitHub Actions](https://docs.github.com/en/actions) for CI. The
+primary CI is configured in the
+[ci.yml workflow file](https://github.com/openxla/iree/blob/main/.github/workflows/ci.yml).
+
+### Self-hosted runners
+
+In addition to the default runners GitHub provides, IREE uses
+[self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners)
+to run many of its workflow jobs. These enable access to additional compute and
+custom configurations such as accelerators. Configuration scripting is checked
+in to this repository (see the
+[README for that directory](https://github.com/openxla/iree/blob/main/build_tools/github_actions/runner/README.md)).
+
+### Custom managed runners
+
+In addition to our self-hosted runners, we use GitHub's
+[large managed runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners)
+for some platforms that are more trouble to configure ourselves (e.g. Mac).
+
+### CI behavior manipulation
+
+The setup step of the CI determines which CI jobs to run. This is controlled by
+the
+[configure_ci.py](https://github.com/openxla/iree/blob/main/build_tools/github_actions/configure_ci.py)
+script. It will generally run a pre-determined set of jobs on presubmit with
+some jobs kept as post-submit only. If changes are only to a certain set of
+excluded files that we know don't affect CI (e.g. docs), then it will skip the
+jobs. You can customize which jobs run using
+[git trailers](https://git-scm.com/docs/git-interpret-trailers) in the PR
+description. The available options are
+
+``` text
+ci-skip: jobs,to,skip
+ci-extra: extra,jobs,to,run
+ci-exactly: exact,set,of,jobs,to,run
+skip-ci: free form reason
+skip-llvm-integrate-benchmark: free form reason
+benchmark-extra: extra,benchmarks,to,run
+runner-env: [testing|prod]
+```
+
+The first three follow the same format and instruct the setup script on which
+jobs to include or exclude from its run. They take a comma-separated list of
+jobs which must be from the set of top-level job identifiers in ci.yml file or
+the special keyword "all" to indicate all jobs. `ci-skip` removes jobs that
+would otherwise be included, though it is not an error to list jobs that would
+not be included by default. `ci-extra` adds additional jobs that would not have
+otherwise been run, though it is not an error to list jobs that would have been
+included anyway. It *is* an error to list a job in both of these fields.
+`ci-exactly` provides an exact list of jobs that should run. It is mutually
+exclusive with both `ci-skip` and `ci-extra`. In all these cases, the setup does
+not make any effort to ensure that job dependencies are satisfied. Thus, if you
+request skipping the `build_all` job, all the jobs that depend on it will fail,
+not be skipped. `skip-ci` is an older option that simply skips all jobs. Its
+usage is deprecated and it is mutually exclusive with all of the other `ci-*`
+options. Prefer `ci-skip: all`.
+
+Benchmarks don't run by default on PRs, and must be specifically requested. They
+*do* run by default on PRs detected to be an integration of LLVM into IREE, but
+this behavior can be disabled with `skip-llvm-integrate-benchmark`. The
+`benchmark-extra` option allows specifying additional benchmark presets to run
+as part of benchmarking. It accepts a comma-separated list of benchmark presets.
+This combines with labels added to the PR (which are a more limited set of
+options). See the
+[benchmark suites documentation](../performance/benchmark-suites.md).
+
+The `runner-env` option controls which runner environment to target for our
+self-hosted runners. We maintain a test environment to allow testing out new
+configurations prior to rolling them out. This trailer is for advanced users who
+are working on the CI infrastructure itself.
+
+#### CI configuration recipes
+
+Copy/paste any of these at the bottom of a PR description to change what the CI
+runs.
+
+* Also run Windows and macOS builds that are normally post-merge only:
+
+  ``` text
+  ci-extra: build_test_all_windows,build_test_all_macos_arm64,build_test_all_macos_x86_64
+  ```
+
+* Also run GPU tests on NVIDIA A100 runners (opt-in due to low availability):
+
+  ``` text
+  ci-extra: test_a100
+  ```
+
+* Skip all CI builds and tests, e.g. for comment-only changes:
+
+  ``` text
+  skip-ci: Comment-only change.
+  ```
+
+* Only run Bazel builds, e.g. for changes only affecting Bazel rules:
+
+  ``` text
+  ci-exactly: build_test_all_bazel
+  ```
+
+For example, this PR opted in to running the `build_test_all_windows` job:
+
+![ci-extra](./contributing-ci-extra.png)
+
+The enabled jobs can be viewed from the Summary page of an action run:
+
+![ci_enabled_jobs](./contributing-ci-enabled-jobs.png)
+
+## Contributor tips
+
+These are opinionated tips documenting workflows that some members of the team
+have found useful. They are focused on meta-tooling, not on IREE code
+specifically (you will find the latter in the
+[Developer Overview](./developer-overview.md)).
+
+!!! note
+
+    It is certainly possible to use workflows other than these. Some common
+    tasks, especially for maintainers, will likely be made easier by using
+    these flows.
+
+We assume a basic knowledge
+of `git` and GitHub and suggests some specific ways of using it.
+
+### Useful tools
+
+* GitHub CLI (<https://github.com/cli/cli>). A CLI for interacting with GitHub.
+    Most importantly, it allows scripting the creation of pull requests.
+* Refined GitHub Chrome and Firefox Extension:
+    <https://github.com/sindresorhus/refined-github>. Nice extension that adds a
+    bunch of features to the GitHub UI.
+* VSCode: <https://code.visualstudio.com/>. The most commonly used IDE amongst
+    IREE developers.
+* [Ccache](https://ccache.dev/), a fast C/C++ compiler cache. See our
+  [CMake with `ccache`](../building/cmake-with-ccache.md) page
+
+### Git structure
+
+We tend to use the "triangular" or "forking" workflow. Develop primarily on a
+clone of the repository on your development machine. Any local branches named
+the same as persistent branches from the
+[main repository](https://github.com/openxla/iree) are pristine (though
+potentially stale) copies. You only fastforward these to match upstream and
+otherwise do development on other branches. When sending PRs, you push to a
+different branch on your public fork and create the PR from there.
+
+<!-- TODO(scotttodd): screenshots / diagrams here
+  (https://mermaid.js.org/syntax/gitgraph.html?) -->
+
+#### Setup
+
+1. Create a fork of the main repository.
+
+2. Create a local git repository with remotes `upstream` (the main repository)
+    and `origin` (your personal fork). To list your current remotes
+    `git remote -v`.
+
+    a. If you already cloned from the main repository (e.g. by following the
+    getting started guide):
+
+    ```shell
+    # From your existing git repo
+    $ git remote rename origin upstream
+    $ git remote add origin https://github.com/<github_username>/iree.git
+    ```
+
+    b. If you haven't already cloned:
+
+    ```shell
+    # From whatever directory under which you want to nest your repo
+    $ git clone https://github.com/<github_username>/iree.git
+    $ cd iree
+    $ git remote add upstream https://github.com/openxla/iree.git
+    ```
+
+    This is especially important for maintainers who have write access (so can
+    push directly to the main repository) and admins who have elevated
+    privileges (so can push directly to protected branches). These names are
+    just suggestions, but you might find some scripts where the defaults are for
+    remotes named like this. For extra safety, you can make it difficult to push
+    directly to upstream by setting the push url to something invalid: `git
+    remote set-url --push upstream DISABLE`, which requires re-enabling the push
+    URL explicitly before pushing. You can wrap this behavior in a custom git
+    command like
+    [git-sudo](https://gist.github.com/GMNGeoffrey/42dd9a9792390094a43bdb69659320c0).
+
+3. Use a script like
+    [git_update.sh](https://github.com/openxla/iree/blob/main/build_tools/scripts/git/git_update.sh)
+    to easily synchronize `main` with `upstream`. Submodules make this is a
+    little trickier than it should be. You can also turn this into a git command
+    by adding it to your path as `git-update`.
+
+#### Git config
+
+These are some additional options you could put in your top-level `.gitconfig`
+or repository-specific `.git/config` files that are conducive the recommended
+workflow
+
+```ini
+[push]
+  default = current
+[alias]
+  # Delete branches that you pushed and have been deleted upstream, e.g. because
+  # the PR was merged.
+  gone = ! "git fetch -p  && git for-each-ref --format '%(refname:short) %(upstream:track)' | awk '$2 == \"[gone]\" {print $1}' | xargs -r git branch -D"
+  # Update from upstream (custom command) and delete obsolete local branches.
+  sync = ! (git update main && git gone)
+  # Create a new branch based off of main (requires a clean working directory).
+  new = "!f(){ \\\ngit checkout main && git switch -c $1; \\\n}; f"
+  # Display branches in a useful "latest last" format
+  br = for-each-ref --sort=committerdate refs/heads/ --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) - %(color:red)%(objectname:short)%(color:reset) - %(contents:subject) (%(color:green)%(committerdate:relative)%(color:reset))'
+  # `git git foo` -> `git foo` typo fixer
+  git = "!f(){ \\\n git \"$@\"; \\\n}; f"
+  # Get the git root directory
+  root = rev-parse --show-toplevel
+  # checkout, but also sync submodules
+  ch = "!f() { \\\n git checkout \"$@\"; git submodule sync && git submodule update --init; \\\n}; f"
+  # See the diff for a PR branch vs the main branch
+  diffmain = diff --merge-base main
+  # See only the files that differ vs the main branch
+  whatsout = diffmain --name-only
+[checkout]
+  # If the checkout command
+  defaultRemote = origin
+[pull]
+  # When pulling, only complete the pull if its a clean fast forward.
+  ff = only
+[remote]
+  # Push to your fork (origin) by default
+  pushDefault = origin
+[url "ssh://git@github.com/"]
+  # Pull with https (so no auth required), but push with ssh.
+  pushInsteadOf = https://github.com/
+```
diff --git a/docs/website/docs/developers/general/developer-overview.md b/docs/website/docs/developers/general/developer-overview.md
new file mode 100644
index 0000000..94b51f0
--- /dev/null
+++ b/docs/website/docs/developers/general/developer-overview.md
@@ -0,0 +1,209 @@
+# Developer overview
+
+This guide provides an overview of IREE's project structure and main tools for
+developers.
+
+## Project code layout
+
+* [/compiler/](https://github.com/openxla/iree/blob/main/compiler/):
+  MLIR dialects, LLVM compiler passes, module translation code, etc.
+    * [bindings/](https://github.com/openxla/iree/blob/main/compiler/bindings/):
+    Python and other language bindings
+* [/runtime/](https://github.com/openxla/iree/tree/main/runtime/):
+  Standalone runtime code including the VM and HAL drivers
+    * [bindings/](https://github.com/openxla/iree/tree/main/runtime/bindings/):
+    Python and other language bindings
+* [/integrations/](https://github.com/openxla/iree/blob/main/integrations/):
+  Integrations between IREE and other frameworks, such as TensorFlow
+* [/tests/](https://github.com/openxla/iree/blob/main/tests/):
+  Tests for full compiler->runtime workflows
+* [/tools/](https://github.com/openxla/iree/blob/main/tools/):
+  Developer tools (`iree-compile`, `iree-run-module`, etc.)
+* [/samples/](https://github.com/openxla/iree/blob/main/samples/): Also see the
+  separate <https://github.com/iree-org/iree-samples> repository
+
+## IREE compiler code layout
+
+* [API/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/API):
+  Public C API
+* [Codegen/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/Codegen):
+  Code generation for compute kernels
+* [Dialect/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/Dialect):
+  MLIR dialects (`Flow`, `HAL`, `Stream`, `VM`, etc.)
+* [InputConversion/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/InputConversion):
+  Conversions from input dialects and preprocessing
+
+## IREE runtime code layout
+
+* [base/](https://github.com/openxla/iree/blob/main/runtime/src/iree/base/):
+  Common types and utilities used throughout the runtime
+* [hal/](https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/):
+  **H**ardware **A**bstraction **L**ayer for IREE's runtime, with
+  implementations for hardware and software backends
+* [schemas/](https://github.com/openxla/iree/blob/main/runtime/src/iree/schemas/):
+  Data storage format definitions, primarily using
+  [FlatBuffers](https://google.github.io/flatbuffers/)
+* [task/](https://github.com/openxla/iree/blob/main/runtime/src/iree/task/):
+  System for running tasks across multiple CPU threads
+* [tooling/](https://github.com/openxla/iree/blob/main/runtime/src/iree/tooling/):
+  Utilities for tests and developer tools, not suitable for use as-is in
+  downstream applications
+* [vm/](https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/):
+  Bytecode **V**irtual **M**achine used to work with IREE modules and invoke
+  IREE functions
+
+## Developer tools
+
+IREE's core compiler accepts programs in supported input MLIR dialects (e.g.
+`stablehlo`, `tosa`, `linalg`). Import tools and APIs may be used to convert
+from framework-specific formats like TensorFlow
+[SavedModel](https://www.tensorflow.org/guide/saved_model) to MLIR modules.
+While programs are ultimately compiled down to modules suitable for running on
+some combination of IREE's target deployment platforms, IREE's developer tools
+can run individual compiler passes, translations, and other transformations step
+by step.
+
+### iree-opt
+
+`iree-opt` is a tool for testing IREE's compiler passes. It is similar to
+[mlir-opt](https://github.com/llvm/llvm-project/tree/main/mlir/tools/mlir-opt)
+and runs sets of IREE's compiler passes on `.mlir` input files. See "conversion"
+in [MLIR's Glossary](https://mlir.llvm.org/getting_started/Glossary/#conversion)
+for more information. Transformations performed by `iree-opt` can range from
+individual passes performing isolated manipulations to broad pipelines that
+encompass a sequence of steps.
+
+Test `.mlir` files that are checked in typically include a `RUN` block at the
+top of the file that specifies which passes should be performed and if
+`FileCheck` should be used to test the generated output.
+
+Here's an example of a small compiler pass running on a
+[test file](https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/Util/Transforms/test/drop_compiler_hints.mlir):
+
+```shell
+$ ../iree-build/tools/iree-opt \
+  --split-input-file \
+  --mlir-print-ir-before-all \
+  --iree-drop-compiler-hints \
+  $PWD/compiler/src/iree/compiler/Dialect/Util/Transforms/test/drop_compiler_hints.mlir
+```
+
+For a more complex example, here's how to run IREE's complete transformation
+pipeline targeting the VMVX backend on the
+[fullyconnected.mlir](https://github.com/openxla/iree/blob/main/tests/e2e/stablehlo_models/fullyconnected.mlir)
+model file:
+
+```shell
+$ ../iree-build/tools/iree-opt \
+  --iree-transformation-pipeline \
+  --iree-hal-target-backends=vmvx \
+  $PWD/tests/e2e/stablehlo_models/fullyconnected.mlir
+```
+
+### iree-compile
+
+`iree-compile` is IREE's main compiler driver for generating binaries from
+supported input MLIR assembly.
+
+For example, to translate `simple.mlir` to an IREE module:
+
+```shell
+$ ../iree-build/tools/iree-compile \
+  --iree-hal-target-backends=vmvx \
+  $PWD/samples/models/simple_abs.mlir \
+  -o /tmp/simple_abs_vmvx.vmfb
+```
+
+### iree-run-module
+
+The `iree-run-module` program takes an already translated IREE module as input
+and executes an exported main function using the provided inputs.
+
+This program can be used in sequence with `iree-compile` to translate a
+`.mlir` file to an IREE module and then execute it. Here is an example command
+that executes the simple `simple_abs_vmvx.vmfb` compiled from `simple_abs.mlir`
+above on IREE's VMVX driver:
+
+```shell
+$ ../iree-build/tools/iree-run-module \
+  --module=/tmp/simple_abs_vmvx.vmfb \
+  --device=local-task \
+  --function=abs \
+  --input=f32=-2
+```
+
+### iree-check-module
+
+The `iree-check-module` program takes an already translated IREE module as input
+and executes it as a series of
+[googletest](https://github.com/google/googletest) tests. This is the test
+runner for the IREE [check framework](./testing-guide.md#end-to-end-tests).
+
+```shell
+$ ../iree-build/tools/iree-compile \
+  --iree-input-type=stablehlo \
+  --iree-hal-target-backends=vmvx \
+  $PWD/tests/e2e/xla_ops/abs.mlir \
+  -o /tmp/abs.vmfb
+```
+
+```shell
+$ ../iree-build/tools/iree-check-module \
+  --device=local-task \
+  --module=/tmp/abs.vmfb
+```
+
+### iree-run-mlir
+
+The `iree-run-mlir` program takes a `.mlir` file as input, translates it to an
+IREE bytecode module, and executes the module.
+
+It is designed for testing and debugging, not production uses, and therefore
+does some additional work that usually must be explicit, like marking every
+function as exported by default and running all of them.
+
+For example, to execute the contents of
+[samples/models/simple_abs.mlir](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir):
+
+```shell
+# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
+$ ../iree-build/tools/iree-run-mlir \
+  --iree-hal-target-backends=vmvx \
+  $PWD/samples/models/simple_abs.mlir \
+  --input=f32=-2
+```
+
+### iree-dump-module
+
+The `iree-dump-module` program prints the contents of an IREE module FlatBuffer
+file.
+
+For example, to inspect the module translated above:
+
+```shell
+../iree-build/tools/iree-dump-module /tmp/simple_abs_vmvx.vmfb
+```
+
+### Useful generic flags
+
+#### Read inputs from a file
+
+All the IREE tools support reading input values from a file. This is quite
+useful for debugging. Use `--help` for each tool to see what the flag to set.
+The inputs are expected to be newline-separated. Each input should be either a
+scalar or a buffer. Scalars should be in the format `type=value` and buffers
+should be in the format `[shape]xtype=[value]`. For example:
+
+``` text
+1x5xf32=1,-2,-3,4,-5
+1x5x3x1xf32=15,14,13,12,11,10,9,8,7,6,5,4,3,2,1
+```
+
+#### `--iree-flow-trace-dispatch-tensors`
+
+This flag will enable tracing inputs and outputs for each dispatch function. It
+is easier to narrow down test cases, since IREE breaks a ML workload into
+multiple dispatch function. When the flag is on, IREE will insert trace points
+before and after each dispatch function. The first trace op is for inputs, and
+the second trace op is for outputs. There will be two events for one dispatch
+function.
diff --git a/docs/website/docs/guides/developer-tips.md b/docs/website/docs/developers/general/developer-tips.md
similarity index 98%
rename from docs/website/docs/guides/developer-tips.md
rename to docs/website/docs/developers/general/developer-tips.md
index f6a2988..04b8a67 100644
--- a/docs/website/docs/guides/developer-tips.md
+++ b/docs/website/docs/developers/general/developer-tips.md
@@ -1,8 +1,9 @@
----
+<!-- TODO(scotttodd): add icons to all developers/ pages -->
+<!-- ---
 icon: material/lightbulb-on
----
+--- -->
 
-# IREE developer tips and tricks
+# Developer tips and tricks
 
 The IREE compiler is built using [MLIR](https://mlir.llvm.org/), so it naturally
 supports the common
diff --git a/docs/website/docs/developers/general/release-management.md b/docs/website/docs/developers/general/release-management.md
new file mode 100644
index 0000000..33a3a57
--- /dev/null
+++ b/docs/website/docs/developers/general/release-management.md
@@ -0,0 +1,38 @@
+# Release management
+
+IREE cuts automated releases via a workflow that is
+[triggered daily](https://github.com/openxla/iree/blob/main/.github/workflows/schedule_candidate_release.yml).
+The only constraint placed on the commit that is released is that it has passed
+all CI checks. These are published on GitHub with the "pre-release" status. For
+debugging this process, see the
+[Release debugging playbook](../debugging/releases.md).
+
+We periodically promote one of these candidates to a "stable" release by
+removing the "pre-release" status. This makes it show up as a "latest" release
+on GitHub. We also push the Python packages for this release to PyPI.
+
+## Picking a candidate to promote
+
+When selecting a candidate we use the following criteria:
+
+1. ⪆4 days old so that problems with it may have been spotted
+2. Contains no P0 regressions vs the previous stable release
+3. LLVM submodule commit ideally exists upstream (no cherry picks or patches)
+
+When you've identified a potential candidate, email the iree-discuss list with
+the proposal and solicit feedback. People may point out known regressions or
+request that some feature make the cut.
+
+## Promoting a candidate to stable
+
+1. (Authorized users only) Push to PyPI using
+    [pypi_deploy.sh](https://github.com/openxla/iree/blob/main//build_tools/python_deploy/pypi_deploy.sh)
+
+    * For Googlers, the password is stored at <http://go/iree-pypi-password>
+
+2. Open the release on GitHub. Rename the release from "candidate" to "stable",
+    uncheck the option for "pre-release", and check the option for "latest".
+
+    ![rename_release](./release-renaming.png)
+
+    ![promote_release](./release-promotion.png)
diff --git a/docs/website/docs/developers/general/release-promotion.png b/docs/website/docs/developers/general/release-promotion.png
new file mode 100644
index 0000000..8633117
--- /dev/null
+++ b/docs/website/docs/developers/general/release-promotion.png
Binary files differ
diff --git a/docs/website/docs/developers/general/release-renaming.png b/docs/website/docs/developers/general/release-renaming.png
new file mode 100644
index 0000000..e8bc98f
--- /dev/null
+++ b/docs/website/docs/developers/general/release-renaming.png
Binary files differ
diff --git a/docs/website/docs/developers/general/testing-guide.md b/docs/website/docs/developers/general/testing-guide.md
new file mode 100644
index 0000000..59b08e1
--- /dev/null
+++ b/docs/website/docs/developers/general/testing-guide.md
@@ -0,0 +1,400 @@
+# Testing guide
+
+Like the IREE project in general, IREE tests are divided into a few different
+components and use different tooling depending on the needs of that component.
+
+| Test type       | Test                                            | Build system | Supported platforms |
+|:--------------  | :-----------------                              | -----------  | -------------       |
+| Compiler tests  | iree_lit_test                                   | Bazel/CMake  | Host                |
+| Runtime tests   | iree_cc_test                                    | Bazel/CMake  | Host/Device         |
+|                 | iree_native_test                                | Bazel/CMake  | Host/Device         |
+|                 | iree_hal_cts_test_suite                         | CMake        | Host/Device         |
+| Core E2E tests  | iree_check_test                                 | Bazel/CMake  | Host/Device         |
+|                 | iree_trace_runner_test                          | Bazel/CMake  | Host/Device         |
+|                 | iree_generated_trace_runner_test                | Bazel/CMake  | Host/Device         |
+|                 | iree_static_linker_test                         | CMake        | Host/Device         |
+
+There are also more `*_test_suite` targets that groups test targets with the
+same configuration together.
+
+## Compiler tests
+
+Tests for the IREE compilation pipeline are written as lit tests in the same
+style as MLIR.
+
+By convention, IREE includes tests for
+
+* printing and parsing of ops in `.../IR/test/{OP_CATEGORY}_ops.mlir` files
+* folding and canonicalization in `.../IR/test/{OP_CATEGORY}_folding.mlir` files
+* compiler passes and pipelines in other `.../test/*.mlir` files
+
+### Running a test
+
+For the test
+[`iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir`](https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir)
+
+With CMake, run this from the build directory:
+
+```shell
+ctest -R iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir.test
+```
+
+With Bazel, run this from the repo root:
+
+```shell
+bazel test //compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test:arithmetic_ops.mlir.test
+```
+
+### Writing a test
+
+For advice on writing MLIR compiler tests, see the
+[MLIR testing guide](https://mlir.llvm.org/getting_started/TestingGuide/). Tests
+should be `.mlir` files in `test` directory adjacent to the functionality they
+are testing. Instead of `mlir-opt`, use `iree-opt`, which registers IREE
+dialects and passes and doesn't register some unnecessary core ones.
+
+As with most parts of the IREE compiler, these should not have a dependency on
+the runtime.
+
+### Configuring the build system
+
+In the Bazel BUILD file, create a `iree_lit_test_suite` rule. We usually create
+a single suite that globs all `.mlir` files in the directory and is called
+"lit".
+
+```python
+load("//iree/build_tools/bazel:iree_lit_test.bzl", "iree_lit_test_suite")
+
+iree_lit_test_suite(
+    name = "lit",
+    srcs = glob(["*.mlir"]),
+    tools = [
+        "@llvm-project//llvm:FileCheck",
+        "//tools:iree-opt",
+    ],
+)
+```
+
+There is a corresponding CMake function, calls to which will be generated by our
+[Bazel to CMake converter](https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake/bazel_to_cmake.py).
+
+```cmake
+iree_lit_test_suite(
+  NAME
+    lit
+  SRCS
+    "arithmetic_ops.mlir"
+  DATA
+    FileCheck
+    iree-opt
+)
+```
+
+You can also create a test for a single file with `iree_lit_test`.
+
+## Runtime tests
+
+Tests for the runtime C++ code use the
+[GoogleTest](https://github.com/google/googletest) testing framework. They
+should generally follow the style and best practices of that framework.
+
+### Running a test
+
+For the test
+[`/runtime/src/iree/base/bitfield_test.cc`](https://github.com/openxla/iree/blob/main/runtime/src/iree/base/bitfield_test.cc):
+
+With CMake, run this from the build directory:
+
+```shell
+ctest -R iree/base/bitfield_test
+```
+
+With Bazel, run this from the repo root:
+
+```shell
+bazel test //runtime/src/iree/base:arena_test
+```
+
+### Setting test environments
+
+Parallel testing for `ctest` can be enabled via the `CTEST_PARALLEL_LEVEL`
+environment variable. For example:
+
+```shell
+export CTEST_PARALLEL_LEVEL=$(nproc)
+```
+
+To use the Vulkan backend as test driver, you may need to select between a
+Vulkan implementation from SwiftShader and multiple Vulkan-capable hardware
+devices. This can be done via environment variables. See the
+[generic Vulkan setup](../vulkan-environment-setup.md#useful-environment-variables)
+page for details regarding these variables.
+
+For Bazel, you can persist the configuration in `user.bazelrc` to save typing.
+For example:
+
+```shell
+test:vkswiftshader --test_env="LD_LIBRARY_PATH=..."
+test:vkswiftshader --test_env="VK_LAYER_PATH=..."
+test:vknative --test_env="LD_LIBRARY_PATH=..."
+test:vknative --test_env="VK_LAYER_PATH=..."
+```
+
+Then you can use `bazel test --config=vkswiftshader` to select SwiftShader as
+the Vulkan implementation. Similarly for other implementations.
+
+### Writing a test
+
+For advice on writing tests in the GoogleTest framework, see the
+[GoogleTest primer](https://github.com/google/googletest/blob/main/docs/primer.md).
+Test files for source file `foo.cc` with build target `foo` should live in the
+same directory with source file `foo_test.cc` and build target `foo_test`. You
+should `#include` `iree/testing/gtest.h` instead of any of the gtest or gmock
+headers.
+
+As with all parts of the IREE runtime, these should not have a dependency on the
+compiler.
+
+### Configuring the build system
+
+In the Bazel BUILD file, create a `cc_test` target with your test file as the
+source and any necessary dependencies. Usually, you can link in a standard gtest
+main function. Use `iree/testing:gtest_main` instead of the `gtest_main` that
+comes with gtest.
+
+```python
+cc_test(
+    name = "arena_test",
+    srcs = ["arena_test.cc"],
+    deps = [
+        ":arena",
+        "//iree/testing:gtest_main",
+    ],
+)
+```
+
+We have created a corresponding CMake function `iree_cc_test` that mirrors the
+Bazel rule's behavior. Our
+[Bazel to CMake converter](https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake/bazel_to_cmake.py)
+should generally derive the `CMakeLists.txt` file from the BUILD file:
+
+```cmake
+iree_cc_test(
+  NAME
+    arena_test
+  SRCS
+    "arena_test.cc"
+  DEPS
+    ::arena
+    iree::testing::gtest_main
+)
+```
+
+There are other more specific test targets, such as `iree_hal_cts_test_suite`,
+which are designed to test specific runtime support with template configuration
+and is not supported by Bazel rules.
+
+## IREE core end-to-end (e2e) tests
+
+Here "end-to-end" means from the input accepted by the IREE core compiler
+(dialects like TOSA, StableHLO, Linalg) to execution using the IREE runtime
+components. It does not include tests of the integrations with ML frameworks
+(e.g. TensorFlow, PyTorch) or bindings to other languages (e.g. Python).
+
+We avoid using the more traditional `lit` tests used elsewhere in the compiler
+for runtime execution tests. Lit tests require running the compiler tools on
+the test platform through shell or python scripts that act on files from a local
+file system. On platforms like Android, the web, and embedded systems, each of
+these features is either not available or is severely limited.
+
+Instead, to test these flows we use a custom framework called `check`. The check
+framework compiles test programs on the host machine into standalone test binary
+files that can be pushed to test devices (such as Android phones) where they
+run with gtest style assertions (e.g. `check.expect_almost_eq(lhs, rhs)`).
+
+### Building e2e tests
+
+The files needed by these tests are not built by default with CMake. You'll
+need to build the special `iree-test-deps` target to generate test files prior
+to running CTest (from the build directory):
+
+```shell
+cmake --build . --target iree-test-deps
+```
+
+To run e2e model tests in
+[generated_e2e_model_tests.cmake](https://github.com/openxla/iree/tree/main/tests/e2e/stablehlo_models/generated_e2e_model_tests.cmake),
+because of their dependencies, `-DIREE_BUILD_E2E_TEST_ARTIFACTS=ON` needs to be
+set when configuring CMake. Also see
+[IREE Benchmark Suite Prerequisites](../performance/benchmark-suites.md#prerequisites)
+for required packages.
+
+### Running a Test
+
+For the test
+[`tests/e2e/xla_ops/floor.mlir`](https://github.com/openxla/iree/tree/main/tests/e2e/xla_ops/floor.mlir)
+compiled for the VMVX target backend and running on the VMVX driver (here they
+match exactly, but in principle there's a many-to-many mapping from backends to
+drivers).
+
+With CMake, run this from the build directory:
+
+```shell
+ctest -R tests/e2e/xla_ops/check_vmvx_local-task_floor.mlir
+```
+
+With Bazel, run this from the repo root:
+
+```shell
+bazel test tests/e2e/xla_ops:check_vmvx_local-task_floor.mlir
+```
+
+### Setting test environments
+
+Similarly, you can use environment variables to select Vulkan implementations
+for running tests as explained in the [Runtime tests](#runtime-tests) section.
+
+### Writing a test
+
+These tests live in `tests/e2e`. A single test consists of a `.mlir` source
+file specifying an IREE module where each exported function takes no inputs and
+returns no results and corresponds to a single test case.
+
+As an example, here are some tests for the MHLO floor operation:
+
+```mlir
+func.func @tensor() {
+  %input = util.unfoldable_constant dense<[0.0, 1.1, 2.5, 4.9]> : tensor<4xf32>
+  %result = "mhlo.floor"(%input) : (tensor<4xf32>) -> tensor<4xf32>
+  check.expect_almost_eq_const(%result, dense<[0.0, 1.0, 2.0, 4.0]> : tensor<4xf32>): tensor<4xf32>
+  return
+}
+
+func.func @scalar() {
+  %input = util.unfoldable_constant dense<101.3> : tensor<f32>
+  %result = "mhlo.floor"(%input) : (tensor<f32>) -> tensor<f32>
+  check.expect_almost_eq_const(%result, dense<101.0> : tensor<f32>): tensor<f32>
+  return
+}
+
+func.func @negative() {
+  %input = util.unfoldable_constant dense<-1.1> : tensor<f32>
+  %result = "mhlo.floor"(%input) : (tensor<f32>) -> tensor<f32>
+  check.expect_almost_eq_const(%result, dense<-2.0> : tensor<f32>): tensor<f32>
+  return
+}
+```
+
+Test cases are created in gtest for each public function exported by the module.
+
+Note the use of `util.unfoldable_constant` to specify test constants. If we were
+to use a regular constant the compiler would fold away everything at compile
+time and our test would not actually test the runtime. `unfoldable_constant`
+adds a barrier that prevents folding. To prevent folding/constant propagate on
+an arbitrary SSA-value you can use `util.optimization_barrier`.
+
+Next we use this input constant to exercise the runtime feature under test (in
+this case, just a single floor operation). Finally, we use a check dialect
+operation to make an assertion about the output. There are a few different
+[assertion operations](../../reference/mlir-dialects/Check.md).
+Here we use the `expect_almost_eq_const` op: *almost* because we are comparing
+floats and want to allow for floating-point imprecision, and *const* because we
+want to compare it to a constant value. This last part is just syntactic sugar
+around
+
+```mlir
+%expected = arith.constant dense<101.0> : tensor<f32>
+check.expect_almost_eq(%result, %expected) : tensor<f32>
+```
+
+The output of running this test looks like:
+
+```txt
+[==========] Running 4 tests from 1 test suite.
+[----------] Global test environment set-up.
+[----------] 4 tests from module
+[ RUN      ] module.tensor
+[       OK ] module.tensor (76 ms)
+[ RUN      ] module.scalar
+[       OK ] module.scalar (79 ms)
+[ RUN      ] module.double
+[       OK ] module.double (55 ms)
+[ RUN      ] module.negative
+[       OK ] module.negative (54 ms)
+[----------] 4 tests from module (264 ms total)
+
+[----------] Global test environment tear-down
+[==========] 4 tests from 1 test suite ran. (264 ms total)
+[  PASSED  ] 4 tests.
+```
+
+The "module" name for the test suite comes from the default name for an implicit
+MLIR module. To give the test suite a more descriptive name, use an explicit
+named top-level module in this file.
+
+### Configuring the build system
+
+A single `.mlir` source file can be turned into a test target with the
+`iree_check_test` Bazel macro (and corresponding CMake function).
+
+```python
+load("//build_tools/bazel:iree_check_test.bzl", "iree_check_test")
+
+iree_check_test(
+    name = "check_vmvx_local-task_floor.mlir",
+    src = "floor.mlir",
+    driver = "local-task",
+    target_backend = "vmvx",
+)
+```
+
+The target naming convention is "check_backend_driver_src". The generated test
+will automatically be tagged with a "driver=vmvx" tag, which can help filter
+tests by backend (especially when many tests are generated, as below).
+
+Usually we want to create a suite of tests across many backends and drivers.
+This can be accomplished with additional macros. For a single backend/driver
+pair:
+
+```python
+load("//build_tools/bazel:iree_check_test.bzl", "iree_check_single_backend_test_suite")
+
+iree_check_single_backend_test_suite(
+    name = "check_vmvx_local-task",
+    srcs = glob(["*.mlir"]),
+    driver = "local-task",
+    target_backend = "vmvx",
+)
+```
+
+This will generate a separate test target for each file in `srcs` with a name
+following the convention above as well as a Bazel
+[test_suite](https://bazel.build/reference/be/general#test_suite)
+called "check_vmvx_local-task" that will run all the generated tests.
+
+You can also generate suites across multiple pairs:
+
+```python
+load("//build_tools/bazel:iree_check_test.bzl", "iree_check_test_suite")
+
+iree_check_test_suite(
+    name = "check",
+    srcs = ["success.mlir"],
+    # Leave this argument off to run on all supported backend/driver pairs.
+    target_backends_and_drivers = [
+        ("vmvx", "local-task"),
+        ("vulkan-spirv", "vulkan"),
+    ],
+)
+```
+
+This will create a test per source file and backend/driver pair, a test suite
+per backend/driver pair, and a test suite, "check", that will run all the tests.
+
+The CMake functions follow a similar pattern. The calls to them are generated in
+our `CMakeLists.txt` file by
+[bazel_to_cmake](https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake/bazel_to_cmake.py).
+
+There are other test targets that generate tests based on template configuraton
+and platform detection, such as `iree_static_linker_test`. Those targets are
+not supported by Bazel rules at this point.
diff --git a/docs/website/docs/developers/index.md b/docs/website/docs/developers/index.md
new file mode 100644
index 0000000..226be22
--- /dev/null
+++ b/docs/website/docs/developers/index.md
@@ -0,0 +1,6 @@
+# Developers
+
+These pages cover topics useful for project maintainers and contributors.
+
+!!! caution
+    Some of these pages may be stale. Contributions are always welcome!
diff --git a/docs/website/docs/developers/performance/benchmark-suites.md b/docs/website/docs/developers/performance/benchmark-suites.md
new file mode 100644
index 0000000..9c8f565
--- /dev/null
+++ b/docs/website/docs/developers/performance/benchmark-suites.md
@@ -0,0 +1,316 @@
+# Benchmark suites
+
+IREE Benchmarks Suites is a collection of benchmarks for IREE developers to
+track performance improvements/regressions during development.
+
+The benchmark suites are run for each commit on the main branch and the results
+are uploaded to <https://perf.iree.dev> for regression analysis (for the current
+supported targets). On pull requests, users can add labels `benchmarks:*` to
+trigger the benchmark runs. The results will be compared with
+<https://perf.iree.dev> and post in the comments.
+
+Information about the definitions of the benchmark suites can be found in the
+[IREE Benchmark Suites Configurations](https://github.com/openxla/iree/blob/main/build_tools/python/benchmark_suites/iree/README.md).
+
+## Running benchmark suites locally
+
+### Prerequisites
+
+Install `iree-import-tf` and `iree-import-tflite` in your Python environment
+(see
+[Tensorflow Integration](https://iree.dev/guides/ml-frameworks/tensorflow/)
+and
+[TFLite Integration](https://iree.dev/guides/ml-frameworks/tflite/)).
+
+### Choose benchmark presets
+
+IREE Benchmark Suites contain many benchmarks for different devices and model
+sizes, which can take lots of space and time to build all of them. So benchmarks
+are grouped into presets to allow building and running only a subset of them.
+The available presets are:
+
+Execution benchmarks:
+
+- `android-cpu`: benchmarks for mobile CPUs
+- `android-gpu`: benchmarks for mobile GPUs
+- `cuda`: benchmarks for CUDA with a small model set
+- `cuda-large`: benchmarks for CUDA with a large model set
+- `vulkan-nvidia`: benchmarks for Vulkan on NVIDIA graphics cards
+- `x86_64`: benchmarks for x86_64 CPUs with a small model set
+- `x86_64-large`: benchmarks for x86_64 with a large model set
+
+Compilation benchmarks (to collect compilation statistics, such as module
+sizes):
+
+- `comp-stats`: compilation benchmarks with a small model set
+- `comp-stats-large`: compilation benchmark with a large model set
+
+Note that `*-large` presets will download and build a few hundreds GBs of
+artifacts.
+
+Set the environment variables of benchmark presets for the steps below, for
+example:
+
+```sh
+export EXECUTION_BENCHMARK_PRESETS="cuda,x86_64"
+export COMPILATION_BENCHMARK_PRESETS="comp-stats"
+```
+
+### Build benchmark suites
+
+Configure IREE with `-DIREE_BUILD_E2E_TEST_ARTIFACTS=ON`:
+
+```sh
+cmake -GNinja -B "${IREE_BUILD_DIR?}" -S "${IREE_REPO?}" \
+  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
+  -DCMAKE_C_COMPILER=clang \
+  -DCMAKE_CXX_COMPILER=clang++ \
+  -DIREE_ENABLE_LLD=ON \
+  -DIREE_BUILD_E2E_TEST_ARTIFACTS=ON
+```
+
+If you only need the imported MLIR models:
+
+```sh
+cmake --build "${IREE_BUILD_DIR?}" --target \
+  iree-benchmark-import-models
+  # For large benchmarks (this will take > 100G disk space)
+  # iree-benchmark-import-models-large
+```
+
+Otherwise, compile the benchmark suites and tools for benchmarking:
+
+```sh
+cmake --build "${IREE_BUILD_DIR?}" --target \
+  iree-benchmark-suites \
+  # If any *-large preset is enabled, also build this target:
+  # iree-benchmark-suites-large \
+  iree-benchmark-module
+export E2E_TEST_ARTIFACTS_DIR="${IREE_BUILD_DIR?}/e2e_test_artifacts"
+```
+
+> TODO(#13683): Each preset should have its own target to further reduce
+> unnecessary builds
+
+### Run benchmarks
+
+Export the execution benchmark config:
+
+```sh
+build_tools/benchmarks/export_benchmark_config.py execution \
+  --benchmark_presets="${EXECUTION_BENCHMARK_PRESETS?}" \
+  > "${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json"
+```
+
+Run benchmarks (currently only support running on a Linux host):
+
+```sh
+build_tools/benchmarks/run_benchmarks_on_linux.py \
+  --normal_benchmark_tool_dir="${IREE_BUILD_DIR?}/tools" \
+  --e2e_test_artifacts_dir="${E2E_TEST_ARTIFACTS_DIR?}" \
+  --execution_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json" \
+  --target_device_name="<target_device_name, e.g. c2-standard-16>" \
+  --output="${E2E_TEST_ARTIFACTS_DIR?}/benchmark_results.json" \
+  --verbose \
+  --cpu_uarch="<host CPU uarch, e.g. CascadeLake>"
+# Traces can be collected by adding:
+# --traced_benchmark_tool_dir="${IREE_TRACED_BUILD_DIR?}/tools" \
+# --trace_capture_tool=/path/to/iree-tracy-capture \
+# --capture_tarball=captured_tracy_files.tar.gz
+```
+
+Note that:
+
+- `<target_device_name>` selects a benchmark group targets a specific device:
+    - Common options:
+        - `c2-standard-16` for x86_64 CPU benchmarks.
+        - `a2-highgpu-1g` for NVIDIA GPU benchmarks.
+    - All device names are defined under
+        [build_tools/python/e2e_test_framework/device_specs](https://github.com/openxla/iree/tree/main/build_tools/python/e2e_test_framework/device_specs).
+- To run x86_64 benchmarks, right now `--cpu_uarch` needs to be provided and
+    only `CascadeLake` is available currently.
+- To build traced benchmark tools, see
+    [Profiling with Tracy](profiling-with-tracy.md).
+
+Filters can be used to select the benchmarks:
+
+```sh
+build_tools/benchmarks/run_benchmarks_on_linux.py \
+  --normal_benchmark_tool_dir="${IREE_BUILD_DIR?}/tools" \
+  --e2e_test_artifacts_dir="${E2E_TEST_ARTIFACTS_DIR?}" \
+  --execution_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json" \
+  --target_device_name="c2-standard-16" \
+  --output="${E2E_TEST_ARTIFACTS_DIR?}/benchmark_results.json" \
+  --verbose \
+  --cpu_uarch="CascadeLake" \
+  --model_name_regex="MobileBert*" \
+  --driver_filter_regex='local-task' \
+  --mode_regex="4-thread"
+```
+
+### Generate compilation statistics (compilation benchmarks)
+
+Export the compilation benchmark config:
+
+```sh
+build_tools/benchmarks/export_benchmark_config.py compilation \
+  --benchmark_presets="${COMPILATION_BENCHMARK_PRESETS?}" \
+  > "${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json"
+```
+
+Generate the compilation statistics:
+
+```sh
+build_tools/benchmarks/collect_compilation_statistics.py \
+  --compilation_benchmark_config=comp_config.json \
+  --e2e_test_artifacts_dir="${E2E_TEST_ARTIFACTS_DIR?}" \
+  --build_log="${IREE_BUILD_DIR?}/.ninja_log" \
+  --output="${E2E_TEST_ARTIFACTS_DIR?}/compile_stats_results.json"
+```
+
+Note that you need to use [Ninja](https://ninja-build.org/) to build the
+benchmark suites as the tool collects information from its build log.
+
+### Show execution / compilation benchmark results
+
+If you want to generate a comparison report locally, you can use
+[diff_local_benchmarks.py](https://github.com/openxla/iree/blob/main/build_tools/benchmarks/diff_local_benchmarks.py)
+script to compare two result json files and generate the report. For example:
+
+```sh
+build_tools/benchmarks/diff_local_benchmarks.py \
+  --base "${E2E_TEST_ARTIFACTS_DIR?}/before_benchmark_results.json" \
+  --target "${E2E_TEST_ARTIFACTS_DIR?}/after_benchmark_results.json" \
+  > report.md
+```
+
+An example that compares compilation statistics:
+
+```sh
+build_tools/benchmarks/diff_local_benchmarks.py \
+  --base-compile-stats "${E2E_TEST_ARTIFACTS_DIR?}/before_compile_stats_results.json" \
+  --target-compile-stats "${E2E_TEST_ARTIFACTS_DIR?}/after_compile_stats_results.json" \
+  > report.md
+```
+
+### Find compile and run commands to reproduce benchmarks
+
+Each benchmark has its benchmark ID in the benchmark suites, you will see a
+benchmark ID at:
+
+- In the serie's URL of <https://perf.iree.dev>
+    - Execution benchmark: `https://perf.iree.dev/serie?IREE?<benchmark_id>`
+    - Compilation benchmark:
+        `https://perf.iree.dev/serie?IREE?<benchmark_id>-<metric_id>`
+- In `benchmark_results.json` and `compile_stats_results.json`
+    - Execution benchmark result has a field `run_config_id`
+    - Compilation benchmark result has a field `gen_config_id`
+- In PR benchmark summary or the markdown generated by
+    `diff_local_benchmarks.py`, each benchmark has the link to its
+    <https://perf.iree.dev> URL, which includes the benchmark ID.
+
+If you don't have artifacts locally, see
+[Fetching Benchmark Artifacts from CI](#fetching-benchmark-artifacts-from-ci) to
+find the GCS directory of the CI artifacts. Then fetch the needed files:
+
+```sh
+# Get ${E2E_TEST_ARTIFACTS_DIR_URL} from "Fetching Benchmark Artifacts from CI".
+export E2E_TEST_ARTIFACTS_DIR="e2e_test_artifacts"
+
+# Download all artifacts
+mkdir "${E2E_TEST_ARTIFACTS_DIR?}"
+gcloud storage cp -r "${E2E_TEST_ARTIFACTS_DIR_URL?}" "${E2E_TEST_ARTIFACTS_DIR?}"
+```
+
+Run the helper tool to dump benchmark commands from benchmark configs:
+
+```sh
+build_tools/benchmarks/benchmark_helper.py dump-cmds \
+  --execution_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/execution-benchmark-config.json" \
+  --compilation_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/compilation-benchmark-config.json" \
+  --e2e_test_artifacts_dir="${E2E_TEST_ARTIFACTS_DIR?}" \
+  --benchmark_id="<benchmark_id>"
+```
+
+### Get full list of benchmarks
+
+The commands below output the full list of execution and compilation benchmarks,
+including the benchmark names and their flags:
+
+```sh
+build_tools/benchmarks/export_benchmark_config.py execution > "${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json"
+build_tools/benchmarks/export_benchmark_config.py compilation > "${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json"
+build_tools/benchmarks/benchmark_helper.py dump-cmds \
+  --execution_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json" \
+  --compilation_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json"
+```
+
+## Fetching benchmark Artifacts from CI
+
+### 1. Find the corresponding CI workflow run
+
+On the commit of the benchmark run, you can find the list of the workflow jobs
+by clicking the green check mark. Click any job starts with `CI /`:
+
+![image](https://user-images.githubusercontent.com/2104162/234647960-3df9d0f0-a34a-47ad-bda8-095ae44de865.png)
+
+### 2. Get URLs of GCS artifacts
+
+On the CI page, click `Summary` on the top-left to open the summary page. Scroll
+down and the links to artifacts are listed in a section titled "Artifact Links".
+Paste the content in your shell to define all needed variables for the following
+steps:
+
+![image](https://user-images.githubusercontent.com/2104162/234716421-3a69b6ad-211d-4e39-8f9e-a4f22f91739d.png)
+
+### 3. Fetch the benchmark artifacts
+
+To fetch files from the GCS URL, the gcloud CLI tool
+(<https://cloud.google.com/sdk/docs/install>) can list the directory contents and
+download files (see <https://cloud.google.com/sdk/gcloud/reference/storage> for
+more usages). If you want to use CI artifacts to reproduce benchmarks locally,
+see
+[Find Compile and Run Commands to Reproduce Benchmarks](#find-compile-and-run-commands-to-reproduce-benchmarks).
+
+Assume you get the GCS URL variables from
+[Get URLs of GCS artifacts](#2-get-urls-of-gcs-artifacts).
+
+Download artifacts:
+
+```sh
+# The GCS directory has the same structure as your local ${IREE_BUILD_DIR?}/e2e_test_artifacts.
+gcloud storage ls "${E2E_TEST_ARTIFACTS_DIR_URL?}"
+
+# Download all source and imported MLIR files:
+gcloud storage cp "${E2E_TEST_ARTIFACTS_DIR_URL?}/*.mlir" "<target_dir>"
+```
+
+Execution and compilation benchmark configs can be downloaded at:
+
+```sh
+# Execution benchmark config:
+gcloud storage cp \
+  "${E2E_TEST_ARTIFACTS_DIR_URL?}/execution-benchmark-config.json" \
+  "${E2E_TEST_ARTIFACTS_DIR?}/exec_config.json"
+
+# Compilation benchmark config:
+gcloud storage cp \
+  "${E2E_TEST_ARTIFACTS_DIR_URL?}/compilation-benchmark-config.json" \
+  "${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json"
+```
+
+Benchmark raw results and traces can be downloaded at:
+
+```sh
+# Execution benchmark raw results
+gcloud storage cp "${EXECUTION_BENCHMARK_RESULTS_DIR_URL?}/benchmark-results-*.json" .
+
+# Optional: Merge raw results into a single file
+build_tools/benchmarks/benchmark_helper.py merge-results benchmark-results-*.json > benchmark_results.json
+
+# Execution benchmark traces
+gcloud storage cp "${EXECUTION_BENCHMARK_RESULTS_DIR_URL?}/benchmark-traces-*.tar.gz" .
+
+# Compilation benchmark results
+gcloud storage cp "${COMPILATION_BENCHMARK_RESULTS_URL?}" .
+```
diff --git a/docs/website/docs/developers/performance/benchmarking.md b/docs/website/docs/developers/performance/benchmarking.md
new file mode 100644
index 0000000..7844255
--- /dev/null
+++ b/docs/website/docs/developers/performance/benchmarking.md
@@ -0,0 +1,388 @@
+# Benchmarking
+
+IREE uses benchmarks to inspect performance at varying levels of granularity.
+Benchmarking is implemented using the
+[Google Benchmark library](https://github.com/google/benchmark). To understand
+performance details and guide optimization, please refer to the
+IREE [profiling](./profiling.md) documentation.
+
+## Module Benchmarks
+
+`iree-benchmark-module` is a program accepting (almost) the same inputs as
+`iree-run-module` that will benchmark the invocation of a single entry function.
+It measures timing for the whole process of invoking a function through the VM,
+including allocating and freeing output buffers. This is a high-level benchmark
+of an entire invocation flow. It provides a big picture view, but depends on
+many different variables, like an integration test. For finer-grained
+measurements more akin to unit tests, see [Executable Benchmarks](#executable-benchmarks).
+
+To use `iree-benchmark-module`, generate an IREE module for the target backend:
+
+```shell
+$ bazel run //tools:iree-compile -- \
+  --iree-hal-target-backends=vmvx \
+  $PWD/samples/models/simple_abs.mlir \
+  -o /tmp/module.fb
+```
+
+and then benchmark an exported function in that module:
+
+```shell
+$ bazel run //tools:iree-benchmark-module -- \
+  --module=/tmp/module.fb \
+  --device=local-task \
+  --function=abs \
+  --input=f32=-2
+```
+
+You'll see output like
+
+```shell
+Run on (12 X 4500 MHz CPU s)
+CPU Caches:
+  L1 Data 32K (x6)
+  L1 Instruction 32K (x6)
+  L2 Unified 1024K (x6)
+  L3 Unified 8448K (x1)
+Load Average: 2.21, 1.93, 3.34
+***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
+ be noisy and will incur extra overhead.
+***WARNING*** Library was built as DEBUG. Timings may be affected.
+------------------------------------------------------------------------------
+Benchmark                                    Time             CPU   Iterations
+------------------------------------------------------------------------------
+BM_RunModule/process_time/real_time       0.22 ms         0.23 ms         3356
+```
+
+Notice that there are a few warnings in there (you may not see all of these).
+The benchmark library helpfully warns about some common issues that will affect
+benchmark timing. When trying to obtain real benchmark numbers, you should
+generally build an optimized build (`-c opt` in Bazel) and
+[disable CPU scaling](#cpu-configuration).
+
+```shell
+bazel build -c opt //tools:iree-benchmark-module
+```
+
+Another thing to consider is that depending on where you are running the
+benchmark you might want to avoid additional programs running at the same time.
+Bazel itself runs a server even when it's not being actively invoked that can be
+quite a memory hog, so we'll instead invoke the binary directly. Use your
+favorite process manager (e.g. [htop](https://hisham.hm/htop/) or
+[pkill](https://en.wikipedia.org/wiki/Pkill) on Linux) to kill heavy-weight
+programs such as Chrome and Bazel.
+
+Now we'll actually invoke the binary:
+
+```shell
+$ ./bazel-bin/tools/iree-benchmark-module \
+  --module=/tmp/module.fb \
+  --device=local-task \
+  --function=abs \
+  --input=f32=-2
+```
+
+```shell
+Run on (12 X 4500 MHz CPU s)
+CPU Caches:
+  L1 Data 32K (x6)
+  L1 Instruction 32K (x6)
+  L2 Unified 1024K (x6)
+  L3 Unified 8448K (x1)
+Load Average: 1.49, 3.42, 3.49
+------------------------------------------------------------------------------
+Benchmark                                    Time             CPU   Iterations
+------------------------------------------------------------------------------
+BM_RunModule/process_time/real_time      0.011 ms        0.014 ms        61654
+```
+
+Remember to [restore CPU scaling](#cpu-configuration) when you're done.
+
+## Executable Benchmarks
+
+We also benchmark the performance of individual parts of the IREE system in
+isolation. IREE breaks a model down to dispatch functions. To benchmark all the
+dispatch functions, generate an IREE module with the
+`-iree-flow-export-benchmark-funcs` flag set:
+
+```shell
+$ build/tools/iree-compile \
+  --iree-input-type=stablehlo \
+  --iree-flow-export-benchmark-funcs \
+  --iree-hal-target-backends=vmvx \
+  tests/e2e/stablehlo_models/fullyconnected.mlir \
+  -o /tmp/fullyconnected.vmfb
+```
+
+and then benchmark all exported dispatch functions (and all exported functions)
+in that module:
+
+```shell
+$ build/tools/iree-benchmark-module
+  --module=/tmp/fullyconnected.vmfb
+  --device=local-task
+```
+
+If no `entry_function` is specified, `iree-benchmark-module` will register a
+benchmark for each exported function that takes no inputs.
+
+You will see output like:
+
+```shell
+Run on (72 X 3700 MHz CPU s)
+CPU Caches:
+  L1 Data 32 KiB (x36)
+  L1 Instruction 32 KiB (x36)
+  L2 Unified 1024 KiB (x36)
+  L3 Unified 25344 KiB (x2)
+Load Average: 4.39, 5.72, 6.76
+---------------------------------------------------------------------------------------------
+Benchmark                                                   Time             CPU   Iterations
+---------------------------------------------------------------------------------------------
+BM_main_ex_dispatch_0_benchmark/process_time/real_time  0.030 ms        0.037 ms        34065
+BM_main_ex_dispatch_1_benchmark/process_time/real_time  0.034 ms        0.042 ms        20567
+BM_main_ex_dispatch_2_benchmark/process_time/real_time  0.043 ms        0.051 ms        18576
+BM_main_ex_dispatch_3_benchmark/process_time/real_time  0.029 ms        0.036 ms        21345
+BM_main_ex_dispatch_4_benchmark/process_time/real_time  0.042 ms        0.051 ms        15880
+BM_main_ex_dispatch_5_benchmark/process_time/real_time  0.030 ms        0.037 ms        17854
+BM_main_ex_dispatch_6_benchmark/process_time/real_time  0.043 ms        0.052 ms        14919
+BM_main_benchmark/process_time/real_time                0.099 ms        0.107 ms         5892
+```
+
+### Bytecode Module Benchmarks
+
+Normally, the IREE VM is expected to be integrated into applications and driving
+model execution. So its performance is of crucial importance. We strive to
+introduce as little overhead as possible and have several benchmark binaries
+dedicated for evaluating the VM's performance. These benchmark binaries are
+named as `*_benchmark` in the
+[`iree/vm/`](https://github.com/openxla/iree/tree/main/runtime/src/iree/vm)
+directory. They also use the Google Benchmark library as the above.
+
+## CPU Configuration
+
+When benchmarking, it's important to consider the configuration of your CPUs.
+Most notably, CPU scaling can give variable results, so you'll usually want to
+disable it. This can get pretty complex, but the most basic thing to do is to
+run all CPUs at maximum frequency. The other thing to consider is what CPU(s)
+your program is running on. Both of these get more complicated on mobile and in
+multithreaded workloads.
+
+### Linux
+
+Google benchmark provides some
+[instructions](https://github.com/google/benchmark#disabling-cpu-frequency-scaling).
+Note that the library will print "CPU scaling is enabled" warnings for any
+configuration that
+[doesn't have the quota governor set to performance](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L228).
+Similarly the CPU frequency it reports is the
+[maximum frequency of cpu0](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L533),
+not the frequency of the processor it's actually running on. This means that
+more advanced configurations should ignore these messages.
+
+Turn off CPU scaling before benchmarking.
+
+```shell
+sudo cpupower frequency-set --governor performance
+```
+
+Restore CPU scaling after benchmarking:
+
+```shell
+sudo cpupower frequency-set --governor powersave
+```
+
+To learn more about different quota
+governor settings, see
+<https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt>. To restrict
+which CPUs you run on, use the `taskset` command which takes a hexadecimal mask.
+
+To only run on the lowest-numbered CPU you can run
+
+```shell
+taskset 1 sleep 20 &
+```
+
+You can confirm that the process is running on the given CPU:
+
+```shell
+ps -o psr $!
+```
+
+Note that `$!` indicates the process ID of the last executed background command,
+so you can only use this shorthand if you didn't run any commands after the
+sleep. For more info on taskset, see <https://linux.die.net/man/1/taskset>.
+
+### Android
+
+Read and understand the [Linux](#linux) instructions first.
+
+Android doesn't give us quite as nice tooling, but the principle is basically
+the same. One important difference is that thermal throttling is a much bigger
+concern on mobile. Without a cooling plate, it is likely that high clock speeds
+will overheat the device and engage thermal throttling, which will ignore
+whatever clock speeds you may have set to prevent things from catching on fire.
+Therefore the naive approach above is likely not a good idea.
+
+You will likely need to be root (use `su` or `adb root`). The commands will
+depend on your exact phone and number of cores. First play around and make sure
+you understand what everything means. Note that each CPU has its own files which
+are used to control its behavior, but changes to a single CPU will sometimes
+affect others (see `/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus`).
+
+Some useful files:
+
+```shell
+/proc/cpuinfo
+/sys/devices/system/cpu/possible
+/sys/devices/system/cpu/present
+/sys/devices/system/cpu/cpu0/online
+/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
+/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
+/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
+/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
+/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
+/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
+/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
+```
+
+See the clockspeed of each CPU
+
+```shell
+$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
+    paste \
+      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \
+      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \
+      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \
+done
+```
+
+Before changing things, make sure to check the current scaling governor settings
+first so you can put them back when you're done.
+
+```shell
+$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
+    cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
+done
+```
+
+#### Single-Core Example
+
+Here's an example to run IREE in a single-threaded context on CPU 7 at its
+lowest clock speed.
+
+First we'll take control of the clockspeed by setting the governor to
+"userspace".
+
+```shell
+$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
+  echo userspace > \
+    "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
+done
+```
+
+We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency.
+We choose the minimum instead of the maximum here to mitigate thermal throttling
+concerns
+
+```shell
+$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \
+/sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed
+```
+
+We can confirm the frequencies of all the CPUs by running the same command
+above. Now to run a command specifically on cpu7, use `taskset 80`
+(hex for 10000000):
+
+```shell
+taskset 80 sleep 20 &
+ps -o psr $!
+```
+
+Remember to cleanup when you're done! Here we'll set the scaling governor back
+to schedutil because that's what they were before on the particular device this,
+was tested on, but that may not exist on all devices.
+
+```shell
+$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
+  echo schedutil > \
+    "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
+done
+```
+
+#### Android Scripts
+
+We provide a few scripts to set clockspeeds on Android (under
+`build_tools/benchmarks`). These are somewhat device-specific:
+
+* The `set_android_scaling_governor.sh` work on all CPUs, but the default
+  governor name may be different across devices.
+* The `set_*_gpu_scaling_policy.sh` script used should match the actual GPU on
+  your device.
+
+Sample configuration steps for Pixel 6:
+
+1. Copy all scripts to the device:
+
+   ```shell
+   adb push build_tools/benchmarks/*.sh /data/local/tmp
+   ```
+
+1. Launch interactive adb shell as super user:
+
+   ```shell
+   adb shell
+   oriole:/ # su
+   oriole:/ # cd /data/local/tmp
+   ```
+
+1. Pin frequencies (high clockspeeds):
+
+   ```shell
+   oriole:/ # ./set_android_scaling_governor.sh
+    CPU info (before changing governor):
+    cpu     governor        cur     min     max
+    ------------------------------------------------
+    cpu0    sched_pixel     1098000 300000  1803000
+    cpu1    sched_pixel     1598000 300000  1803000
+    cpu2    sched_pixel     1598000 300000  1803000
+    cpu3    sched_pixel     1098000 300000  1803000
+    cpu4    sched_pixel     400000  400000  2253000
+    cpu5    sched_pixel     400000  400000  2253000
+    cpu6    sched_pixel     500000  500000  2802000
+    cpu7    sched_pixel     500000  500000  2802000
+    Setting CPU frequency governor to performance
+    CPU info (after changing governor):
+    cpu     governor        cur     min     max
+    ------------------------------------------------
+    cpu0    performance     1803000 300000  1803000
+    cpu1    performance     1803000 300000  1803000
+    cpu2    performance     1803000 300000  1803000
+    cpu3    performance     1803000 300000  1803000
+    cpu4    performance     2253000 400000  2253000
+    cpu5    performance     2253000 400000  2253000
+    cpu6    performance     2802000 500000  2802000
+    cpu7    performance     2802000 500000  2802000
+   oriole:/data/local/tmp # ./set_pixel6_gpu_scaling_policy.sh
+    GPU info (before changing frequency scaling policy):
+    policy                                  cur     min     max
+    --------------------------------------------------------------
+    coarse_demand [adaptive] always_on      251000  151000  848000
+    Setting GPU frequency scaling policy to performance
+    GPU info (after changing frequency scaling policy):
+    policy                                  cur     min     max
+    --------------------------------------------------------------
+    coarse_demand adaptive [always_on]      848000  151000  848000
+   ```
+
+1. Restore default frequencies:
+
+   ```shell
+   oriole:/ # ./set_android_scaling_governor.sh sched_pixel
+   ...
+   oriole:/ # ./set_pixel6_gpu_scaling_policy.sh default
+   ...
+   ```
+
+TODO(scotttodd): Windows instructions
diff --git a/docs/website/docs/developers/performance/profiling-cpu-events.md b/docs/website/docs/developers/performance/profiling-cpu-events.md
new file mode 100644
index 0000000..7477490
--- /dev/null
+++ b/docs/website/docs/developers/performance/profiling-cpu-events.md
@@ -0,0 +1,267 @@
+# Profiling CPUs
+
+CPUs are able to
+[record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain
+events that may be relevant when investigating the performance of a program. A
+common example of such an event is a
+["cache miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the
+program tries to access data in memory that isn't already in some CPU cache,
+causing that access to be slower than it could otherwise be.
+
+Querying and analyzing this data can be useful, but is hard in two distinct
+ways:
+
+* Depending on the CPU and on the OS, both hardware and software limitations can
+  get in the way of obtaining accurate data.
+* This data tends to be inherently difficult to interpret, even when it is
+  perfectly accurate. In practice it is often noisy and inaccurate, which makes
+  interpretation even more complicated.
+
+There are two parts to this page: platform-specific information about [how to
+query](#perf-and-simpleperf-on-linux-and-android) this data, and, at the end, a
+platform-independent explanation of [how to
+interpret](#interpreting-cpu-event-counts) it.
+
+## Perf and Simpleperf, on Linux and Android
+
+### Overview
+
+The Linux kernel exposes system event counters to user-space programs by means
+of the
+[`perf_event_open`](https://man7.org/linux/man-pages/man2/perf_event_open.2.html)
+system call. This includes both hardware event counters (such as CPU cache
+events) and software events from the kernel (such as page faults and context
+switches). Anyone may use this system call to implement a profiler, but Linux
+readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page).
+
+### Preserving artifacts
+
+By default IREE cleans up any temporary files it creates while running. Tools
+like perf, however, require those files exist even after the process has exited.
+The environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` can be set to preserve
+the files. This is only needed for the CPU path when using the system loader.
+
+```shell
+export IREE_PRESERVE_DYLIB_TEMP_FILES=1
+```
+
+### Desktop linux
+
+On desktop Linux we can use
+[`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on
+most Linux distributions, for instance on Debian-based distributions do:
+
+```shell
+sudo apt install linux-perf
+```
+
+Run the program to be profiled, prepending its command line with `perf record`.
+By default this will write the profile data to the current directory,
+`./perf.data`. Sometimes this isn't ideal, such as then the current directory is
+under version control. Explicit paths can be specified by `-o` flag to direct
+the output of `perf record`, and then by `-i` flags to select the input of
+subsequent commands analyzing the profile. Example:
+
+```shell
+perf record -o /tmp/perf.data \
+  ./tools/iree-benchmark-module \
+    --device=local-task \
+    ... command-line arguments of iree-benchmark-module as usual ...
+```
+
+By default, this samples time spent. One may specify instead an event to sample
+by, with the `-e` flag. For instance, to sample by L1 cache misses, one may do:
+
+```shell
+perf record -o /tmp/perf.data -e L1-dcache-load-misses \
+  ./tools/iree-benchmark-module \
+    --device=local-task \
+    ... command-line arguments of iree-benchmark-module as usual ...
+```
+
+`perf list` dumps the list of event types.
+
+Once you have recorded a profile, there are two main ways to analyze it: `perf
+report` and `perf annotate`.
+
+`perf report` breaks down the event counts by symbol. In the default case where
+what was sampled was time, this is just an ordinary profile by symbol name, no
+different than what could be viewed in other profilers such as
+[Tracy](profiling-with-tracy.md). Where it gets really interesting is when the
+profile was recording a specific event type, as in the above `-e
+L1-dcache-load-misses` example:
+
+``` shell
+perf report -i /tmp/perf.data
+
+Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861
+Overhead  Command          Shared Object              Symbol
+  61.53%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_31
+  13.30%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_11
+   2.11%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_13
+   1.90%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_19
+   1.54%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_25
+   1.49%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_5
+```
+
+`perf annotate` breaks down the event counts by instruction. Again, in the
+default case where what was sampled was time, this is no different than what
+could be viewed in Tracy, and the real motivation to use `perf` is when
+profiling by specific event types as in the above `-e L1-dcache-load-misses`
+example:
+
+``` shell
+perf annotate -i perf.data
+
+Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861
+serving_default_ex_dispatch_31  /tmp/dylib_executablenzpx2Q.so [Percent: local period]
+  1.66 │        movups -0x1000(%rdi),%xmm10
+  0.48 │        movups -0x800(%rdi),%xmm9
+  0.82 │        movups (%rdi),%xmm8
+  0.49 │        movaps %xmm1,%xmm4
+  0.12 │        shufps $0x0,%xmm1,%xmm4
+  0.14 │        mulps  %xmm5,%xmm4
+  0.28 │        addps  %xmm6,%xmm4
+  0.60 │        movaps %xmm3,%xmm6
+  0.34 │        shufps $0x0,%xmm3,%xmm6
+```
+
+#### Warning
+
+`perf annotate` is even noisier than `perf report` as it can be overly
+optimistic, depending on the CPU, to pin an event to a specific instruction.
+Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more
+generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the
+above example (recorded on a Skylake workstation) shows: it blamed a `mulps
+%xmm5,%xmm4` instruction for a cache miss, which doesn't make sense as that
+instruction only touches registers.
+
+### Android
+
+On Android we can use
+[`simpleperf`](https://developer.android.com/ndk/guides/simpleperf). It's
+preinstalled on current Android `userdebug` images, and part of the Android NDK.
+
+In theory, as Android is Linux, it should be possible to use `perf`.
+Unfortunately, `perf` is difficult to build for Android. Fortunately,
+`simpleperf` is readily available: it is preinstalled in Android `userdebug`
+images, and it is part of the Android NDK.
+
+First, we record on the device:
+
+```shell
+adb shell \
+  simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \
+    /data/local/tmp/iree-benchmark-module \
+      --device=local-task \
+      ... command-line arguments of iree-benchmark-module as usual ...
+```
+
+Then pull the recorded data from the device, and analyze on the desktop. We
+assume that `${ANDROID_NDK}` points to the local copy of the Android NDK.
+
+```shell
+adb pull /data/local/tmp/perf.data /tmp/perf.data
+${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data
+```
+
+This prints a breakdown of `raw-l1d-cache-refill` events by symbol.
+
+Like with `perf`, a list of event types can be queried by the `list` subcommand:
+
+```shell
+adb shell simpleperf list
+```
+
+#### No support for `annotate` by CPU event
+
+There is no `simpleperf annotate`. The `simpleperf` documentation lists a couple
+of
+[ways](https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md#show-annotated-source-code-and-disassembly)
+of achieving the same thing.
+
+However:
+
+* The common case of annotating by time, as opposed to annotating by CPU event,
+  is supported by [Tracy](profiling-with-tracy.md).
+* Annotating by CPU event is inherently not working due to hardware limitations
+  of the ARM CPUs found in Android devices. That is, the hardware is too
+  imprecise at pinning an event to a particular instruction.
+
+## Interpreting CPU event counts
+
+### Problems
+
+There are multiple layers of complexity in interpreting CPU event counts.
+
+#### These events are in themselves normal
+
+The first difficulty is in the fact that most of these events are *normal*. So
+just knowing that they happened is not in itself actionable.
+
+For example, if we learn that some code causes cache misses, that isn't big
+news: so does all code. Maybe this code has *too many* cache misses, but how
+many is too many? Maybe this code alone accounts for a large fraction of the
+overall total of the whole program, but maybe even that is normal, for instance
+if the code being studied is the 'hot' part of the program where a large
+fraction of overall time is spent?
+
+#### These events are hardware-dependent and under-documented
+
+Many of these events have a meaning that varies between CPUs and that is
+difficult to characterize on any CPU, let alone in a way that applies to all
+CPUs.
+
+For example, take the "L2 data cache refill". On ARM, with `simpleperf`, that
+would be `raw-l2d-cache-refill`. Questions:
+
+* Is “L2” [inclusive](https://en.wikipedia.org/wiki/Cache_inclusion_policy) of
+  “L1”?
+* How many bytes are transferred per “refill”?
+* Are accesses induced by speculative execution or by automatic pre-fetching
+  counted in the same way as accesses induced by actual code execution?
+
+The answers to all of the above questions are CPU-dependent. They may even vary
+between the CPU cores of the same Android device.
+
+#### These events are imprecise and noisy, particularly on ARM CPUs
+
+Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on
+ARM, as discussed above, there is inaccuracy in which instruction is blamed for
+which event, which will increase inaccuracy of per-symbol breakdowns for very
+cheap symbols (and makes `perf annotate` impossible as noted above). Finally, be
+aware that some ARM CPUs may perform event count interpolation, so we may not
+have any access to true hardware counts.
+
+### Recommendations
+
+Here is a workflow pattern that allows to make significant use of CPU event
+counts, despite all the problems noted above:
+
+* Hypothesize that some code diff might help performance, and might help
+  reducing the number of CPU events of a certain type, and that the two might be
+  related.
+* Benchmark with and without the code diff, on the same device, everything else
+  being equal.
+    * Let your benchmark perform a fixed number of iterations, or, if using a
+    benchmark termination condition of the form "run until at least N seconds
+    have elapsed", carefully divide event counts by the actual number of
+    iterations that were run.
+* If the observed CPU event count difference is significant, go ahead and claim
+  that your code diff probably helps with that aspect of CPU behavior.
+
+Some things NOT to be done:
+
+* Don’t try to compare different metrics, not even when it seems obvious that
+  they should satisfy a simple relationship, not even on the same CPU (e.g. “L1
+  accesses should be greater than L2 accesses”).
+* Don’t divide by some “total” metric to get some kinds of ratios. For example,
+  don’t try to compute a “cache miss ratio” as quotient of “cache refill” over
+  “all cache accesses” metrics. The first problem with that (even before we get
+  to CPU-specific issues) is that that’s rewarding increases to the “all cache
+  accesses” metrics, so if something bad happens in your codegen and your kernel
+  ends up spilling a lot of register to the stack, that’s going to be a lot more
+  accesses which will all be L1 hits so that’ll help this ratio look better!  So
+  more generally, just try to minimize some CPU metrics (that count “costly”
+  events), not some more complex math expression formed from arithmetic on CPU
+  metrics.
diff --git a/docs/website/docs/developers/performance/profiling-gpu-vulkan.md b/docs/website/docs/developers/performance/profiling-gpu-vulkan.md
new file mode 100644
index 0000000..a92d6d9
--- /dev/null
+++ b/docs/website/docs/developers/performance/profiling-gpu-vulkan.md
@@ -0,0 +1,134 @@
+# Profiling GPUs using Vulkan
+
+[Tracy](./profiling-with-tracy.md) offers great insights into CPU/GPU
+interactions and Vulkan API usage
+details. However, information at a finer granularity, especially inside a
+particular shader dispatch, is missing. To supplement general purpose tools
+like Tracy, vendor-specific tools can be used.
+
+(TODO: add some pictures for each tool)
+
+## RenderDoc
+
+Support for [RenderDoc](https://github.com/baldurk/renderdoc) can be enabled by
+configuring cmake with `-DIREE_ENABLE_RENDERDOC_PROFILING=ON`. When built in to
+IREE the profiling functionality is available for programmatic use via the
+`iree_hal_device_profiling_begin` and `iree_hal_device_profiling_end` APIs.
+
+When using one of the standard IREE tools (`iree-run-module`,
+`iree-benchmark-module`, etc) the `--device_profiling_mode=queue` flag can be
+passed to enable capture around the entire invocation (be careful when
+benchmarking as the recordings can be quite large!). The default capture file
+name can be specified with `--device_profiling_file=foo.rdc`.
+
+Capturing in the RenderDoc UI can be done by specifying the IREE tool or
+embedding application (`iree-run-module`, etc) as the launch executable and
+adding all arguments as normal.
+
+Capturing from the command line can be done using `renderdoccmd` with the
+specified file appearing (by default) in the executable directory:
+
+```shell
+renderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=foo.rdc ...
+stat tools/foo.rdc
+renderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=/some/path/foo.rdc ...
+stat /some/path/foo.rdc
+```
+
+## Android GPUs
+
+There are multiple GPU vendors for the Android platforms, each offering their
+own tools. [Android GPU Inspector](https://gpuinspector.dev/)
+(AGI) provides a cross-vendor solution. See the
+[documentation](https://gpuinspector.dev/docs/) for more details.
+
+### Build Android app to run IREE
+
+In order to perform capture and analysis with AGI, you will need a full Android
+app. In IREE we have a simple Android native app wrapper to help package
+IREE core libraries together with a specific VM bytecode invocation into an
+Android app. The wrapper and its documentation are placed at
+[`tools/android/run_module_app/`](https://github.com/openxla/iree/tree/main/tools/android/run_module_app).
+
+For example, to package a module compiled from the following
+`stablehlo-dot.mlir` as an Android app:
+
+```mlir
+func @dot(%lhs: tensor<2x4xf32>, %rhs: tensor<4x2xf32>) -> tensor<2x2xf32> {
+  %0 = "stablehlo.dot"(%lhs, %rhs) : (tensor<2x4xf32>, tensor<4x2xf32>) -> tensor<2x2xf32>
+  return %0 : tensor<2x2xf32>
+}
+```
+
+```shell
+# First compile into a VM bytecode module
+$ /path/to/iree/build/tools/iree-compile -- \
+  --iree-input-type=stablehlo \
+  --iree-hal-target-backends=vulkan-spirv \
+  /path/to/stablehlo-dot.mlir \
+  -o /tmp/stablehlo-dot.vmfb
+
+# Then package the Android app
+$ /path/to/iree/source/tools/android/run_module_app/build_apk.sh \
+  ./build-apk \
+  --device vulkan \
+  --module /tmp/stablehlo-dot.vmfb \
+  --function dot \
+  --input=...
+```
+
+Where `/path/to/input/file` is a file containing inputs to `dot`, for example:
+
+``` text
+2x4xf32=[[1.0 2.0 3.0 4.0][5.0 6.0 7.0 8.0]]
+4x2xf32=[[9.0 10.0][11.0 12.0][13.0 14.0][15.0 16.0]]
+```
+
+The above will build an `iree-run-module.apk` under the `./build-apk/`
+directory, which you can then install via `adb install`.
+
+`build_apk.sh` needs the Android SDK and NDK internally, an easy way to manage
+them is by installing [Android Studio](https://developer.android.com/studio).
+After installation, you will need to set up a few environment variables, which
+are printed at the beginning of `build_apk.sh` invocation.
+
+### Capture and analyze with AGI
+
+You can follow AGI's
+[Getting Started](https://gpuinspector.dev/docs/getting-started) page to learn
+how to use it. In general the steps are:
+
+* Install the latest AGI from <https://github.com/google/agi/releases> and launch.
+* Fill in the "Application" field by searching the app. The line should read
+  like `android.intent.action.MAIN:dev.iree.run_module/android.app.NativeActivity`.
+* Select start at beginning and choose a proper duration.
+* Configure system profile to include all GPU counters.
+* Start capture.
+
+Generated traces are in the [perfetto](https://perfetto.dev/) format. They can
+be viewed directly within AGI and also online in a browser at
+<https://ui.perfetto.dev/>, without needing an Android device.
+
+## Desktop GPUs
+
+Vulkan supports both graphics and compute, but most tools in the Vulkan
+ecosystem focus on graphics. As a result, some Vulkan profiling tools expect
+commands to correspond to a sequence of frames presented to displays via
+framebuffers. This means additional steps for IREE and other Vulkan
+applications that solely rely on headless compute. For graphics-focused tools,
+we need to wrap IREE's logic inside a dummy rendering loop in order to provide
+the necessary markers for these tools to perform capture and analysis.
+
+### AMD
+
+For AMD GPUs, [Radeon GPU Profiler](https://gpuopen.com/rgp/) (RGP) is the tool
+to understand fine details of how IREE GPU performs. See the
+[documentation](https://radeon-gpuprofiler.readthedocs.io/en/latest/) for
+details.
+
+### NVIDIA
+
+For NVIDIA GPUs, [NVIDIA Nsight Graphics](https://developer.nvidia.com/nsight-graphics)
+is the tool to understand fine details of how IREE GPU performs. See the
+[documentation](https://docs.nvidia.com/nsight-graphics/UserGuide/index.html)
+for details.
diff --git a/docs/website/docs/developers/performance/profiling-with-tracy.md b/docs/website/docs/developers/performance/profiling-with-tracy.md
new file mode 100644
index 0000000..c0d0d23
--- /dev/null
+++ b/docs/website/docs/developers/performance/profiling-with-tracy.md
@@ -0,0 +1,420 @@
+# Profiling with Tracy
+
+[Tracy](https://github.com/wolfpld/tracy) is a profiler that puts together in a
+single view both instrumentation and system profiling (sampling, systrace). It's
+key to understand the nuance here.
+
+* *Instrumentation* is code built into the process being profiled, collecting
+  timestamps at the start and end of "zones". Once it's enabled at build time,
+  it typically just works &mdash; it is a part of our application logic just
+  like anything else, so there's no reason why it would not work.
+* *Sampling* and *SysTrace* rely on specific
+  system features to collect information on what is *actually* running. These
+  rely on OS and binary (ELF) file features, so they can take a bit more care to
+  get to work properly.
+
+There are two components to Tracy. They communicate over a TCP socket.
+
+* The "client" is the program being profiled.
+* The "server" is:
+    * Either the Tracy profiler UI (which we build as `iree-tracy-profiler`),
+    * Or the Tracy command-line capture tool (`iree-tracy-capture`) that can
+        save a trace for later loading in the Tracy profiler UI.
+
+## The Tracy manual
+
+The primary source of Tracy documentation, including for build instructions, is
+a PDF manual that's part of each numbered release.
+[Download](https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf)
+or
+[view in browser](https://docs.google.com/viewer?url=https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf).
+
+## Overview
+
+We will go through each steps below, but here is an overview. It highlights the
+simpler subset of instructions when only instrumentation is needed, vs. the
+additional steps needed when Sampling is also wanted.
+
+Component | Instrumentation only | Instrumentation and Sampling
+--- | --- | ---
+Build Tracy capture (`iree-tracy-capture`) | Base instructions below for [dependencies](#install-dependencies) and [build](#build-the-tracy-tools) | Same
+Build Tracy profiler (`iree-tracy-profiler`) | Base instructions below for [dependencies](#install-dependencies) and [build](#build-the-tracy-tools) | Same plus [`capstone-next` instructions](#do-you-need-capstone-next) for CPU disassembly to work
+Build the IREE compiler (`iree-compile`) for profiling your own modules  | [Nothing particular](#build-the-iree-compiler-iree-compile) | Same
+Build the IREE compiler (`iree-compile`) for profiling the compiler itself | [Also need](#build-the-iree-compiler-iree-compile) CMake setting: `IREE_ENABLE_COMPILER_TRACING` | Same
+Compile your IREE module (run `iree-compile`) | [Nothing particular](#compile-your-iree-module-run-iree-compile) | [Also need](#additional-steps-for-sampling) to pass `--iree-llvmcpu-link-embedded=false` (and also, for `llvm-cpu` backend, pass `--iree-llvmcpu-debug-symbols=true`, but that is currently default).
+Build IREE device binaries (`iree-run-module` etc) | [Base instructions below](#build-iree-device-binaries-with-tracy-instrumentation-clients) (CMake: set `IREE_ENABLE_RUNTIME_TRACING`) | [Also need](#additional-steps-for-sampling-1) debug information (Set `CMAKE_BUILD_TYPE` to `RelWithDebInfo`).
+Run IREE device binaries loading your modules | [Nothing particular](#running-the-profiled-program) (May need to set the environment variable `TRACY_NO_EXIT=1` for short-running benchmarks) | [Also need](#additional-steps-for-sampling-2) to set the environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` and adjust device security settings or run as root depending on OS.
+Run Tracy capture (`iree-tracy-capture`) to collect the trace | If device!=host (e.g. Android), [set up TCP port forwarding](#running-the-tracy-capture-cli-connecting-and-saving-profiles). | Same
+Build IREE's own tests and benchmark suites with Tracy instrumentation | [As above](#build-iree-device-binaries-with-tracy-instrumentation-clients), CMake: set `IREE_ENABLE_RUNTIME_TRACING`. | [Also need](#additional-steps-for-sampling) the CMake setting `IREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER` so that `--iree-llvmcpu-link-embedded=false` will be passed to `iree-compile`.
+
+## Install dependencies
+
+### Do you need capstone-next?
+
+You can skip this section if you don't need disassembly of CPU code.
+
+[Capstone](https://github.com/capstone-engine/capstone) is the disassembly
+framework used by Tracy. The default branch, which is what OS packages still
+distribute, is running a few years behind current CPU architectures.
+
+Newer CPU architectures such as RISC-V, or newer extensions of existing
+architectures (e.g. new SIMD instructions in the ARM architecture) are typically
+only supported in the
+[`next`](https://github.com/capstone-engine/capstone/tree/next) branch. If you
+need that support, check out and build that branch. Consider uninstalling any OS
+package for `capstone` or otherwise ensure that your IREE build will pick up
+your `next` branch build.
+
+### Linux
+
+If you haven't opted to build `capstone-next` (see above section), install the
+OS package for `capstone` now (Debian-based distributions):
+
+```shell
+sudo apt install libcapstone-dev
+```
+
+Install other dependencies:
+
+```shell
+sudo apt install libtbb-dev libzstd-dev libglfw3-dev libfreetype6-dev libgtk-3-dev
+```
+
+If you only build the command-line tool `iree-tracy-capture` and not the
+graphical `iree-tracy-profiler`, you can install only:
+
+```shell
+sudo apt install libtbb-dev libzstd-dev
+```
+
+The zstd version on Ubuntu 18.04 is old. You will need to install it from source
+from <https://github.com/facebook/zstd.git>
+
+### Mac
+
+If you haven't opted to build `capstone-next` (see above section), install the
+system `capstone` now:
+
+```shell
+brew install capstone
+```
+
+Install other dependencies:
+
+```shell
+brew install pkg-config glfw freetype tbb zstd
+```
+
+## Build the Tracy tools
+
+A CMake-based build system for Tracy is maintained as part of IREE. In your IREE
+desktop build directory, set the following CMake option:
+
+```shell
+cmake -DIREE_BUILD_TRACY=ON -DIREE_ENABLE_LLD=ON .
+```
+
+That enables building the Tracy server tools, `iree-tracy-profiler` and
+`iree-tracy-capture`, introduced above. It also enables building the tool
+`iree-tracy-csvexport` which can be used to export a captured trace as a
+CSV file (see Section 6 "Exporting zone statistics to CSV" in the Tracy manual).
+
+If profiling on Android/ARM, you might need the patch discussed in the next
+paragraph.
+
+Consider building **without** assertions (`cmake -DIREE_ENABLE_ASSERTIONS=OFF`).
+At least `iree-tracy-profiler` has some
+[faulty assertions](https://github.com/wolfpld/tracy/pull/382) that can cause
+the profiler UI to crash during normal usage.
+
+Rebuild, either everything or just these specific targets:
+
+```shell
+cmake --build . --target iree-tracy-profiler iree-tracy-capture iree-tracy-csvexport
+```
+
+This should have created the `iree-tracy-profiler`, `iree-tracy-capture`, and
+`iree-tracy-csvexport` binaries:
+
+```shell
+$ find . -name iree-tracy-*
+./tracy/iree-tracy-profiler
+./tracy/iree-tracy-capture
+./tracy/iree-tracy-csvexport
+```
+
+## Build the IREE compiler (`iree-compile`)
+
+Most people don't need to rebuild `iree-compile` at all for Tracy and can skip
+this section.
+
+If you want to profile `iree-compile` itself as opposed to just profiling
+modules compiled with it, then rebuild it with the CMake setting
+`IREE_ENABLE_COMPILER_TRACING` set to `ON`.
+
+## Compile your IREE module (run `iree-compile`)
+
+If you only want Instrumentation and not Sampling then you don't need anything
+particular here. Just run `iree-compile` as usual.
+
+### Additional steps for Sampling
+
+In order for Sampling to work with your compiled modules, add this flag to your
+`iree-compile` command line: `--iree-llvmcpu-link-embedded=false`.
+
+For the `llvm-cpu` target backend, sampling features also rely on debug
+information in the compiled module, enabled by
+`--iree-llvmcpu-debug-symbols=true`, but that is currently the default.
+
+When building IREE's own test and benchmark suites, if Tracy Sampling support
+is wanted, set the CMake setting
+`IREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER` to `ON`. It has the effect of
+passing that `--iree-llvmcpu-link-embedded=false` when compiling test/benchmark
+modules.
+
+## Build IREE device binaries with Tracy instrumentation ("clients")
+
+Set the CMake setting `IREE_ENABLE_RUNTIME_TRACING` to `ON` and rebuild IREE
+device binaries, e.g.
+
+```shell
+cd iree-device-build-dir
+cmake -DIREE_ENABLE_RUNTIME_TRACING=ON .
+cmake --build .
+```
+
+### Additional steps for Sampling
+
+In order for Sampling features to work, make sure that binaries contain debug
+information. That usually means changing the `CMAKE_BUILD_TYPE` to
+`RelWithDebInfo` instead of `Release`.
+
+In your IREE device build directory, set the following CMake options:
+
+```shell
+cd iree-device-build-dir
+cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .
+```
+
+## Running the profiled program
+
+The basic recipe is to just run your program as usual on the device and, while
+it is running, run `iree-tracy-capture` on the host to connect to it.
+
+In the typical case of a short-running benchmark, one usually runs with the
+environment variable `TRACY_NO_EXIT` defined so that the benchmark does not
+exit until `iree-tracy-capture` has connected to it.
+
+Example:
+
+```shell
+TRACY_NO_EXIT=1 /data/local/tmp/iree-benchmark-module ... (usual flags)
+```
+
+### Additional steps for Sampling
+
+In order for Sampling to work, the IREE compiled module code mapping must still
+be accessible by the time Tracy tries to read symbols code. This requires
+setting the environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES`. It is
+easiest to set it to `1` but one may also set it to an explicit path where to
+preserve the temporary files.
+
+Example:
+
+```shell
+TRACY_NO_EXIT=1 IREE_PRESERVE_DYLIB_TEMP_FILES=1 /data/local/tmp/iree-benchmark-module ... (usual flags)
+```
+
+Tracing doesn't work properly on VMs (see "Problematic Platforms / Virtual
+Machines" section 2.1.6.4 of the [manual](#the-tracy-manual)). To get sampling,
+you should run the profiled program on bare metal.
+
+## Operating system settings required for Sampling and SysTrace
+
+### Desktop Linux
+
+On desktop Linux, the profiled application must be run as root, e.g. with
+`sudo`. Otherwise, profile data will lack important components.
+
+### Android
+
+When profiling on an Android device, in order to get the most useful information
+in the trace, tweak system permissions as follows before profiling. This needs
+to be done again after every reboot of the Android device.
+
+From your desktop, get a shell on the Android device:
+
+```shell
+adb shell
+```
+
+The following commands are meant to be run from that Android device shell.
+First, get root access for this shell:
+
+```shell
+$ su
+#
+```
+
+Now run the following commands as root on the Android device:
+
+```shell
+setenforce 0
+mount -o remount,hidepid=0 /proc
+echo 0 > /proc/sys/kernel/perf_event_paranoid
+echo 0 > /proc/sys/kernel/kptr_restrict
+```
+
+Note: in order for this to work, the device needs to be *rooted*, which means
+that the above `su` command must succeed. This is sometimes confused with the
+`adb root` command, but that's not the same. `adb root` restarts the `adbd`
+daemon as root, which causes device shells to be root shells by default. This is
+unnecessary here and we don't recommend it: real Android applications *never*
+run as root, so Tracy/Android *has* to support running benchmarks as regular
+user and it's best to stick to this for the sake of realistic benchmarks.
+Internally, Tracy executes `su` commands to perform certain actions, so it too
+relies on the device being *rooted* without relying on the benchmark process
+being run as root.
+
+### "RESOURCE_EXHAUSTED; failed to open file" issue
+
+This is a
+[known issue with how tracy operates](https://github.com/wolfpld/tracy/issues/512).
+One way to workaround it is to manually increase the total number of files
+that can be kept opened simultaneously and run the benchmark command with that
+setting:
+
+```shell
+sudo sh -c "ulimit -n <bigNum> && <myTracyInstrumentedProgram>"
+```
+
+**Explanation:**
+
+Tracy keeps a number of file descriptors open that, depending on the machine and
+its settings, may exceed the limit allowed by the system resulting in `iree`
+to fail to open more files.
+In particular, it is commom to have a relatively low limit when running
+with `sudo`.
+
+## Running the Tracy Capture CLI, connecting and saving profiles
+
+While the program that you want to profile is still running (thanks to
+`TRACY_NO_EXIT=1`), start the Tracy capture tool in another terminal. From the
+IREE build directory:
+
+```shell
+tracy/iree-tracy-capture -o myprofile.tracy
+Connecting to 127.0.0.1:8086...
+```
+
+It should connect to the IREE client and save the output to myprofile.tracy that
+can be visualized by the client below. You can start the capture tool first to
+make sure you don't miss any capture events.
+
+Note that the connection uses TCP port 8086. If the Tracy-instrumented program
+is running on a separate machine, this port needs to be forwarded. In
+particular, when benchmarking on Android, this is needed:
+
+```shell
+adb forward tcp:8086 tcp:8086
+```
+
+## Running the Tracy profiler UI, connecting and visualizing
+
+If you have previously captured a tracy file (previous section), this command
+should succeed loading it (from the IREE build directory):
+
+```shell
+tracy/iree-tracy-profiler myprofile.tracy
+```
+
+Alternatively, while the program that you want to profile is still running
+(possibly thanks to `TRACY_NO_EXIT=1`), the Tracy profiler can connect to it
+directly (so it is not required to capture the trace into a file): just running
+
+```shell
+tracy/iree-tracy-profiler
+```
+
+should show a dialog offering to connect to a client i.e. a profiled program:
+
+![Tracy connection dialog](https://gist.github.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_connect.jpeg)
+
+If connecting doesn't work:
+
+* If the profiled program is on a separate machine, make sure you've correctly
+    set up port forwarding.
+* On Android, the `adb forward` may need to be run again.
+* Make sure that the profiled program is still running. Do you need
+    `TRACY_NO_EXIT=1`?
+* Kill the profiled program and restart it.
+
+You should now start seeing a profile. The initial view should look like this:
+
+![Tracy initial view, normal case](https://gist.githubusercontent.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_initial_view.jpeg)
+
+Before going further, take a second to check that your recorded profile data has
+all the data that it should have. Permissions issues, as discussed above, could
+cause it to lack "sampling" or "CPU data" information, particularly on Android.
+For example, here is what he initial view looks like when one forgot to run the
+profiled program as root on Desktop Linux (where running as root is required, as
+explained above):
+
+![Tracy initial view, permissions issue](https://gist.githubusercontent.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_permissions_issue.jpeg)
+
+Notice how the latter screenshot is lacking the following elements:
+
+* No 'CPU data' header on the left side, with the list of all CPU cores. The
+    'CPU usage' graph is something else.
+* No 'ghost' icon next to the 'Main thread' header.
+
+Click the 'Statistics' button at the top. It will open a window like this:
+
+![Tracy statistics window](https://gist.githubusercontent.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_statistics.jpeg)
+
+See how the above screenshot has two radio buttons at the top: 'Instrumentation'
+and 'Sampling'. At this point, if you don't see the 'Sampling' radio button, you
+need to resolve that first, as discussed above about possible permissions
+issues.
+
+These 'Instrumentation' and 'Sampling' statistics correspond the two kinds of
+data that Tracy collects about your program. In the Tracy main view, they
+correspond, respectively, to 'instrumentation' and 'ghost' zones. Refer to the
+[Tracy PDF manual](#the-tracy-manual) for a general introduction to these
+concepts. For each thread, the ghost icon toggles the view between these two
+kinds of zones.
+
+Back to the main view, look for the part of the timeline that is of interest to
+you. Your area of interest might not be on the Main thread. In fact, it might be
+on a thread that's not visible in the initial view at all. To pan around with
+the mouse, hold the **right mouse button** down (or its keyboard equivalent on
+macOS). Alternatively, look for the 'Frame' control at the top of the Tracy
+window. Use the 'next frame' arrow button until more interesting threads appear.
+
+IREE module code tends to run on a thread whose name contains the word `worker`.
+
+Once you have identified the thread of interest, you typically want to click its
+ghost icon to view its "ghost" (i.e. sampling) zones.
+
+Here is what you should get when clicking on a ghost zone:
+
+![ghost zone source view](https://gist.githubusercontent.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_source_view.jpeg)
+
+The percentages column to the left of the disassembly shows where time is being
+spent. This is unique to the sampling data (ghost zones) and has no equivalent
+in the instrumentation data (instrumentation zones). Here is what we get
+clicking on the corresponding instrumentation zone:
+
+![instrumentation zone source view](https://gist.githubusercontent.com/bjacob/ff7dec20c1dfc7d0fc556cc7275bca9a/raw/fe4e22ca0301ebbfd537c47332a4a2c300a417b3/tracy_normal_zone_info.jpeg)
+
+This still has a 'Source' button but that only shows the last C++ caller that
+had explicit Tracy information, so here we see a file under `iree/hal` whereas
+the Ghost zone saw into the IREE compiled module that that calls into, with the
+source view pointing to the `.mlir` file.
+
+## Configuring Tracy instrumentation
+
+Set IREE's `IREE_TRACING_MODE` value (defined in
+[iree/base/tracing.h](https://github.com/openxla/iree/blob/main/runtime/src/iree/base/tracing.h))
+to adjust which tracing features, such as allocation tracking and callstacks,
+are enabled.
diff --git a/docs/website/docs/developers/performance/profiling.md b/docs/website/docs/developers/performance/profiling.md
new file mode 100644
index 0000000..a5a4274
--- /dev/null
+++ b/docs/website/docs/developers/performance/profiling.md
@@ -0,0 +1,25 @@
+# Profiling overview
+
+IREE [benchmarking](./benchmarking.md) gives us an accurate and reproducible
+view of program performance at specific levels of granularity. To analyze system
+behavior in more depth, there are various ways to
+[profile](https://en.wikipedia.org/wiki/Profiling_(computer_programming)) IREE.
+
+## Tracy
+
+Tracy is a profiler that's been used for a wide range of profiling tasks on
+IREE. Refer to [Profiling with Tracy](./profiling-with-tracy.md).
+
+## Vulkan GPU Profiling
+
+[Tracy](./profiling-with-tracy.md) offers great insights into CPU/GPU
+interactions and Vulkan API usage details. However, information at a finer
+granularity, especially inside a particular shader dispatch, is missing. To
+supplement general purpose tools like Tracy, vendor-specific tools can be used.
+Refer to [Profiling GPUs using Vulkan](./profiling-gpu-vulkan.md).
+
+## CPU cache and other CPU event profiling
+
+For some advanced CPU profiling needs such as querying CPU cache and other
+events, one may need to use some OS-specific profilers. See
+[Profiling CPUs](./profiling-cpu-events.md).
diff --git a/docs/website/docs/developers/usage-best-practices.md b/docs/website/docs/developers/usage-best-practices.md
new file mode 100644
index 0000000..5bd33a6
--- /dev/null
+++ b/docs/website/docs/developers/usage-best-practices.md
@@ -0,0 +1,87 @@
+# Usage best practices
+
+This page contains a list of best practices for getting the most out of IREE,
+spanning model authoring, ahead-of-time compilation, and runtime use. Treat
+these as a collection of ideas to consider or areas to start benchmarking when
+working on your own applications.
+
+## Introduction
+
+Common themes include:
+
+* Give the compiler as much information as possible
+* Give the compiler opportunities to batch work together or defer computation
+* Keep compute devices saturated with work through pipelining
+* Use dense math where possible, particularly for inner loop bodies
+* Limit synchronization points between devices like CPUs and GPUs
+* Profile early and often, using the right tools for each level of granularity
+
+## Practices for model authoring
+
+### Track state within your model when possible
+
+If your model is stateful prefer to store that state directly within your
+program rather than externalizing it through arguments and return values. By
+keeping state inside your program the compiler is better able to reason about
+it and function calls will have lower overhead.
+
+If you do externalize state, try to pack that state into a limited number of
+arguments.
+
+See the
+[variables and state](https://github.com/openxla/iree/tree/main/samples/variables_and_state)
+sample for further guidance on tracking and using state.
+
+### Limit uses of dynamic shapes
+
+While IREE aims to support general dynamic shapes use, it is better able to
+optimize parts of programs where shapes are static. Slow varying dimensions
+like batch index or timestamp are safer uses of dynamic shapes than faster
+varying dimensions like the x/y/channel dimensions of images.
+
+See the
+[dynamic shapes](https://github.com/openxla/iree/tree/main/samples/dynamic_shapes)
+sample for further guidance on using dynamic shapes.
+
+## Practices for compilation settings
+
+TODO: which compiler targets to use (try both CUDA and Vulkan?)
+
+TODO: use the most specific LLVM target triple you can?
+
+### Tuning compilation heuristics
+
+IREE runs its own suite of benchmarks continuously using the definitions at
+<https://github.com/openxla/iree/tree/main/benchmarks>. The flags set for these
+benchmarks represent the latest manually tuned values for workloads we track
+closely and referencing them may help with your own search for peak performance.
+You can use these flags in your own explorations, but note that as compiler
+performance matures, the existing flags will gradually be replaced with
+attributes for autotuning or command line options for experimental features.
+
+## Practices for runtime use
+
+TODO: sample code, profile numbers
+
+### Tuning runtime settings
+
+When running on the CPU, the task system flags specified in
+[iree/task/api.c](https://github.com/openxla/iree/blob/main/runtime/src/iree/task/api.c)
+give control over how worker threads will be created. For example, the
+`--task_topology_group_count=3` flag can be set to explicitly run on three
+workers rather than rely on heuristic selection that defaults to one worker
+per detected physical core.
+
+If running on a single thread or system with no threading support the
+`local-sync` HAL driver can be used instead of the multithreaded `local-task`
+HAL driver to reduce dependencies and code size. When running with the
+`local-sync` driver all execution happens inline on the thread invoking the
+IREE runtime and will block until it has completed.
+
+### Do the minimum amount of work: cache queries and reuse buffers
+
+When using IREE's runtime libraries, try to front-load queries, particularly
+queries using strings that look up into maps like
+`iree_runtime_session_call_by_name`, so that hot sections of code are doing the
+minimum amount of work: routing inputs through buffers, scheduling runtime
+calls, and routing outputs through other buffers.
diff --git a/docs/website/docs/developers/vulkan-environment-setup.md b/docs/website/docs/developers/vulkan-environment-setup.md
new file mode 100644
index 0000000..ff76271
--- /dev/null
+++ b/docs/website/docs/developers/vulkan-environment-setup.md
@@ -0,0 +1,216 @@
+# Vulkan environment setup
+
+[Vulkan](https://www.khronos.org/vulkan/) is a new generation graphics and
+compute API that provides high-efficiency, cross-platform access to modern GPUs
+used in a wide variety of devices from PCs and consoles to mobile phones and
+embedded platforms.
+
+This page lists steps and tips for setting up and troubleshooting a Vulkan
+development environment. The information here is meant to be generic.
+
+## Vulkan architecture
+
+Vulkan adopts a layered architecture, which aims to better support extensiblity.
+There are four components involved in this architecture:
+
+* The Vulkan Application
+* [The Vulkan Loader][VulkanLoader]
+* [Vulkan Layers][VulkanLayer]
+* [Installable Client Drivers (ICDs)][VulkanICD]
+
+![High Level View of Loader][VulkanArchPicture]
+
+The Vulkan loader sits between the Vulkan application, which calls Vulkan APIs,
+and the ICDs, which implements these Vulkan APIs. Vulkan layers agument the
+Vulkan system to provide optional features like validation and debugging. The
+Vulkan loader composes a chain of requested layers, which processes the Vulkan
+application's API calls one by one, and finally redirects the API calls made by
+the Vulkan application to one or more ICDs.
+
+It's highly recommned to read the
+[Architecture of the Vulkan Loader Interfaces Overview][VulkanArchOverview] to
+get a general understanding of what these components are and how they interact
+with one another.
+
+## Vulkan development environment setup
+
+### Windows
+
+You need to install the [Vulkan SDK][VulkanSDK] from LunarG to get the Vulkan
+loader.
+
+Typically the Vulkan SDK will be installed at `C:\VulkanSDK\<version>\` and
+there will be an environment variable `VULKAN_SDK` pointing to it. You can run
+the `vulkancube` executable under the `Bin\` subdirectory of the Vulkan SDK to
+make sure everything works properly. If not, you probably need to check whether
+the graphics card is Vulkan capable or update the driver.
+
+### Debian/Ubuntu
+
+The following packages should be installed for a proper Vulkan runtime to test
+the runtime functions properly:
+
+* [libvulkan1][PackageLibVulkan1] for the Vulkan loader `libvulkan.so`.
+* [AMD] [mesa-vulkan-drivers][PackageMesaVulkan] for Mesa AMD Vulkan ICD.
+* [NVIDIA] [nvidia-vulkan-icd][PackageNvidiaVulkan] for NVIDIA Vulkan ICD.
+
+The above packages provide the Vulkan loader and ICDs. With them an Vulkan
+application should be able to run. You may additionally want to install
+
+* [vulkan-tools][PackageVulkanTools] for command-line tools like `vulkaninfo`
+    (dumping available ICDs and their capabilities) and GUI application like
+    `vulkancube` (rendering a rotating cube).
+
+In order to develop Vulkan applications, you additionally need the following
+packages:
+
+* [libvulkan-dev][PackageVulkanDev] for various Vulkan header files.
+* [vulkan-validationlayers][PackageVulkanValidation] for Vulkan validation
+    layers like `VkLayer_standard_validation`.
+
+### Linux
+
+For other Linux distros, please consult the corresponding package management
+tools for the packages needed. (And please feel free to update this doc
+regarding them.)
+
+You can also download and install the [Vulkan SDK][VulkanSDK] from LunarG. It
+packages the loader with many useful layers and other shader tools. The source
+code of the SDK component projects are included, allowing you to recompile the
+artifacts if needed.
+
+You can also build the Vulkan SDK component projects like
+[Vulkan-Loader][VulkanLoaderSource] and
+[Vulkan-ValidationLayers][VulkanValidationLayersSource] from source. But note
+that building these components separately you need to make sure they are
+consistent with one another (e.g., using the same version of Vulkan headers) to
+function together.
+
+If you have multiple versions of Vulkan loaders exist, you may also need to set
+`LD_LIBRARY_PATH` and `LD_PRELOAD` to load the desired version of the loader.
+For example:
+
+```shell
+LD_LIBRARY_PATH={PATH_TO_VULKAN_SDK}/x86_64/lib/
+LD_PRELOAD=libvulkan.so.1
+```
+
+This can also be done by sourcing the proper `setup-env.sh` from one of the
+downloaded Vulkan SDKs.
+
+### Android
+
+Please make sure your Android device is Vulkan capable. Vulkan is supported on
+Android since 7, but we track newer Android versions (10+) closely and haven't
+set a clear min version yet.
+
+#### Troubleshooting Vulkan function `vkCreateInstance` not available
+
+Since Android 8 Oreo, Android re-architected the OS framework with
+[project Treble](https://source.android.com/devices/architecture#hidl).
+Framework libraries and
+[vendor libraries](https://source.android.com/devices/architecture/vndk) have a
+more strict and clear separation. Their dependencies are carefully scrutinized
+and only selected cases are allowed. This is enforced with
+[linker namespaces](https://source.android.com/devices/architecture/vndk/linker-namespace).
+
+`/data/local/tmp` is the preferred directory for automating native binary tests
+built using NDK toolchain. They should be allowed to access libraries like
+`libvulkan.so` for their functionality. However, there was an issue with fully
+treblized Android 10 where `/data/local/tmp` did not have access to the linker
+namespaces needed by `libvulkan.so`. This should be
+[fixed](https://android.googlesource.com/platform/system/linkerconfig/+/296da5b1eb88a3527ee76352c2d987f82f3252eb)
+now. But as typically in the Android system, it takes a long time to see the fix
+getting propagated, if ever.
+
+A known workaround is to symlink the vendor Vulkan implementation under
+`/vendor/lib[64]` as `libvulkan.so` under `/data/local/tmp` and use
+`LD_LIBRARY_PATH=/data/local/tmp` when invoking IREE executables.
+
+For Qualcomm Adreno GPUs, the vendor Vulkan implementation is at
+`/vendor/lib[64]/hw/vulkan.*.so`. So for example for Snapdragon 865:
+
+```shell
+adb shell ln -s /vendor/lib64/hw/vulkan.kona.so /data/local/tmp/libvulkan.so
+```
+
+For ARM Mali GPUs, there is only one monolithic driver
+(`/vendor/lib[64]/libGLES_mali.so`) for OpenGL and Vulkan and the Vulkan vendor
+driver (`/vendor/lib[64]/hw/vulkan.*.so`) is just a symlink to it. So for
+example:
+
+```shell
+adb shell ln -s /vendor/lib64/libGLES_mali.so /data/local/tmp/libvulkan.so
+```
+
+## Vulkan debugging and profiling
+
+### RenderDoc
+
+RenderDoc is an awesome tool that one can use to capture and introspect Vulkan
+applications. It can be downloaded from RenderDoc's
+[website](https://renderdoc.org/) or compiled from [source][RenderDocSource].
+
+## Vulkan development environment troubleshooting
+
+### Useful environment variables
+
+There are a few environment variables that can alter the default Vulkan loader
+behavior and print verbose information, notably:
+
+* `VK_LOADER_DEBUG`: enable loader debug messages. Setting it to `all` will
+    enable the most verbose logging from the loader. This is especially useful
+    when trying to see what layers/ICDs are searched and used.
+* `VK_ICD_FILENAMES`: force the loader to use a specific ICD. This is
+    especially useful when you have multiple Vulkan capable devices and want to
+    select which one to use manually.
+* `VK_INSTANCE_LAYERS`: force the loader to enable the given layers. For
+    example, You can force enable `VK_LAYER_LUNARG_api_dump` to have a detailed
+    dump of all Vulkan API calls made by the application. You can force enable
+    `VK_LAYER_LUNARG_core_validation` to validate the API calls made by the
+    application.
+* `VK_LAYER_PATH`: override the loader's standard layer library search folders.
+
+Please see the [Vulkan loader's documentation][VulkanLoaderEnvVars] for detailed
+explanation for these variables.
+
+### Setting environment variables for Bazel test
+
+Bazel runs tests in a sandbox and environment variables must be passed through
+to the test runner. Consider putting environment setup in a `user.bazelrc` to
+save typing. For example:
+
+```starlark
+test --test_env="LD_LIBRARY_PATH=/absolute/path/to/vulkan/sdk/x86_64/lib/"
+test --test_env="LD_PRELOAD=libvulkan.so.1"
+test --test_env="VK_LAYER_PATH=/absolute/path/to/additional/layers/:$VK_LAYER_PATH"
+```
+
+### SSH on Linux and X forwarding
+
+Physical devices enumerated on NVIDIA drivers can be affected by the `DISPLAY`
+environment variable. If you are running under an SSH session to Linux or using
+chrome remote desktop and have problems with physical device enumeration, you
+probably want to check the `DISPLAY` environment and set it to point to a
+display at the server side, for example:
+
+```bash
+export DISPLAY=:0
+```
+
+[VulkanArchOverview]: https://github.com/KhronosGroup/Vulkan-Loader/blob/master/loader/LoaderAndLayerInterface.md#overview
+[VulkanArchPicture]: https://raw.githubusercontent.com/KhronosGroup/Vulkan-Loader/master/docs/images/high_level_loader.png
+[VulkanICD]: https://github.com/KhronosGroup/Vulkan-Loader/blob/master/loader/LoaderAndLayerInterface.md#installable-client-drivers
+[VulkanLayer]: https://github.com/KhronosGroup/Vulkan-Loader/blob/master/loader/LoaderAndLayerInterface.md#layers
+[VulkanLoader]: https://github.com/KhronosGroup/Vulkan-Loader/blob/master/loader/LoaderAndLayerInterface.md#the-loader
+[VulkanLoaderEnvVars]: https://github.com/KhronosGroup/Vulkan-Loader/blob/master/loader/LoaderAndLayerInterface.md#table-of-debug-environment-variables
+[VulkanLoaderSource]: https://github.com/KhronosGroup/Vulkan-Loader
+[VulkanSDK]: https://www.lunarg.com/vulkan-sdk/
+[VulkanValidationLayersSource]: https://github.com/KhronosGroup/Vulkan-ValidationLayers
+[PackageLibVulkan1]: https://packages.ubuntu.com/focal/libvulkan1
+[PackageMesaVulkan]: https://packages.ubuntu.com/focal/mesa-vulkan-drivers
+[PackageNvidiaVulkan]: https://packages.debian.org/buster/nvidia-vulkan-icd
+[PackageVulkanDev]: https://packages.ubuntu.com/focal/libvulkan-dev
+[PackageVulkanTools]: https://packages.ubuntu.com/focal/vulkan-tools
+[PackageVulkanValidation]: https://packages.ubuntu.com/eoan/vulkan-validationlayers
+[RenderDocSource]: https://github.com/baldurk/renderdoc
diff --git a/docs/website/docs/guides/index.md b/docs/website/docs/guides/index.md
index 99d800f..fbef597 100644
--- a/docs/website/docs/guides/index.md
+++ b/docs/website/docs/guides/index.md
@@ -33,7 +33,3 @@
   for AMD-specific solutions
 * [:simple-apple: GPU - Metal](./deployment-configurations/gpu-metal.md)
   for running on Apple hardware
-
-## Other topics
-
-* [:material-lightbulb-on: Developer tips and tricks](./developer-tips.md)
diff --git a/docs/website/mkdocs.yml b/docs/website/mkdocs.yml
index 4867083..7a746a5 100644
--- a/docs/website/mkdocs.yml
+++ b/docs/website/mkdocs.yml
@@ -2,6 +2,7 @@
 site_url: https://iree.dev/
 repo_url: https://github.com/openxla/iree
 repo_name: openxla/iree
+edit_uri: blob/main/docs/website/docs/
 
 theme:
   name: material
@@ -9,6 +10,7 @@
   logo_alt: IREE
   icon:
     repo: fontawesome/brands/github
+    edit: material/file-eye-outline
   font:
     text: Noto
     code: Noto Sans Mono
@@ -18,6 +20,7 @@
   custom_dir: overrides
 
   features:
+    - content.action.edit # Link to view/edit documentation source on GitHub
     - content.code.annotate # Allow inline annotations
     - content.code.copy # Enable copy button
     - content.tabs.link # Link content tabs across site (e.g. Windows/Linux)
@@ -136,8 +139,6 @@
           - GPU - CUDA: "guides/deployment-configurations/gpu-cuda.md"
           - GPU - ROCm: "guides/deployment-configurations/gpu-rocm.md"
           - GPU - Metal: "guides/deployment-configurations/gpu-metal.md"
-      - "Other topics":
-          - Developer tips and tricks: "guides/developer-tips.md"
   - "Reference":
       - "reference/index.md"
       - "API bindings":
@@ -165,6 +166,40 @@
           - Glossary: "reference/glossary.md"
           - Optimization options: "reference/optimization-options.md"
           - Extensions: "reference/extensions.md"
+  - "Developers":
+      - "developers/index.md"
+      - "General development topics":
+          - "developers/general/contributing.md"
+          - "developers/general/developer-overview.md"
+          - "developers/general/developer-tips.md"
+          - "developers/general/release-management.md"
+          - "developers/general/testing-guide.md"
+      - "Building":
+          - "developers/building/bazel.md"
+          - "developers/building/emscripten.md"
+          - "developers/building/cmake-options-and-variables.md"
+          - "developers/building/cmake-with-ccache.md"
+      - "Debugging":
+          - "developers/debugging/android-with-lldb.md"
+          - "developers/debugging/compile-time-regressions.md"
+          - "developers/debugging/integration-tests.md"
+          - "developers/debugging/releases.md"
+          - "developers/debugging/sanitizers.md"
+      - "Performance":
+          - "developers/performance/benchmarking.md"
+          - "developers/performance/benchmark-suites.md"
+          - "developers/performance/profiling.md"
+          - "developers/performance/profiling-cpu-events.md"
+          - "developers/performance/profiling-gpu-vulkan.md"
+          - "developers/performance/profiling-with-tracy.md"
+      - "Design docs":
+          - "developers/design-docs/cuda-backend.md"
+          - "developers/design-docs/design-roadmap.md"
+          - "developers/design-docs/function-abi.md"
+          - "developers/design-docs/invocation-execution-model.md"
+      - "Other topics":
+          - "developers/usage-best-practices.md"
+          - "developers/vulkan-environment-setup.md"
   - "Community":
       - "community/index.md"
       - "Blog":
@@ -219,3 +254,6 @@
         # Some blog post names/paths changed when setting up the blog plugin
         "community/blog/2021-07-19-tflite-tosa.md": "community/blog/posts/tflite-tosa.md"
         "community/blog/2021-10-13-mmt4d.md": "community/blog/posts/mmt4d.md"
+
+        # "Developers" section was added
+        "guides/developer-tips.md": "developers/general/developer-tips.md"
diff --git a/docs/website/overrides/404.html b/docs/website/overrides/404.html
index 64dbee5..fd3a129 100644
--- a/docs/website/overrides/404.html
+++ b/docs/website/overrides/404.html
@@ -6,8 +6,6 @@
 
 <p>Sorry, we couldn't find that page.</p>
 
-<p>The <a href="https://github.com/openxla/iree/tree/main/docs/developers"><code>docs/developers/</code></a> directory on GitHub might be helpful.
-
 <p>Click <a href="{{ config.site_url }}">here</a> to go back to the home page.</p>
 
 {% endblock %}