Merge docs/developers into docs/website/. (#15396) Fixes https://github.com/openxla/iree/issues/15116. ![image](https://github.com/openxla/iree/assets/4010439/db4e0fe5-36cb-4cae-8597-6607493f774f) ## Notes for review | Location description | Preview URL | | --------- | -------------- | | source folder (GitHub markdown) | [`docs/website/docs/developers/` on my fork](https://github.com/ScottTodd/iree/tree/developer-docs-folder/docs/website/docs/developers) | | website (mkdocs site generator) | https://scotttodd.github.io/iree/developers | * I tried to split this change into multiple smaller PRs / reviewable commits, but the nature of moving so many files around and getting them integrated into the website made that tricky * I may have missed some URL renamings. mkdocs warns about broken links on the website itself, at least ## Overview This takes the existing [`/docs/developers/`](https://github.com/openxla/iree/tree/main/docs/developers) folder and merges it into [`docs/website/`](https://github.com/openxla/iree/tree/main/docs/website) for publishing on https://iree.dev/. The website has historically been primarily serving an ambiguous group of "users" (hoping to take a model authored in an ML framework and deploy it to some target). This broadens the scope of the website to also include developers (from established maintainers to new contributors). This change does a few things: * Moves existing pages across folders * Deletes some stale pages * Updates page style to match what the "mkdocs" site generator and "mkdocs material" framework use * Updates links across the project to use website URLs, relative links, or GitHub links as appropriate ## Detailed list of changes * Added "edit this page" buttons to all pages ![image](https://github.com/openxla/iree/assets/4010439/04660ff5-3612-4062-820b-ef82686aad80) * Merged `building_with_bazel_[linux, macos, windows]` into a single "Building with Bazel" page that uses content tabs ![image](https://github.com/openxla/iree/assets/4010439/2beedc02-79d5-4039-a847-d86c5d471ee0) * Renamed files so alphabetical sorting highlights the category that each file belongs in ![image](https://github.com/openxla/iree/assets/4010439/c5a732ad-82f3-468b-b942-26d7e9d3c4ae) * Renamed files from using underscores to using dashes (more natural for URLs) * Merged some "debugging integration test" pages and deleted outdated information (pointing at old TensorFlow code that no longer exists) * Moved "developer tips" from the top level "Guides" category into the "General development topics" subsection under this new top level "Developers" category * Applied lint and style fixes to all files (e.g. max line length, `Subsection capitalization` instead of `Subsection Capitalization`) * Merged "contributor tips" into "contributing" * Redirected or removed references to docs/developers/ (e.g. website 404 page pointed there as another place to look for docs) * Deleted "codegen passes", "hal driver features", and "dynamic shapes" design docs (all were stale) * Removed references to old processes (quirks based on supporting Google's downstream monorepo) ## Future work This PR is focused primarily on moving pages over and making minor changes where appropriate. More work is needed to refresh the content on several pages. The "developer docs" should be seen as a wiki of sorts, so the support expectations are lower, but outdated or missing documentation can be worse than no documentation in some respects. Known issues to follow up on: * The "Contributing" page should be updated, perhaps with a separate page for "Infrastructure" forked out * We have many "benchmarking" and "profiling" pages. That's great, but people shouldn't need to read all of the pages to be productive * The design docs are _very_ outdated. I removed a few of them, but we should figure out if the remaining ones are worth keeping around. New pages would be nice too * These pages could have icons and other style tweaks, e.g. the sidebar shows icons but it looks better if all pages list them: ![image](https://github.com/openxla/iree/assets/4010439/7efb0bbd-23aa-4b91-8ee3-4787d800a3fb) * mkdocs [material] supports showing revision dates on files. That would be useful for showing how fresh a file is, but files can be touched by refactorings and generated files don't have git information... need to think through that a bit

commit: 573f5e91901ce54c594b0e82f894a98f9f9d53f7 [log] [tgz]
author: Scott Todd <scotttodd@google.com> Fri Nov 03 08:31:57 2023 -0700
committer: GitHub <noreply@github.com> Fri Nov 03 08:31:57 2023 -0700
tree: 9ceaf3282304ea8dc0b820cbdffa111f321221bc
parent: d354465a12182e59ed21918420b5b777b7a9b6db [diff]
diff --git a/docs/README.md b/docs/README.md
deleted file mode 100644
index 7174d47..0000000
--- a/docs/README.md
+++ /dev/null

@@ -1,28 +0,0 @@
-# IREE Documentation
-
-Documentation exclusively for project developers lives in
-[`developers/`](developers/), while the source pages and assets for IREE's
-user-focused website live in [`website/`](website/).
-
-Developer documentation should use GitHub-flavored markdown (see
-[this guide](https://guides.github.com/features/mastering-markdown/)), while
-the website uses [MkDocs](https://www.mkdocs.org/), with the
-[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme.
-
-A high bar should be set for pages published to the website:
-
-* Page structure should be consistent across sections
-* Content should be kept up to date regularly
-* Instructions should be portable across environments (e.g. don't
-  overspecialize on a specific Linux distribution or a particular version of
-  Visual Studio on Windows)
-
-When in doubt, the guide at https://developers.google.com/style offers good
-instructions.
-
-Developer documentation _can_ compromise on each of these points. Pages may
-also be promoted to website/ after some refinement.
-
-For more details on how this is set up, see
-
-* [IREE Website Overview - July 10, 2023](https://docs.google.com/presentation/d/116TyW_aCsPXmmjRYI2tRqpOwDaGNoV8LDC_j9hsMrDk/edit?usp=sharing)

diff --git a/docs/developers/README.md b/docs/developers/README.md
deleted file mode 100644
index a2b9599..0000000
--- a/docs/developers/README.md
+++ /dev/null

@@ -1,4 +0,0 @@
-# IREE Developer Documentation
-
-Project documentation, developer guides, and other pages not published on
-IREE's user-facing website.

diff --git a/docs/developers/debugging/integration_correctness_issue_breakdown.md b/docs/developers/debugging/integration_correctness_issue_breakdown.md
deleted file mode 100644
index 4996e9f..0000000
--- a/docs/developers/debugging/integration_correctness_issue_breakdown.md
+++ /dev/null

@@ -1,51 +0,0 @@
-This doc describes tips about triaging correctness issue. Feel free to reach out
-to @hanhanW or ask questions on Discord if you need help or tips on triaging
-correctness issue.
-
-# Decouple the reproduce from integrations
-
-## TF integration tests
-
-See [instructions for reproducing failures in TF/TFLite integration tests](https://github.com/hanhanW/iree/blob/main/docs/developers/debugging/tf_integrations_test_repro.md).
-
-For input data, they are not dumped within the flagfile. You can construct the
-function inputs by looking into `log.txt`. There is an [issue](https://github.com/openxla/iree/issues/8658)
-for tracking this.
-
-## iree-samples
-
-Follow [README](https://github.com/iree-org/iree-samples#readme) to run the model.
-The MLIR files will be generated. You'll find the saved file from log. E.g.,
-
-``` shell
-[ RUN      ] MobilenetV2Int8Test.test_compile_tflite
-I0401 17:27:04.084272 140182373025024 test_util.py:119] Setting up for IREE
-I0401 17:27:04.085064 140182373025024 binaries.py:218] Invoke IREE Pipeline:
-  /tmp/iree-samples/iree-samples.venv/lib/python3.9/site-packages/iree/tools/tflite/iree-import-tflite
-    /tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/model.tflite
-    --mlir-print-debuginfo
-    --save-temp-tfl-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tflite.mlir
-    --save-temp-iree-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tosa.mlir
-```
-
-Unfortunately, the artifacts are not dumped in the runs. There is an [issue](https://github.com/openxla/iree/issues/8756)
-for tracking this. A workaround can be found in the issue.
-
-# Narrow down the repro
-
-The model itself is big. IREE breaks a model into dispatches and launches the
-kernels. The inputs and outputs could be diverged starting from one of
-launches. To get a smaller reproduce, you can use [-iree-flow-trace-dispatch-tensors](https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/developer_overview.md#iree-flow-trace-dispatch-tensors).
-You can compare the logs between builds/backends, and get the idea about which
-dispatch results in wrong outputs. The dumped inputs can be reused in a
-flagfile.
-
-Since we get the suspicious dispatch, we are able to create a test case based on
-the dispatch function. The dispatch function can be derived after the
-`OutlineDispatchRegions` pass. The function signatures have to be modified
-manually. You'll have to put `flow.dispatch.tensor.load` variables to function
-arguments, and replace `flow.dispatch.tensor.store` with `return` op.
-
-At this stage, the reproduce is narrowed down to a single dispatch function.
-
-Note: This only works when dispatch formation logics are identical between runs.

diff --git a/docs/developers/debugging/lldb_on_android.md b/docs/developers/debugging/lldb_on_android.md
deleted file mode 100644
index 1fb6e7d..0000000
--- a/docs/developers/debugging/lldb_on_android.md
+++ /dev/null

@@ -1,61 +0,0 @@
-This doc shows how to use LLDB to debug native binaries on Android. For a more
-complete explanation, see the
-[official LLDB documentation on remote debugging](https://lldb.llvm.org/use/remote.html).
-
-# Debugging with LLDB on Android
-
-## Prerequisites
-
-We assume the following setup:
-
-1. [Android NDK is installed](https://developer.android.com/ndk/downloads) and
-   the `ANDROID_NDK` environment variable is set to the installation path.
-1. Your Android device connected and configured for
-   [`adb`](https://developer.android.com/studio/command-line/adb).
-1. The Android binary of interest is already compiled and the command to run it
-   (in `adb shell`) is `<your-binary> [program args...]`. This does *not* have
-   to be a proper Android app with a manifest, etc.
-
-## Running Manually
-
-1. Push the toolchain files, including `lldb-server`, to your device:
-
-   ```shell
-   adb shell "mkdir -p /data/local/tmp/tools"
-   adb push "$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/aarch64/* /data/local/tmp/tools
-   ```
-
-   You may need to adjust the clang toolchain version to match the one in your
-   NDK. You can find it with
-   `find "$ANDROID_NDK/toolchains/llvm/prebuilt" -name lldb-server`.
-
-1. Set up port forwarding. We are going to use port 5039 but you are free to
-   pick a different one:
-
-   ```shell
-   adb forward tcp:5039 tcp:5039
-   ```
-
-1. Start an `lldb-server` in a new interactive adb shell:
-
-   ```shell
-    adb shell
-    /data/local/tmp/tools/lldb-server platform --listen '*:5039' --server
-   ```
-
-1. Launch `lldb`, connect to the server and run the binary:
-
-   ```shell
-   lldb -o 'platform select remote-android' \
-        -o 'platform connect connect://:5039' \
-        -o 'platform shell cd /data/local/tmp'
-   target create <your-binary>
-   run [program args...]
-   ```
-
-   You can either use the system `lldb` or a prebuilt under `"$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/<your-host-arch>`.
-
-   Explanation: each `-o` (short for `--one-shot`) tells lldb to execute a
-   command on startup. You can run those manually in the lldb shell, if you
-   prefer. Then, we tell lldb which working directory to use, where to find the
-   executable, and what command line arguments to use.

diff --git a/docs/developers/debugging/tf_integrations_test_repro.md b/docs/developers/debugging/tf_integrations_test_repro.md
deleted file mode 100644
index 3cde538..0000000
--- a/docs/developers/debugging/tf_integrations_test_repro.md
+++ /dev/null

@@ -1,76 +0,0 @@
-# Debugging failures in TF/TFLite integration tests
-
-These are steps to reproduce/address failures in TF/TFLite integration tests.
-These instructions are most stable on Linux, though they may work with a few
-tweaks on Windows and macOS.
-
-All steps here assume starting from the IREE root directory.
-
-1. First create a Python virtual environment to install packages into:
-
-    ```bash
-    python -m venv iree-tf.venv
-    source iree-tf.venv/bin/activate
-
-    # Install test requirements
-    python -m pip install -r ./integrations/tensorflow/test/requirements.txt
-    ```
-
-2. Install IREE's tools and Python bindings or build them from source
-
-    Install distributed packages
-
-    ```bash
-    # Install packages from nightly releases
-    # This should work for most cases, as the importers change infrequently
-    python -m pip install \
-      iree-compiler iree-runtime iree-tools-tf iree-tools-tflite \
-      --find-links https://iree.dev/pip-release-links.html
-    ```
-
-    _OR_ build from source
-
-    ```bash
-    # Build Python bindings from source
-    cmake -G Ninja -B ../iree-build/ -DIREE_BUILD_PYTHON_BINDINGS=ON .
-    cmake --build ../iree-build/
-
-    # Add IREE built-from-source Python packages to PYTHONPATH
-    source .env
-
-    # Install IREE TF/TFLite Python packages
-    python -m pip install integrations/tensorflow/python_projects/iree_tf
-    python -m pip install integrations/tensorflow/python_projects/iree_tflite
-    ```
-
-3. Run the python test command line
-
-    The command can be obtained from the run file. For example, if
-    `iree_tfl_tests/llvmcpu_posenet_i8.run` failed,
-
-    ```bash
-    cd integrations/tensorflow/test/
-    cat iree_tfl_tests/llvmcpu_posenet_i8.run
-
-    # REQUIRES: llvmcpu
-    # RUN: %PYTHON -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=%t
-
-    cd python/
-    python -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=/tmp/posenet_i8_failure
-    ```
-
-    Note that the command can only be run under
-    `integrations/tensorflow/test/python` directory.
-
-4. Extract intermediate files and use with native tools
-
-    The test will create an `iree_input.mlir` in the temp directory specified.
-    Those can then be fed into `iree-compile` (built locally to reproduce the
-    error)
-
-    ```bash
-    iree-compile \
-      --iree-hal-target-backends=llvm-cpu \
-      --iree-input-type=stablehlo \
-      iree_input.mlir
-    ```

diff --git a/docs/developers/design_docs/codegen_passes.md b/docs/developers/design_docs/codegen_passes.md
deleted file mode 100644
index 4f69024..0000000
--- a/docs/developers/design_docs/codegen_passes.md
+++ /dev/null

@@ -1,659 +0,0 @@
-# IREE CPU/GPU Code Generation Pipeline
-
-This document is intended to provide an overview of the codegen pipeline within
-IREE used to generate CPU/GPU code. It intends to give an overview of the main
-passes used, the objective of the pass, the current implementation, and what it
-is expected to achieve in the long term.
-
-Note that while the code generation pipeline supports dynamic shapes, this work
-is very preliminary. The description of this is not covered here.
-
-## Input to the codegen pipeline
-
-The input to the code generation pipeline is the module within the
-`hal.executable.variant` operation. Functions within this module that do __not__
-have `Visibility::Private` are the *entry point* functions of the dispatch
-region. These are the functions that are *invoked* by the IREE runtime. In
-addition, each dispatch region also contains a `hal.interface` operation that
-describes the ABI to use for the dispatch region. Two examples of the input to
-the code generation pipeline are shown below. In both of these, a single
-dispatch function contains a sequence of MHLO operations that the dispatch
-region creation has grouped into a single region. Ideally the grouped operations
-are fused into a single kernel.
-
-```mlir
-hal.executable.variant "vulkan*" {
-  module attributes {spv.target_env = ...} {
-    func @main_ex_dispatch() {
-      %c0 = constant 0 : index
-      %0 = hal.interface.load.tensor @legacy_io::@arg0,
-             offset = %c0 : tensor<32x24xf32>
-      %1 = hal.interface.load.tensor @legacy_io::@arg1,
-             offset = %c0 : tensor<24x16xf32>
-      %2 = "mhlo.dot"(%0, %1) {precision_config = ["DEFAULT", "DEFAULT"]} :
-             (tensor<32x24xf32>, tensor<24x16xf32>) -> tensor<32x16xf32>
-      hal.interface.store.tensor %2, @legacy_io::@ret0,
-        offset = %c0 : tensor<32x16xf32>
-      return
-    }
-    hal.interface private @legacy_io  {
-      hal.interface.binding @arg0, set=0, binding=0,
-        type="StorageBuffer", access="Read"
-      hal.interface.binding @arg1, set=0, binding=1,
-        type="StorageBuffer", access="Read"
-      hal.interface.binding @ret0, set=0, binding=2,
-        type="StorageBuffer", access="Write|Discard"
-    }
-  }
-}
-```
-
-<a name="snippet1"></a> Snippet 1 : Dispatch region with matrix-matrix multiply
-operation.
-
-```mlir
-hal.executable.variant "vulkan*" {
-  module attributes {spv.target_env = ...} {
-    func @main_ex_dispatch() {
-      %c0 = constant 0 : index
-      %0 = hal.interface.load.tensor @legacy_io::@arg0,
-             offset = %c0 : tensor<10x15xf32>
-      %1 = hal.interface.load.tensor @legacy_io::@arg1,
-             offset = %c0 : tensor<10x15xf32>
-      %2 = hal.interface.load.tensor @legacy_io::@arg2,
-             offset = %c0 : tensor<15xf32>
-      %3 = "mhlo.add"(%0, %1) :
-         (tensor<10x15xf32>, tensor<10x15xf32>) -> tensor<10x15xf32>
-      %4 = "mhlo.broadcast"(%2) : (tensor<15xf32>) -> tensor<10x15xf32>
-      %5 = "mhlo.multiply"(%3, %4) :
-         (tensor<10x15xf32>, tensor<10x15xf32>) -> tensor<10x15xf32>
-      hal.interface.store.tensor %5, @legacy_io::@ret0,
-        offset = %c0 : tensor<10x15xf32>
-      return
-    }
-    hal.interface private @legacy_io  {
-      hal.interface.binding @arg0, set=0, binding=0,
-        type="StorageBuffer", access="Read"
-      hal.interface.binding @arg1, set=0, binding=1,
-        type="StorageBuffer", access="Read"
-      hal.interface.binding @arg2, set=0, binding=2,
-        type="StorageBuffer", access="Read"
-      hal.interface.binding @ret0, set=0, binding=3,
-        type="StorageBuffer", access="Write|Discard"
-    }
-  }
-}
-```
-
-<a name="snippet2"></a> Snippet 2 : Dispatch region with element-wise
-operations.
-
-__Roadmap Note__: The current implementation might not actually fuse the
-operations grouped into a dispatch region into a single kernel. It is possible
-to end up with multiple kernels per dispatch region. Over time we plan to
-address this by using fusion at different levels (see below).
-
-The inputs to the dispatch region are materialized within the entry point
-function using the `hal.interface.load.tensor` operation, This operation returns
-a `tensor` view of the buffer used to store the inputs. Similarly the result of
-the dispatch region are *written* out using the `hal.interface.store.tensor`
-operation.
-
-The main constraint that the code generation operates under is that it should
-not require additional (temporary) buffers to execute the operations grouped
-together within a dispatch region. The rationale behind this constraint is that
-buffer allocation/synchronization in IREE happens at the granularity of dispatch
-regions, allowing the scheduler to make better decision about where to insert
-appropriate synchronizations.
-
-The IR after all the passes used in the lowering from MHLO to SPIR-V for the
-above two examples can be found here ([matrix-matrix multiply op][DotAfterAll],
-[elementwise ops][PwAfterAll]). Below is a description of the major passes used.
-
-## Conversion from MHLO dialect to Linalg on buffers
-
-The code generation pipeline heavily relies on use of
-[Structured Operations][LinalgRationale], specifically the
-[Linalg Dialect][LinalgDialect]. Both, the Linalg operations on `tensor`s and on
-`memref`s are central to the progressive lowering approach followed here. The
-first part of the code generation pipeline is to convert the MHLO operations on
-`tensor`s to Linalg operation on `memref`s. This part of the pipeline is common
-to both CPU and GPU code generation.
-
-The steps involved in this conversion is shown below. Each of the arrows
-represents a pass in the pipeline:
-
-![MHLO To Linalg on `memref` conversion](./hlo_to_linalg.png)
-
-The next sections describe each of these passes in more detail.
-
-### MHLO to Linalg on tensors
-
-The first step is to convert MHLO operations to Linalg on tensors. This is done
-using the [HLOToLinalgPass][HLOToLinalgPass] from Tensorflow. An example of the
-conversion is shown below, where the `mhlo.add`, `mhlo.broadcast` and
-`mhlo.multiply` operations are converted to `linalg.generic` operations on
-tensors.
-
-```mlir
-#map0 = affine_map<(d0, d1) -> (d0, d1)>
-#map1 = affine_map<(d0, d1) -> (d1)>
-%3 = linalg.generic
-       {args_in = 2 : i64, args_out = 1 : i64,
-        indexing_maps = [#map0, #map0, #map0],
-        iterator_types = ["parallel", "parallel"]} %0, %1 {
-     ^bb0(%arg0: f32, %arg1: f32):  // no predecessors
-       %5 = addf %arg0, %arg1 : f32
-       linalg.yield %5 : f32
-     } : tensor<10x15xf32>, tensor<10x15xf32> -> tensor<10x15xf32>
-%4 = linalg.generic
-       {args_in = 1 : i64, args_out = 1 : i64,
-        indexing_maps = [#map1, #map0],
-        iterator_types = ["parallel", "parallel"]} %2 {
-     ^bb0(%arg0: f32):  // no predecessors
-       linalg.yield %arg0 : f32
-     }: tensor<15xf32> -> tensor<10x15xf32>
-%5 = linalg.generic
-       {args_in = 2 : i64, args_out = 1 : i64,
-        indexing_maps = [#map0, #map0, #map0],
-        iterator_types = ["parallel", "parallel"]} %3, %4 {
-     ^bb0(%arg0: f32, %arg1: f32):  // no predecessors
-       %5 = mulf %arg0, %arg1 : f32
-       linalg.yield %5 : f32
-     }: tensor<10x15xf32>, tensor<10x15xf32> -> tensor<10x15xf32>
-```
-
-<a name="snippet3"></a> Snippet 3 : MHLO to Linalg conversion for
-[element-wise operations](#snippet2)
-
-At the time of writing the representation of Linalg on `tensor`s does not model
-reduction iterator types completely. Specifically, the reduction in Linalg is
-modeled using read-modify-write approach, i.e. each iteration of the reduction
-loop reads the value stored in the output, adds its contribution, and writes
-back to the same location. This means the output has to be *initialized* to the
-null element of the reduction operator (i.e. 0 if the reduction is done using
-addition). This works for operations on buffers. Since tensors are SSA values
-they cannot be updated in-place. As a result, the reduction semantics does not
-map as well to `tensor`s. For now it is treated as a convention that when the
-Linalg operation is converted to use `memref`s it has to be initialized
-appropriately before performing the reduction. Due to this, the conversion from
-MHLO op to Linalg op is only done for operations which do not need a *reduction*
-iterator type in the converted Linalg op. Consequently, only element-wise
-operations, broadcast operations and data movement operations (like copy and
-transpose) are converted to Linalg operations at this stage.
-
-__Roadmap note__: One long term solution for the above is to have operations on
-tensors that have *reduction* iterator type to take an additional argument that
-contains the initial value of the result tensor. When the operation is converted
-to use `memref`s, the buffer for the initial value operand can be reused for the
-result. The details involved have not been fully worked out yet.
-
-### Fusion of Linalg on tensor operations
-
-The Linalg on `tensor` operations generated at the previous step are fused using
-the [LinalgFusionOfTensorOps][LinalgFusionOfTensorOps] from MLIR. Since
-`tensor`s are SSA values, fusion at this stage can be done without using alias
-analysis or dependence analysis based on reads and writes. Instead the use-def
-chains for the `tensor` values can be used to implement producer-consumer
-fusion. This stage fuses most elementwise operations, broadcast operations and
-data movement operations. An example of the fused op is shown below.
-
-```mlir
-#map0 = affine_map<(d0, d1) -> (d0, d1)>
-#map1 = affine_map<(d0, d1) -> (d1)>
-%3 = linalg.generic
-       {args_in = 3 : i64, args_out = 1 : i64,
-        indexing_maps = [#map0, #map0, #map1, #map0],
-        iterator_types = ["parallel", "parallel"]} %0, %1, %2 {
-     ^bb0(%arg0: f32, %arg1: f32, %arg2: f32):  // no predecessors
-       %4 = addf %arg0, %arg1 : f32
-       %5 = mulf %4, %arg2 : f32
-       linalg.yield %5 : f32
-     }: tensor<10x15xf32>, tensor<10x15xf32>, tensor<15xf32> -> tensor<10x15xf32>
-```
-
-<a name="snippet4"></a> Snippet 4: Fusion of Linalg operation on tensors for
-element-wise operations shown in [Snippet 3](#snippet3)
-
-### Conversion of Linalg on tensors to Linalg on buffers
-
-Post fusion all the operation on `tensor`s are converted to analogous operations
-on `memref`s. In general, this requires a buffer allocation pass. In IREE,
-buffer allocation happens at the granularity of dispatch region, and as
-mentioned [earlier](#input-to-the-codegen-pipeline), the dispatch region is not
-expected to use any additional temporary buffers. So instead of having another
-buffer allocation pass within the code generation pipeline, a simpler approach
-is used within IREE:
-
--   For each `hal.interface.store.tensor` an `iree.placeholder` operation is
-    created. The latter uses the same `hal.interface.binding` as the former, but
-    returns a `memref` view of the output of the dispatch region instead of a
-    `tensor` view. This `iree.placeholder` operation is added to start of the
-    entry point function.
-
--   A map is constructed that for a given `tensor` records the `memref` value to
-    use during the conversion. In this map the `tensor` value used in the
-    `hal.interface.store.tensor` is mapped to the `memref` value returned by the
-    created `iree.placeholder` operation.
-
--   The Dialect Conversion framework is used to implement a set of patterns that
-    convert from operations on `tensor`s to operation on `memref`s,
-
-    -   A `hal.interface.load.tensor`, is replaced with an `iree.placeholder` to
-        get the `memref` view of the input to the dispatch region.
-    -   All Linalg operation on `tensor`s (expected to be just `linalg.generic`
-        or `linalg.indexed_generic` operations) are converted to the
-        corresponding operation on `memref`s. Instead of returning a `tensor`
-        value the converted operation takes an additional `memref` operand as
-        argument. This `memref` is where the result of the operation is
-        populated. Current implementation looks for the `memref` to use from the
-        map constructed previously. If there is no `memref` associated with the
-        result `tensor` the conversion fails.
-    -   At this stage, any `mhlo` operation not converted to a Linalg operation
-        are directly converted to a Linalg operation on buffers. This is done
-        for operations that when converted to Linalg have a *reduction* iterator
-        type. Some examples of ops converted this way are
-
-        -   `mhlo.dot`
-        -   `mhlo.reduce`
-        -   `mhlo.conv`
-        -   `mhlo.reduce_window`.
-
-        Since the specification of the Linalg operations require the output
-        `memref` to be initialized appropriately, a `linalg.fill` operation is
-        used to achieve this.
-
-__Roadmap Note__ : Right now the code-generation pipeline relies on fusion of
-operations on tensor level. In the near future, we want to be able to fuse
-operations like `linalg.matmul` and `linalg.conv` with consumers/producers that
-are element-wise operations using the
-[fusion of Linalg operation on `memref`s][LinalgFusionOnBuffers].
-
-At this stage of the compilation all operations must have been converted to
-Linalg operations on buffers. Shown below are the IR at the end of this stage
-for the two examples in Snippets 1 and 2.
-
-```mlir
-func @main_ex_dispatch() {
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<32x16xf32>
-  %c0 = constant 0 : index
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<32x24xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<24x16xf32>
-  %cst = constant 0.000000e+00 : f32
-  linalg.matmul(%1, %2, %0) :
-    memref<32x24xf32>, memref<24x16xf32>, memref<32x16xf32>
-  return
-}
-```
-
-<a name="snippet5"></a> Snippet 5 : Matrix-matrix multiply after conversion to
-Linalg operation on `memref`s.
-
-```mlir
-#map0 = affine_map<(d0, d1) -> (d0, d1)>
-#map1 = affine_map<(d0, d1) -> (d1)>
-func @main_ex_dispatch() {
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<10x15xf32>
-  %c0 = constant 0 : index
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<10x15xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<10x15xf32>
-  %3 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg2} : memref<15xf32>
-  linalg.generic
-    {args_in = 3 : i64, args_out = 1 : i64,
-     indexing_maps = [#map0, #map0, #map1, #map0],
-     iterator_types = ["parallel", "parallel"]} %1, %2, %3, %0 {
-  ^bb0(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32):  // no predecessors
-    %4 = addf %arg0, %arg1 : f32
-    %5 = mulf %4, %arg2 : f32
-    linalg.yield %5 : f32
-  }: memref<10x15xf32>, memref<10x15xf32>, memref<15xf32>, memref<10x15xf32>
-  return
-}
-```
-
-<a name="snippet6"></a> Snippet 6 : Elementwise operations after conversion to
-Linalg operation on `memref`s
-
-The rest of the code-generation differs on whether the compilation is for CPU
-(using LLVM) or for GPU (using SPIR-V).
-
-## Conversion from Linalg on buffers to SPIR-V dialect
-
-The following sections describe the progressive lowering of Linalg operation on
-buffers to SPIR-V dialect. Once lowered to the SPIR-V dialect, it can be
-serialized into a SPIR-V binary using the
-[serialization mechanism provided by the SPIR-V dialect][SpirvSerialization].
-The steps involved in the lowering are described below, with each of the arrows
-representing a pass.
-
-![Linalg on `memref` to SPIR-V conversion](./linalg_to_spirv.png)
-
-These passes are described below in more detail.
-
-### Tiling and fusion on buffer operations
-
-The GPU hardware typically provides multiple-levels of compute hierarchy, namely
-*workgroup* level, *subgroup* level and *workitem* level. These map to blocks,
-warps and threads, respectively, in CUDA terminology. Tiling is a way to map the
-computations to each level of the compute hierarchy. For example 3-D tiling a
-`linalg.matmul` operation decomposes the computation into several tiled
-matrix-matrix multiplies.
-[Tiling transformation in Linalg dialect][LinalgTiling] generates the
-outer-loops that iterate over tiled `linalg.matmul` operations. These outer
-loops can be mapped to different workgroups, if they are parallel. The tiled
-`linalg.matmul` operation can be further tiled to map to subgroups. Finally, the
-tiled operation can be lowered to loops with individual iterations mapped to
-workitems. The [LinalgTileAndFusePass][LinalgTileAndFuse] uses the Linalg Tiling
-patterns ([defined here][LinalgTilingPatterns]) to tile operations like
-`linalg.matmul`, `linalg.conv` and `linalg.*_pooling`. The result of tiling the
-code in Snippet 5 is shown below. As expected there are 2-parallel loops that
-iterate over tiles of the original iteration space (i.e. inter-tile loops) and
-can be distributed to workgroups.
-
-```mlir
-func @main_ex_dispatch_0()
-  attributes {
-    spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
-  %cst = constant 0.000000e+00 : f32
-  %c32 = constant 32 : index
-  %c24 = constant 24 : index
-  %c16 = constant 16 : index
-  %c0 = constant 0 : index
-  %c4 = constant 4 : index
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<32x16xf32>
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<32x24xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<24x16xf32>
-  linalg.fill(%cst, %0) : f32, memref<32x16xf32>
-  scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c32, %c16) step (%c8, %c8) {
-    scf.for %arg2 = %c0 to %24 step %c4 {
-      ...
-      %5 = subview %1[%arg0, %arg2]...
-      ...
-      %8 = subview %2[%arg2, %arg1]...
-      ...
-      %11 = subview %0[%arg0, %arg1]..
-      linalg.matmul {__internal_linalg_transform__ = "workgroup"} %5, %8, %11...
-    }
-    scf.yield
-  }
-  return
-}
-```
-
-<a name="snippet7"></a> Snippet 7 : `linalg.matmul` after tiling.
-
-#### Tile Size and Workgroup Size
-
-When operations that are to be tiled exist within the dispatch function (like
-`linalg.matmul` or `linalg.conv`), this pass also decides the 1. Tile size to be
-used for the tiling. 1. The workgroup size to be used.
-
-The tile size and workgroup size are closely linked since the code within the
-tiled loops are to be collectively executed by the entire workgroup. In other
-words, all workitems in the workgroup collaborate to execute the tiled
-`linalg.matmul`.
-
-__Roadmap Note__ : Currently the tile sizes used in this pass are hard-wired.
-Not much effort has been put into finding ideal tile size for each operation on
-different hardware. The value used is meant to be a baseline to test
-functionality, with performance considerations addressed over time.
-
-#### Markers
-
-Downstream passes have to handle tiled Linalg operations and untiled Linalg
-operation that might exist in the same function in different ways. For example,
-while the former are to be executed collectively by workitems within a
-workgroup, the latter have to be executed by all workitems across workgroups.
-One way to distinguish these two operations is to use the marker mechanism in
-Linalg ([LinalgTransformationFilter][LinalgTilingPatterns]). This is a `StrAttr`
-whose value can be used to encode the scope of the operation. For example, in
-Snippet 7 above, the tiled `linalg.matmul` operation has a marker `workgroup` to
-indicate that this operation needs to be executed by a workgroup in a collective
-manner. At this time, the code-generation pipeline uses only the `workgroup`
-marker.
-
-__Roadmap Note__ : Markers are meant to be short-lived, ideally set and consumed
-within the same pass. In the current pipeline the lifetime spans passes to allow
-lowering to different hierarchies. The separate passes that implement the
-lowering from Linalg to SPIR-V can be combined into a single pass, relying A ->
-B -> C translation mechanism of the Dialect Conversion framework to implement
-the progressive lowering. In interest of separation of concerns and for better
-debuggability these passes are kept separate at the cost of having lifetimes of
-markers span passes.
-
-#### Promoting subviews to use workgroup local memory and use of synchronizations
-
-`Workgroup` memory (or `shared memory` in CUDA terminology) can be used to
-prefetch the inputs to the tiled operation. For example in the matrix-matrix
-multiply case, the same data row (column) of the LHS (RHS) matrix is read by
-multiple workitems. Prefetching the data into `Workgroup` memory can reduce the
-number of loads to `StorageClass` memory by an order of magnitude. This
-transformation can be achieved by using the
-[`Linalg Promotion`][LinalgPromotionPatterns] which modifies the `subview`s that
-are the operands to the tiled Linalg operation to use a new `memref` object. The
-size of this `memref` is computed from the size of the `subview`. This `memref`
-object is later lowered to use `Workgroup` memory Storage Class. The snippet
-below shows this transformation when applied to `linalg.matmul` (along with
-tiling). The newly created `memref` objects are annotated with the memory space
-`3` to indicate that they are to be lowered to use `Workgroup` memory. The copy
-of data from the original `memref` into the new `memref`, as well as the
-necessary synchronization constructs are generated as well. Note the memory
-space annotation used here is consistent with what
-[address space annotations used in NVVM][NVVMAddressSpace].
-
-```mlir
-func @matmul_tile()
-  attributes {
-    spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
-  %c32 = constant 32 : index
-  %c24 = constant 24 : index
-  %c16 = constant 16 : index
-  %c4 = constant 4 : index
-  %c8 = constant 8 : index
-  %c0 = constant 0 : index
-  %c1 = constant 1 : index
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<32x24xf32>
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<24x16xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<32x16xf32>
-  scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c32, %c16) step (%c8, %c8) {
-    scf.for %arg2 = %c0 to %c24 step %c4 {
-      ...
-      %5 = subview %0[%arg0, %arg2]...
-      ...
-      %8 = subview %1[%arg2, %arg1]...
-      ...
-      %11 = subview %2[%arg0, %arg1]...
-      %12 = alloc(%c8, %c4) : memref<?x?xf32, 3>
-      %13 = subview %12[%c0, %c0]...
-      %14 = alloc(%c4, %c8) : memref<?x?xf32, 3>
-      %15 = subview %14[%c0, %c0]...
-      linalg.copy(%5, %13) {__internal_linalg_transform__ = "workgroup"}
-        : memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
-      spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
-      linalg.copy(%8, %15) {__internal_linalg_transform__ = "workgroup"}
-        : memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
-      spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
-      linalg.matmul {__internal_linalg_transform__ = "workgroup"} %13, %15, %11...
-      spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
-      dealloc %12 : memref<?x?xf32, 3>
-      dealloc %14 : memref<?x?xf32, 3>
-    }
-    scf.yield
-  }
-  return
-}
-```
-
-<a name="snippet8"></a> Snippet 8: `linalg.matmul` after tiling and promotion of
-operand subviews to use `Workgroup` memory.
-
-### Distributing to workgroups and workitems
-
-After tiling the operations within the dispatch functions are either
-`scf.parallel` operations or Linalg operations.
-
--   The outer `scf.parallel` operations represent parallel loops that are to be
-    distributed across workgroups. The distribution here assumes that the number
-    of workgroups along each dimension is equal to the number of iterations of
-    the `scf.parallel` operation.
-
--   Linalg operations that are not tiled, and are therefore __not within__ `scf`
-    operations, are lowered to loops. The resulting outer `scf.parallel`
-    operations are collapsed to have a single induction variable. This loop is
-    then distributed across workitems using their `GlobalInvocationId`, (which
-    is same as `blockIdx * blockDim + threadIdx` in CUDA terminology).
-
--   Linalg operations that are tiled, and are therefore __within__ `scf`
-    operations, are lowered to loops and the iterations of the `scf.parallel`
-    operations are mapped to workitems using their `LocalInvocationId` (which is
-    same as `threadIdx` in CUDA terminology). Note that these operations are
-    tagged with the `workgroup` marker which makes it easy to disambiguate from
-    the case where Linalg operations are outside of `scf` operations. Here too,
-    the distribution assumes that the workgroup size is greater than or equal to
-    the number of iterations of the partitioned loop.
-
-These transformations are applied by the [`ConvertToGPUPass`][ConvertToGPU].
-Below is the result of applying this pass to Snippet 7. The outer `scf.parallel`
-loop is distributed across workgroups. The tiled `linalg.matmul` operation is
-lowered to loops, and the outer `scf.parallel` operation generated during this
-lowering are distributed across workitems within the workgroup.
-
-```mlir
-func @main_ex_dispatch_0_dispatch_1()
-  attributes {
-    spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
-  %c24 = constant 24 : index
-  %c8 = constant 8 : index
-  %c4 = constant 4 : index
-  %c0 = constant 0 : index
-  %c1 = constant 1 : index
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<32x16xf32>
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<32x24xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<24x16xf32>
-  %3 = "gpu.block_id"() {dimension = "x"} : () -> index
-  %4 = "gpu.block_id"() {dimension = "y"} : () -> index
-  %5 = muli %4, %c8 : index
-  %6 = muli %3, %c8 : index
-  scf.for %arg0 = %c0 to %c24 step %c4 {
-    ...
-    %15 = subview %1[%5, %arg0]
-    ...
-    %20 = subview %2[%arg0, %6]
-    %21 = subview %0[%5, %6]
-    %22 = "gpu.thread_id"() {dimension = "x"} : () -> index
-    %23 = "gpu.thread_id"() {dimension = "y"} : () -> index
-    %24 = cmpi "slt", %23, %10 : index
-    %25 = cmpi "slt", %22, %19 : index
-    %26 = and %24, %25 : i1
-    scf.if %26 {
-      scf.for %arg1 = %c0 to %14 step %c1 {
-        %27 = load %15[%23, %arg1] : memref<?x?xf32, #map0>
-        %28 = load %20[%arg1, %22] : memref<?x?xf32, #map1>
-        %29 = load %21[%23, %22] : memref<?x?xf32, #map1>
-        %30 = mulf %21, %22 : f32
-        %31 = addf %23, %24 : f32
-        store %25, %15[%23, %22] : memref<4x?xf32, #map1>
-      }
-    }
-  }
-  return
-}
-```
-
-<a name="snippet9"></a> Snippet 9: `linalg.matmul` after distributing parallel
-inter-tile loops to workgroups and intra-tile loops to workitems.
-
-[Snippet 6](#snippet6) shows the fused element-wise operations represented using
-a `linalg.generic` operation. This operation is not tiled in the
-`LinalgTileAndFusePass`. So the `ConvertToGPUPass` lowers this operation to
-`scf.parallel` loops, which are collapsed into a `scf.parallel` operation with a
-single induction variable. This loop is then distributed across workitems using
-the `GlobalInvocationId`. The resulting IR is shown below.
-
-```mlir
-func @main_ex_dispatch_0()
-  attributes {
-    spv.entry_point_abi = {local_size = dense<[32, 1, 1]> : vector<3xi32>}} {
-  %c50 = constant 50 : index
-  %c5 = constant 5 : index
-  %0 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@ret0} : memref<10x15xf32>
-  %1 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg0} : memref<10x15xf32>
-  %2 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg1} : memref<10x15xf32>
-  %3 = iree.placeholder for "interface buffer"
-         {binding = @legacy_io::@arg2} : memref<15xf32>
-  %4 = "gpu.block_id"() {dimension = "x"} : () -> index
-  %5 = "gpu.block_dim"() {dimension = "x"} : () -> index
-  %6 = "gpu.thread_id"() {dimension = "x"} : () -> index
-  %7 = muli %4, %5 : index
-  %8 = addi %7, %6 : index
-  %9 = cmpi "slt", %8, %c50 : index
-  scf.if %9 {
-    %10 = divi_signed %8, %c5 : index
-    %11 = remi_signed %8, %c5 : index
-    %12 = load %1[%10, %11] : memref<10x15xf32>
-    %13 = load %2[%10, %11] : memref<10x15xf32>
-    %14 = load %3[%11] : memref<15xf32>
-    %15 = addf %12, %13 : f32
-    %16 = mulf %15, %14 : f32
-    store %16, %0[%10, %11] : memref<10x15xf32>
-  }
-  return
-}
-```
-
-<a name="snippet10"></a> Snippet 10: Distributing the iterations for pointwise
-operations for GPU execution.
-
-### Lowering to SPIR-V dialect
-
-The last step is to take the result of the previous pass and lowering it to
-SPIR-V dialect. Since SPIR-V dialect is *closed*, i.e. it has a separate type
-system, its best to lower all the operations to SPIR-V in one step. This is done
-by applying all the patterns that lower all the different IR constructs into
-SPIR-V within the [`ConvertToSPIRVPass`][ConvertToSPIRV]. These are
-
--   [GPU dialect to SPIR-V conversion][GPUToSPIRV].
--   [SCF dialect to SPIR-V conversion][SCFToSPIRV].
--   [Standard dialect to SPIR-V conversion][StandardToSPIRV].
--   Patterns that lower the `iree.placeholder` instruction into a SPIR-V.
-
-Once applied the resulting IR is in SPIR-V dialect that can be serialized to a
-SPIR-V binary.
-
-[ConvertToGPU]: https://github.com/openxla/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
-[ConvertToSPIRV]: https://github.com/openxla/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp
-[DotAfterAll]: https://gist.github.com/MaheshRavishankar/9e2d406296f469515c4a79bf1e7eef44
-[GPUToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/GPUToSPIRV/ConvertGPUToSPIRV.h
-[HLOToLinalgPass]: https://github.com/tensorflow/tensorflow/blob/75c40f6bff2faa3d90a375dfa4025b2e6e2d7a3d/tensorflow/compiler/mlir/xla/transforms/passes.h#L67
-[LinalgDialect]: https://mlir.llvm.org/docs/Dialects/Linalg/
-[LinalgFusionOnBuffers]: https://github.com/llvm/llvm-project/blob/ef868a848e6def288d2df7a1b3ebe09463afc8d0/mlir/include/mlir/Dialect/Linalg/Utils/Utils.h#L86
-[LinalgFusionOfTensorOps]: https://github.com/llvm/llvm-project/blob/80cb25cbd555f9634836b766c86aead435b60eaa/mlir/include/mlir/Dialect/Linalg/Passes.td#L30
-[LinalgPromotionPatterns]: https://github.com/llvm/llvm-project/blob/303a7f7a26e2aae1cb85f49dccbc0b5d14e0b2e0/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h#L358
-[LinalgRationale]: https://mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/
-[LinalgTileAndFuse]: https://github.com/openxla/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
-[LinalgTiling]: https://mlir.llvm.org/docs/Dialects/Linalg/#set-of-key-transformationsa-namekey_transformationsa
-[LinalgTilingPatterns]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
-[NVVMAddressSpace]: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#address-space
-[PwAfterAll]: https://gist.github.com/MaheshRavishankar/02cdd22f7c99e568f933244b5a679510
-[SCFToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/SCFToSPIRV/SCFToSPIRV.h
-[SpirvSerialization]: https://mlir.llvm.org/docs/Dialects/SPIR-V/#serialization-and-deserialization
-[StandardToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.h

diff --git a/docs/developers/design_docs/cuda_backend.md b/docs/developers/design_docs/cuda_backend.md
deleted file mode 100644
index c0bffeb..0000000
--- a/docs/developers/design_docs/cuda_backend.md
+++ /dev/null

@@ -1,111 +0,0 @@
-# IREE CUDA backend
-
-This document is intended to provide an overview of the design choices made to support CUDA within IREE. It describes both the HAL runtime and the NVVM codegen side.
-
-## CUDA HAL Driver
-
-The CUDA HAL driver is in [`iree/hal/drivers/cuda/`][iree-cuda] directory. It is
-written in C following the standards of the rest of the HAL module.
-
-### CUDA library dependency
-
-IREE calls directly into [`CUDA driver API`][cuda-driver]. CUDA library is loaded dynamically and cuda.h header from CUDA SDK is part of IREE third_party project. Therefore IREE doesn't require CUDA SDK to be installed when building iree tools.
-At runtime HAL CUDA driver will load libcuda.so/nvcuda.dll library and load a subset of the cuda driver API used in HAL. The list of functions being used are in the file [`iree/hal/drivers/cuda/dynamic_symbols_tables.h`][cuda-symbols]
-
-### Driver
-
-There is no direct equivalent in CUDA to the HAL driver abstraction. We use it to hold the symbols loaded for all the devices.
-
-### Device
-
-The equivalent to HAL device in CUDA is the `CUcontext`, it holds all the state related to memory allocations.
-
-### Command buffer
-
-We implement command buffers using [`CUDA Graph API`][cuda-graph]. Using the Graph API allows to easily encode fine grain dependencies between dispatch without having to create multiple streams.
-Note that Graph API is meant to be used for command buffers that can be recorded once and used several times and there may be a performance penalty to using Graph API for direct command buffer. It is likely that we will also have a pure stream implementation in the future if we see performance problems with direct command buffer usages.
-
-### Event and Barrier
-
-In HAL Event and Barrier are used for GPU<->GPU synchronization either within a command buffer (Event and Barrier) or between command buffers.
-
-The current implementation ignores events and barriers and serializes all the nodes of the graph in order to have a conservative but correct solution.
-
-The design we plan for the future is to map dependencies within a command buffer to graph dependencies in the CUDA Graph API. When an event is signaled all the leaf nodes of the graph will be saved in HAL data structure and when the same command buffer waits on the signal we will add all the nodes as dependency to the future nodes added to the graph.
-
-For simplicity we always serialize command buffers sent to the same command queue.
-
-### Allocator
-
-The allocator will forward allocation requests to `cuMemHostAlloc` for host accessible memory and `cuMemAlloc` for device only memory.
-
-### Buffer
-
-CUDA buffers are represented either as a host pointer or a device pointer of type `CUdeviceptr`.
-
-### Executable
-
-HAL executable maps naturally to a PTX module. The compiler will generate a flat buffer containing a PTX text module as well as a list of entry point function names and the workgroup size associated with those entry points.
-
-### Semaphore
-
-Timeline semaphore is used in IREE to handle coarse grain synchronization for CPU<->GPU, GPU<->GPU and CPU<->CPU. The interface follows closely [`Vulkan timeline semaphore spec`][vulkan-semaphore].
-There is currently no simple way to implement this on CUDA. There are several solutions discussed on this [`IREE issue`][semaphore-issue] but no obvious solution. For now we force CPU and GPU to be synchronized after every submit to ensure correctness and ignore the semaphore.
-
-## NVVM Codegen
-
-### NVVM and PTX
-
-NVVM is a CUDA specific IR composed of LLVM IR and NVVM specific intrinsics. It can be compiled to PTX text using LLVM PTX backend. NVVM has an associated dialect in MLIR that translates 1:1 to NVVM intrinsics. This is what we are using to generate the PTX kernel code.
-
-### IREE flow
-
-IREE's [`target independent codegen`][codegen-passes] converts the compiler input to Linalg on Tensors. Afterward IREE will call the LinalgToLLVMGPU codegen passes.
-
-Once we get into LinalgToLLVMGPU passes we first do bufferize to generate Linalg on Buffers. Then we apply MLIR generic passes to  convert linalg to SCF dialect and then SCF to Standard dialect. After that we convert Standard dialect to LLVM+NVVM dialect.
-
-## Example
-
-Save the following mlir in /tmp/add.mlir
-```mlir
-func.func @add(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
-  %0 = tensor.empty() : tensor<4xf32>
-  %1 = linalg.generic {
-    indexing_maps = [
-      affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]}
-      ins(%arg0, %arg1 : tensor<4xf32>, tensor<4xf32>)
-      outs(%0 : tensor<4xf32>) {
-  ^bb0(%in: f32, %in_0: f32, %out: f32):
-    %2 = arith.addf %in, %in_0 : f32
-    linalg.yield %2 : f32
-  } -> tensor<4xf32>
-  return %1 : tensor<4xf32>
-}
-```
-
-```shell
-# First compile into a VM bytecode module.
-$ ../iree-build/tools/iree-compile \
-  --iree-hal-target-backends=cuda \
-  /tmp/add.mlir \
-  -o /tmp/add.vmfb
-
-# Run the module through CUDA HAL backend.
-$ ../iree-build/tools/iree-run-module \
-  --device=cuda \
-  --module=/tmp/add.vmfb \
-  --function=add \
-  --input="4xf32=[1 2 3 4]" \
-  --input="4xf32=[2 2 2 2]"
-
-EXEC @add
-4xf32=3 4 5 6
-```
-
-[iree-cuda]: https://github.com/openxla/iree/tree/main/iree/hal/drivers/cuda/
-[cuda-symbols]: https://github.com/openxla/iree/blob/main/iree/hal/drivers/cuda/dynamic_symbols_tables.h
-[cuda-driver]: https://docs.nvidia.com/cuda/cuda-driver-api/index.html
-[cuda-graph]: https://developer.nvidia.com/blog/cuda-graphs/
-[vulkan-semaphore]: https://www.khronos.org/blog/vulkan-timeline-semaphores
-[semaphore-issue]: https://github.com/openxla/iree/issues/4727
-[codegen-passes]: https://github.com/openxla/iree/blob/main/docs/design_docs/codegen_passes.md

diff --git a/docs/developers/design_docs/dynamic_shapes.md b/docs/developers/design_docs/dynamic_shapes.md
deleted file mode 100644
index cc59dc4..0000000
--- a/docs/developers/design_docs/dynamic_shapes.md
+++ /dev/null

@@ -1,166 +0,0 @@
-# Dyanmic Shapes
-
-NOTE: Effort is being made to make this facility generic so that it can be
-eventually upstreamed to MLIR in some fashion. However, because MLIR lacks a set
-of frontend ops and generally does not currently have any frontend oriented
-infrastructure, it is being prototyped within IREE in order to find a working
-set of ops and algorithms.
-
-## Levels of dynamicism
-
-In general, there are three levels of shape information that can be present in
-the input IR (or trivially derived by applying some form of shape inferencing
-algorithm). Each additional one imposes more work on the compiler and runtime,
-so generally, the implementation progresses by addressing each once the former
-is well established:
-
-1.  Fully static shapes: No tensors have dynamic dimensions. All tensors are
-    ranked.
-2.  Ranked Dynamicism: All tensors have ranks, but some dimensions may be
-    unspecified.
-3.  Unranked Dynamicism: Some tensors have indeterminate ranks.
-
-At this stage, *Dynamic Shapes* in IREE refers to supporting dynamic ranked
-dynamic tensors, where some dimensions are left unspecified at public function
-boundaries. It is expected that once this is solid, some support can be
-considered for unranked dynamicism, and it is further expected that will entail
-new ops, algorithms and runtime support, apart from what is needed for ranked
-dynamicism.
-
-Within the category of Ranked Dynamicism, it is well known that some dynamic
-dimensions are easier to deal with than others: in common DNN use, outer
-dimensions are much easier and more common with respect to code generation and
-kernel fanout than dynamic inner dimensions.
-
-While the shape handling machinery is relatively generic, we expect that real
-backends will be limited with respect to how much they support all combinations
-of dynamic dimensions. Eventually, IREE intends to solve this by having
-relatively robust CPU fallback for fully dynamic cases and actionable warnings
-that pinpoint when more specificity could increase performance.
-
-## Compiler Frontend
-
-In general, the IREE compiler frontend should accept modules containing
-functions with operands/results that have dynamic dimensions. Such functions may
-also have runtime dependent shapes in the form of `GetShape`-style ops which get
-a shape from an arbitrary tensor, perform some arithmetic on it and use the
-results elsewhere.
-
-### Shape dialect and lowering
-
-IREE is introducing a `shape` dialect with a handful of ops and transformations
-that are useful for materializing dynamic shape computations in terms of high
-level ops on tensors.
-
-#### Types:
-
-*   `ranked_shape`: This value type represents the dynamic dimensions of a
-    partially known, ranked shape. It is used early in the compiler to represent
-    anywhere that dynamic dimensions need to be passed (i.e. function
-    args/results, etc). At lower levels of the compiler, it will generally be
-    dis-aggregated into loose SSA values. This type also carries the datatype
-    used to represent the dimensions. This is currently fixed to i32 but may be
-    leveraged eventually to use smaller integer when such things are known to be
-    legal.
-
-#### Ops:
-
-*   `get_ranked_shape`: Takes a tensor SSA value and returns a corresponding
-    `ranked_shape`. Early in the compilation flow, anything that needs a ranked
-    shape should add such ops so that the compiler can later determine which
-    shape computations to materialize. Getting the `ranked_shape` of a static
-    tensor should yield a constant.
-*   `tie_shape`: Takes tensor and ranked_shape SSA values and returns the
-    tensor. This is used as a junction point by the shape materialization passes
-    to know at various points precisely what the shape is.
-*   ... TODO: need `get_shape_dim` and conversions to/from 1D tensors and loose
-    SSA values.
-
-### Materialization
-
-#### Function signature expansion
-
-Early in the process, all functions should have their arguments and results
-expanded so that any dynamic tensors in their signature will gain a new
-argument/result for the corresponding `ranked_shape`. This is done by expanding
-the signatures and for arguments, inserting placeholder `tie_shape` ops which
-preserve the association for later materialization. For results,
-`get_ranked_shape` ops are inserted.
-
-This is carried out by the `iree-shape-expand-function-dynamic-dims` pass, which
-uses the conversion framework under the hood to perform type expansion.
-
-This pass is typically done early in the compiler flow.
-
-#### Shape dependent codegen
-
-A lot of scheduling logic will need to access shapes (i.e. allocation, workgroup
-size calculation, etc). In general, this should all be done based on a
-`get_ranked_shape` op and corresponding `get_shape_dim` ops. For fully static
-cases, these should reduce down to constants. For dynamic dimensions, the
-`get_ranked_shape` ops serve as anchors where later parts of the compiler know
-they need to materialize shape values.
-
-#### Materializing shape computations
-
-TODO: We have a sketch of this but are still proving it out.
-
-Generally, it should be possible, for any `get_ranked_shape` op, to trace up the
-use-def chain and materialize shape manipulation arithmetic. Once materialized,
-a `tie_shape` op should be inserted to memorialize the junction. Eventually,
-every `get_ranked_shape` op should be follow a `tie_shape` op, and the
-canonicalization rules will elide the `get_ranked_shape`. There is complexity
-around blocks, control flow, etc, but this basic algorithm should be workable.
-
-Work is ongoing upstream to provide a facility to register shape functions with
-ops, which would provide a dynamic, dialect independent way to know what
-arithmetic to materialize. However, in most cases this is not necessary. The
-built-in traits around types and sizes will allow most propagation to happen
-without shape functions. We intend to start with a static set of cases for the
-rest in order to prove the concept.
-
-#### Scalarization
-
-TODO: We have a sketch of this but are still proving it out.
-
-It is quite common in real-world DNN usage to get the 1D tensor representing a
-shape and perform arbitrary tensor ops on it (usually basic arithmetic, slicing,
-concating, tiling, etc). While this is perfectly acceptable from a correctness
-standpoint, it is usually not performant: shapes are typically very small one
-dimensional vectors, and computations on them are usually trivial to reduce to
-small sequences of scalar machine code of a form that CPUs are very good at
-executing. Further, we often want to run these calculations eagerly when
-dispatching functions, etc (i.e. to pre-allocate buffers) and having them
-isolated (versus treating them as arbitrary dense computations) can be quite
-valuable.
-
-We expect that the type bracketing that happens with `ranked_shape` and the
-corresponding ops will make it easy to write some simple DRR patterns to
-identify such shape manipulation sequences and lower them directly to regions of
-`vm` ops operating on scalars. Such regions can be retained and directly emitted
-when lowering to the `vm` dialect and/or CPU code generation and would run with
-low overhead along with any other scheduling code.
-
-While an optimization, we suspect this is an important one.
-
-### Shape inference
-
-TODO: This is mostly placeholder
-
-There is work happening upstream to implement MLIR-integrated shape inference.
-In the mean-time, IREE expects that the input to the compiler has already had
-some shape inference performed on it. In practice, for TensorFlow, there is a
-pass which applies TensorFlow's pre-MLIR shape inference mechanisms to derive
-such things. This has limitations but is a reasonable starting point.
-
-## Compiler Backends
-
-TODO: This is mostly placeholder.
-
-Much of the newer structured-ops based codegen is capable of working (within
-bounds) with ranked dynamic shapes without much work. Given the lack of an e2e
-story, much of this has been done "by way of code review" and there are
-certainly issues to be resolved.
-
-In addition, there are several ABI issues and negotiations with the backend that
-still need to be fleshed out.

diff --git a/docs/developers/design_docs/hal_driver_features.md b/docs/developers/design_docs/hal_driver_features.md
deleted file mode 100644
index 2a423e9..0000000
--- a/docs/developers/design_docs/hal_driver_features.md
+++ /dev/null

@@ -1,120 +0,0 @@
-# HAL Driver Features
-
-Heterogeneity is one of IREE's core design principles. IREE aims to support
-various accelerators for compute, ranging from general purpose CPUs, GPUs, to
-other special purpose accelerators. IREE provides a
-[Hardware Abstraction Layer (HAL)][iree-hal] as a common interface to these
-accelerators. IREE exposes it via an [C API][iree-hal-c-api] for programmers and
-an MLIR [dialect][iree-hal-dialect] for compilers.
-
-Heterogeneity inevitably means IREE needs to provide a solution for managing
-different features on different accelerators and their availability. This doc
-describes the designs and mechanisms.
-
-## General HAL Driver Features
-
-IREE uses compilers to generate native code for each accelerator, serialize the
-native code, and embed the code in one flat byte code following FlatBuffer
-encoding format. The native code embedded in the final FlatBuffer file will
-indicate the target architecture and required feature sets. At runtime IREE
-selects a HAL driver meeting all the requirements to dispatch the workload to.
-
-[TODO: describe the HAL functionality, C API, and dialect abstraction]
-
-## Vulkan HAL Driver Features
-
-Vulkan has many mechanisms for supporting different hardware implementations:
-versions, extensions, features, limits. Vulkan uses SPIR-V to express the GPU
-program but Vulkan is just one client SPIR-V supports. So SPIR-V has its own
-mechanisms for supporting different clients: versions, capabilities, extensions.
-The mechanism in these two domains bear lots of similarity, but they are not
-exactly the same. We need to bridge these two worlds inside IREE.
-
-IREE has its own [Vulkan dialect][iree-vulkan-dialect], which defines the Vulkan
-target environment, including [versions][iree-vulkan-base-td],
-[extensions][iree-vulkan-base-td], [features][iree-vulkan-cap-td]. These
-definitions leverage MLIR attribute for storage, parsing/printing, and
-validation. For example, we can have the following Vulkan target environment:
-
-```
-target_env = #vk.target_env<
-  v1.1, r(120),
-  [VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class],
-  {
-    maxComputeSharedMemorySize = 16384 : i32,
-    maxComputeWorkGroupInvocations = 1024: i32,
-    maxComputeWorkGroupSize = dense<[128, 8, 4]>: vector<3xi32>,
-    subgroupFeatures = 7: i32,
-    subgroupSize = 64 : i32
-  }
->
-```
-
-The above describes a Vulkan implementation that supports specification version
-1.1.120, supports `VK_KHR_spirv_1_4` and `VK_KHR_storage_buffer_storage_classs`
-extensions, has a max compute workgroup invocation of 1024, and so on.
-
-The above bears lots of similarity with the output of the
-[`vulkaninfo`][vulkaninfo] utility. That's intended: `vulkaninfo` gives a
-detailed dump of a Vulkan implementation by following the structures of all the
-registered extensions to the specification. We pick relevant fields from it to
-compose the list in the above to drive code generation. These are just different
-formats for expressing the Vulkan implementation; one can image having a tool to
-directly dump the MLIR attribute form used by IREE from the `vulkaninfo`'s JSON
-dump.
-
-When compiling ML models towards Vulkan, one specifies the target environment as
-a `#vk.target_env` attribute assembly via the
-[`iree-vulkan-target-env`][iree-vulkan-target-env] command-line option. At the
-moment only one target environment is supported; in the future this is expected
-to support multiple ones so that one can compile towards different Vulkan
-implementations at once and embed all of them in the final FlatBuffer and select
-at runtime. Another command-line option, `iree-vulkan-target-triple` is also
-available to allow specifying common triples and avoiding the lengthy target
-environment assembly string. `iree-vulkan-target-triple` will be overridden by
-`iree-vulkan-target-env` if both are given.
-
-Under the hood, this Vulkan target environment is then converted to the SPIR-V
-target environment counterpart to drive code generation. The conversion happens
-in one of Vulkan dialect's [utility function][iree-vulkan-target-conv]. The
-converted SPIR-V target environment is [attached][iree-spirv-target-attach] to
-the dispatch region's module for SPIR-V passes to use.
-
-SPIR-V's target environment is very similar to the Vulkan target environment in
-the above; it lives in upstream MLIR repo and is documented
-[here][mlir-spirv-target] and implemented in SPIR-V dialect's
-[`SPIRVAttribues.h`][mlir-spirv-attr] and
-[`TargetAndABI.td`][mlir-spirv-target-td].
-
-[PR #3469][pr-3469], along with patch [D89364][d89364], shows an example of the
-changes needed to add support for the
-[VK_NV_cooperative_matrix][vk-coop-mat-ext] extension for Vulkan/SPIR-V. The
-overall steps are as follows:
-1. Add the enum corresponding to the extension to `VK_ExtensionAttr` in
-   [VulkanBase.td][iree-vulkan-base-td].
-1. Add necessary capability bits to [`VK_CapabilitiesAttr`][iree-vulkan-cap-td].
-1. Add a corresponding attribute to the `SPV_ResourceLimitsAttr` in
-   [TargetAndABI.td][mlir-spirv-target-td]. (Note: The corresponding SPIR-V
-   extension is likely already defined in
-   [`SPV_ExtensionAttr`][mlir-spirv-extensions-attr])
-1. Convert the capability bits specified in the attribute added to
-   `VK_CapabilitiesAttr` to the attribute added to `SPV_ResourceLimitsAttr`.
-
-[d89364]: https://reviews.llvm.org/D89364
-[iree-hal]: https://github.com/openxla/iree/tree/main/iree/hal
-[iree-hal-c-api]: https://github.com/openxla/iree/blob/main/iree/hal/api.h
-[iree-hal-dialect]: https://github.com/openxla/iree/tree/main/iree/compiler/Dialect/HAL
-[iree-vulkan-dialect]: https://github.com/openxla/iree/tree/main/iree/compiler/Dialect/Vulkan
-[iree-vulkan-base-td]: https://github.com/openxla/iree/blob/main/iree/compiler/Dialect/Vulkan/IR/VulkanBase.td
-[iree-vulkan-cap-td]: https://github.com/openxla/iree/blob/main/iree/compiler/Dialect/Vulkan/IR/VulkanAttributes.td
-[iree-vulkan-target-env]: https://github.com/openxla/iree/blob/b4739d704de15029cd671e53e7d7e743f4ca2e35/iree/compiler/Dialect/HAL/Target/VulkanSPIRV/VulkanSPIRVTarget.cpp#L66-L70
-[iree-vulkan-target-triple]: https://github.com/openxla/iree/blob/main/iree/compiler/Dialect/Vulkan/Utils/TargetEnvUtils.cpp
-[iree-vulkan-target-conv]: https://github.com/openxla/iree/blob/b4739d704de15029cd671e53e7d7e743f4ca2e35/iree/compiler/Dialect/Vulkan/Utils/TargetEnvUtils.h#L29-L42
-[iree-spirv-target-attach]: https://github.com/openxla/iree/blob/b4739d704de15029cd671e53e7d7e743f4ca2e35/iree/compiler/Dialect/HAL/Target/VulkanSPIRV/VulkanSPIRVTarget.cpp#L228-L240
-[mlir-spirv-extensions-attr]: https://github.com/llvm/llvm-project/blob/076305568cd6c7c02ceb9cfc35e1543153406d19/mlir/include/mlir/Dialect/SPIRV/SPIRVBase.td#L314
-[mlir-spirv-target]: https://mlir.llvm.org/docs/Dialects/SPIR-V/#target-environment
-[mlir-spirv-attr]: https://github.com/llvm/llvm-project/blob/076305568cd6c7c02ceb9cfc35e1543153406d19/mlir/include/mlir/Dialect/SPIRV/SPIRVAttributes.h
-[mlir-spirv-target-td]: https://github.com/llvm/llvm-project/blob/076305568cd6c7c02ceb9cfc35e1543153406d19/mlir/include/mlir/Dialect/SPIRV/TargetAndABI.td
-[pr-3469]: https://github.com/openxla/iree/pull/3469
-[vk-coop-mat-ext]: khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_NV_cooperative_matrix.html
-[vulkaninfo]: https://vulkan.lunarg.com/doc/view/latest/linux/vulkaninfo.html

diff --git a/docs/developers/design_docs/hlo_to_linalg.png b/docs/developers/design_docs/hlo_to_linalg.png
deleted file mode 100755
index 469ed26..0000000
--- a/docs/developers/design_docs/hlo_to_linalg.png
+++ /dev/null
Binary files differ

diff --git a/docs/developers/design_docs/linalg_to_spirv.png b/docs/developers/design_docs/linalg_to_spirv.png
deleted file mode 100755
index fd6aee7..0000000
--- a/docs/developers/design_docs/linalg_to_spirv.png
+++ /dev/null
Binary files differ

diff --git a/docs/developers/developing_iree/contributing.md b/docs/developers/developing_iree/contributing.md
deleted file mode 100644
index d5774eb..0000000
--- a/docs/developers/developing_iree/contributing.md
+++ /dev/null

@@ -1,135 +0,0 @@
-# Contributing
-
-This is a more detailed version of the top-level
-[CONTRIBUTING.md](/CONTRIBUTING.md) file. We keep it separate to avoid everyone
-getting a pop-up when creating a PR after each time it changes.
-
-## Downstream Integrations
-
-Due to the process by which we synchronize this GitHub project with our internal
-Google source code repository, there are some oddities in our workflows and
-processes. We aim to minimize these, and especially to mitigate their impact on
-external contributors, but when they come up, they are documented here for
-clarity and transparency. If any of these things are particularly troublesome or
-painful for your workflow, please reach out to us so we can prioritize a fix.
-
-Hopefully these quirks actually make usage in other downstream projects easier,
-but integrators may need to look past some details (like the Bazel build system,
-Android support, etc.) based on their specific requirements.
-
-## Build Systems
-
-IREE supports building from source with both Bazel and CMake. CMake is the
-preferred build system for open source users and offers the most flexible
-configuration options. Bazel is a stricter build system and helps with usage in
-the Google internal source repository. Certain dependencies (think large/complex
-projects like CUDA, TensorFlow, PyTorch, etc.) may be difficult to support with
-one build system or the other, so the project may configure these as optional.
-
-## CI
-
-IREE uses [GitHub Actions](https://docs.github.com/en/actions) for CI. The
-primary CI is configured in the
-[ci.yml workflow file](/.github/workflows/ci.yml).
-
-### Self-Hosted Runners
-
-In addition to the default runners GitHub provides, IREE uses
-[self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners)
-to run many of its workflow jobs. These enable access to additional compute and
-custom configurations such as accelarators. Configuration scripting is checked
-in to this repository (see the
-[README for that directory](/build_tools/github_actions/runner/README.md)).
-
-### Custom Managed Runners
-
-In addition to our self-hosted runners, we use GitHub's
-[large managed runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners)
-for some platforms that are more trouble to configure ourselves (e.g. Mac).
-
-### CI Behavior Manipulation
-
-The setup step of the CI determines which CI jobs to run. This is controlled by
-the [configure_ci.py](/build_tools/github_actions/configure_ci.py) script. It
-will generally run a pre-determined set of jobs on presubmit with some jobs kept
-as post-submit only. If changes are only to a certain set of excluded files that
-we know don't affect CI (e.g. docs), then it will skip the jobs. You can
-customize which jobs run using
-[git trailers](https://git-scm.com/docs/git-interpret-trailers) in the PR
-description. The available options are
-
-``` text
-ci-skip: jobs,to,skip
-ci-extra: extra,jobs,to,run
-ci-exactly: exact,set,of,jobs,to,run
-skip-ci: free form reason
-skip-llvm-integrate-benchmark: free form reason
-benchmark-extra: extra,benchmarks,to,run
-runner-env: [testing|prod]
-```
-
-The first three follow the same format and instruct the setup script on which
-jobs to include or exclude from its run. They take a comma-separated list of
-jobs which must be from the set of top-level job identifiers in ci.yml file or
-the special keyword "all" to indicate all jobs. `ci-skip` removes jobs that
-would otherwise be included, though it is not an error to list jobs that would
-not be included by default. `ci-extra` adds additional jobs that would not have
-otherwise been run, though it is not an error to list jobs that would have been
-included anyway. It *is* an error to list a job in both of these fields.
-`ci-exactly` provides an exact list of jobs that should run. It is mutually
-exclusive with both `ci-skip` and `ci-extra`. In all these cases, the setup does
-not make any effort to ensure that job dependencies are satisfied. Thus, if you
-request skipping the `build_all` job, all the jobs that depend on it will fail,
-not be skipped. `skip-ci` is an older option that simply skips all jobs. Its
-usage is deprecated and it is mutually exclusive with all of the other `ci-*`
-options. Prefer `ci-skip: all`.
-
-Benchmarks don't run by default on PRs, and must be specifically requested. They
-*do* run by default on PRs detected to be an integration of LLVM into IREE, but
-this behavior can be disabled with `skip-llvm-integrate-benchmark`. The
-`benchmark-extra` option allows specifying additional benchmark presets to run
-as part of benchmarking. It accepts a comma-separated list of benchmark presets.
-This combines with labels added to the PR (which are a more limited set of
-options). See the [benchmark suites documentation](./benchmark_suites.md).
-
-The `runner-env` option controls which runner environment to target for our
-self-hosted runners. We maintain a test environment to allow testing out new
-configurations prior to rolling them out. This trailer is for advanced users who
-are working on the CI infrastructure itself.
-
-#### CI configuration recipes
-
-Copy/paste any of these at the bottom of a PR description to change what the CI
-runs.
-
-* Also run Windows and macOS builds that are normally post-merge only:
-
-  ``` text
-  ci-extra: build_test_all_windows,build_test_all_macos_arm64,build_test_all_macos_x86_64
-  ```
-
-* Also run GPU tests on NVIDIA A100 runners (opt-in due to low availability):
-
-  ``` text
-  ci-extra: test_a100
-  ```
-
-* Skip all CI builds and tests, e.g. for comment-only changes:
-
-  ``` text
-  skip-ci: Comment-only change.
-  ```
-
-* Only run Bazel builds, e.g. for changes only affecting Bazel rules:
-
-  ``` text
-  ci-exactly: build_test_all_bazel
-  ```
-
-For example, this PR opted in to running the `build_test_all_windows` job:
-
-![ci-extra](/docs/developers/assets/ci-extra.png)
-
-The enabled jobs can be viewed from the Summary page of an action run:
-
-![ci_enabled_jobs](/docs/developers/assets/ci_enabled_jobs.png)

diff --git a/docs/developers/developing_iree/contributor_tips.md b/docs/developers/developing_iree/contributor_tips.md
deleted file mode 100644
index dd7947a..0000000
--- a/docs/developers/developing_iree/contributor_tips.md
+++ /dev/null

@@ -1,117 +0,0 @@
-# Contributor Tips
-
-This is an opinionated guide documenting workflows that some members of the team
-have found useful. It is focused on meta-tooling, not on IREE code specifically
-(you will find the latter in the
-[Developer Overview](./developer_overview.md)). It is certainly possible to use
-workflows other than these, but some common tasks, especially for maintainers
-will likely be made easier if you use these flows. It assumes a basic knowledge
-of `git` and GitHub and suggests some specific ways of using it.
-
-## Git Structure
-
-We tend to use the "triangular" or "forking" workflow. Develop primarily on a
-clone of the repository on your development machine. Any local branches named
-the same as persistent branches from the
-[main repository](https://github.com/openxla/iree) (currently `main` and
-`stable`) are pristine (though potentially stale) copies. You only fastforward
-these to match upstream and otherwise do development on other branches. When
-sending PRs, you push to a different branch on your public fork and create the
-PR from there.
-
-### Setup
-
-1. Create a fork of the main repository.
-
-2. Create a local git repository with remotes `upstream` (the main repository)
-    and `origin` (your personal fork). To list your current remotes
-    `git remote -v`.
-
-    a. If you already cloned from the main repository (e.g. by following the
-    getting started guide):
-
-    ```shell
-    # From your existing git repo
-    $ git remote rename origin upstream
-    $ git remote add origin https://github.com/<github_username>/iree.git
-    ```
-
-    b. If you haven't already cloned:
-
-    ```shell
-    # From whatever directory under which you want to nest your repo
-    $ git clone https://github.com/<github_username>/iree.git
-    $ cd iree
-    $ git remote add upstream https://github.com/openxla/iree.git
-    ```
-
-    This is especially important for maintainers who have write access (so can
-    push directly to the main repository) and admins who have elevated
-    privileges (so can push directly to protected branches). These names are
-    just suggestions, but you might find some scripts where the defaults are for
-    remotes named like this. For extra safety, you can make it difficult to push
-    directly to upstream by setting the push url to something invalid: `git
-    remote set-url --push upstream DISABLE`, which requires re-enabling the push
-    URL explicitly before pushing. You can wrap this behavior in a custom git
-    command like
-    [git-sudo](https://gist.github.com/GMNGeoffrey/42dd9a9792390094a43bdb69659320c0).
-
-3. Use a script like
-    [git_update.sh](/build_tools/scripts/git/git_update.sh)
-    to easily synchronize `main` with `upstream`. Submodules make this is a
-    little trickier than it should be. You can also turn this into a git command
-    by adding it to your path as `git-update`.
-
-#### Git Config
-
-These are some additional options you could put in your top-level `.gitconfig`
-or repository-specific `.git/config` files that are conducive the recommended
-workflow
-
-```ini
-[push]
-  default = current
-[alias]
-  # Delete branches that you pushed and have been deleted upstream, e.g. because
-  # the PR was merged.
-  gone = ! "git fetch -p  && git for-each-ref --format '%(refname:short) %(upstream:track)' | awk '$2 == \"[gone]\" {print $1}' | xargs -r git branch -D"
-  # Update from upstream (custom command) and delete obsolete local branches.
-  sync = ! (git update main && git gone)
-  # Create a new branch based off of main (requires a clean working directory).
-  new = "!f(){ \\\ngit checkout main && git switch -c $1; \\\n}; f"
-  # Display branches in a useful "latest last" format
-  br = for-each-ref --sort=committerdate refs/heads/ --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) - %(color:red)%(objectname:short)%(color:reset) - %(contents:subject) (%(color:green)%(committerdate:relative)%(color:reset))'
-  # `git git foo` -> `git foo` typo fixer
-  git = "!f(){ \\\n git \"$@\"; \\\n}; f"
-  # Get the git root directory
-  root = rev-parse --show-toplevel
-  # checkout, but also sync submodules
-  ch = "!f() { \\\n git checkout \"$@\"; git submodule sync && git submodule update --init; \\\n}; f"
-  # See the diff for a PR branch vs the main branch
-  diffmain = diff --merge-base main
-  # See only the files that differ vs the main branch
-  whatsout = diffmain --name-only
-[checkout]
-  # If the checkout command
-  defaultRemote = origin
-[pull]
-  # When pulling, only complete the pull if its a clean fast forward.
-  ff = only
-[remote]
-  # Push to your fork (origin) by default
-  pushDefault = origin
-[url "ssh://git@github.com/"]
-  # Pull with https (so no auth required), but push with ssh.
-  pushInsteadOf = https://github.com/
-```
-
-## Useful Tools
-
-* GitHub CLI (<https://github.com/cli/cli>). A CLI for interacting with GitHub.
-    Most importantly, it allows scripting the creation of pull requests.
-* Refined GitHub Chrome and Firefox Extension:
-    <https://github.com/sindresorhus/refined-github>. Nice extension that adds a
-    bunch of features to the GitHub UI.
-* VSCode: <https://code.visualstudio.com/>. The most commonly used IDE amongst
-    IREE developers.
-* [ccache](./ccache.md)

diff --git a/docs/developers/developing_iree/e2e_benchmarking.md b/docs/developers/developing_iree/e2e_benchmarking.md
deleted file mode 100644
index e8f0eaf..0000000
--- a/docs/developers/developing_iree/e2e_benchmarking.md
+++ /dev/null

@@ -1,404 +0,0 @@
-# Benchmark IREE and TFLite
-
-<!-- TODO(meadowlark): Update this doc once the API is stable and change
-                       default to cmake. -->
-
-We use our end-to-end TensorFlow integration tests to test compilation and
-numerical accuracy, and to generate compilation and benchmarking artifacts.
-This allows us to validate that our benchmarks are behaving as we expect them
-to, and to run them using valid inputs for each model.
-
-This guide assumes that you can run the tensorflow integration tests. See
-[this doc](https://iree.dev/building-from-source/getting-started/#python-bindings)
-for more information. That doc also covers writing new tests, which you'll need
-to do if you'd like to benchmark a new TensorFlow model.
-
-## 1. Run IREE's E2E TensorFlow tests to generate the benchmarking artifacts
-
-```shell
-# Continuing from the "Running Python Tests" section of the doc linked above.
-# We need not only Python bindings, but also the TensorFlow compiler frontend.
-# Self-contained commands from that doc --- skip that if you have already
-# completed a build with Python bindings AND TensorFlow compiler frontend.
-$ cd iree-build/  # Make and cd into some build directory
-$ cmake ../iree -G Ninja \
-    -DCMAKE_C_COMPILER=clang \
-    -DCMAKE_CXX_COMPILER=clang++ \
-    -DIREE_BUILD_PYTHON_BINDINGS=ON
-$ cmake --build .
-# Also from the Python get-started doc, set this environment variable:
-$ export PYTHONPATH=$(pwd)/bindings/python
-```
-
-```shell
-# --target_backends: All tests allow you to specify one or more backends to generate benchmarking artifacts for.
-# --artifacts_dir: The default location for these artifacts is under /tmp/iree/modules
-# This is a Python 3 program. On some systems, such as Debian derivatives,
-# use 'python3' instead of 'python'.
-$ python ../iree/integrations/tensorflow/e2e/matrix_ops_static_test.py \
-    --target_backends=iree_vmvx
-# View the generated artifacts:
-$ tree /tmp/iree/modules/MatrixOpsStaticModule/
-```
-
-```shell
-# Some tests use additional flags to specify features of the Module to test/benchmark:
-# --model: the tf.keras.applications model to test
-# --data: the dataset (and corresponding image shapes) to create the model for
-$ python ../iree/integrations/tensorflow/e2e/keras/applications/applications_test.py \
-    --target_backends=iree_vmvx \
-    --model=MobileNetV3Small \
-    --data=imagenet
-# View the generated artifacts:
-$ tree /tmp/iree/modules/MobileNetV3Small/
-```
-
-Each test/module has a folder with the following artifacts (filtered to only
-include those relevant for benchmarking):
-
-```shell
-# Example for a generic module `ModuleName`:
-/tmp/iree/modules/ModuleName
-  ├── iree_vmvx  # Or any other IREE backend.
-  │   └── compiled.vmfb # A flatbuffer containing IREE's compiled code.
-  └── tflite
-      ├── module_method_1.tflite
-      │   # A method on ModuleName compiled to bytes with TFLite, which can
-      │   # be used by the TFLite's benchmark_model binary.
-      ├── module_method_2.tflite
-      ├── ...
-      └── traces
-          ├── traced_function_1
-          │   └── graph_path
-          │       # In general, a trace's name does not have to match the name
-          │       # of the method(s) on the tf.Module that it calls. This file
-          │       # points to the correct module_method_*.tflite graph file
-          │       # for TFLite's benchmark_model to use.
-          ├── traced_function_2
-          └── ...
-
-# Example for MatrixOpsStaticModule:
-/tmp/iree/modules/MatrixOpsStaticModule
-  ├── iree_llvmcpu
-  │   └── compiled.vmfb
-  ├── iree_vmvx
-  │   └──compiled.vmfb
-  ├── iree_vulkan
-  │   └── compiled.vmfb
-  └── tflite
-      ├── basic_matmul.tflite
-      ├── matmul_broadcast_singleton_dimension.tflite
-      ├── matmul_lhs_batch.tflite
-      ├── matmul_rhs_batch.tflite
-      └── traces
-          ├── basic_matmul
-          │   └── graph_path
-          ├── matmul_broadcast_singleton_dimension
-          │   └── graph_path
-          ├── matmul_lhs_batch
-          │   └── graph_path
-          └── matmul_rhs_batch
-              └── graph_path
-```
-
-## 2. Benchmarking IREE on desktop
-
-See also ./benchmarking.md
-
-Use iree-benchmark-module to benchmark the generated model. For example, to
-benchmark a static left-hand-side batched matmul using `MatrixOpsStaticModule`
-on VMVX run:
-
-```shell
-$ tools/iree-benchmark-module \
-  --module=/tmp/iree/modules/MatrixOpsStaticModule/iree_vmvx/compiled.vmfb \
-  --device=local-task \
-  --function=matmul_lhs_batch \
-  --input=256x64x32xf32=2 \
-  --input=32x16xf32=3
-
-
-```
-
-Note that the arguments to `--input` are shapes plus an arbitrary value
-to populate a splat. Some more complicated models might have very different
-performance characteristics depending on the input data, so this manual
-specification will not work well.
-
-TODO(#6688): Discuss new yaml trace files.
-
-## 3. Benchmarking TFLite on desktop
-
-### 3.1 Build TFLite's `benchmark_model` binary
-
-```shell
-# Enter the TensorFlow Bazel workspace.
-$ cd third_party/tensorflow/
-
-# Build the benchmark_model binary.
-$ bazel build --copt=-mavx2 -c opt \
-  //tensorflow/lite/tools/benchmark:benchmark_model
-
-# By default, TFLite/x86 uses various matrix multiplication libraries.
-# It is possible to force it to only use Ruy for all matrix multiplications.
-# That is the default on ARM but not on x86. This will overwrite the
-# previous binary unless you move it.
-#
-# Note that Ruy takes care of -mavx2 and other AVX extensions internally,
-# so this passing this flag here isn't going to make a difference to
-# matrix multiplications. However, the rest of TFLite's kernels outside
-# of ruy will still benefit from -mavx2.
-$ bazel build --copt=-mavx2 -c opt \
-  --define=tflite_with_ruy=true \
-  //tensorflow/lite/tools/benchmark:benchmark_model
-
-# The binary can now be found in the following directory:
-$ ls bazel-bin/tensorflow/lite/tools/benchmark/
-```
-
-### 3.2 Benchmark the model on TFLite
-
-We pass TFLite the graph generated from the test above (located at the path from
-graph_path). It will generate fake inputs for the model.
-
-Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
-example we can run the benchmark as follows:
-
-```shell
-# Run within `third_party/tensorflow/`.
-$ ./bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
-  --graph=$(cat "/tmp/iree/modules/MatrixOpsStaticModule/tflite/traces/matmul_lhs_batch/graph_path") \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=100 \
-  --enable_op_profiling=true
-```
-
-## 4. Benchmarking IREE on Android
-
-### 4.1 Prepare the benchmarking tools
-
-IREE only supports compiling to Android with CMake. Documentation on setting up
-your environment to cross-compile to Android can be found
-[here](https://iree.dev/building-from-source/android/).
-
-```shell
-# After following the instructions above up to 'Build all targets', the
-# iree-benchmark-module binary should be in the following directory:
-$ ls build-android/tools/
-
-# Copy the benchmarking binary to phone.
-$ adb push build-android/tools/iree-benchmark-module /data/local/tmp
-```
-
-### 4.2 Push the IREE's compilation / benchmarking artifacts to the device
-
-In this example we'll only copy over the files we need to benchmark a single
-module on a single backend, but you can easily copy all of the modules over
-as well.
-
-Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
-example:
-
-```shell
-# Make a directory for the module/backend pair we want to benchmark.
-$ adb shell mkdir -p /data/local/tmp/MatrixOpsStaticModule/iree_vmvx/
-
-# Transfer the files.
-$ adb push /tmp/iree/modules/MatrixOpsStaticModule/iree_vmvx/* \
-  /data/local/tmp/MatrixOpsStaticModule/iree_vmvx/
-```
-
-### 4.3 Benchmark the module
-
-```shell
-$ adb shell /data/local/tmp/iree-benchmark-module \
-  --module="/data/local/tmp/MatrixOpsStaticModule/iree_vmvx/compiled.vmfb" \
-  --device=local-task \
-  --function=matmul_lhs_batch \
-  --input=256x64x32xf32=2 \
-  --input=32x16xf32=3
-```
-
-## 5. Benchmarking TFLite on Android
-
-### 5.1 Prepare the benchmarking tools
-
-There are three options for getting TFLite's `benchmark_model` binary for
-Android.
-
-The first two are to build it directly, either in a
-[`docker` container](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_using_docker)
-or
-[in your own
-environment](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_without_docker).
-To build TensorFlow tools with Android:
-
-- Run `./configure` under TensorFlow repo.
-- Add the following section to the TensorFlow WORKSPACE file.
-
-``` starlark
-android_ndk_repository(
-    name="androidndk",
-    path="/full/path/to/android_ndk",
-)
-```
-
-TODO(hanchung): Place the Android setup to somewhere outside IREE, e.g.,
-TensorFlow.
-
-Then you can configure the TFLite `benchmark_model` binary in the following
-ways:
-
-```shell
-# Build the benchmark_model binary without any add-ons.
-# Note that unlike TFLite/x86, TFLite/ARM uses Ruy by default for all
-# matrix multiplications (No need to pass tflite_with_ruy), except for some
-# matrix*vector products. Below we show how to force using ruy also for that.
-$ bazel build -c opt \
-  --config=android_arm64 \
-  --cxxopt='--std=c++17' \
-  //tensorflow/lite/tools/benchmark:benchmark_model
-
-# Copy the benchmarking binary to phone and allow execution.
-$ adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
-  /data/local/tmp
-$ adb shell chmod +x /data/local/tmp/benchmark_model
-```
-
-```shell
-# Build the benchmark_model binary using ruy even for matrix*vector
-# products. This is only worth trying in models that are heavy on matrix*vector
-# shapes, typically LSTMs and other RNNs.
-$ bazel build -c opt \
-  --config=android_arm64 \
-  --cxxopt='--std=c++17' \
-  --copt=-DTFLITE_WITH_RUY_GEMV \
-  //tensorflow/lite/tools/benchmark:benchmark_model
-
-# Rename the binary for comparison with the standard benchmark_model.
-$ mv bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
-  bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy_gemv
-$ adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy_gemv \
-  /data/local/tmp/
-$ adb shell chmod +x /data/local/tmp/benchmark_model_plus_ruy_gemv
-```
-
-```shell
-# Build the benchmark_model binary with flex.
-$ bazel build -c opt \
-  --config=android_arm64 \
-  --cxxopt='--std=c++17' \
-  //tensorflow/lite/tools/benchmark:benchmark_model_plus_flex
-
-# Copy the benchmarking binary to phone and allow execution.
-$ adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
-  /data/local/tmp
-$ adb shell chmod +x /data/local/tmp/benchmark_model_plus_flex
-```
-
-Alternatively, you can download and install the
-[Android Benchmark App](https://www.tensorflow.org/lite/performance/measurement#android_benchmark_app).
-If you choose to install the app then you'll have to modify the benchmarking
-commands below slightly, as shown in
-[this example](https://www.tensorflow.org/lite/performance/measurement#run_benchmark).
-
-### 5.2 Run the benchmark
-
-```shell
-# Copy the data over to the phone.
-$ mkdir -p /data/local/tmp/MatrixOpsStaticModule/tflite
-$ adb push /tmp/iree/modules/MatrixOpsStaticModule/tflite/* \
-  /data/local/tmp/MatrixOpsStaticModule/tflite/
-```
-
-```shell
-# Benchmark with TFLite.
-$ adb shell taskset f0 /data/local/tmp/benchmark_model \
-  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=10 \
-```
-
-```shell
-# Benchmark with TFLite + RUY GEMV
-$ adb shell taskset f0 /data/local/tmp/benchmark_model_plus_ruy_gemv \
-  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=10 \
-```
-
-```shell
-# Benchmark with TFLite + Flex.
-$ adb shell taskset f0 /data/local/tmp/benchmark_model_plus_flex \
-  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=10 \
-```
-
-```shell
-# Benchmark with TFLite running on GPU.
-$ adb shell taskset f0 /data/local/tmp/benchmark_model \
-  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=10 \
-  --use_gpu=true
-```
-
-Running benchmark on GPU won't give op profiling. To detailed profiling
-information for GPU you can run the following script:
-
-```shell
-# Op profiling on GPU using OpenCL backend.
-$ sh tensorflow/lite/delegates/gpu/cl/testing/run_performance_profiling.sh \
-  -m /data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite
-```
-
-Note: You will have to manually specify the TFLite graph that you want to
-benchmark, as the `graph_path` file assumes that the graph has not moved. The
-name of the `.tflite` graph that you need to benchmark _may_ be different from
-the name of the trace that you want to benchmark, but you can use `cat` on
-the `graph_path` file to verify the correct `.tflite` filename if you're unsure.
-
-Tip:<br>
-nbsp;&nbsp;&nbsp;&nbsp;Sometimes `benchmark_tool` falls back to use CPU even
-when `use_gpu` is set. To get more information, you can turn on traces in the
-tool by `adb shell setprop debug.tflite.trace 1`.
-
-### Profile
-
-There are 2 profilers built into TFLite's `benchmark_model` program. Both of
-them impact latencies, so they should only be used to get a breakdown of the
-relative time spent in each operator type, they should not be enabled for the
-purpose of measuring a latency.
-
-The first is `enable_op_profiling`. It's based on timestamps before and after
-each op. It's a runtime command-line flag taken by `benchmark_model`. Example:
-
-``` bash
-$ adb shell taskset f0 /data/local/tmp/benchmark_model \
-  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
-  --warmup_runs=1 \
-  --num_threads=1 \
-  --num_runs=10 \
-  --enable_op_profiling=true
-```
-
-The second is `ruy_profiler`. Despite its name, it's available regardless of
-whether `ruy` is used for the matrix multiplications. It's a sampling profiler,
-which allows it to provide some more detailed information, particularly on
-matrix multiplications. It's a build-time switch:
-
-``` bash
-$ bazel build \
-  --define=ruy_profiler=true \
-  -c opt \
-  --config=android_arm64 \
-  //tensorflow/lite/tools/benchmark:benchmark_model
-```
-
-The binary thus built can be run like above, no command-line flag needed.

diff --git a/docs/developers/developing_iree/releasing.md b/docs/developers/developing_iree/releasing.md
deleted file mode 100644
index 1e42462..0000000
--- a/docs/developers/developing_iree/releasing.md
+++ /dev/null

@@ -1,81 +0,0 @@
-# Releasing
-
-IREE cuts automated releases via a workflow that is
-[triggered daily](https://github.com/openxla/iree/blob/main/.github/workflows/schedule_candidate_release.yml).
-The only constraint placed on the commit that is released is that it has passed
-all CI checks. These are published on GitHub with the "pre-release" status. For
-debugging this process, see the
-[Debugging releases playbook](/docs/developers/debugging/releases.md).
-
-We periodically promote one of these candidates to a "stable" release by
-removing the "pre-release" status. This makes it show up as a "latest" release
-on GitHub. We also push the Python packages for this release to PyPI.
-
-The criteria for selecting this candidate is a bit more involved.
-
-## Coupling to the Google Integrate
-
-The Google team that manages these stable releases at the moment is also
-responsible for integrating the IREE source code into Google's monorepo. For
-convenience, we select a candidate pre-release, attempt to integrate it into
-Google's monorepo and then promote it to stable if that was successful without
-cherry picks that would affect the quality of the release (because they wouldn't
-be present in the promoted version).
-
-This gives some additional validation to the release because it is stress-tested
-running in a different environment and we not-infrequently catch some issues. We
-do not currently have a way to add cherry-picks to a release, so if cherry-picks
-for functional issues are required, we skip promoting the candidate to "stable".
-
-This coupling introduces some additional constraints to the process that are not
-inherent. It would be perfectly fine to start promoting candidates based on some
-other process, but since the same people are managing both, we've coupled these
-so we don't have to keep track of as many different versions.
-
-As the project matures, we will likely remove this coupling. In particular we
-will likely start integrating into Google's monorepo at a faster cadence than
-the stable releases, so a 1:1 mapping there will not make sense.
-
-The PyPI password is also currently stored in Google's internal secret
-management system, so for others to manage the deployment, we would need to
-store it elsewhere with appropriate ACLs.
-
-At the point where others want to engage in the release process, we can easily
-remove any coupling to any Google processes.
-
-## Picking a Candidate to Promote
-
-When selecting a candidate we use the following criteria:
-
-1. ⪆4 days old so that problems with it may have been spotted.
-2. Contains no P0 regressions vs the previous stable release.
-3. LLVM submodule commit exists upstream (no cherry picks or patches) and
-   matches a commit already integrated into Google's monorepo
-
-The constraint on LLVM version is largely due to our current process for doing
-so. We aim to lift this limitation and if the process were decoupled from the
-Google integration (see
-[Coupling to the Google Integrate](#coupling-to-the-google-integrate)), it would
-go away anyway.
-
-There is currently no specific tracking for P0 regressions (process creation in
-progress). When you've identified a potential candidate, email the iree-discuss
-list with the proposal and solicit feedback. People may point out known
-regressions or request that some feature make the cut.
-
-## Releasing
-
-1. (Googlers only) Integrate into Google's monorepo, following
-   <http://go/iree-g3-integrate-playbook>. If OSS-relevant cherry-picks were
-   required to complete this, STOP: do not promote the candidate.
-
-2. (Googlers only) Push to PyPI using
-   [pypi_deploy.sh](/build_tools/python_deploy/pypi_deploy.sh) and the
-   password stored at <http://go/iree-pypi-password>.
-
-3. Open the release on GitHub. Rename the release from “candidate" to “stable",
-   uncheck the option for “pre-release”, and check the option for "latest".
-
-   ![rename_release](/docs/developers/assets/rename_release.png)
-
-   ![promote_release](/docs/developers/assets/promote_release.png)

diff --git a/docs/developers/developing_iree/repository_management.md b/docs/developers/developing_iree/repository_management.md
deleted file mode 100644
index 3fb3e3a..0000000
--- a/docs/developers/developing_iree/repository_management.md
+++ /dev/null

@@ -1,21 +0,0 @@
-# IREE Repository Management
-
-Due to the process by which we synchronize this GitHub project with our internal
-Google source code repository, there are some oddities in our workflows and
-processes. We aim to minimize these, and especially to mitigate their impact on
-external contributors, but they are documented here for clarity and
-transparency. If any of these things are particularly troublesome or painful for
-your workflow, please reach out to us so we can prioritize a fix.
-
-Hopefully these quirks actually make usage in other downstream projects easier,
-but integrators may need to look past some details (like the Bazel build system,
-Android support, etc.) based on their specific requirements.
-
-## Build Systems
-
-IREE supports building from source with both Bazel and CMake. CMake is the
-preferred build system for open source users and offers the most flexible
-configuration options. Bazel is a stricter build system and helps with usage in
-the Google internal source repository. Certain dependencies (think large/complex
-projects like CUDA, TensorFlow, PyTorch, etc.) may be difficult to support with
-one build system or the other, so the project may configure these as optional.

diff --git a/docs/developers/get_started/README.md b/docs/developers/get_started/README.md
deleted file mode 100644
index 74b2b91..0000000
--- a/docs/developers/get_started/README.md
+++ /dev/null

@@ -1,12 +0,0 @@
-# Additional getting started guides
-
----
-
-The primary guides are located at
-<https://iree.dev/building-from-source/> (source in
-[the website/ folder](../../website/docs/building-from-source/) )
-
----
-
-The files in this folder contain a mix of legacy and minimal-effort (Bazel)
-documentation.

diff --git a/docs/developers/get_started/building_with_bazel_linux.md b/docs/developers/get_started/building_with_bazel_linux.md
deleted file mode 100644
index 27126b3..0000000
--- a/docs/developers/get_started/building_with_bazel_linux.md
+++ /dev/null

@@ -1,117 +0,0 @@
-# Getting Started on Linux with Bazel
-
-**NOTE** Bazel build support is primarily for internal project infrastructure.
-We strongly recommend users build with CMake instead.
-
-This guide walks through building the core compiler and runtime parts of IREE
-from source. Auxiliary components like the Python bindings and Vulkan driver are
-documented separately, as they require further setup.
-
-## Prerequisites
-
-### Install Bazel
-
-Install Bazel, matching IREE's
-[`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion) by
-following the
-[official docs](https://docs.bazel.build/versions/master/install.html).
-
-### Install a Compiler
-
-We recommend Clang. GCC is not fully supported.
-
-```shell
-sudo apt install clang
-```
-
-Set environment variables for Bazel:
-
-```shell
-export CC=clang
-export CXX=clang++
-```
-
-### Install python3 numpy
-
-```shell
-python3 -m pip install numpy
-```
-
-## Clone and Build
-
-### Clone
-
-Clone the repository, initialize its submodules and configure:
-
-```shell
-git clone https://github.com/openxla/iree.git
-cd iree
-git submodule update --init
-python3 configure_bazel.py
-```
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;Editors and other programs can also clone the
-> repository, just make sure that they initialize the submodules.
-
-### Build
-
-Run all core tests:
-
-```shell
-bazel test -k //...
-```
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;You can add flags like
-> `--test_env=IREE_VULKAN_DISABLE=1` to your test command to change how/which
-> tests run.
-
-In general, build artifacts will be under the `bazel-bin` directory at the top
-level.
-
-## Recommended user.bazelrc
-
-You can put a user.bazelrc at the root of the repository and it will be ignored
-by git. The recommended contents for Linux are:
-
-```shell
-build --disk_cache=/tmp/bazel-cache
-
-# Use --config=debug to compile IREE and LLVM without optimizations
-# and with assertions enabled.
-build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
-
-# Use --config=asserts to enable assertions. This has to be done globally:
-# Code compiled with and without assertions can't be linked together (ODR violation).
-build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
-```
-
-## What's next?
-
-### Take a Look Around
-
-Build all of IREE's 'tools' directory:
-
-```shell
-bazel build tools/...
-```
-
-Check out what was built:
-
-```shell
-ls bazel-bin/tools/
-./bazel-bin/tools/iree-compile --help
-```
-
-Translate a
-[MLIR file](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir)
-and execute a function in the compiled module:
-
-```shell
-# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
-$ ./bazel-bin/tools/iree-run-mlir \
-  --iree-hal-target-backends=vmvx --print-mlir \
-  ./samples/models/simple_abs.mlir \
-  --input=f32=-2
-```

diff --git a/docs/developers/get_started/building_with_bazel_macos.md b/docs/developers/get_started/building_with_bazel_macos.md
deleted file mode 100644
index 884aa20..0000000
--- a/docs/developers/get_started/building_with_bazel_macos.md
+++ /dev/null

@@ -1,120 +0,0 @@
-# Getting Started on macOS with Bazel
-
-**NOTE** Bazel build support is primarily for internal project infrastructure.
-We strongly recommend users build with CMake instead.
-
-This guide walks through building the core compiler and runtime parts of IREE
-from source. Auxiliary components like the Python bindings and Vulkan driver are
-not documented for macOS at this time.
-
-IREE is not officially supported on macOS at this time. It may work, but it is
-not a part of our open source CI, and may be intermittently broken.
-Contributions related to macOS support and documentation are welcome however.
-
-## Prerequisites
-
-### Install Homebrew
-
-This guide uses [Homebrew](https://brew.sh/) to install IREE's dependencies.
-
-```shell
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
-```
-
-### Install Bazel
-
-Install Bazel via Homebrew:
-
-```shell
-brew install bazel
-```
-
-Note: when you first run `bazel` to build IREE, it will prompt you to copy and
-run a shell command to select the right version.
-
-### Install python3 numpy
-
-```shell
-python3 -m pip install numpy --user
-```
-
-## Clone and Build
-
-### Clone
-
-Clone the repository, initialize its submodules and configure:
-
-```shell
-git clone https://github.com/openxla/iree.git
-cd iree
-git submodule update --init
-python3 configure_bazel.py
-```
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;Editors and other programs can also clone the
-> repository, just make sure that they initialize the submodules.
-
-### Build
-
-Run all core tests that pass on our OSS CI:
-
-```shell
-$ bazel test -k //... \
-  --test_env=IREE_VULKAN_DISABLE=1 \
-  --build_tag_filters="-nokokoro" \
-  --test_tag_filters="--nokokoro,-driver=vulkan"
-```
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;Not all tests are passing on macOS, but the build does
-> complete successfully at the time of writing.
-
-In general, build artifacts will be under the `bazel-bin` directory at the top
-level.
-
-## Recommended user.bazelrc
-
-You can put a user.bazelrc at the root of the repository and it will be ignored
-by git. The recommended contents for Linux/macOS are:
-
-```shell
-build --disk_cache=/tmp/bazel-cache
-
-# Use --config=debug to compile IREE and LLVM without optimizations
-# and with assertions enabled.
-build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
-
-# Use --config=asserts to enable assertions. This has to be done globally:
-# Code compiled with and without assertions can't be linked together (ODR violation).
-build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
-```
-
-## What's next?
-
-### Take a Look Around
-
-Build all of IREE's 'tools' directory:
-
-```shell
-bazel build tools/...
-```
-
-Check out what was built:
-
-```shell
-ls bazel-bin/tools/
-./bazel-bin/tools/iree-compile --help
-```
-
-Translate a
-[MLIR file](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir)
-and execute a function in the compiled module:
-
-```shell
-# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
-$ ./bazel-bin/tools/iree-run-mlir \
-  --iree-hal-target-backends=vmvx --print-mlir \
-  ./samples/models/simple_abs.mlir \
-  --input=f32=-2
-```

diff --git a/docs/developers/get_started/building_with_bazel_windows.md b/docs/developers/get_started/building_with_bazel_windows.md
deleted file mode 100644
index b67cafd..0000000
--- a/docs/developers/get_started/building_with_bazel_windows.md
+++ /dev/null

@@ -1,120 +0,0 @@
-# Getting Started on Windows with Bazel
-
-**NOTE** Bazel build support is primarily for internal project infrastructure.
-Bazel on Windows in particular is particularly unstable and unsupported.
-We strongly recommend users build with CMake instead.
-
-This guide walks through building the core compiler and runtime parts of IREE
-from source. Auxiliary components like the Python bindings and Vulkan driver are
-documented separately, as they require further setup.
-
-## Prerequisites
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;You can simplify installation by using a package
-> manager like [Scoop](https://scoop.sh/) or
-> [Chocolatey](https://chocolatey.org/).
-
-### Install Bazel
-
-Install Bazel version > 2.0.0 (see
-[`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion) for
-the specific version IREE uses) by following the
-[official docs](https://docs.bazel.build/versions/master/install-windows.html).
-
-Also install [MSYS2](https://www.msys2.org/) by following Bazel's documentation.
-
-### Install Python3
-
-Instructions for installation can be found
-[here](https://www.python.org/downloads/windows/).
-
-### Install Build Tools For Visual Studio
-
-Install the full Visual Studio or "Build Tools For Visual Studio" from the
-[downloads page](https://visualstudio.microsoft.com/downloads/).
-
-Set a few environment variables. You are welcome to configure these however you
-choose. For example, you could set them as system or user level environment
-variables through your "System Properties" or you could use a shell such as
-PowerShell or [cmder](https://cmder.net/)). Setting them through PowerShell
-would look like this:
-
-```powershell
-> $env:BAZEL_VS = "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools"
-```
-
-## Clone and Build
-
-### Clone
-
-Using your shell of choice (such as PowerShell or [cmder](https://cmder.net/)),
-clone the repository, initialize its submodules, and configure:
-
-```powershell
-> git clone https://github.com/openxla/iree.git
-> cd iree
-> git submodule update --init
-> python configure_bazel.py
-```
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;Clone to a short path like `C:\projects\` to avoid
-> issues with Windows maximum path lengths (260 characters).
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;Editors and other programs can also clone the
-> repository, just make sure that they initialize the submodules.
-
-> Tip:<br>
-> &nbsp;&nbsp;&nbsp;&nbsp;configure_bazel.py only detects that you have Windows
-> and will output the default `--config=windows` to `configured.bazelrc`, which
-> assumes the latest version of MSVC. To avoid some warnings, you may want to
-> replace it with `--config=msvc2017`.
-
-### Build
-
-Run all core tests:
-
-```powershell
-> bazel test -k //...
-```
-
-In general, build artifacts will be under the `bazel-bin` directory at the top
-level.
-
-## Recommended user.bazelrc
-
-You can put a user.bazelrc at the root of the repository and it will be ignored
-by git. The recommended contents for Windows are:
-
-```starlark
-build --disk_cache=c:/bazelcache
-build:debug --compilation_mode=dbg --copt=/O2 --per_file_copt=iree@/Od --strip=never
-```
-
-## What's next?
-
-### Take a Look Around
-
-Build all of IREE's 'tools' directory:
-
-```powershell
-> bazel build tools/...
-```
-
-Check out what was built:
-
-```powershell
-> dir bazel-bin\iree\tools\
-> .\bazel-bin\tools\iree-compile.exe --help
-```
-
-Translate a
-[MLIR file](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir)
-and execute a function in the compiled module:
-
-```powershell
-> REM iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
-> .\bazel-bin\tools\iree-run-mlir.exe --iree-hal-target-backends=vmvx --print-mlir .\iree\samples\models\simple_abs.mlir --input=f32=-2
-```

diff --git a/docs/website/README.md b/docs/website/README.md
index 6f777b2..84cd5e8 100644
--- a/docs/website/README.md
+++ b/docs/website/README.md

@@ -1,10 +1,10 @@
 # IREE User-Facing Documentation Website
 
-This directory contains the source and assets for IREE's website, hosted on
-[GitHub Pages](https://pages.github.com/).
+This directory contains the source and assets for <https://iree.dev/>.
 
 The website is generated using [MkDocs](https://www.mkdocs.org/), with the
-[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme.
+[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme and
+is served using [GitHub Pages](https://pages.github.com/).
 
 ## How to edit this documentation
 
@@ -38,3 +38,40 @@
 ```shell
 mkdocs gh-deploy --remote-name <your remote>
 ```
+
+## Website sections and authoring tips
+
+For more details on how this is set up, see
+[IREE Website Overview - July 10, 2023](https://docs.google.com/presentation/d/116TyW_aCsPXmmjRYI2tRqpOwDaGNoV8LDC_j9hsMrDk/edit?usp=sharing)
+(though note that the website organization has changed since then).
+
+For documentation language and style, the guide at
+<https://developers.google.com/style> offers good advice.
+
+### Building from source
+
+Instructions on how to build the project from source on supported platforms.
+
+* Focus on instructions that apply to all users, independent of specific
+  package managers and development styles
+* Set developers up for success with good default options
+* Explain how interact with the build system and its outputs
+
+### Guides
+
+Workflow-oriented guides showing users how to accomplish tasks
+
+### Reference
+
+Unopinionated descriptions of system components
+
+### Developers
+
+Less structured pages for project development topics
+
+* Pages may be "promoted" from this category to another category if they are
+  generally useful to a wide enough range of developers
+
+### Community (Blog)
+
+A place to showcase work across the community

diff --git a/docs/website/docs/building-from-source/getting-started.md b/docs/website/docs/building-from-source/getting-started.md
index 1203e43..8ed175d 100644
--- a/docs/website/docs/building-from-source/getting-started.md
+++ b/docs/website/docs/building-from-source/getting-started.md

@@ -153,7 +153,7 @@
     -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
     ```
 
-    See also our [developer documentation for ccache](https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/ccache.md).
+    See also our [developer documentation for ccache](../developers/building/cmake-with-ccache.md).
 
 ### :octicons-gear-16: Optional components
 

diff --git a/docs/website/docs/community/blog/posts/cuda-backend.md b/docs/website/docs/community/blog/posts/cuda-backend.md
index 9c3c56c..340c1d7 100644
--- a/docs/website/docs/community/blog/posts/cuda-backend.md
+++ b/docs/website/docs/community/blog/posts/cuda-backend.md

@@ -28,7 +28,8 @@
 
 ### HAL support
 
-IREE has a [HAL API](https://github.com/openxla/iree/blob/main/docs/developers/design_roadmap.md#hal-hardware-abstraction-layer-and-multi-architecture-executables)
+IREE has a
+[HAL API](https://iree.dev/developers/design-docs/design-roadmap/#hal-hardware-abstraction-layer-and-multi-architecture-executables)
 that abstract all the targets behind a common interface. The first step to
 supporting a CUDA target was to map the HAL API onto CUDA. We use the CUDA
 driver API to reduce dependencies and be closer to the hardware. The HAL API is
@@ -96,7 +97,7 @@
 ![Compilation diagram](./cuda-bring_up.png)
 
 The steps to reproduce running a simple op end to end through CUDA backend are
-described [here](https://github.com/openxla/iree/blob/main/docs/developers/design_docs/cuda_backend.md#example).
+described [here](../../../developers/design-docs/cuda-backend.md/#example).
 
 ## Performance
 

diff --git a/docs/website/docs/developers/building/bazel.md b/docs/website/docs/developers/building/bazel.md
new file mode 100644
index 0000000..790345f
--- /dev/null
+++ b/docs/website/docs/developers/building/bazel.md

@@ -0,0 +1,220 @@
+# Building with Bazel
+
+This page walks through building IREE from source using the
+[Bazel build system](https://bazel.build/).
+
+!!! warning
+
+    Bazel build support is primarily for internal project infrastructure. We
+    strongly recommend [using CMake](../../building-from-source/index.md)
+    instead.
+
+    Our Bazel configuration is also _only_ tested on Linux. Windows and macOS
+    may be unstable.
+
+## :octicons-download-16: Prerequisites
+
+=== ":fontawesome-brands-linux: Linux"
+
+    1. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the
+        [official docs](https://bazel.build/install).
+
+    2. Install a compiler such as Clang (GCC is not fully supported).
+
+        ```shell
+        sudo apt install clang
+        ```
+
+        Set environment variables for Bazel:
+
+        ```shell
+        export CC=clang
+        export CXX=clang++
+        ```
+
+    3. Install Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+=== ":fontawesome-brands-apple: macOS"
+
+    1. Install [Homebrew](https://brew.sh/):
+
+        ```shell
+        /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
+        ```
+
+    2. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the [official docs](https://bazel.build/install/os-x) or
+        via Homebrew:
+
+        ```shell
+        brew install bazel
+        ```
+
+    3. Install Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+=== ":fontawesome-brands-windows: Windows"
+
+    !!! tip
+
+        You can simplify installation by using a package manager like
+        [Scoop](https://scoop.sh/) or [Chocolatey](https://chocolatey.org/).
+
+    1. Install Bazel, matching IREE's
+        [`.bazelversion`](https://github.com/openxla/iree/blob/main/.bazelversion)
+        by following the [official docs](https://bazel.build/install/windows).
+
+        Also install [MSYS2](https://www.msys2.org/) by following Bazel's documentation.
+
+    2. Install Python3 ([docs here](https://www.python.org/downloads/windows/))
+        and Python build requirements:
+
+        ```shell
+        python -m pip install -r runtime/bindings/python/iree/runtime/build_requirements.txt
+        ```
+
+    3. Install the full Visual Studio or "Build Tools For Visual Studio" from the
+        [downloads page](https://visualstudio.microsoft.com/downloads/) then
+        set the `BAZEL_VS` environment variable:
+
+        ```powershell
+        > $env:BAZEL_VS = "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools"
+        ```
+
+## :octicons-rocket-16: Quickstart: clone and build
+
+### Clone
+
+Use [Git](https://git-scm.com/) to clone the IREE repository and initialize its
+submodules:
+
+```shell
+git clone https://github.com/openxla/iree.git
+cd iree
+git submodule update --init
+```
+
+Configure Bazel:
+
+```shell
+# This generates a `configured.bazelrc` file by analyzing your environment.
+# Skipping this step will make it difficult to select your platform/compiler.
+python3 configure_bazel.py
+```
+
+=== ":fontawesome-brands-linux: Linux"
+
+    (No Linux-specific tips for configuring)
+
+=== ":fontawesome-brands-apple: macOS"
+
+    (No macOS-specific tips for configuring)
+
+=== ":fontawesome-brands-windows: Windows"
+
+    !!! tip
+
+        Clone to a short path like `C:\projects\` to avoid issues with Windows
+        maximum path lengths (260 characters).
+
+    !!! tip
+
+        `configure_bazel.py` only detects that you have Windows and will output
+        the default `--config=windows` to `configured.bazelrc`, which assumes
+        the latest version of MSVC. To avoid some warnings, you may want to
+        replace it with (for example) `--config=msvc2022`.
+
+### Build
+
+Run all core tests:
+
+```shell
+bazel test -k //...
+```
+
+!!! tip
+
+    You can add flags like `--test_env=IREE_VULKAN_DISABLE=1` to your test
+    command to change how/which tests run.
+
+In general, build artifacts will be under the `bazel-bin` directory at the top
+level.
+
+## :octicons-gear-16: Recommended `user.bazelrc`
+
+You can put a user.bazelrc at the root of the repository and it will be ignored
+by git.
+
+=== ":fontawesome-brands-linux: Linux"
+
+    ```shell
+    build --disk_cache=/tmp/bazel-cache
+
+    # Use --config=debug to compile IREE and LLVM without optimizations
+    # and with assertions enabled.
+    build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
+
+    # Use --config=asserts to enable assertions. This has to be done globally:
+    # Code compiled with and without assertions can't be linked together (ODR violation).
+    build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
+    ```
+
+=== ":fontawesome-brands-apple: macOS"
+
+    ```shell
+    build --disk_cache=/tmp/bazel-cache
+
+    # Use --config=debug to compile IREE and LLVM without optimizations
+    # and with assertions enabled.
+    build:debug --config=asserts --compilation_mode=opt '--per_file_copt=iree|llvm@-O0' --strip=never
+
+    # Use --config=asserts to enable assertions. This has to be done globally:
+    # Code compiled with and without assertions can't be linked together (ODR violation).
+    build:asserts --compilation_mode=opt '--copt=-UNDEBUG'
+    ```
+
+=== ":fontawesome-brands-windows: Windows"
+
+    ```shell
+    build --disk_cache=c:/bazelcache
+    build:debug --compilation_mode=dbg --copt=/O2 --per_file_copt=iree@/Od --strip=never
+    ```
+
+## What's next?
+
+### Take a Look Around
+
+Build all of IREE's 'tools' directory:
+
+```shell
+bazel build tools/...
+```
+
+Check out what was built:
+
+```shell
+ls bazel-bin/tools/
+./bazel-bin/tools/iree-compile --help
+```
+
+Translate a
+[MLIR file](https://github.com/openxla/iree/blob/main/samples/models/simple_abs.mlir)
+and execute a function in the compiled module:
+
+```shell
+# iree-run-mlir <compiler flags> [input.mlir] <runtime flags>
+$ ./bazel-bin/tools/iree-run-mlir \
+  --iree-hal-target-backends=vmvx --print-mlir \
+  ./samples/models/simple_abs.mlir \
+  --input=f32=-2
+```

diff --git a/docs/developers/get_started/cmake_options_and_variables.md b/docs/website/docs/developers/building/cmake-options-and-variables.md
similarity index 71%
rename from docs/developers/get_started/cmake_options_and_variables.md
rename to docs/website/docs/developers/building/cmake-options-and-variables.md
index 28e6dc6..9c4b137 100644
--- a/docs/developers/get_started/cmake_options_and_variables.md
+++ b/docs/website/docs/developers/building/cmake-options-and-variables.md

@@ -1,116 +1,164 @@
-# CMake Options and Variables
+# CMake options and variables
 
-## Frequently-used CMake Variables
+## Frequently-used CMake variables
 
-### `CMAKE_BUILD_TYPE`:STRING
+### `CMAKE_BUILD_TYPE`
+
+* type: STRING
 
 Sets the build type. Possible values are `Release`, `Debug`,
 `RelWithDebInfo` and `MinSizeRel`. If unset, build type is set to `Release`.
 
-### `CMAKE_<LANG>_COMPILER`:STRING
+### `CMAKE_<LANG>_COMPILER`
+
+* type: STRING
 
 This is the command that will be used as the `<LANG>` compiler, which are `C`
 and `CXX` in IREE. These variables are set to compile IREE with `clang` or
 rather `clang++`. Once set, these variables can not be changed.
 
-## IREE-specific CMake Options and Variables
+## IREE-specific CMake options and variables
 
 This gives a brief explanation of IREE specific CMake options and variables.
 
-### `IREE_ENABLE_RUNTIME_TRACING`:BOOL
+### `IREE_ENABLE_RUNTIME_TRACING`
+
+* type: BOOL
 
 Enables instrumented runtime tracing. Defaults to `OFF`.
 
-### `IREE_ENABLE_COMPILER_TRACING`:BOOL
+### `IREE_ENABLE_COMPILER_TRACING`
+
+* type: BOOL
 
 Enables instrumented compiler tracing. This requires that
 `IREE_ENABLE_RUNTIME_TRACING` also be set. Defaults to `OFF`.
 
-### `IREE_BUILD_COMPILER`:BOOL
+### `IREE_BUILD_COMPILER`
+
+* type: BOOL
 
 Builds the IREE compiler. Defaults to `ON`.
 
-### `IREE_BUILD_TESTS`:BOOL
+### `IREE_BUILD_TESTS`
+
+* type: BOOL
 
 Builds IREE unit tests. Defaults to `ON`.
 
-### `IREE_BUILD_DOCS`:BOOL
+### `IREE_BUILD_DOCS`
+
+* type: BOOL
 
 Builds IREE documentation files. Defaults to `OFF`.
 
-### `IREE_BUILD_SAMPLES`:BOOL
+### `IREE_BUILD_SAMPLES`
+
+* type: BOOL
 
 Builds IREE sample projects. Defaults to `ON`.
 
-### `IREE_BUILD_PYTHON_BINDINGS`:BOOL
+### `IREE_BUILD_PYTHON_BINDINGS`
+
+* type: BOOL
 
 Builds the IREE python bindings. Defaults to `OFF`.
 
-### `IREE_BUILD_BINDINGS_TFLITE`:BOOL
+### `IREE_BUILD_BINDINGS_TFLITE`
+
+* type: BOOL
 
 Builds the IREE TFLite C API compatibility shim. Defaults to `ON`.
 
-### `IREE_BUILD_BINDINGS_TFLITE_JAVA`:BOOL
+### `IREE_BUILD_BINDINGS_TFLITE_JAVA`
+
+* type: BOOL
 
 Builds the IREE TFLite Java bindings with the C API compatibility shim.
 Defaults to `ON`.
 
-### `IREE_BUILD_EXPERIMENTAL_REMOTING`:BOOL
+### `IREE_BUILD_EXPERIMENTAL_REMOTING`
+
+* type: BOOL
 
 Builds experimental remoting component. Defaults to `OFF`.
 
-### `IREE_HAL_DRIVER_DEFAULTS`:BOOL
+### `IREE_HAL_DRIVER_DEFAULTS`
+
+* type: BOOL
 
 Default setting for each `IREE_HAL_DRIVER_*` option.
 
-### `IREE_HAL_DRIVER_*`:BOOL
+### `IREE_HAL_DRIVER_*`
+
+* type: BOOL
 
 Individual options enabling the build for each runtime HAL driver.
 
-### `IREE_TARGET_BACKEND_DEFAULTS`:BOOL
+### `IREE_TARGET_BACKEND_DEFAULTS`
+
+* type: BOOL
 
 Default setting for each `IREE_TARGET_BACKEND_*` option.
 
-### `IREE_TARGET_BACKEND_*`:BOOL
+### `IREE_TARGET_BACKEND_*`
+
+* type: BOOL
 
 Individual options enabling the build for each compiler target backend.
 
-### `IREE_INPUT_*`:BOOL
+### `IREE_INPUT_*`
+
+* type: BOOL
 
 Individual options enabling each set of input dialects.
 
-### `IREE_OUTPUT_FORMAT_C`:BOOL
+### `IREE_OUTPUT_FORMAT_C`
+
+* type: BOOL
 
 Enables the vm-c compiler output format, using MLIR EmitC. Defaults to `ON`.
 
-### `IREE_DEV_MODE`:BOOL
+### `IREE_DEV_MODE`
+
+* type: BOOL
 
 Configure settings to optimize for IREE development (as opposed to CI or
 release). Defaults to `OFF`. For example, this will downgrade some compiler
 diagnostics from errors to warnings.
 
-### `IREE_ENABLE_LLD`:BOOL
+### `IREE_ENABLE_LLD`
+
+* type: BOOL
 
 Use lld when linking. Defaults to `OFF`. This option is equivalent to
 `-DIREE_USE_LINKER=lld`. The option `IREE_ENABLE_LLD` and `IREE_USE_LINKER` can
 not be set at the same time.
 
-### `IREE_ENABLE_ASAN`:BOOL
+### `IREE_ENABLE_ASAN`
+
+* type: BOOL
 
 Enable [address sanitizer](https://clang.llvm.org/docs/AddressSanitizer.html) if
 the current build type is Debug and the compiler supports it.
 
-### `IREE_ENABLE_MSAN`:BOOL
+### `IREE_ENABLE_MSAN`
+
+* type: BOOL
 
 Enable [memory sanitizer](https://clang.llvm.org/docs/MemorySanitizer.html) if
 the current build type is Debug and the compiler supports it.
 
-### `IREE_ENABLE_TSAN`:BOOL
+### `IREE_ENABLE_TSAN`
+
+* type: BOOL
 
 Enable [thread sanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html) if
 the current build type is Debug and the compiler supports it.
 
-### `IREE_ENABLE_UBSAN`:BOOL
+### `IREE_ENABLE_UBSAN`
+
+* type: BOOL
 
 Enable [undefiend behavior sanitizer](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)
 if the current build type is Debug and the compiler supports it.

diff --git a/docs/developers/developing_iree/ccache.md b/docs/website/docs/developers/building/cmake-with-ccache.md
similarity index 98%
rename from docs/developers/developing_iree/ccache.md
rename to docs/website/docs/developers/building/cmake-with-ccache.md
index 2036998..759b98c 100644
--- a/docs/developers/developing_iree/ccache.md
+++ b/docs/website/docs/developers/building/cmake-with-ccache.md

@@ -1,4 +1,4 @@
-# Using `ccache` to build IREE
+# CMake with `ccache`
 
 [`ccache`](https://ccache.dev/) is a compilation cache. In principle, just
 prepending compiler invocations with `ccache` is all one needs to enable it,

diff --git a/docs/developers/get_started/building_with_emscripten.md b/docs/website/docs/developers/building/emscripten.md
similarity index 90%
rename from docs/developers/get_started/building_with_emscripten.md
rename to docs/website/docs/developers/building/emscripten.md
index 181a271..2323859 100644
--- a/docs/developers/get_started/building_with_emscripten.md
+++ b/docs/website/docs/developers/building/emscripten.md

@@ -1,4 +1,4 @@
-# Getting Started With Emscripten
+# Building with Emscripten
 
 [Emscripten](https://emscripten.org/index.html) is a complete compiler
 toolchain to WebAssembly, using LLVM, with a special focus on speed, size, and
@@ -24,9 +24,9 @@
 source ./emsdk_env.sh
 ```
 
-## Building IREE's Runtime with Emscripten
+## Building IREE's runtime with Emscripten
 
-### Host Configuration
+### Host configuration
 
 Build and install at least the compiler tools on your host machine, or install
 them from a binary distribution:
@@ -40,7 +40,7 @@
 $ cmake --build ../iree-build-host/ --target install
 ```
 
-### Target Configuration
+### Target configuration
 
 ```shell
 $ emcmake cmake -G Ninja -B ../iree-build-emscripten/ \
@@ -58,7 +58,7 @@
   --target iree_samples_simple_embedding_simple_embedding_vmvx_sync
 ```
 
-### Load into a WebAssembly Environment
+### Load into a WebAssembly environment
 
 Copy the outputs from the build process (e.g. `simple_embedding_vmvx_sync.js`
 and `simple_embedding_vmvx_sync.wasm`) into your application and follow

diff --git a/docs/website/docs/developers/debugging/android-with-lldb.md b/docs/website/docs/developers/debugging/android-with-lldb.md
new file mode 100644
index 0000000..7774b7c
--- /dev/null
+++ b/docs/website/docs/developers/debugging/android-with-lldb.md

@@ -0,0 +1,61 @@
+# Android debugging with LLDB
+
+This doc shows how to use LLDB to debug native binaries on Android. For a more
+complete explanation, see the
+[official LLDB documentation on remote debugging](https://lldb.llvm.org/use/remote.html).
+
+## Prerequisites
+
+We assume the following setup:
+
+1. [Android NDK is installed](https://developer.android.com/ndk/downloads) and
+   the `ANDROID_NDK` environment variable is set to the installation path.
+2. Your Android device connected and configured for
+   [`adb`](https://developer.android.com/studio/command-line/adb).
+3. The Android binary of interest is already compiled and the command to run it
+   (in `adb shell`) is `<your-binary> [program args...]`. This does *not* have
+   to be a proper Android app with a manifest, etc.
+
+## Running Manually
+
+1. Push the toolchain files, including `lldb-server`, to your device:
+
+    ```shell
+    adb shell "mkdir -p /data/local/tmp/tools"
+    adb push "$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/aarch64/* /data/local/tmp/tools
+    ```
+
+    You may need to adjust the clang toolchain version to match the one in your
+    NDK. You can find it with
+    `find "$ANDROID_NDK/toolchains/llvm/prebuilt" -name lldb-server`.
+
+2. Set up port forwarding. We are going to use port 5039 but you are free to
+   pick a different one:
+
+    ```shell
+    adb forward tcp:5039 tcp:5039
+    ```
+
+3. Start an `lldb-server` in a new interactive adb shell:
+
+    ```shell
+    adb shell
+    /data/local/tmp/tools/lldb-server platform --listen '*:5039' --server
+    ```
+
+4. Launch `lldb`, connect to the server and run the binary:
+
+    ```shell
+    lldb -o 'platform select remote-android' \
+        -o 'platform connect connect://:5039' \
+        -o 'platform shell cd /data/local/tmp'
+    target create <your-binary>
+    run [program args...]
+    ```
+
+    You can either use the system `lldb` or a prebuilt under `"$ANDROID_NDK"/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.6/lib/linux/<your-host-arch>`.
+
+    Explanation: each `-o` (short for `--one-shot`) tells lldb to execute a
+    command on startup. You can run those manually in the lldb shell, if you
+    prefer. Then, we tell lldb which working directory to use, where to find the
+    executable, and what command line arguments to use.

diff --git a/docs/developers/debugging/compile_time_regressions.md b/docs/website/docs/developers/debugging/compile-time-regressions.md
similarity index 93%
rename from docs/developers/debugging/compile_time_regressions.md
rename to docs/website/docs/developers/debugging/compile-time-regressions.md
index d15f1dc..a6ea76c 100644
--- a/docs/developers/debugging/compile_time_regressions.md
+++ b/docs/website/docs/developers/debugging/compile-time-regressions.md

@@ -1,4 +1,4 @@
-# Debugging Compile Time Regressions
+# Compile time regression debugging
 
 So the IREE compiler used to compile a program quickly, but it is now slower.
 What do you do?
@@ -7,26 +7,26 @@
 
 Try to answer as many of these questions as you can:
 
-* **When did compilation get slower?**
+> **When did compilation get slower?**
 
   A specific git commit is ideal, but "sometime in the last week" is a good
   starting point. You'll ultimately want to find a culprit release or git
   commit that changed the compiler code.
 
-* **How much slower did compilation get?**
+> **How much slower did compilation get?**
 
   Be specific - did it jump from 1 minute to 2 minutes, or 1 minute to 1 hour?
   Identifying the scale of the regression can help set the priority to
   investigate it.
 
-* **What is the full compile command?**
+> **What is the full compile command?**
 
   Try to extract the input program and full list of flags passed to the
   compiler binary so that others can reproduce what you're seeing. Try to
   distill this as much as possible to using just native tools (no Python or
   other framework layers).
 
-* **What environment is the compiler running in?**
+> **What environment is the compiler running in?**
 
   Are you using a `Debug` build, or a release build? What operating system and
   size machine is running the compiler (e.g. Linux developer machine, or a
@@ -44,7 +44,7 @@
 specific commits in IREE, though it typically won't let you step through changes
 in submodules (e.g. MLIR updates in `third_party/llvm-project/`).
 
-**Tip**: [Configure ccache](../developing_iree/ccache.md) if you'll be
+**Tip**: [Configure ccache](../building/cmake-with-ccache.md) if you'll be
 rebuilding the compiler while bisecting
 
 A manual workflow with `git bisect` looks like this:
@@ -81,9 +81,7 @@
 git bisect run run_bisect.sh
 ```
 
-Other sample scripts:
-
-#### Compile executable sources individually with a timeout
+#### Sample: compile executable sources individually with a timeout
 
 ```bash
 #!/bin/bash
@@ -186,8 +184,10 @@
 
 ### Using Tracy
 
+<!-- TODO(scotttodd): update link -->
+
 See our documentation on
-[profiling with Tracy](../developing_iree/profiling_with_tracy.md). For compile
+[profiling with Tracy](../performance/profiling-with-tracy.md). For compile
 time regressions, pay particular attention to the different compilation phases
 (Flow/Stream/HAL), how many times `TranslateExecutablesPass` runs, and if there
 are outlier passes that take significantly longer to run than others.

diff --git a/docs/website/docs/developers/debugging/integration-tests.md b/docs/website/docs/developers/debugging/integration-tests.md
new file mode 100644
index 0000000..5761c48
--- /dev/null
+++ b/docs/website/docs/developers/debugging/integration-tests.md

@@ -0,0 +1,121 @@
+# Integration test debugging
+
+This document includes tips for triaging integration test correctness issues.
+Feel free to reach out to @hanhanW or ask questions on Discord for more help.
+
+## General tips
+
+### Narrow down reproducers
+
+* Models themselves can be large, and IREE breaks models into dispatches/kernels
+and then launches those individually. Program outputs could diverge starting
+from any individual launch. To get a smaller reproducer, you can use
+[--iree-flow-trace-dispatch-tensors](../general/developer-overview.md#-iree-flow-trace-dispatch-tensors).
+* You can compare the logs between builds/backends to get an idea about which
+dispatch results in wrong outputs. The dumped inputs can be reused in a
+flagfile.
+
+Once a suspicious dispatch is identified, we can create a test case based on
+the dispatch function. The dispatch function can be derived after the
+`OutlineDispatchRegions` pass. The function signatures have to be modified
+manually. You'll have to put `flow.dispatch.tensor.load` variables to function
+arguments, and replace `flow.dispatch.tensor.store` with `return` op.
+
+Note: This only works when dispatch formation logics are identical between runs.
+
+## iree-samples repository tests
+
+Follow [README](https://github.com/iree-org/iree-samples#readme) to run the model.
+The MLIR files will be generated. You'll find the saved file from log. E.g.,
+
+``` shell
+[ RUN      ] MobilenetV2Int8Test.test_compile_tflite
+I0401 17:27:04.084272 140182373025024 test_util.py:119] Setting up for IREE
+I0401 17:27:04.085064 140182373025024 binaries.py:218] Invoke IREE Pipeline:
+  /tmp/iree-samples/iree-samples.venv/lib/python3.9/site-packages/iree/tools/tflite/iree-import-tflite
+    /tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/model.tflite
+    --mlir-print-debuginfo
+    --save-temp-tfl-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tflite.mlir
+    --save-temp-iree-input=/tmp/iree-samples/tflitehub/tmp/mobilenet_v2_int8_test.py/tosa.mlir
+```
+
+Unfortunately, the artifacts are not dumped in the runs. There is an
+[issue](https://github.com/openxla/iree/issues/8756) for tracking this. A
+workaround can be found in the issue.
+
+## TensorFlow integration tests
+
+These are steps to reproduce/address failures in TF/TFLite integration tests.
+These instructions are most stable on Linux, though they may work with a few
+tweaks on Windows and macOS.
+
+All steps here assume starting from the IREE root directory.
+
+1. First create a Python virtual environment to install packages into:
+
+    ```bash
+    python -m venv iree-tf.venv
+    source iree-tf.venv/bin/activate
+
+    # Install test requirements
+    python -m pip install -r ./integrations/tensorflow/test/requirements.txt
+    ```
+
+2. Install IREE's tools and Python bindings or build them from source
+
+    Install distributed packages
+
+    ```bash
+    # Install packages from nightly releases
+    # This should work for most cases, as the importers change infrequently
+    python -m pip install \
+      iree-compiler iree-runtime iree-tools-tf iree-tools-tflite \
+      --find-links https://iree.dev/pip-release-links.html
+    ```
+
+    _OR_ build from source
+
+    ```bash
+    # Build Python bindings from source
+    cmake -G Ninja -B ../iree-build/ -DIREE_BUILD_PYTHON_BINDINGS=ON .
+    cmake --build ../iree-build/
+
+    # Add IREE built-from-source Python packages to PYTHONPATH
+    source .env
+
+    # Install IREE TF/TFLite Python packages
+    python -m pip install integrations/tensorflow/python_projects/iree_tf
+    python -m pip install integrations/tensorflow/python_projects/iree_tflite
+    ```
+
+3. Run the python test command line
+
+    The command can be obtained from the run file. For example, if
+    `iree_tfl_tests/llvmcpu_posenet_i8.run` failed,
+
+    ```bash
+    cd integrations/tensorflow/test/
+    cat iree_tfl_tests/llvmcpu_posenet_i8.run
+
+    # REQUIRES: llvmcpu
+    # RUN: %PYTHON -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=%t
+
+    cd python/
+    python -m iree_tfl_tests.posenet_i8_test --target_backend=llvmcpu --artifacts_dir=/tmp/posenet_i8_failure
+    ```
+
+    Note that the command can only be run under
+    `integrations/tensorflow/test/python` directory.
+
+4. Extract intermediate files and use with native tools
+
+    The test will create an `iree_input.mlir` in the temp directory specified.
+    Those can then be fed into `iree-compile` (built locally to reproduce the
+    error)
+
+    ```bash
+    iree-compile \
+      --iree-hal-target-backends=llvm-cpu \
+      --iree-input-type=stablehlo \
+      iree_input.mlir
+    ```

diff --git a/docs/developers/debugging/releases.md b/docs/website/docs/developers/debugging/releases.md
similarity index 99%
rename from docs/developers/debugging/releases.md
rename to docs/website/docs/developers/debugging/releases.md
index 5b05dec..82554d8 100644
--- a/docs/developers/debugging/releases.md
+++ b/docs/website/docs/developers/debugging/releases.md

@@ -1,4 +1,4 @@
-# Debugging releases playbook
+# Release debugging playbook
 
 ## Tools and Locations
 

diff --git a/docs/developers/developing_iree/sanitizers.md b/docs/website/docs/developers/debugging/sanitizers.md
similarity index 82%
rename from docs/developers/developing_iree/sanitizers.md
rename to docs/website/docs/developers/debugging/sanitizers.md
index 95e362a..1c9f342 100644
--- a/docs/developers/developing_iree/sanitizers.md
+++ b/docs/website/docs/developers/debugging/sanitizers.md

@@ -1,4 +1,4 @@
-# Using Address/Memory/Thread Sanitizers
+# Sanitizers (ASan/MSan/TSan)
 
 [AddressSanitizer](https://clang.llvm.org/docs/AddressSanitizer.html),
 [MemorySanitizer](https://clang.llvm.org/docs/MemorySanitizer.html) and
@@ -14,15 +14,15 @@
 
 Tool   | Detects | Helps debug what? | Slowdown | Memory overhead | Android support
 ------ | ------- | ----------------- | -------- | --------------- | ---------------
-ASan   | Out-of-bounds accesses,<br>Use-after-free,<br>Use-after-return,<br>Memory leaks (*), ... | Crashes,<br>non-deterministic results,<br>memory leaks (*) | 2x | 3x | Yes
+ASan   | Out-of-bounds accesses, use-after-free, use-after-return, memory leaks | Crashes, non-deterministic results, memory leaks | 2x | 3x | Yes
 MSan   | Uninitialized memory reads | Non-deterministic results | 3x | ? | Yes
 TSan   | Data races | Many bugs in multi-thread code | 5x-15x | 5x-10x | [No](https://github.com/android/ndk/issues/1171)
 
-Notes:
+!!! note
 
-* (*) See [this
-  documentation](https://clang.llvm.org/docs/AddressSanitizer.html#memory-leak-detection)
-  on leak detection. It is only enabled by default on some platforms.
+    See
+    [this documentation](https://clang.llvm.org/docs/AddressSanitizer.html#memory-leak-detection)
+    on leak detection. It is only enabled by default on some platforms.
 
 ## Support status and how to enable each sanitizer
 
@@ -71,7 +71,7 @@
 If you know what you're doing (i.e. if you are not building targets that
 internally involve a LLVM/CPU `iree_bytecode_module`), feel free to locally
 comment out the CMake error and only set `IREE_ENABLE_TSAN`. Also see a
-[past attempt]((<https://github.com/openxla/iree/pull/8966>) to relax that CMake
+[past attempt](https://github.com/openxla/iree/pull/8966) to relax that CMake
 validation.
 
 ### MSan (MemorySanitizer)
@@ -125,9 +125,12 @@
 
 Where `/tmp/asan.txt` is where you've pasted the raw sanitizer report.
 
-**Tip:** this script will happily just echo any line that isn't a stack frame.
-That means you can feed it the whole `ASan` report at once, and it will output a
-symbolized version of it. DO NOT run it on a single stack at a time! That is
-unlike the symbolizer tool that's being added in NDK r22, and one of the reasons
-why we prefer to keep our own script. For more details see [this
-comment](https://github.com/android/ndk/issues/753#issuecomment-719719789)
+!!! tip
+
+    This script will happily just echo any line that isn't a stack frame.
+    That means you can feed it the whole `ASan` report at once, and it will
+    output a symbolized version of it. DO NOT run it on a single stack at a
+    time! That is unlike the symbolizer tool that's being added in NDK r22, and
+    one of the reasons why we prefer to keep our own script. For more details
+    see
+    [this comment](https://github.com/android/ndk/issues/753#issuecomment-719719789).

diff --git a/docs/website/docs/developers/design-docs/cuda-backend.md b/docs/website/docs/developers/design-docs/cuda-backend.md
new file mode 100644
index 0000000..74b2f69
--- /dev/null
+++ b/docs/website/docs/developers/design-docs/cuda-backend.md

@@ -0,0 +1,154 @@
+# CUDA backend
+
+!!! note - "Authored March, 2021"
+
+This document is intended to provide an overview of the design choices made to
+support CUDA within IREE. It describes both the HAL runtime and the NVVM
+codegen side.
+
+## CUDA HAL Driver
+
+The CUDA HAL driver is in [`iree/hal/drivers/cuda/`][iree-cuda] directory. It is
+written in C following the standards of the rest of the HAL module.
+
+### CUDA library dependency
+
+IREE calls directly into [`CUDA driver API`][cuda-driver]. CUDA library is
+loaded dynamically and cuda.h header from CUDA SDK is part of IREE third_party
+project. Therefore IREE doesn't require CUDA SDK to be installed when building
+iree tools.
+
+At runtime HAL CUDA driver will load libcuda.so/nvcuda.dll library and load a
+subset of the cuda driver API used in HAL. The list of functions being used are
+in the file [`iree/hal/drivers/cuda/dynamic_symbols_tables.h`][cuda-symbols]
+
+### Driver
+
+There is no direct equivalent in CUDA to the HAL driver abstraction. We use it
+to hold the symbols loaded for all the devices.
+
+### Device
+
+The equivalent to HAL device in CUDA is the `CUcontext`, it holds all the state
+related to memory allocations.
+
+### Command buffer
+
+We implement command buffers using [`CUDA Graph API`][cuda-graph]. Using the
+Graph API allows to easily encode fine grain dependencies between dispatch
+without having to create multiple streams.
+
+Note that Graph API is meant to be used for command buffers that can be
+recorded once and used several times and there may be a performance penalty to
+using Graph API for direct command buffer. It is likely that we will also have
+a pure stream implementation in the future if we see performance problems with
+direct command buffer usages.
+
+### Event and Barrier
+
+In HAL Event and Barrier are used for GPU<->GPU synchronization either within a
+command buffer (Event and Barrier) or between command buffers.
+
+The current implementation ignores events and barriers and serializes all the
+nodes of the graph in order to have a conservative but correct solution.
+
+The design we plan for the future is to map dependencies within a command
+buffer to graph dependencies in the CUDA Graph API. When an event is signaled
+all the leaf nodes of the graph will be saved in HAL data structure and when
+the same command buffer waits on the signal we will add all the nodes as
+dependency to the future nodes added to the graph.
+
+For simplicity we always serialize command buffers sent to the same command
+queue.
+
+### Allocator
+
+The allocator will forward allocation requests to `cuMemHostAlloc` for host
+accessible memory and `cuMemAlloc` for device only memory.
+
+### Buffer
+
+CUDA buffers are represented either as a host pointer or a device pointer of
+type `CUdeviceptr`.
+
+### Executable
+
+HAL executable maps naturally to a PTX module. The compiler will generate a
+flat buffer containing a PTX text module as well as a list of entry point
+function names and the workgroup size associated with those entry points.
+
+### Semaphore
+
+Timeline semaphore is used in IREE to handle coarse grain synchronization for
+CPU<->GPU, GPU<->GPU and CPU<->CPU. The interface follows closely
+[`Vulkan timeline semaphore spec`][vulkan-semaphore].
+
+There is currently no simple way to implement this on CUDA. There are several
+solutions discussed on this [`IREE issue`][semaphore-issue] but no obvious
+solution. For now we force CPU and GPU to be synchronized after every submit to
+ensure correctness and ignore the semaphore.
+
+## NVVM Codegen
+
+### NVVM and PTX
+
+NVVM is a CUDA specific IR composed of LLVM IR and NVVM specific intrinsics. It
+can be compiled to PTX text using LLVM PTX backend. NVVM has an associated
+dialect in MLIR that translates 1:1 to NVVM intrinsics. This is what we are
+using to generate the PTX kernel code.
+
+### IREE flow
+
+IREE's target independent codegen converts the compiler input to Linalg on
+Tensors. Afterward IREE will call the LinalgToLLVMGPU codegen passes.
+
+Once we get into LinalgToLLVMGPU passes we first do bufferize to generate
+Linalg on Buffers. Then we apply MLIR generic passes to  convert linalg to SCF
+dialect and then SCF to Standard dialect. After that we convert Standard
+dialect to LLVM+NVVM dialect.
+
+## Example
+
+Save the following mlir in /tmp/add.mlir
+
+```mlir
+func.func @add(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> {
+  %0 = tensor.empty() : tensor<4xf32>
+  %1 = linalg.generic {
+    indexing_maps = [
+      affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]}
+      ins(%arg0, %arg1 : tensor<4xf32>, tensor<4xf32>)
+      outs(%0 : tensor<4xf32>) {
+  ^bb0(%in: f32, %in_0: f32, %out: f32):
+    %2 = arith.addf %in, %in_0 : f32
+    linalg.yield %2 : f32
+  } -> tensor<4xf32>
+  return %1 : tensor<4xf32>
+}
+```
+
+```shell
+# First compile into a VM bytecode module.
+$ ../iree-build/tools/iree-compile \
+  --iree-hal-target-backends=cuda \
+  /tmp/add.mlir \
+  -o /tmp/add.vmfb
+
+# Run the module through CUDA HAL backend.
+$ ../iree-build/tools/iree-run-module \
+  --device=cuda \
+  --module=/tmp/add.vmfb \
+  --function=add \
+  --input="4xf32=[1 2 3 4]" \
+  --input="4xf32=[2 2 2 2]"
+
+EXEC @add
+4xf32=3 4 5 6
+```
+
+[iree-cuda]: https://github.com/openxla/iree/tree/main/iree/hal/drivers/cuda/
+[cuda-symbols]: https://github.com/openxla/iree/blob/main/iree/hal/drivers/cuda/dynamic_symbols_tables.h
+[cuda-driver]: https://docs.nvidia.com/cuda/cuda-driver-api/index.html
+[cuda-graph]: https://developer.nvidia.com/blog/cuda-graphs/
+[vulkan-semaphore]: https://www.khronos.org/blog/vulkan-timeline-semaphores
+[semaphore-issue]: https://github.com/openxla/iree/issues/4727

diff --git a/docs/developers/design_roadmap.md b/docs/website/docs/developers/design-docs/design-roadmap.md
similarity index 75%
rename from docs/developers/design_roadmap.md
rename to docs/website/docs/developers/design-docs/design-roadmap.md
index 7220860..7b37354 100644
--- a/docs/developers/design_roadmap.md
+++ b/docs/website/docs/developers/design-docs/design-roadmap.md

@@ -1,8 +1,4 @@
-# IREE Design Roadmap
-
-<a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a>
-
-<!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING -->
+# Design roadmap
 
 A not-so-concise walkthrough of various IREE features that are in the design
 process and planned for future versions. A lot of the questions around how the
@@ -14,167 +10,12 @@
 [iree-discuss](https://groups.google.com/forum/#!forum/iree-discuss) mailing
 list.
 
-<!-- TOC -->
-
--   [IREE Design Roadmap](#iree-design-roadmap)
-    -   [Input Dialects](#input-dialects)
-    -   [Future MLIR XLA HLO Replacement](#future-mlir-xla-hlo-replacement)
-    -   [`linalg`: High-level Hierarchical Optimization](#linalg-high-level-hierarchical-optimization)
-    -   [XLA HLO: Canonicalizations](#xla-hlo-canonicalizations)
-    -   [XLA HLO: Tensor to Primitive Conversion](#xla-hlo-tensor-to-primitive-conversion)
-    -   [Quantization](#quantization)
-    -   [`flow`: Data- and Execution-Flow Modeling](#flow-data--and-execution-flow-modeling)
-    -   [Avoiding Readbacks with `flow.stream`](#avoiding-readbacks-with-flowstream)
-    -   [Threading `flow.stream` through the CFG](#threading-flowstream-through-the-cfg)
-    -   [Predication of `flow.dispatch`](#predication-of-flowdispatch)
-    -   [Deduping `flow.executable`s](#deduping-flowexecutables)
-    -   [Rematerializing CSE'd Expressions](#rematerializing-csed-expressions)
-    -   [Device Placement](#device-placement)
-    -   [`hal`: Hardware Abstraction Layer and Multi-Architecture Executables](#hal-hardware-abstraction-layer-and-multi-architecture-executables)
-    -   [Allow Targets to Specify `hal.interface`s](#allow-targets-to-specify-halinterfaces)
-    -   [Target-specific Scheduling Specialization](#target-specific-scheduling-specialization)
-    -   [Buffer Usage Tracking](#buffer-usage-tracking)
-    -   [Batched Executable Caching and Precompilation](#batched-executable-caching-and-precompilation)
-    -   [Target-aware Executable Compression](#target-aware-executable-compression)
-    -   [Target-aware Constant Compression](#target-aware-constant-compression)
-    -   [Command Buffer Stateful Deduplication](#command-buffer-stateful-deduplication)
-    -   [Resource Timeline](#resource-timeline)
-    -   [Transient Tensor Ringbuffer](#transient-tensor-ringbuffer)
-    -   [Timeline Semaphores on the Module ABI](#timeline-semaphores-on-the-module-abi)
-    -   [GPU-like CPU Scheduling](#gpu-like-cpu-scheduling)
-    -   [`vm`: Lightweight Virtual Machine](#vm-lightweight-virtual-machine)
-    -   [Coroutines for Batching and Cooperative Scheduling](#coroutines-for-batching-and-cooperative-scheduling)
-        -   [Cellular Batching](#cellular-batching)
-    -   [Lowering to LLVM IR](#lowering-to-llvm-ir)
-    -   [Improved Type Support](#improved-type-support)
-    -   [Indirect Command Buffer/On-Accelerator Execution](#indirect-command-bufferon-accelerator-execution)
-
-<!-- /TOC -->
+[TOC]
 
 ## Input Dialects
 
-<a id="markdown-Input%20Dialects" name="Input%20Dialects"></a>
-
-### Future MLIR XLA HLO Replacement
-
-<a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a>
-
-IREE's current input dialect is the XLA HLO dialect representing operations on
-tensors. This was a pragmatic decision based on having HLO already defined and
-proof of existing models being lowered to it from Tensorflow, allowing us to
-focus on the IREE-specific portions of work. Unfortunately, HLO is tied to
-Tensorflow and has many quirks that would not otherwise have been designed had
-that not been the case. There are discussions happening about an upstream MLIR
-[Tensor Compute Primitives](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/)
-dialect that HLO can be lowered into, allowing IREE (and other backends) to
-decouple themselves from XLA and be easier to target from frontends.
-
-### `linalg`: High-level Hierarchical Optimization
-
-<a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a>
-
-It's required that IREE inputs are all in tensor form (and not in-place memref
-updates) in order to perform a large majority of the `flow` transformations.
-Recent work in the [Linalg](https://mlir.llvm.org/docs/Dialects/Linalg/) dialect
-is adding support for operating on value-semantic tensors, meaning that we can
-first apply `mhlo` to `linalg` lowerings and any of the transformations
-available in Linalg prior to performing our own `flow` lowerings. The advantage
-is that Linalg will have much stronger and principled code motion and nested
-loop transformation optimizations than is possible on higher-level ops. As not
-all operations can be represented as `linalg` ops IREE will be able to ingest a
-mix of `linalg`, `std`, and `mhlo` (or its replacement) ops.
-
-### XLA HLO: Canonicalizations
-
-<a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a>
-
-Very little effort has been applied to `mhlo` optimizations and there are a
-significant number of missing folders, canonicalizers, and simple
-transformations. Many of these happen in legacy XLA C++ backends; however we
-need them in MLIR so that we can make use of dynamic shapes, mixed dialect
-inputs, etc. The `tf2xla` bridge work (converting Tensorflow models into the
-corresponding `mhlo` ops) is nearing its initial milestones and afterward we
-expect more of these missing pieces to be filled in.
-
-Examples of the optimizations that will greatly benefit IREE (and any other
-backend consuming `mhlo`) include:
-
--   Eliding unneeded transpose, reshape, and broadcast operations.
--   Inserting transpose, reshape, and broadcast operations to allow for more
-    optimal memory access patterns (such as transposing gather input to allow
-    for memcpy-like transfers instead of column-wise cache-unfriendly accesses).
--   Moving operations above broadcasts such that the smallest amount of work is
-    performed.
-
-### XLA HLO: Tensor to Primitive Conversion
-
-<a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a>
-
-HLO only operates on tensor values - even for simple scalars - and this presents
-a problem when attempting to determine which code should be specified to run on
-accelerators vs. what should run on the host. The canonical example is
-`mhlo.while`, which as seen in the example below uses scalar tensors for its
-loop iteration counter and comparison.
-
-```mlir
-%start = arith.constant dense<1> : tensor<i32>
-%bound = arith.constant dense<3> : tensor<i32>
-%res = "mhlo.while"(%start) ( {
-^bb0(%count: tensor<i32>):
-  %1 = "mhlo.compare"(%count, %bound) {comparison_direction = "LT"} : (tensor<i32>, tensor<i32>) -> tensor<i1>
-  "mhlo.return"(%1) : (tensor<i1>) -> ()
-},  {
-^bb0(%count: tensor<i32>):
-  %1 = mhlo.add %count, %count : tensor<i32>
-  "mhlo.return"(%1) : (tensor<i32>) -> ()
-}) : (tensor<i32>) -> tensor<i32>
-```
-
-A naïve but correct lowering (what's currently in IREE) would perform the
-comparison and increment on the device and insert a host readback to see if the
-loop should continue:
-
-```mlir
-func @main() -> tensor<i32> attributes {iree.reflection = {f = "I1!R6!B3!t6", fv = "1"}} {
-  %cst = arith.constant dense<1> : tensor<i32>
-  %cst_0 = arith.constant dense<3> : tensor<i32>
-  %cst_1 = arith.constant dense<1> : vector<3xi32>
-  br ^bb1(%cst : tensor<i32>)
-^bb1(%2: tensor<i32>):  // 2 preds: ^bb0, ^bb2
-  %3 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %2 : tensor<i32>, %arg2 = %cst_0 : tensor<i32>) -> tensor<i1> {
-    %8 = flow.dispatch @main_ex_dispatch_0::@main_ex_dispatch_0[%arg0 : vector<3xi32>](%arg1, %arg2) : (tensor<i32>, tensor<i32>) -> tensor<i1>
-    flow.return %8 : tensor<i1>
-  }
-  %4 = flow.tensor.load %3 : tensor<i1>
-  cond_br %4, ^bb2(%2 : tensor<i32>), ^bb3(%2 : tensor<i32>)
-^bb2(%5: tensor<i32>):  // pred: ^bb1
-  %6 = flow.ex.stream.fragment(%arg0 = %cst_1 : vector<3xi32>, %arg1 = %5 : tensor<i32>) -> tensor<i32> {
-    %8 = flow.dispatch @main_ex_dispatch_1::@main_ex_dispatch_1[%arg0 : vector<3xi32>](%arg1) : (tensor<i32>) -> tensor<i32>
-    flow.return %8 : tensor<i32>
-  }
-  br ^bb1(%6 : tensor<i32>)
-^bb3(%7: tensor<i32>):  // pred: ^bb1
-  return %7 : tensor<i32>
-}
-```
-
-Of note is the `flow.tensor.load` op indicating a host readback. Though this
-correctly executes the loop it is extremely inefficient. What's desired is for
-the loop iterator and condition to all happen on the host, with the iterator
-being passed to the loop body as an argument that can be encoded into a command
-buffer in future lowering stages. This eliminates host readback and allows for
-much larger `flow.stream` sequences, feeding more into the pipeline for the
-accelerator.
-
-Not all source frontends have this issue (misrepresenting simple host
-computation as non-dense tensor operations), and our goal is to add a
-transformation that heuristically converts `mhlo` ops acting on small tensors to
-`std` ops acting on primitive values (`i32`, `index`, etc).
-
 ### Quantization
 
-<a id="markdown-Quantization" name="Quantization"></a>
-
 It's assumed that any work related to quantization/compression has happened
 prior to lowering into IREE dialects. Our plan is to use the proposed
 [Quantization Transforms](https://llvm.discourse.group/t/rfc-a-proposal-for-implementing-quantization-transformations-in-mlir/655)
@@ -191,8 +32,6 @@
 
 ## `flow`: Data- and Execution-Flow Modeling
 
-<a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a>
-
 The `flow` dialect is designed to allow us to extract as much concurrency as
 possible from a program and partition IR into the scheduling and execution
 domains. Today we have the IR structure and transformation flow in place but
@@ -204,8 +43,6 @@
 
 ### Avoiding Readbacks with `flow.stream`
 
-<a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a>
-
 A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
 ops) will be removed when we have an
 [HLO tensor->primitive conversion](#xla-hlo-tensor-to-primitive-conversion).
@@ -215,7 +52,7 @@
 perform suboptimally but encouraging authors to adjust their input model to
 enable better behavior. The IREE VM also has specific support for hiding
 readback latency in an efficient way via
-[coroutines](coroutines-for-batching-and-cooperative-scheduling).
+[coroutines](#coroutines-for-batching-and-cooperative-scheduling).
 
 The most common case we are currently seeing in the IR is that of dynamic copies
 where the offsets are dependent on the result of previous computations. Source
@@ -243,8 +80,6 @@
 
 ### Threading `flow.stream` through the CFG
 
-<a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a>
-
 The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
 a temporary implementation designed to get the concept of streams lowered to the
 HAL dialect. For streams to be effective at modeling larger concurrency scopes
@@ -290,8 +125,6 @@
 
 ### Predication of `flow.dispatch`
 
-<a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a>
-
 While the
 [`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
 can remove many of the simpler conditional dispatches there will always be some
@@ -324,8 +157,6 @@
 
 ### Deduping `flow.executable`s
 
-<a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a>
-
 While still in the `flow` dialect, the executables are target-agnostic. This
 makes simple IR tree diffing a potential solution to deduplication. Since most
 of the dispatches originate from the same source-language library calls in input
@@ -337,8 +168,6 @@
 
 ### Rematerializing CSE'd Expressions
 
-<a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a>
-
 Common subexpression elimination is performed many times during lowering,
 however there comes a point where the CSE can introduce false dependencies and
 additional allocations that are otherwise avoidable. For example if a
@@ -381,8 +210,6 @@
 
 ### Device Placement
 
-<a id="markdown-Device%20Placement" name="Device%20Placement"></a>
-
 While still within the `flow` dialect we have the ability to easily split
 streams and safely shuffle around operations. Target execution backends can opt
 into such behavior to ensure that device restrictions such as maximum in-flight
@@ -395,8 +222,6 @@
 
 ## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables
 
-<a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a>
-
 As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
 the techniques classically used in real-time graphics apply. The benefit we have
 by modeling our usage of such a low-level API in IR is that the normal work -
@@ -406,8 +231,6 @@
 
 ### Allow Targets to Specify `hal.interface`s
 
-<a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a>
-
 The `hal.interface` op specifies the ABI between the scheduler and the device
 containing the buffer bindings and additional non-buffer data (parameters,
 shapes, specialization flags, etc). Today a naïve ordering is used uniformly for
@@ -428,8 +251,6 @@
 
 ### Target-specific Scheduling Specialization
 
-<a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a>
-
 Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
 regions, it's not always possible for all target backends to schedule a region
 as a single dispatch. A classic example is algorithms like
@@ -459,8 +280,6 @@
 
 ### Buffer Usage Tracking
 
-<a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a>
-
 Many explicit hardware APIs require knowing how buffers are used alongside with
 where they should be located. For example this additional information determines
 caching policy on buffer accesses (write-through, write-back, etc), visibility
@@ -491,8 +310,6 @@
 
 ### Batched Executable Caching and Precompilation
 
-<a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a>
-
 For targets that may require runtime preprocessing of their executables prior to
 dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
 compilation mechanism based on Vulkan's
@@ -525,8 +342,6 @@
 
 ### Target-aware Executable Compression
 
-<a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a>
-
 An advantage of representing executable binaries in IR after translation is that
 we can apply various post-compilation compression and minification techniques
 while still know precisely where the executable will be used. This is extremely
@@ -550,8 +365,6 @@
 
 ### Target-aware Constant Compression
 
-<a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a>
-
 It's still an area that needs more research but one goal of the IREE design was
 to enable efficient target- and context-aware compression of large constants
 (typically model weights/parameters/embeddings). This may mean reusing existing
@@ -566,8 +379,6 @@
 
 ### Command Buffer Stateful Deduplication
 
-<a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a>
-
 The IREE HAL - much like Vulkan it is based on - eschews much of the state that
 traditional APIs have in favor of (mostly) immutable state objects (pipeline
 layouts, pipeline states, descriptor sets, etc). There are still a few stateful
@@ -582,8 +393,6 @@
 
 ### Resource Timeline
 
-<a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a>
-
 A core concept of the IREE scheduler that allows for overlapping in-flight
 invocations is that of the resource timeline. This identifies module state that
 can be in use by multiple invocations and assigns timeline milestones denoting
@@ -613,14 +422,12 @@
 execution is appropriately synchronized. This is where indirect dispatch,
 [predication](#predication-of-flowdispatch),
 [indirect command buffers](#indirect-command-bufferon-accelerator-execution),
-and [VM coroutines](coroutines-for-batching-and-cooperative-scheduling) can all
+and [VM coroutines](#coroutines-for-batching-and-cooperative-scheduling) can all
 help cover for the times where we are unable to transform away the indirection
 or emit shape logic without data dependencies.
 
 ### Transient Tensor Ringbuffer
 
-<a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a>
-
 (When properly implemented) almost all buffers required during execution never
 escape the command buffers they are used in or a single VM invocation. We can
 trivially identify this from the explicit captures of `flow.stream` and
@@ -645,9 +452,9 @@
 the allocations are performed. Since almost all usage involves simple write head
 bumps there is no need for ahead-of-time memory planning or large fixed
 allocations, and since no buffer within the ringbuffer can alias we can have
-coarse (*read: low overhead*) guarantees about the availability of certain
-regions of the ringbuffer (*"when this event is signaled all prior ringbuffer
-writes have completed"*).
+coarse (_read: low overhead_) guarantees about the availability of certain
+regions of the ringbuffer (_"when this event is signaled all prior ringbuffer
+writes have completed"_).
 
 Usually any planning we may want to perform can be done in IR via code motion.
 For example applying traditional algorithms used to reduce register pressure
@@ -662,8 +469,6 @@
 
 ### Timeline Semaphores on the Module ABI
 
-<a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a>
-
 Functions calls made across modules (either from C++ into the VM, VM->VM, or
 VM->C++) should be able to define timeline semaphores used to wait and signal on
 the call. We can do this by making all exports automatically have the semaphores
@@ -683,8 +488,6 @@
 
 ### GPU-like CPU Scheduling
 
-<a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a>
-
 One approach to using multiple cores on a CPU is to perform interior
 parallelization of operations such as OpenMP or library-call-based custom thread
 pools (gemmlowp). This works when each individual operation is relatively costly
@@ -729,8 +532,6 @@
 
 ## `vm`: Lightweight Virtual Machine
 
-<a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a>
-
 The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
 intermediate lowering IR. Many of the optimizations we can perform on it will
 benefit all use cases (such as when lowering to LLVM IR) by allowing
@@ -739,8 +540,6 @@
 
 ### Coroutines for Batching and Cooperative Scheduling
 
-<a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a>
-
 One of the largest features currently missing from the VM is coroutines (aka
 user-mode fiber scheduling). Coroutines are what will allow us to have multiple
 in-flight invocations into a module - some of which may be waiting on external
@@ -801,8 +600,6 @@
 
 #### Cellular Batching
 
-<a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a>
-
 Though coroutines help throughput there is a way we've found to reduce latency
 that's been documented as
 [cellular batching](http://madsys.cs.tsinghua.edu.cn/publications/EUROSYS2018-gao.pdf).
@@ -852,8 +649,6 @@
 
 ### Lowering to LLVM IR
 
-<a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a>
-
 For scenarios where dynamic module loading is not required and entire modules
 can be compiled into applications we can lower the VM IR to LLVM IR within
 MLIR's transformation pipeline. Instead of embedding `vm.call` ops that are
@@ -876,8 +671,6 @@
 
 ### Improved Type Support
 
-<a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a>
-
 Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
 intentional limitation such that we can determine what is really needed to
 express the scheduling we perform, with the idea being that such a limited model
@@ -894,8 +687,6 @@
 
 ### Indirect Command Buffer/On-Accelerator Execution
 
-<a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a>
-
 Though IREE will use many different tricks such as
 [predication](#predication-of-flowdispatch) to build deep pipelines there is
 still the requirement that the command recording and submission happens on the

diff --git a/docs/developers/design_docs/function_abi.md b/docs/website/docs/developers/design-docs/function-abi.md
similarity index 71%
rename from docs/developers/design_docs/function_abi.md
rename to docs/website/docs/developers/design-docs/function-abi.md
index baa1154..1148dfa 100644
--- a/docs/developers/design_docs/function_abi.md
+++ b/docs/website/docs/developers/design-docs/function-abi.md

@@ -1,4 +1,10 @@
-# Function Signatures
+# Function ABI
+
+!!! note
+
+    Authored December, 2019
+
+    Updated August, 2021
 
 A key job of the IREE compiler and runtime is capturing function call semantics
 from the originating system and providing mechanisms so that invocations can be
@@ -6,7 +12,7 @@
 this requires additional metadata on top of the raw characteristics of a
 function. Where possible, this is done by attaching attributes to a function.
 
--   `iree.abi` : JSON encoded description of the function's calling convention.
+- `iree.abi` : JSON encoded description of the function's calling convention.
 
 ## V1 ABI
 
@@ -19,24 +25,24 @@
 
 ### Value Types:
 
--   Byte aligned integer type (i8, i16, i32, i64)
--   Floating point value (f16, f32, f64)
+- Byte aligned integer type (i8, i16, i32, i64)
+- Floating point value (f16, f32, f64)
 
 ### Reference Types:
 
--   ND-Array buffers of Value Types:
+- ND-Array buffers of Value Types:
 
-    -   Simple: Packed, C-layout
-    -   Strided: Arbitrary layout with strides (future)
+    - Simple: Packed, C-layout
+    - Strided: Arbitrary layout with strides (future)
 
--   String (byte arrays)
+- String (byte arrays)
 
--   Opaque reference object
+- Opaque reference object
 
 ### Sequence Types:
 
--   Tuples: fixed length lists where each position has its own type bound
--   Homogenous list: lists of arbitrary size where a single type bound applies
+- Tuples: fixed length lists where each position has its own type bound
+- Homogenous list: lists of arbitrary size where a single type bound applies
     to all elements
 
 The intent with these low level types is that calling conventions can be
@@ -47,39 +53,39 @@
 
 The above are all representable with native constructs in the VM:
 
--   ValueType:
+- ValueType:
 
-    -   Runtime:
+    - Runtime:
         [`iree_vm_value`](https://github.com/openxla/iree/blob/main/iree/vm/value.h)
-    -   Compile Time: primitive MLIR integer/floating point types
+    - Compile Time: primitive MLIR integer/floating point types
 
--   Simple ND-Array Buffer:
+- Simple ND-Array Buffer:
 
-    -   Runtime:
+    - Runtime:
         [`iree_hal_buffer_view`](https://github.com/openxla/iree/blob/main/iree/hal/buffer_view.h)
-    -   Compile Time: `tensor<>`
+    - Compile Time: `tensor<>`
 
--   String:
+- String:
 
-    -   Runtime:
+    - Runtime:
         [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
         containing `i8`
-    -   Compile Time: `!util.list<i8>`
+    - Compile Time: `!util.list<i8>`
 
--   Tuple:
+- Tuple:
 
-    -   Runtime:
+    - Runtime:
         [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
         of variant
-    -   Compile Time: `!util.list<?>`
-    -   Note that these are statically type erased at the boundary.
+    - Compile Time: `!util.list<?>`
+    - Note that these are statically type erased at the boundary.
 
--   TypedList (homogenous):
+- TypedList (homogenous):
 
-    -   Runtime:
+    - Runtime:
         [`iree_vm_list`](https://github.com/openxla/iree/blob/main/iree/vm/list.h)
         of `T`
-    -   Compile Time: `!util.list<T>`
+    - Compile Time: `!util.list<T>`
 
 ### Extended Type Calling Conventions
 
@@ -109,9 +115,9 @@
 elements of the tuple are the natural order of the structure, where that is
 either:
 
--   For a C-like system where order is determinate, it is the order of
+- For a C-like system where order is determinate, it is the order of
     declaration.
--   For a name-based system (i.e. bind to `dict`) where no order is defined, the
+- For a name-based system (i.e. bind to `dict`) where no order is defined, the
     natural order will be the lexically sorted order of the keys.
 
 #### String
@@ -146,42 +152,42 @@
 
 The JSON object contains:
 
--   `a` (array): List of type records for each argument.
--   `r` (array): List of type records for each argument.
+- `a` (array): List of type records for each argument.
+- `r` (array): List of type records for each argument.
 
 Type records are one of:
 
--   A string naming a primitive type:
+- A string naming a primitive type:
 
-    -   `i[0-9]+`: Integer type with given bit width
-    -   `f[0-9]+`: IEEE floating point type with given bit width
-    -   `bf16`: BFloat16
+    - `i[0-9]+`: Integer type with given bit width
+    - `f[0-9]+`: IEEE floating point type with given bit width
+    - `bf16`: BFloat16
 
--   JSON `null`: A null reference value
+- JSON `null`: A null reference value
 
--   `"unknown"`: An unknown/unmapped type
+- `"unknown"`: An unknown/unmapped type
 
--   An array, interpreted as a tuple describing a compound type.
+- An array, interpreted as a tuple describing a compound type.
 
 ##### Compound type tuples
 
 A compound type tuple has a type identifier as its first element, followed with
 type specific fields:
 
--   `["named", "key", {slot_type}]`: Associates a name with a slot. This is
+- `["named", "key", {slot_type}]`: Associates a name with a slot. This is
     used with the root argument list to denote named arguments that can be
     passed positionally or by keyword.
--   `["ndarray", {element_type}, {rank}, {dim...}]`: For unknown rank, the
+- `["ndarray", {element_type}, {rank}, {dim...}]`: For unknown rank, the
     `rank` will be `null` and there will be no dims. Any unknown dim will be
     `null`.
--   `["slist", {slot_type...}]`: An anonymous structured list of fixed arity and
+- `["slist", {slot_type...}]`: An anonymous structured list of fixed arity and
     slot specific types. If there are gaps in the list, empty slots will have a
     `null` type.
--   `["stuple", {slot_type...}]`: Same as `slist` but some languages
+- `["stuple", {slot_type...}]`: Same as `slist` but some languages
     differentiate between sequences represented as lists and those represented
     as tuples (read-only lists).
--   `["sdict", ["key", {slot_type}]...]`: An anonymous structure with named
+- `["sdict", ["key", {slot_type}]...]`: An anonymous structure with named
     slots. Note that when passing these types, the keys are not passed to the
     function (only the slot values).
--   `["py_homogeneous_list", {element_type}]`: A Python list of unknown size
+- `["py_homogeneous_list", {element_type}]`: A Python list of unknown size
     with elements sharing a common type bound given by `element_type`.

diff --git a/docs/developers/design_docs/execution_model.md b/docs/website/docs/developers/design-docs/invocation-execution-model.md
similarity index 95%
rename from docs/developers/design_docs/execution_model.md
rename to docs/website/docs/developers/design-docs/invocation-execution-model.md
index aa67e33..c5d31ef 100644
--- a/docs/developers/design_docs/execution_model.md
+++ b/docs/website/docs/developers/design-docs/invocation-execution-model.md

@@ -1,4 +1,6 @@
-# IREE Invocation Execution Model
+# Invocation execution model
+
+!!! note - "Authored June, 2022"
 
 This documents the behavior of the user-visible invocation mechanism IREE uses
 to schedule program execution. Internally IREE uses a very similar modeling for
@@ -9,7 +11,9 @@
 executes it. Though here we focus on the user-visible portion of execution
 there is really only one "IREE execution model" and the entire stack follows
 the same design. At its core this design is just an instantiation of an
-[out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution) algorithm such as those [originating from the 1960's](https://en.wikipedia.org/wiki/Tomasulo_algorithm).
+[out-of-order execution](https://en.wikipedia.org/wiki/Out-of-order_execution)
+algorithm such as those
+[originating from the 1960's](https://en.wikipedia.org/wiki/Tomasulo_algorithm).
 
 ## Glossary
 
@@ -54,7 +58,8 @@
 _Modules_ define executable code and data that can be loaded, linked, and run à
 la [ELF shared libraries](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format).
 Modules may be implemented as C/C++, generated bytecode or C sources from the
-IREE compiler, or any other mechanism that can run code and implement the [`iree_vm_module_t` interface](https://github.com/google/iree/blob/0e8d8370699912c6b51889e8f7e967690102402c/runtime/src/iree/vm/module.h#L335-L437).
+IREE compiler, or any other mechanism that can run code and implement the
+[`iree_vm_module_t` interface](https://github.com/google/iree/blob/0e8d8370699912c6b51889e8f7e967690102402c/runtime/src/iree/vm/module.h#L335-L437).
 Modules on their own are read-only and can be reused across many contexts.
 
 Traditional ML runtimes would use a model (graph, etc) as their module
@@ -170,6 +175,7 @@
 ```
 
 To the user this would appear as:
+
 ```mermaid
 sequenceDiagram
     User->>@some_func: invoke
@@ -324,8 +330,12 @@
 buffers like large constants/read-only data or user-managed ringbuffers.
 Stream-ordered allocations are also supported to allow for pooled buffer
 reservations that can be allocated in a scheduled order alongside program
-execution. For more detailed examples see the CUDA blog posts describing their
-implementation: [part 1](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/), [part 2](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/).
+execution.
+
+For more detailed examples see the CUDA blog posts describing their
+implementation:
+[part 1](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/),
+[part 2](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/).
 
 With stream-ordered allocations each allocation and deallocation operation is
 scheduled with wait and signal fences just as with invocations. This allows
@@ -400,7 +410,8 @@
 ### Internal State
 
 A stateful program may contain internal timelines that it uses to order its own
-execution. Take for example [this simple stateful program](https://github.com/google/iree-jax/blob/main/tests/program/dynamic_state.py):
+execution. Take for example
+[this simple stateful program](https://github.com/google/iree-jax/blob/main/tests/program/dynamic_state.py):
 
 ```python
 class TrivialKernel(Program):

diff --git a/docs/developers/assets/ci_enabled_jobs.png b/docs/website/docs/developers/general/contributing-ci-enabled-jobs.png
similarity index 100%
rename from docs/developers/assets/ci_enabled_jobs.png
rename to docs/website/docs/developers/general/contributing-ci-enabled-jobs.png
Binary files differ

diff --git a/docs/developers/assets/ci-extra.png b/docs/website/docs/developers/general/contributing-ci-extra.png
similarity index 100%
rename from docs/developers/assets/ci-extra.png
rename to docs/website/docs/developers/general/contributing-ci-extra.png
Binary files differ

diff --git a/docs/website/docs/developers/general/contributing.md b/docs/website/docs/developers/general/contributing.md
new file mode 100644
index 0000000..5041630
--- /dev/null
+++ b/docs/website/docs/developers/general/contributing.md

@@ -0,0 +1,262 @@
+# Contributing
+
+This is a more detailed version of the top-level
+[CONTRIBUTING.md](https://github.com/openxla/iree/blob/main/CONTRIBUTING.md)
+file. We keep it separate to avoid everyone getting a pop-up when creating a PR
+after each time it changes.
+
+<!-- TODO(scotttodd): Update this document
+    * pull more text into this, update that to point to the website
+    * document access controls (join organization then team)
+    * document revert policy
+    * document where new community members should start
+-->
+
+## Build systems
+
+IREE supports building from source with both Bazel and CMake.
+
+* CMake is the preferred build system and offers the most flexible
+  configuration options
+* Bazel is a stricter build system and helps with usage in Google's downstream
+  source repository
+* Certain dependencies (think large/complex projects like CUDA, TensorFlow,
+  PyTorch, etc.) may be difficult to support with one build system or the
+  other, so the project may configure these as optional
+
+## Continuous integration (CI)
+
+IREE uses [GitHub Actions](https://docs.github.com/en/actions) for CI. The
+primary CI is configured in the
+[ci.yml workflow file](https://github.com/openxla/iree/blob/main/.github/workflows/ci.yml).
+
+### Self-hosted runners
+
+In addition to the default runners GitHub provides, IREE uses
+[self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners)
+to run many of its workflow jobs. These enable access to additional compute and
+custom configurations such as accelerators. Configuration scripting is checked
+in to this repository (see the
+[README for that directory](https://github.com/openxla/iree/blob/main/build_tools/github_actions/runner/README.md)).
+
+### Custom managed runners
+
+In addition to our self-hosted runners, we use GitHub's
+[large managed runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners)
+for some platforms that are more trouble to configure ourselves (e.g. Mac).
+
+### CI behavior manipulation
+
+The setup step of the CI determines which CI jobs to run. This is controlled by
+the
+[configure_ci.py](https://github.com/openxla/iree/blob/main/build_tools/github_actions/configure_ci.py)
+script. It will generally run a pre-determined set of jobs on presubmit with
+some jobs kept as post-submit only. If changes are only to a certain set of
+excluded files that we know don't affect CI (e.g. docs), then it will skip the
+jobs. You can customize which jobs run using
+[git trailers](https://git-scm.com/docs/git-interpret-trailers) in the PR
+description. The available options are
+
+``` text
+ci-skip: jobs,to,skip
+ci-extra: extra,jobs,to,run
+ci-exactly: exact,set,of,jobs,to,run
+skip-ci: free form reason
+skip-llvm-integrate-benchmark: free form reason
+benchmark-extra: extra,benchmarks,to,run
+runner-env: [testing|prod]
+```
+
+The first three follow the same format and instruct the setup script on which
+jobs to include or exclude from its run. They take a comma-separated list of
+jobs which must be from the set of top-level job identifiers in ci.yml file or
+the special keyword "all" to indicate all jobs. `ci-skip` removes jobs that
+would otherwise be included, though it is not an error to list jobs that would
+not be included by default. `ci-extra` adds additional jobs that would not have
+otherwise been run, though it is not an error to list jobs that would have been
+included anyway. It *is* an error to list a job in both of these fields.
+`ci-exactly` provides an exact list of jobs that should run. It is mutually
+exclusive with both `ci-skip` and `ci-extra`. In all these cases, the setup does
+not make any effort to ensure that job dependencies are satisfied. Thus, if you
+request skipping the `build_all` job, all the jobs that depend on it will fail,
+not be skipped. `skip-ci` is an older option that simply skips all jobs. Its
+usage is deprecated and it is mutually exclusive with all of the other `ci-*`
+options. Prefer `ci-skip: all`.
+
+Benchmarks don't run by default on PRs, and must be specifically requested. They
+*do* run by default on PRs detected to be an integration of LLVM into IREE, but
+this behavior can be disabled with `skip-llvm-integrate-benchmark`. The
+`benchmark-extra` option allows specifying additional benchmark presets to run
+as part of benchmarking. It accepts a comma-separated list of benchmark presets.
+This combines with labels added to the PR (which are a more limited set of
+options). See the
+[benchmark suites documentation](../performance/benchmark-suites.md).
+
+The `runner-env` option controls which runner environment to target for our
+self-hosted runners. We maintain a test environment to allow testing out new
+configurations prior to rolling them out. This trailer is for advanced users who
+are working on the CI infrastructure itself.
+
+#### CI configuration recipes
+
+Copy/paste any of these at the bottom of a PR description to change what the CI
+runs.
+
+* Also run Windows and macOS builds that are normally post-merge only:
+
+  ``` text
+  ci-extra: build_test_all_windows,build_test_all_macos_arm64,build_test_all_macos_x86_64
+  ```
+
+* Also run GPU tests on NVIDIA A100 runners (opt-in due to low availability):
+
+  ``` text
+  ci-extra: test_a100
+  ```
+
+* Skip all CI builds and tests, e.g. for comment-only changes:
+
+  ``` text
+  skip-ci: Comment-only change.
+  ```
+
+* Only run Bazel builds, e.g. for changes only affecting Bazel rules:
+
+  ``` text
+  ci-exactly: build_test_all_bazel
+  ```
+
+For example, this PR opted in to running the `build_test_all_windows` job:
+
+![ci-extra](./contributing-ci-extra.png)
+
+The enabled jobs can be viewed from the Summary page of an action run:
+
+![ci_enabled_jobs](./contributing-ci-enabled-jobs.png)
+
+## Contributor tips
+
+These are opinionated tips documenting workflows that some members of the team
+have found useful. They are focused on meta-tooling, not on IREE code
+specifically (you will find the latter in the
+[Developer Overview](./developer-overview.md)).
+
+!!! note
+
+    It is certainly possible to use workflows other than these. Some common
+    tasks, especially for maintainers, will likely be made easier by using
+    these flows.
+
+We assume a basic knowledge
+of `git` and GitHub and suggests some specific ways of using it.
+
+### Useful tools
+
+* GitHub CLI (<https://github.com/cli/cli>). A CLI for interacting with GitHub.
+    Most importantly, it allows scripting the creation of pull requests.
+* Refined GitHub Chrome and Firefox Extension:
+    <https://github.com/sindresorhus/refined-github>. Nice extension that adds a
+    bunch of features to the GitHub UI.
+* VSCode: <https://code.visualstudio.com/>. The most commonly used IDE amongst
+    IREE developers.
+* [Ccache](https://ccache.dev/), a fast C/C++ compiler cache. See our
+  [CMake with `ccache`](../building/cmake-with-ccache.md) page
+
+### Git structure
+
+We tend to use the "triangular" or "forking" workflow. Develop primarily on a
+clone of the repository on your development machine. Any local branches named
+the same as persistent branches from the
+[main repository](https://github.com/openxla/iree) are pristine (though
+potentially stale) copies. You only fastforward these to match upstream and
+otherwise do development on other branches. When sending PRs, you push to a
+different branch on your public fork and create the PR from there.
+
+<!-- TODO(scotttodd): screenshots / diagrams here
+  (https://mermaid.js.org/syntax/gitgraph.html?) -->
+
+#### Setup
+
+1. Create a fork of the main repository.
+
+2. Create a local git repository with remotes `upstream` (the main repository)
+    and `origin` (your personal fork). To list your current remotes
+    `git remote -v`.
+
+    a. If you already cloned from the main repository (e.g. by following the
+    getting started guide):
+
+    ```shell
+    # From your existing git repo
+    $ git remote rename origin upstream
+    $ git remote add origin https://github.com/<github_username>/iree.git
+    ```
+
+    b. If you haven't already cloned:
+
+    ```shell
+    # From whatever directory under which you want to nest your repo
+    $ git clone https://github.com/<github_username>/iree.git
+    $ cd iree
+    $ git remote add upstream https://github.com/openxla/iree.git
+    ```
+
+    This is especially important for maintainers who have write access (so can
+    push directly to the main repository) and admins who have elevated
+    privileges (so can push directly to protected branches). These names are
+    just suggestions, but you might find some scripts where the defaults are for
+    remotes named like this. For extra safety, you can make it difficult to push
+    directly to upstream by setting the push url to something invalid: `git
+    remote set-url --push upstream DISABLE`, which requires re-enabling the push
+    URL explicitly before pushing. You can wrap this behavior in a custom git
+    command like
+    [git-sudo](https://gist.github.com/GMNGeoffrey/42dd9a9792390094a43bdb69659320c0).
+
+3. Use a script like
+    [git_update.sh](https://github.com/openxla/iree/blob/main/build_tools/scripts/git/git_update.sh)
+    to easily synchronize `main` with `upstream`. Submodules make this is a
+    little trickier than it should be. You can also turn this into a git command
+    by adding it to your path as `git-update`.
+
+#### Git config
+
+These are some additional options you could put in your top-level `.gitconfig`
+or repository-specific `.git/config` files that are conducive the recommended
+workflow
+
+```ini
+[push]
+  default = current
+[alias]
+  # Delete branches that you pushed and have been deleted upstream, e.g. because
+  # the PR was merged.
+  gone = ! "git fetch -p  && git for-each-ref --format '%(refname:short) %(upstream:track)' | awk '$2 == \"[gone]\" {print $1}' | xargs -r git branch -D"
+  # Update from upstream (custom command) and delete obsolete local branches.
+  sync = ! (git update main && git gone)
+  # Create a new branch based off of main (requires a clean working directory).
+  new = "!f(){ \\\ngit checkout main && git switch -c $1; \\\n}; f"
+  # Display branches in a useful "latest last" format
+  br = for-each-ref --sort=committerdate refs/heads/ --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) - %(color:red)%(objectname:short)%(color:reset) - %(contents:subject) (%(color:green)%(committerdate:relative)%(color:reset))'
+  # `git git foo` -> `git foo` typo fixer
+  git = "!f(){ \\\n git \"$@\"; \\\n}; f"
+  # Get the git root directory
+  root = rev-parse --show-toplevel
+  # checkout, but also sync submodules
+  ch = "!f() { \\\n git checkout \"$@\"; git submodule sync && git submodule update --init; \\\n}; f"
+  # See the diff for a PR branch vs the main branch
+  diffmain = diff --merge-base main
+  # See only the files that differ vs the main branch
+  whatsout = diffmain --name-only
+[checkout]
+  # If the checkout command
+  defaultRemote = origin
+[pull]
+  # When pulling, only complete the pull if its a clean fast forward.
+  ff = only
+[remote]
+  # Push to your fork (origin) by default
+  pushDefault = origin
+[url "ssh://git@github.com/"]
+  # Pull with https (so no auth required), but push with ssh.
+  pushInsteadOf = https://github.com/
+```

diff --git a/docs/developers/developing_iree/developer_overview.md b/docs/website/docs/developers/general/developer-overview.md
similarity index 92%
rename from docs/developers/developing_iree/developer_overview.md
rename to docs/website/docs/developers/general/developer-overview.md
index fa0f189..94b51f0 100644
--- a/docs/developers/developing_iree/developer_overview.md
+++ b/docs/website/docs/developers/general/developer-overview.md

@@ -1,12 +1,9 @@
-# Developer Overview
+# Developer overview
 
 This guide provides an overview of IREE's project structure and main tools for
 developers.
 
-**Note: project layout is evolving at the moment, see
-<https://github.com/openxla/iree/issues/8955>**
-
-## Project Code Layout
+## Project code layout
 
 * [/compiler/](https://github.com/openxla/iree/blob/main/compiler/):
   MLIR dialects, LLVM compiler passes, module translation code, etc.
@@ -25,7 +22,7 @@
 * [/samples/](https://github.com/openxla/iree/blob/main/samples/): Also see the
   separate <https://github.com/iree-org/iree-samples> repository
 
-## IREE Compiler Code Layout
+## IREE compiler code layout
 
 * [API/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/API):
   Public C API
@@ -36,7 +33,7 @@
 * [InputConversion/](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/InputConversion):
   Conversions from input dialects and preprocessing
 
-## IREE Runtime Code Layout
+## IREE runtime code layout
 
 * [base/](https://github.com/openxla/iree/blob/main/runtime/src/iree/base/):
   Common types and utilities used throughout the runtime
@@ -55,7 +52,7 @@
   Bytecode **V**irtual **M**achine used to work with IREE modules and invoke
   IREE functions
 
-## Developer Tools
+## Developer tools
 
 IREE's core compiler accepts programs in supported input MLIR dialects (e.g.
 `stablehlo`, `tosa`, `linalg`). Import tools and APIs may be used to convert
@@ -140,7 +137,7 @@
 The `iree-check-module` program takes an already translated IREE module as input
 and executes it as a series of
 [googletest](https://github.com/google/googletest) tests. This is the test
-runner for the IREE [check framework](./testing_guide.md#end-to-end-tests).
+runner for the IREE [check framework](./testing-guide.md#end-to-end-tests).
 
 ```shell
 $ ../iree-build/tools/iree-compile \
@@ -189,13 +186,11 @@
 
 ### Useful generic flags
 
-There are a few useful generic flags when working with IREE tools:
-
 #### Read inputs from a file
 
 All the IREE tools support reading input values from a file. This is quite
-useful for debugging. Use `-help` for each tool to see what the flag to set. The
-inputs are expected to be newline-separated. Each input should be either a
+useful for debugging. Use `--help` for each tool to see what the flag to set.
+The inputs are expected to be newline-separated. Each input should be either a
 scalar or a buffer. Scalars should be in the format `type=value` and buffers
 should be in the format `[shape]xtype=[value]`. For example:
 
@@ -204,7 +199,7 @@
 1x5x3x1xf32=15,14,13,12,11,10,9,8,7,6,5,4,3,2,1
 ```
 
-#### `iree-flow-trace-dispatch-tensors`
+#### `--iree-flow-trace-dispatch-tensors`
 
 This flag will enable tracing inputs and outputs for each dispatch function. It
 is easier to narrow down test cases, since IREE breaks a ML workload into

diff --git a/docs/website/docs/guides/developer-tips.md b/docs/website/docs/developers/general/developer-tips.md
similarity index 98%
rename from docs/website/docs/guides/developer-tips.md
rename to docs/website/docs/developers/general/developer-tips.md
index f6a2988..04b8a67 100644
--- a/docs/website/docs/guides/developer-tips.md
+++ b/docs/website/docs/developers/general/developer-tips.md

@@ -1,8 +1,9 @@
----
+<!-- TODO(scotttodd): add icons to all developers/ pages -->
+<!-- ---
 icon: material/lightbulb-on
----
+--- -->
 
-# IREE developer tips and tricks
+# Developer tips and tricks
 
 The IREE compiler is built using [MLIR](https://mlir.llvm.org/), so it naturally
 supports the common

diff --git a/docs/website/docs/developers/general/release-management.md b/docs/website/docs/developers/general/release-management.md
new file mode 100644
index 0000000..33a3a57
--- /dev/null
+++ b/docs/website/docs/developers/general/release-management.md

@@ -0,0 +1,38 @@
+# Release management
+
+IREE cuts automated releases via a workflow that is
+[triggered daily](https://github.com/openxla/iree/blob/main/.github/workflows/schedule_candidate_release.yml).
+The only constraint placed on the commit that is released is that it has passed
+all CI checks. These are published on GitHub with the "pre-release" status. For
+debugging this process, see the
+[Release debugging playbook](../debugging/releases.md).
+
+We periodically promote one of these candidates to a "stable" release by
+removing the "pre-release" status. This makes it show up as a "latest" release
+on GitHub. We also push the Python packages for this release to PyPI.
+
+## Picking a candidate to promote
+
+When selecting a candidate we use the following criteria:
+
+1. ⪆4 days old so that problems with it may have been spotted
+2. Contains no P0 regressions vs the previous stable release
+3. LLVM submodule commit ideally exists upstream (no cherry picks or patches)
+
+When you've identified a potential candidate, email the iree-discuss list with
+the proposal and solicit feedback. People may point out known regressions or
+request that some feature make the cut.
+
+## Promoting a candidate to stable
+
+1. (Authorized users only) Push to PyPI using
+    [pypi_deploy.sh](https://github.com/openxla/iree/blob/main//build_tools/python_deploy/pypi_deploy.sh)
+
+    * For Googlers, the password is stored at <http://go/iree-pypi-password>
+
+2. Open the release on GitHub. Rename the release from "candidate" to "stable",
+    uncheck the option for "pre-release", and check the option for "latest".
+
+    ![rename_release](./release-renaming.png)
+
+    ![promote_release](./release-promotion.png)

diff --git a/docs/developers/assets/promote_release.png b/docs/website/docs/developers/general/release-promotion.png
similarity index 100%
rename from docs/developers/assets/promote_release.png
rename to docs/website/docs/developers/general/release-promotion.png
Binary files differ

diff --git a/docs/developers/assets/rename_release.png b/docs/website/docs/developers/general/release-renaming.png
similarity index 100%
rename from docs/developers/assets/rename_release.png
rename to docs/website/docs/developers/general/release-renaming.png
Binary files differ

diff --git a/docs/developers/developing_iree/testing_guide.md b/docs/website/docs/developers/general/testing-guide.md
similarity index 84%
rename from docs/developers/developing_iree/testing_guide.md
rename to docs/website/docs/developers/general/testing-guide.md
index c0028e6..59b08e1 100644
--- a/docs/developers/developing_iree/testing_guide.md
+++ b/docs/website/docs/developers/general/testing-guide.md

@@ -1,4 +1,4 @@
-# Testing Guide
+# Testing guide
 
 Like the IREE project in general, IREE tests are divided into a few different
 components and use different tooling depending on the needs of that component.
@@ -11,27 +11,27 @@
 |                 | iree_hal_cts_test_suite                         | CMake        | Host/Device         |
 | Core E2E tests  | iree_check_test                                 | Bazel/CMake  | Host/Device         |
 |                 | iree_trace_runner_test                          | Bazel/CMake  | Host/Device         |
-|                 | iree_single_backend_generated_trace_runner_test | Bazel/CMake  | Host/Device         |
 |                 | iree_generated_trace_runner_test                | Bazel/CMake  | Host/Device         |
 |                 | iree_static_linker_test                         | CMake        | Host/Device         |
 
 There are also more `*_test_suite` targets that groups test targets with the
 same configuration together.
 
-## Compiler Tests
+## Compiler tests
 
 Tests for the IREE compilation pipeline are written as lit tests in the same
 style as MLIR.
 
-By convention, IREE includes tests for printing and parsing of MLIR ops in
-`.../IR/test/{OP_CATEGORY}_ops.mlir` files, tests for folding and
-canonicalization in `.../IR/test/{OP_CATEGORY}_folding.mlir` files, and tests
-for compiler passes and pipelines in other `.../test/*.mlir` files.
+By convention, IREE includes tests for
 
-### Running a Test
+* printing and parsing of ops in `.../IR/test/{OP_CATEGORY}_ops.mlir` files
+* folding and canonicalization in `.../IR/test/{OP_CATEGORY}_folding.mlir` files
+* compiler passes and pipelines in other `.../test/*.mlir` files
+
+### Running a test
 
 For the test
-<https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir>
+[`iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir`](https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test/arithmetic_ops.mlir)
 
 With CMake, run this from the build directory:
 
@@ -45,7 +45,7 @@
 bazel test //compiler/src/iree/compiler/Dialect/VM/Conversion/MathToVM/test:arithmetic_ops.mlir.test
 ```
 
-### Writing a Test
+### Writing a test
 
 For advice on writing MLIR compiler tests, see the
 [MLIR testing guide](https://mlir.llvm.org/getting_started/TestingGuide/). Tests
@@ -56,13 +56,13 @@
 As with most parts of the IREE compiler, these should not have a dependency on
 the runtime.
 
-### Configuring the Build System
+### Configuring the build system
 
 In the Bazel BUILD file, create a `iree_lit_test_suite` rule. We usually create
 a single suite that globs all `.mlir` files in the directory and is called
 "lit".
 
-```bzl
+```python
 load("//iree/build_tools/bazel:iree_lit_test.bzl", "iree_lit_test_suite")
 
 iree_lit_test_suite(
@@ -76,7 +76,7 @@
 ```
 
 There is a corresponding CMake function, calls to which will be generated by our
-[Bazel to CMake Converter](https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake/bazel_to_cmake.py).
+[Bazel to CMake converter](https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake/bazel_to_cmake.py).
 
 ```cmake
 iree_lit_test_suite(
@@ -92,13 +92,13 @@
 
 You can also create a test for a single file with `iree_lit_test`.
 
-## Runtime Tests
+## Runtime tests
 
 Tests for the runtime C++ code use the
-[Google Test](https://github.com/google/googletest) testing framework. They
+[GoogleTest](https://github.com/google/googletest) testing framework. They
 should generally follow the style and best practices of that framework.
 
-### Running a Test
+### Running a test
 
 For the test
 [`/runtime/src/iree/base/bitfield_test.cc`](https://github.com/openxla/iree/blob/main/runtime/src/iree/base/bitfield_test.cc):
@@ -127,7 +127,7 @@
 To use the Vulkan backend as test driver, you may need to select between a
 Vulkan implementation from SwiftShader and multiple Vulkan-capable hardware
 devices. This can be done via environment variables. See the
-[generic Vulkan setup](../get_started/vulkan_environment_setup.md#useful-environment-variables)
+[generic Vulkan setup](../vulkan-environment-setup.md#useful-environment-variables)
 page for details regarding these variables.
 
 For Bazel, you can persist the configuration in `user.bazelrc` to save typing.
@@ -143,10 +143,10 @@
 Then you can use `bazel test --config=vkswiftshader` to select SwiftShader as
 the Vulkan implementation. Similarly for other implementations.
 
-### Writing a Test
+### Writing a test
 
-For advice on writing tests in the Googletest framework, see the
-[Googletest primer](https://github.com/google/googletest/blob/main/docs/primer.md).
+For advice on writing tests in the GoogleTest framework, see the
+[GoogleTest primer](https://github.com/google/googletest/blob/main/docs/primer.md).
 Test files for source file `foo.cc` with build target `foo` should live in the
 same directory with source file `foo_test.cc` and build target `foo_test`. You
 should `#include` `iree/testing/gtest.h` instead of any of the gtest or gmock
@@ -155,14 +155,14 @@
 As with all parts of the IREE runtime, these should not have a dependency on the
 compiler.
 
-### Configuring the Build System
+### Configuring the build system
 
 In the Bazel BUILD file, create a `cc_test` target with your test file as the
 source and any necessary dependencies. Usually, you can link in a standard gtest
 main function. Use `iree/testing:gtest_main` instead of the `gtest_main` that
 comes with gtest.
 
-```bzl
+```python
 cc_test(
     name = "arena_test",
     srcs = ["arena_test.cc"],
@@ -194,13 +194,12 @@
 which are designed to test specific runtime support with template configuration
 and is not supported by Bazel rules.
 
-## IREE Core End-to-End Tests
+## IREE core end-to-end (e2e) tests
 
-Here "End-to-End" means from the input accepted by the IREE core compiler
-(dialects like TOSA, MHLO, Linalg, and Arithmetic) to execution using the
-IREE runtime components. It does not include tests of the integrations with ML
-frameworks (e.g. TensorFlow, TensorFlow Lite) or bindings to other languages
-(e.g. Python).
+Here "end-to-end" means from the input accepted by the IREE core compiler
+(dialects like TOSA, StableHLO, Linalg) to execution using the IREE runtime
+components. It does not include tests of the integrations with ML frameworks
+(e.g. TensorFlow, PyTorch) or bindings to other languages (e.g. Python).
 
 We avoid using the more traditional `lit` tests used elsewhere in the compiler
 for runtime execution tests. Lit tests require running the compiler tools on
@@ -224,16 +223,16 @@
 ```
 
 To run e2e model tests in
-[generated_e2e_model_tests.cmake](/tests/e2e/stablehlo_models/generated_e2e_model_tests.cmake),
+[generated_e2e_model_tests.cmake](https://github.com/openxla/iree/tree/main/tests/e2e/stablehlo_models/generated_e2e_model_tests.cmake),
 because of their dependencies, `-DIREE_BUILD_E2E_TEST_ARTIFACTS=ON` needs to be
 set when configuring CMake. Also see
-[IREE Benchmark Suite Prerequisites](/docs/developers/developing_iree/benchmark_suites.md#prerequisites)
+[IREE Benchmark Suite Prerequisites](../performance/benchmark-suites.md#prerequisites)
 for required packages.
 
 ### Running a Test
 
 For the test
-<https://github.com/openxla/iree/tree/main/tests/e2e/xla_ops/floor.mlir>
+[`tests/e2e/xla_ops/floor.mlir`](https://github.com/openxla/iree/tree/main/tests/e2e/xla_ops/floor.mlir)
 compiled for the VMVX target backend and running on the VMVX driver (here they
 match exactly, but in principle there's a many-to-many mapping from backends to
 drivers).
@@ -253,9 +252,9 @@
 ### Setting test environments
 
 Similarly, you can use environment variables to select Vulkan implementations
-for running tests as explained in the [Runtime Tests](#runtime-tests) section.
+for running tests as explained in the [Runtime tests](#runtime-tests) section.
 
-### Writing a Test
+### Writing a test
 
 These tests live in `tests/e2e`. A single test consists of a `.mlir` source
 file specifying an IREE module where each exported function takes no inputs and
@@ -297,7 +296,7 @@
 Next we use this input constant to exercise the runtime feature under test (in
 this case, just a single floor operation). Finally, we use a check dialect
 operation to make an assertion about the output. There are a few different
-[assertion operations](https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/Modules/Check).
+[assertion operations](../../reference/mlir-dialects/Check.md).
 Here we use the `expect_almost_eq_const` op: *almost* because we are comparing
 floats and want to allow for floating-point imprecision, and *const* because we
 want to compare it to a constant value. This last part is just syntactic sugar
@@ -333,12 +332,12 @@
 MLIR module. To give the test suite a more descriptive name, use an explicit
 named top-level module in this file.
 
-### Configuring the Build System
+### Configuring the build system
 
 A single `.mlir` source file can be turned into a test target with the
 `iree_check_test` Bazel macro (and corresponding CMake function).
 
-```bzl
+```python
 load("//build_tools/bazel:iree_check_test.bzl", "iree_check_test")
 
 iree_check_test(
@@ -357,7 +356,7 @@
 This can be accomplished with additional macros. For a single backend/driver
 pair:
 
-```bzl
+```python
 load("//build_tools/bazel:iree_check_test.bzl", "iree_check_single_backend_test_suite")
 
 iree_check_single_backend_test_suite(
@@ -370,12 +369,12 @@
 
 This will generate a separate test target for each file in `srcs` with a name
 following the convention above as well as a Bazel
-[test_suite](https://docs.bazel.build/versions/master/be/general.html#test_suite)
+[test_suite](https://bazel.build/reference/be/general#test_suite)
 called "check_vmvx_local-task" that will run all the generated tests.
 
 You can also generate suites across multiple pairs:
 
-```bzl
+```python
 load("//build_tools/bazel:iree_check_test.bzl", "iree_check_test_suite")
 
 iree_check_test_suite(
@@ -399,11 +398,3 @@
 There are other test targets that generate tests based on template configuraton
 and platform detection, such as `iree_static_linker_test`. Those targets are
 not supported by Bazel rules at this point.
-
-## Binding Tests
-
-TODO(laurenzo): Explain binding test setup.
-
-## Integration Tests
-
-TODO(silvasean): Explain integration test setup.

diff --git a/docs/website/docs/developers/index.md b/docs/website/docs/developers/index.md
new file mode 100644
index 0000000..226be22
--- /dev/null
+++ b/docs/website/docs/developers/index.md

@@ -0,0 +1,6 @@
+# Developers
+
+These pages cover topics useful for project maintainers and contributors.
+
+!!! caution
+    Some of these pages may be stale. Contributions are always welcome!

diff --git a/docs/developers/developing_iree/benchmark_suites.md b/docs/website/docs/developers/performance/benchmark-suites.md
similarity index 92%
rename from docs/developers/developing_iree/benchmark_suites.md
rename to docs/website/docs/developers/performance/benchmark-suites.md
index ebbf7cf..9c8f565 100644
--- a/docs/developers/developing_iree/benchmark_suites.md
+++ b/docs/website/docs/developers/performance/benchmark-suites.md

@@ -1,4 +1,4 @@
-# IREE Benchmark Suites
+# Benchmark suites
 
 IREE Benchmarks Suites is a collection of benchmarks for IREE developers to
 track performance improvements/regressions during development.
@@ -10,9 +10,9 @@
 <https://perf.iree.dev> and post in the comments.
 
 Information about the definitions of the benchmark suites can be found in the
-[IREE Benchmark Suites Configurations](/build_tools/python/benchmark_suites/iree/README.md).
+[IREE Benchmark Suites Configurations](https://github.com/openxla/iree/blob/main/build_tools/python/benchmark_suites/iree/README.md).
 
-## Running Benchmark Suites Locally
+## Running benchmark suites locally
 
 ### Prerequisites
 
@@ -22,7 +22,7 @@
 and
 [TFLite Integration](https://iree.dev/guides/ml-frameworks/tflite/)).
 
-### Choose Benchmark Presets
+### Choose benchmark presets
 
 IREE Benchmark Suites contain many benchmarks for different devices and model
 sizes, which can take lots of space and time to build all of them. So benchmarks
@@ -56,7 +56,7 @@
 export COMPILATION_BENCHMARK_PRESETS="comp-stats"
 ```
 
-### Build Benchmark Suites
+### Build benchmark suites
 
 Configure IREE with `-DIREE_BUILD_E2E_TEST_ARTIFACTS=ON`:
 
@@ -92,7 +92,7 @@
 > TODO(#13683): Each preset should have its own target to further reduce
 > unnecessary builds
 
-### Run Benchmarks
+### Run benchmarks
 
 Export the execution benchmark config:
 
@@ -126,11 +126,11 @@
         - `c2-standard-16` for x86_64 CPU benchmarks.
         - `a2-highgpu-1g` for NVIDIA GPU benchmarks.
     - All device names are defined under
-        [build_tools/python/e2e_test_framework/device_specs](/build_tools/python/e2e_test_framework/device_specs).
+        [build_tools/python/e2e_test_framework/device_specs](https://github.com/openxla/iree/tree/main/build_tools/python/e2e_test_framework/device_specs).
 - To run x86_64 benchmarks, right now `--cpu_uarch` needs to be provided and
     only `CascadeLake` is available currently.
 - To build traced benchmark tools, see
-    [Profiling with Tracy](/docs/developers/developing_iree/profiling_with_tracy.md).
+    [Profiling with Tracy](profiling-with-tracy.md).
 
 Filters can be used to select the benchmarks:
 
@@ -148,7 +148,7 @@
   --mode_regex="4-thread"
 ```
 
-### Generate Compilation Statistics (Compilation Benchmarks)
+### Generate compilation statistics (compilation benchmarks)
 
 Export the compilation benchmark config:
 
@@ -171,10 +171,10 @@
 Note that you need to use [Ninja](https://ninja-build.org/) to build the
 benchmark suites as the tool collects information from its build log.
 
-### Show Execution / Compilation Benchmark Results
+### Show execution / compilation benchmark results
 
 If you want to generate a comparison report locally, you can use
-[diff_local_benchmarks.py](/build_tools/benchmarks/diff_local_benchmarks.py)
+[diff_local_benchmarks.py](https://github.com/openxla/iree/blob/main/build_tools/benchmarks/diff_local_benchmarks.py)
 script to compare two result json files and generate the report. For example:
 
 ```sh
@@ -193,7 +193,7 @@
   > report.md
 ```
 
-### Find Compile and Run Commands to Reproduce Benchmarks
+### Find compile and run commands to reproduce benchmarks
 
 Each benchmark has its benchmark ID in the benchmark suites, you will see a
 benchmark ID at:
@@ -232,7 +232,7 @@
   --benchmark_id="<benchmark_id>"
 ```
 
-### Get Full List of Benchmarks
+### Get full list of benchmarks
 
 The commands below output the full list of execution and compilation benchmarks,
 including the benchmark names and their flags:
@@ -245,7 +245,7 @@
   --compilation_benchmark_config="${E2E_TEST_ARTIFACTS_DIR?}/comp_config.json"
 ```
 
-## Fetching Benchmark Artifacts from CI
+## Fetching benchmark Artifacts from CI
 
 ### 1. Find the corresponding CI workflow run
 

diff --git a/docs/developers/developing_iree/benchmarking.md b/docs/website/docs/developers/performance/benchmarking.md
similarity index 100%
rename from docs/developers/developing_iree/benchmarking.md
rename to docs/website/docs/developers/performance/benchmarking.md


diff --git a/docs/developers/developing_iree/profiling_cpu_events.md b/docs/website/docs/developers/performance/profiling-cpu-events.md
similarity index 95%
rename from docs/developers/developing_iree/profiling_cpu_events.md
rename to docs/website/docs/developers/performance/profiling-cpu-events.md
index e352f6b..7477490 100644
--- a/docs/developers/developing_iree/profiling_cpu_events.md
+++ b/docs/website/docs/developers/performance/profiling-cpu-events.md

@@ -1,12 +1,12 @@
-# CPU cache and other CPU event profiling
+# Profiling CPUs
 
 CPUs are able to
 [record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain
 events that may be relevant when investigating the performance of a program. A
-common example of such an event is a ["cache
-miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the program
-tries to access data in memory that isn't already in some CPU cache, causing
-that access to be slower than it could otherwise be.
+common example of such an event is a
+["cache miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the
+program tries to access data in memory that isn't already in some CPU cache,
+causing that access to be slower than it could otherwise be.
 
 Querying and analyzing this data can be useful, but is hard in two distinct
 ways:
@@ -34,7 +34,7 @@
 switches). Anyone may use this system call to implement a profiler, but Linux
 readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page).
 
-### Preserving Artifacts
+### Preserving artifacts
 
 By default IREE cleans up any temporary files it creates while running. Tools
 like perf, however, require those files exist even after the process has exited.
@@ -45,7 +45,7 @@
 export IREE_PRESERVE_DYLIB_TEMP_FILES=1
 ```
 
-### Desktop Linux
+### Desktop linux
 
 On desktop Linux we can use
 [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on
@@ -87,7 +87,7 @@
 `perf report` breaks down the event counts by symbol. In the default case where
 what was sampled was time, this is just an ordinary profile by symbol name, no
 different than what could be viewed in other profilers such as
-[Tracy](profiling_with_tracy.md). Where it gets really interesting is when the
+[Tracy](profiling-with-tracy.md). Where it gets really interesting is when the
 profile was recording a specific event type, as in the above `-e
 L1-dcache-load-misses` example:
 
@@ -183,7 +183,7 @@
 However:
 
 * The common case of annotating by time, as opposed to annotating by CPU event,
-  is supported by [Tracy](profiling_with_tracy.md).
+  is supported by [Tracy](profiling-with-tracy.md).
 * Annotating by CPU event is inherently not working due to hardware limitations
   of the ARM CPUs found in Android devices. That is, the hardware is too
   imprecise at pinning an event to a particular instruction.

diff --git a/docs/developers/developing_iree/profiling_vulkan_gpu.md b/docs/website/docs/developers/performance/profiling-gpu-vulkan.md
similarity index 97%
rename from docs/developers/developing_iree/profiling_vulkan_gpu.md
rename to docs/website/docs/developers/performance/profiling-gpu-vulkan.md
index 50fb37c..a92d6d9 100644
--- a/docs/developers/developing_iree/profiling_vulkan_gpu.md
+++ b/docs/website/docs/developers/performance/profiling-gpu-vulkan.md

@@ -1,6 +1,6 @@
-# Vulkan GPU Profiling
+# Profiling GPUs using Vulkan
 
-[Tracy](./profiling_with_tracy.md) offers great insights into CPU/GPU
+[Tracy](./profiling-with-tracy.md) offers great insights into CPU/GPU
 interactions and Vulkan API usage
 details. However, information at a finer granularity, especially inside a
 particular shader dispatch, is missing. To supplement general purpose tools

diff --git a/docs/developers/developing_iree/profiling_with_tracy.md b/docs/website/docs/developers/performance/profiling-with-tracy.md
similarity index 100%
rename from docs/developers/developing_iree/profiling_with_tracy.md
rename to docs/website/docs/developers/performance/profiling-with-tracy.md


diff --git a/docs/developers/developing_iree/profiling.md b/docs/website/docs/developers/performance/profiling.md
similarity index 75%
rename from docs/developers/developing_iree/profiling.md
rename to docs/website/docs/developers/performance/profiling.md
index 46cb0f2..a5a4274 100644
--- a/docs/developers/developing_iree/profiling.md
+++ b/docs/website/docs/developers/performance/profiling.md

@@ -1,4 +1,4 @@
-# Profiling
+# Profiling overview
 
 IREE [benchmarking](./benchmarking.md) gives us an accurate and reproducible
 view of program performance at specific levels of granularity. To analyze system
@@ -8,18 +8,18 @@
 ## Tracy
 
 Tracy is a profiler that's been used for a wide range of profiling tasks on
-IREE. Refer to [profiling_with_tracy.md](./profiling_with_tracy.md).
+IREE. Refer to [Profiling with Tracy](./profiling-with-tracy.md).
 
 ## Vulkan GPU Profiling
 
-[Tracy](./profiling_with_tracy.md) offers great insights into CPU/GPU
+[Tracy](./profiling-with-tracy.md) offers great insights into CPU/GPU
 interactions and Vulkan API usage details. However, information at a finer
 granularity, especially inside a particular shader dispatch, is missing. To
 supplement general purpose tools like Tracy, vendor-specific tools can be used.
-Refer to [profiling_vulkan_gpu.md](./profiling_vulkan_gpu.md).
+Refer to [Profiling GPUs using Vulkan](./profiling-gpu-vulkan.md).
 
 ## CPU cache and other CPU event profiling
 
 For some advanced CPU profiling needs such as querying CPU cache and other
 events, one may need to use some OS-specific profilers. See
-[profiling_cpu_events.md](./profiling_cpu_events.md).
+[Profiling CPUs](./profiling-cpu-events.md).

diff --git a/docs/developers/best_practices.md b/docs/website/docs/developers/usage-best-practices.md
similarity index 96%
rename from docs/developers/best_practices.md
rename to docs/website/docs/developers/usage-best-practices.md
index 3df6d81..5bd33a6 100644
--- a/docs/developers/best_practices.md
+++ b/docs/website/docs/developers/usage-best-practices.md

@@ -1,4 +1,4 @@
-# IREE Best Practices
+# Usage best practices
 
 This page contains a list of best practices for getting the most out of IREE,
 spanning model authoring, ahead-of-time compilation, and runtime use. Treat
@@ -52,7 +52,7 @@
 ### Tuning compilation heuristics
 
 IREE runs its own suite of benchmarks continuously using the definitions at
-https://github.com/openxla/iree/tree/main/benchmarks. The flags set for these
+<https://github.com/openxla/iree/tree/main/benchmarks>. The flags set for these
 benchmarks represent the latest manually tuned values for workloads we track
 closely and referencing them may help with your own search for peak performance.
 You can use these flags in your own explorations, but note that as compiler

diff --git a/docs/developers/get_started/vulkan_environment_setup.md b/docs/website/docs/developers/vulkan-environment-setup.md
similarity index 97%
rename from docs/developers/get_started/vulkan_environment_setup.md
rename to docs/website/docs/developers/vulkan-environment-setup.md
index e06fc33..ff76271 100644
--- a/docs/developers/get_started/vulkan_environment_setup.md
+++ b/docs/website/docs/developers/vulkan-environment-setup.md

@@ -1,11 +1,11 @@
-# Generic Vulkan Development Environment Setup and Troubleshooting
+# Vulkan environment setup
 
 [Vulkan](https://www.khronos.org/vulkan/) is a new generation graphics and
 compute API that provides high-efficiency, cross-platform access to modern GPUs
 used in a wide variety of devices from PCs and consoles to mobile phones and
 embedded platforms.
 
-This page lists steps and tips for setting up and trouble shooting a Vulkan
+This page lists steps and tips for setting up and troubleshooting a Vulkan
 development environment. The information here is meant to be generic.
 
 ## Vulkan architecture

diff --git a/docs/website/docs/guides/index.md b/docs/website/docs/guides/index.md
index 99d800f..fbef597 100644
--- a/docs/website/docs/guides/index.md
+++ b/docs/website/docs/guides/index.md

@@ -33,7 +33,3 @@
   for AMD-specific solutions
 * [:simple-apple: GPU - Metal](./deployment-configurations/gpu-metal.md)
   for running on Apple hardware
-
-## Other topics
-
-* [:material-lightbulb-on: Developer tips and tricks](./developer-tips.md)

diff --git a/docs/website/mkdocs.yml b/docs/website/mkdocs.yml
index 4867083..7a746a5 100644
--- a/docs/website/mkdocs.yml
+++ b/docs/website/mkdocs.yml

@@ -2,6 +2,7 @@
 site_url: https://iree.dev/
 repo_url: https://github.com/openxla/iree
 repo_name: openxla/iree
+edit_uri: blob/main/docs/website/docs/
 
 theme:
   name: material
@@ -9,6 +10,7 @@
   logo_alt: IREE
   icon:
     repo: fontawesome/brands/github
+    edit: material/file-eye-outline
   font:
     text: Noto
     code: Noto Sans Mono
@@ -18,6 +20,7 @@
   custom_dir: overrides
 
   features:
+    - content.action.edit # Link to view/edit documentation source on GitHub
     - content.code.annotate # Allow inline annotations
     - content.code.copy # Enable copy button
     - content.tabs.link # Link content tabs across site (e.g. Windows/Linux)
@@ -136,8 +139,6 @@
           - GPU - CUDA: "guides/deployment-configurations/gpu-cuda.md"
           - GPU - ROCm: "guides/deployment-configurations/gpu-rocm.md"
           - GPU - Metal: "guides/deployment-configurations/gpu-metal.md"
-      - "Other topics":
-          - Developer tips and tricks: "guides/developer-tips.md"
   - "Reference":
       - "reference/index.md"
       - "API bindings":
@@ -165,6 +166,40 @@
           - Glossary: "reference/glossary.md"
           - Optimization options: "reference/optimization-options.md"
           - Extensions: "reference/extensions.md"
+  - "Developers":
+      - "developers/index.md"
+      - "General development topics":
+          - "developers/general/contributing.md"
+          - "developers/general/developer-overview.md"
+          - "developers/general/developer-tips.md"
+          - "developers/general/release-management.md"
+          - "developers/general/testing-guide.md"
+      - "Building":
+          - "developers/building/bazel.md"
+          - "developers/building/emscripten.md"
+          - "developers/building/cmake-options-and-variables.md"
+          - "developers/building/cmake-with-ccache.md"
+      - "Debugging":
+          - "developers/debugging/android-with-lldb.md"
+          - "developers/debugging/compile-time-regressions.md"
+          - "developers/debugging/integration-tests.md"
+          - "developers/debugging/releases.md"
+          - "developers/debugging/sanitizers.md"
+      - "Performance":
+          - "developers/performance/benchmarking.md"
+          - "developers/performance/benchmark-suites.md"
+          - "developers/performance/profiling.md"
+          - "developers/performance/profiling-cpu-events.md"
+          - "developers/performance/profiling-gpu-vulkan.md"
+          - "developers/performance/profiling-with-tracy.md"
+      - "Design docs":
+          - "developers/design-docs/cuda-backend.md"
+          - "developers/design-docs/design-roadmap.md"
+          - "developers/design-docs/function-abi.md"
+          - "developers/design-docs/invocation-execution-model.md"
+      - "Other topics":
+          - "developers/usage-best-practices.md"
+          - "developers/vulkan-environment-setup.md"
   - "Community":
       - "community/index.md"
       - "Blog":
@@ -219,3 +254,6 @@
         # Some blog post names/paths changed when setting up the blog plugin
         "community/blog/2021-07-19-tflite-tosa.md": "community/blog/posts/tflite-tosa.md"
         "community/blog/2021-10-13-mmt4d.md": "community/blog/posts/mmt4d.md"
+
+        # "Developers" section was added
+        "guides/developer-tips.md": "developers/general/developer-tips.md"

diff --git a/docs/website/overrides/404.html b/docs/website/overrides/404.html
index 64dbee5..fd3a129 100644
--- a/docs/website/overrides/404.html
+++ b/docs/website/overrides/404.html

@@ -6,8 +6,6 @@
 
 <p>Sorry, we couldn't find that page.</p>
 
-<p>The <a href="https://github.com/openxla/iree/tree/main/docs/developers"><code>docs/developers/</code></a> directory on GitHub might be helpful.
-
 <p>Click <a href="{{ config.site_url }}">here</a> to go back to the home page.</p>
 
 {% endblock %}
commit	573f5e91901ce54c594b0e82f894a98f9f9d53f7	[log] [tgz]
author	Scott Todd <scotttodd@google.com>	Fri Nov 03 08:31:57 2023 -0700
committer	GitHub <noreply@github.com>	Fri Nov 03 08:31:57 2023 -0700
tree	9ceaf3282304ea8dc0b820cbdffa111f321221bc
parent	d354465a12182e59ed21918420b5b777b7a9b6db [diff]