Fixing roadmap TOC.

Closes https://github.com/google/iree/pull/1296

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/iree/pull/1296 from google:benvanik-roadmap-toc d7e8c726117954aacffac43fdeb3af76683999b0
PiperOrigin-RevId: 303392309
diff --git a/docs/roadmap_design.md b/docs/roadmap_design.md
index 6987a01..3063585 100644
--- a/docs/roadmap_design.md
+++ b/docs/roadmap_design.md
@@ -1,6 +1,8 @@
 # IREE Design Roadmap
 
-<a id="markdown-iree-design-roadmap" name="iree-design-roadmap"></a>
+<a id="markdown-IREE%20Design%20Roadmap" name="IREE%20Design%20Roadmap"></a>
+
+<!-- WARNING: DO NOT EDIT THIS FILE IN AN EDITOR WITH AUTO FORMATTING -->
 
 A not-so-concise walkthrough of various IREE features that are in the design
 process and planned for future versions. A lot of the questions around how the
@@ -51,11 +53,11 @@
 
 ## Input Dialects
 
-<a id="markdown-input-dialects" name="input-dialects"></a>
+<a id="markdown-Input%20Dialects" name="Input%20Dialects"></a>
 
 ### Future MLIR XLA HLO Replacement
 
-<a id="markdown-future-mlir-xla-hlo-replacement" name="future-mlir-xla-hlo-replacement"></a>
+<a id="markdown-Future%20MLIR%20XLA%20HLO%20Replacement" name="Future%20MLIR%20XLA%20HLO%20Replacement"></a>
 
 IREE's current input dialect is the XLA HLO dialect representing operations on
 tensors. This was a pragmatic decision based on having HLO already defined and
@@ -69,7 +71,7 @@
 
 ### `linalg`: High-level Hierarchical Optimization
 
-<a id="markdown-linalg-high-level-hierarchical-optimization" name="linalg-high-level-hierarchical-optimization"></a>
+<a id="markdown-%60linalg%60%3A%20High-level%20Hierarchical%20Optimization" name="%60linalg%60%3A%20High-level%20Hierarchical%20Optimization"></a>
 
 It's required that IREE inputs are all in tensor form (and not in-place memref
 updates) in order to perform a large majority of the `flow` transformations.
@@ -84,7 +86,7 @@
 
 ### XLA HLO: Canonicalizations
 
-<a id="markdown-xla-hlo-canonicalizations" name="xla-hlo-canonicalizations"></a>
+<a id="markdown-XLA%20HLO%3A%20Canonicalizations" name="XLA%20HLO%3A%20Canonicalizations"></a>
 
 Very little effort has been applied to `xla_hlo` optimizations and there are a
 significant number of missing folders, canonicalizers, and simple
@@ -106,7 +108,7 @@
 
 ### XLA HLO: Tensor to Primitive Conversion
 
-<a id="markdown-xla-hlo-tensor-to-primitive-conversion" name="xla-hlo-tensor-to-primitive-conversion"></a>
+<a id="markdown-XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion" name="XLA%20HLO%3A%20Tensor%20to%20Primitive%20Conversion"></a>
 
 HLO only operates on tensor values - even for simple scalars - and this presents
 a problem when attempting to determine which code should be specified to run on
@@ -171,7 +173,7 @@
 
 ### Quantization
 
-<a id="markdown-quantization" name="quantization"></a>
+<a id="markdown-Quantization" name="Quantization"></a>
 
 It's assumed that any work related to quantization/compression has happened
 prior to lowering into IREE dialects. Our plan is to use the proposed
@@ -189,7 +191,7 @@
 
 ## `flow`: Data- and Execution-Flow Modeling
 
-<a id="markdown-flow-data--and-execution-flow-modeling" name="flow-data--and-execution-flow-modeling"></a>
+<a id="markdown-%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling" name="%60flow%60%3A%20Data-%20and%20Execution-Flow%20Modeling"></a>
 
 The `flow` dialect is designed to allow us to extract as much concurrency as
 possible from a program and partition IR into the scheduling and execution
@@ -202,7 +204,7 @@
 
 ### Avoiding Readbacks with `flow.stream`
 
-<a id="markdown-avoiding-readbacks-with-flowstream" name="avoiding-readbacks-with-flowstream"></a>
+<a id="markdown-Avoiding%20Readbacks%20with%20%60flow.stream%60" name="Avoiding%20Readbacks%20with%20%60flow.stream%60"></a>
 
 A majority of the readbacks we have today (manifested as `flow.tensor.load.*`
 ops) will be removed when we have an
@@ -241,7 +243,7 @@
 
 ### Threading `flow.stream` through the CFG
 
-<a id="markdown-threading-flowstream-through-the-cfg" name="threading-flowstream-through-the-cfg"></a>
+<a id="markdown-Threading%20%60flow.stream%60%20through%20the%20CFG" name="Threading%20%60flow.stream%60%20through%20the%20CFG"></a>
 
 The current `flow.ex.stream.fragment`, as denoted by the `ex`perimental tag, is
 a temporary implementation designed to get the concept of streams lowered to the
@@ -288,7 +290,7 @@
 
 ### Predication of `flow.dispatch`
 
-<a id="markdown-predication-of-flowdispatch" name="predication-of-flowdispatch"></a>
+<a id="markdown-Predication%20of%20%60flow.dispatch%60" name="Predication%20of%20%60flow.dispatch%60"></a>
 
 While the
 [`flow.stream` threading through the CFG](#threading-flowstream-through-the-cfg)
@@ -322,7 +324,7 @@
 
 ### Deduping `flow.executable`s
 
-<a id="markdown-deduping-flowexecutables" name="deduping-flowexecutables"></a>
+<a id="markdown-Deduping%20%60flow.executable%60s" name="Deduping%20%60flow.executable%60s"></a>
 
 While still in the `flow` dialect, the executables are target-agnostic. This
 makes simple IR tree diffing a potential solution to deduplication. Since most
@@ -335,7 +337,7 @@
 
 ### Rematerializing CSE'd Expressions
 
-<a id="markdown-rematerializing-csed-expressions" name="rematerializing-csed-expressions"></a>
+<a id="markdown-Rematerializing%20CSE'd%20Expressions" name="Rematerializing%20CSE'd%20Expressions"></a>
 
 Common subexpression elimination is performed many times during lowering,
 however there comes a point where the CSE can introduce false dependencies and
@@ -379,7 +381,7 @@
 
 ### Device Placement
 
-<a id="markdown-device-placement" name="device-placement"></a>
+<a id="markdown-Device%20Placement" name="Device%20Placement"></a>
 
 While still within the `flow` dialect we have the ability to easily split
 streams and safely shuffle around operations. Target execution backends can opt
@@ -393,7 +395,7 @@
 
 ## `hal`: Hardware Abstraction Layer and Multi-Architecture Executables
 
-<a id="markdown-hal-hardware-abstraction-layer-and-multi-architecture-executables" name="hal-hardware-abstraction-layer-and-multi-architecture-executables"></a>
+<a id="markdown-%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables" name="%60hal%60%3A%20Hardware%20Abstraction%20Layer%20and%20Multi-Architecture%20Executables"></a>
 
 As the IREE HAL is designed almost 1:1 with a compute-only Vulkan API many of
 the techniques classically used in real-time graphics apply. The benefit we have
@@ -404,7 +406,7 @@
 
 ### Allow Targets to Specify `hal.interface`s
 
-<a id="markdown-allow-targets-to-specify-halinterfaces" name="allow-targets-to-specify-halinterfaces"></a>
+<a id="markdown-Allow%20Targets%20to%20Specify%20%60hal.interface%60s" name="Allow%20Targets%20to%20Specify%20%60hal.interface%60s"></a>
 
 The `hal.interface` op specifies the ABI between the scheduler and the device
 containing the buffer bindings and additional non-buffer data (parameters,
@@ -426,7 +428,7 @@
 
 ### Target-specific Scheduling Specialization
 
-<a id="markdown-target-specific-scheduling-specialization" name="target-specific-scheduling-specialization"></a>
+<a id="markdown-Target-specific%20Scheduling%20Specialization" name="Target-specific%20Scheduling%20Specialization"></a>
 
 Though the `flow` dialect attempts to fuse as many ops as possible into dispatch
 regions, it's not always possible for all target backends to schedule a region
@@ -457,7 +459,7 @@
 
 ### Buffer Usage Tracking
 
-<a id="markdown-buffer-usage-tracking" name="buffer-usage-tracking"></a>
+<a id="markdown-Buffer%20Usage%20Tracking" name="Buffer%20Usage%20Tracking"></a>
 
 Many explicit hardware APIs require knowing how buffers are used alongside with
 where they should be located. For example this additional information determines
@@ -489,7 +491,7 @@
 
 ### Batched Executable Caching and Precompilation
 
-<a id="markdown-batched-executable-caching-and-precompilation" name="batched-executable-caching-and-precompilation"></a>
+<a id="markdown-Batched%20Executable%20Caching%20and%20Precompilation" name="Batched%20Executable%20Caching%20and%20Precompilation"></a>
 
 For targets that may require runtime preprocessing of their executables prior to
 dispatch, such as SPIR-V or MSL, the IREE HAL provides a caching and batch
@@ -523,7 +525,7 @@
 
 ### Target-aware Executable Compression
 
-<a id="markdown-target-aware-executable-compression" name="target-aware-executable-compression"></a>
+<a id="markdown-Target-aware%20Executable%20Compression" name="Target-aware%20Executable%20Compression"></a>
 
 An advantage of representing executable binaries in IR after translation is that
 we can apply various post-compilation compression and minification techniques
@@ -548,7 +550,7 @@
 
 ### Target-aware Constant Compression
 
-<a id="markdown-target-aware-constant-compression" name="target-aware-constant-compression"></a>
+<a id="markdown-Target-aware%20Constant%20Compression" name="Target-aware%20Constant%20Compression"></a>
 
 It's still an area that needs more research but one goal of the IREE design was
 to enable efficient target- and context-aware compression of large constants
@@ -564,7 +566,7 @@
 
 ### Command Buffer Stateful Deduplication
 
-<a id="markdown-command-buffer-stateful-deduplication" name="command-buffer-stateful-deduplication"></a>
+<a id="markdown-Command%20Buffer%20Stateful%20Deduplication" name="Command%20Buffer%20Stateful%20Deduplication"></a>
 
 The IREE HAL - much like Vulkan it is based on - eschews much of the state that
 traditional APIs have in favor of (mostly) immutable state objects (pipeline
@@ -580,7 +582,7 @@
 
 ### Resource Timeline
 
-<a id="markdown-resource-timeline" name="resource-timeline"></a>
+<a id="markdown-Resource%20Timeline" name="Resource%20Timeline"></a>
 
 A core concept of the IREE scheduler that allows for overlapping in-flight
 invocations is that of the resource timeline. This identifies module state that
@@ -617,7 +619,7 @@
 
 ### Transient Tensor Ringbuffer
 
-<a id="markdown-transient-tensor-ringbuffer" name="transient-tensor-ringbuffer"></a>
+<a id="markdown-Transient%20Tensor%20Ringbuffer" name="Transient%20Tensor%20Ringbuffer"></a>
 
 (When properly implemented) almost all buffers required during execution never
 escape the command buffers they are used in or a single VM invocation. We can
@@ -660,7 +662,7 @@
 
 ### Timeline Semaphores on the Module ABI
 
-<a id="markdown-timeline-semaphores-on-the-module-abi" name="timeline-semaphores-on-the-module-abi"></a>
+<a id="markdown-Timeline%20Semaphores%20on%20the%20Module%20ABI" name="Timeline%20Semaphores%20on%20the%20Module%20ABI"></a>
 
 Functions calls made across modules (either from C++ into the VM, VM->VM, or
 VM->C++) should be able to define timeline semaphores used to wait and signal on
@@ -681,7 +683,7 @@
 
 ### GPU-like CPU Scheduling
 
-<a id="markdown-gpu-like-cpu-scheduling" name="gpu-like-cpu-scheduling"></a>
+<a id="markdown-GPU-like%20CPU%20Scheduling" name="GPU-like%20CPU%20Scheduling"></a>
 
 One approach to using multiple cores on a CPU is to perform interior
 parallelization of operations such as OpenMP or library-call-based custom thread
@@ -727,7 +729,7 @@
 
 ## `vm`: Lightweight Virtual Machine
 
-<a id="markdown-vm-lightweight-virtual-machine" name="vm-lightweight-virtual-machine"></a>
+<a id="markdown-%60vm%60%3A%20Lightweight%20Virtual%20Machine" name="%60vm%60%3A%20Lightweight%20Virtual%20Machine"></a>
 
 The VM is designed as a dynamic linkage ABI, stable bytecode representation, and
 intermediate lowering IR. Many of the optimizations we can perform on it will
@@ -737,7 +739,7 @@
 
 ### Coroutines for Batching and Cooperative Scheduling
 
-<a id="markdown-coroutines-for-batching-and-cooperative-scheduling" name="coroutines-for-batching-and-cooperative-scheduling"></a>
+<a id="markdown-Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling" name="Coroutines%20for%20Batching%20and%20Cooperative%20Scheduling"></a>
 
 One of the largest features currently missing from the VM is coroutines (aka
 user-mode fiber scheduling). Coroutines are what will allow us to have multiple
@@ -799,7 +801,7 @@
 
 #### Cellular Batching
 
-<a id="markdown-cellular-batching" name="cellular-batching"></a>
+<a id="markdown-Cellular%20Batching" name="Cellular%20Batching"></a>
 
 Though coroutines help throughput there is a way we've found to reduce latency
 that's been documented as
@@ -850,7 +852,7 @@
 
 ### Lowering to LLVM IR
 
-<a id="markdown-lowering-to-llvm-ir" name="lowering-to-llvm-ir"></a>
+<a id="markdown-Lowering%20to%20LLVM%20IR" name="Lowering%20to%20LLVM%20IR"></a>
 
 For scenarios where dynamic module loading is not required and entire modules
 can be compiled into applications we can lower the VM IR to LLVM IR within
@@ -874,7 +876,7 @@
 
 ### Improved Type Support
 
-<a id="markdown-improved-type-support" name="improved-type-support"></a>
+<a id="markdown-Improved%20Type%20Support" name="Improved%20Type%20Support"></a>
 
 Currently the VM only supports two types: `i32` and `vm.ref<T>`. This is an
 intentional limitation such that we can determine what is really needed to
@@ -892,7 +894,7 @@
 
 ### Indirect Command Buffer/On-Accelerator Execution
 
-<a id="markdown-indirect-command-bufferon-accelerator-execution" name="indirect-command-bufferon-accelerator-execution"></a>
+<a id="markdown-Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution" name="Indirect%20Command%20Buffer%2FOn-Accelerator%20Execution"></a>
 
 Though IREE will use many different tricks such as
 [predication](#predication-of-flowdispatch) to build deep pipelines there is