Adding description of IREE Codegen pass pipeline.
This change adds description for the IREE Codegen pass pipeline used
to generate SPIR-V code. The LLVM side is TODO.
Also accompanied minor change to make it consistent with the
description.
diff --git a/docs/design_docs/codegen_passes.md b/docs/design_docs/codegen_passes.md
new file mode 100644
index 0000000..5177d37
--- /dev/null
+++ b/docs/design_docs/codegen_passes.md
@@ -0,0 +1,636 @@
+# IREE CPU/GPU Code Generation Pipeline
+
+This document is intended to provide an overview of the codegen pipeline within
+IREE used to generate CPU/GPU code. It intends to give an overview of the main
+passes used, the objective of the pass, the current implementation, and what it
+is expected to achieve in the long term.
+
+Note that while the code generation pipeline supports dynamic shapes, this work
+is very preliminary. The description of this is not covered here.
+
+## Input to the codegen pipeline
+
+The input to the code generation pipeline is the module within the
+`hal.executable.target` operation. Functions within this module that do __not__
+have `Visibility::Private` are the *entry point* functions of the dispatch
+region. These are the functions that are *invoked* by the IREE runtime. In
+addition, each dispatch region also contains a `hal.interface` operation that
+describes the ABI to use for the dispatch region. Two examples of the input to
+the code generation pipeline are shown below. In both of these, a single
+dispatch function contains a sequence of MHLO operations that the dispatch
+region creation has grouped into a single region. Ideally the grouped operations
+are fused into a single kernel.
+
+```mlir
+hal.executable.target "vulkan*" {
+ module attributes {spv.target_env = ...} {
+ func @main_ex_dispatch() {
+ %c0 = constant 0 : index
+ %0 = hal.interface.load.tensor @legacy_io::@arg0,
+ offset = %c0 : tensor<4x5xf32>
+ %1 = hal.interface.load.tensor @legacy_io::@arg1,
+ offset = %c0 : tensor<5x10xf32>
+ %2 = "mhlo.dot"(%0, %1) {precision_config = ["DEFAULT", "DEFAULT"]} :
+ (tensor<4x5xf32>, tensor<5x10xf32>) -> tensor<4x10xf32>
+ hal.interface.store.tensor %2, @legacy_io::@ret0,
+ offset = %c0 : tensor<4x10xf32>
+ return
+ }
+ hal.interface @legacy_io attributes {sym_visibility = "private"} {
+ hal.interface.binding @arg0, set=0, binding=0,
+ type="StorageBuffer", access="Read"
+ hal.interface.binding @arg1, set=0, binding=1,
+ type="StorageBuffer", access="Read"
+ hal.interface.binding @ret0, set=0, binding=2,
+ type="StorageBuffer", access="Write|Discard"
+ }
+ }
+}
+```
+<a name="snippet1"></a>
+Snippet 1 : Dispatch region with matrix-matrix multiply operation.
+
+```mlir
+hal.executable.target "vulkan*" {
+ module attributes {spv.target_env = ...} {
+ func @main_ex_dispatch() {
+ %c0 = constant 0 : index
+ %0 = hal.interface.load.tensor @legacy_io::@arg0,
+ offset = %c0 : tensor<10x5xf32>
+ %1 = hal.interface.load.tensor @legacy_io::@arg1,
+ offset = %c0 : tensor<10x5xf32>
+ %2 = hal.interface.load.tensor @legacy_io::@arg2,
+ offset = %c0 : tensor<10x5xf32>
+ %3 = "mhlo.add"(%0, %1) :
+ (tensor<10x5xf32>, tensor<10x5xf32>) -> tensor<10x5xf32>
+ %4 = "mhlo.multiply"(%3, %2) :
+ (tensor<10x5xf32>, tensor<10x5xf32>) -> tensor<10x5xf32>
+ hal.interface.store.tensor %4, @legacy_io::@ret0,
+ offset = %c0 : tensor<10x5xf32>
+ return
+ }
+ hal.interface @legacy_io attributes {sym_visibility = "private"} {
+ hal.interface.binding @arg0, set=0, binding=0,
+ type="StorageBuffer", access="Read"
+ hal.interface.binding @arg1, set=0, binding=1,
+ type="StorageBuffer", access="Read"
+ hal.interface.binding @arg2, set=0, binding=2,
+ type="StorageBuffer", access="Read"
+ hal.interface.binding @ret0, set=0, binding=3,
+ type="StorageBuffer", access="Write|Discard"
+ }
+ }
+}
+```
+<a name="snippet2"></a>
+Snippet 2 : Dispatch region with element-wise operations.
+
+__Roadmap Note__: The current implementation might not actually fuse the
+operations grouped into a dispatch region into a single kernel. It is possible
+to end up with multiple kernels per dispatch region. Over time we plan to address
+this by using fusion at different levels (see below).
+
+The inputs to the dispatch region are materialized within the entry point
+function using the `hal.interface.load.tensor` operation, This operation returns
+a `tensor` view of the buffer used to store the inputs. Similarly the result of
+the dispatch region are *written* out using the `hal.interface.store.tensor`
+operation.
+
+The main constraint that the code generation operates under is that it should
+not require additional (temporary) buffers to execute the operations grouped
+together within a dispatch region. The rationale behind this constraint is that
+buffer allocation/synchronization in IREE happens at the granularity of dispatch
+regions, allowing the scheduler to make better decision about where to insert
+appropriate synchronizations.
+
+The IR after all the passes used in the lowering from MHLO to SPIR-V for the
+above two examples can be found here ([matrix-matrix multiply op][DotAfterAll],
+[elementwise ops][PwAfterAll]). Below is a description of the major passes used.
+
+## Conversion from MHLO dialect to Linalg on buffers
+
+The code generation pipeline heavily relies on use of [Structured
+Operations][LinalgRationale], specifically the [Linalg Dialect][LinalgDialect].
+Both, the Linalg operations on `tensor`s and on `memref`s are central to the
+progressive lowering approach followed here. The first part of the code
+generation pipeline is to convert the MHLO operations on `tensor`s to Linalg
+operation on `memref`s. This part of the pipeline is common to both CPU and GPU
+code generation.
+
+The steps involved in this conversion is shown below. Each of the arrows
+represents a pass in the pipeline  The next sections describe each of
+these passes in more detail.
+
+
+### MHLO to Linalg on tensors
+
+The first step is to convert MHLO operations to Linalg on tensors. This is done
+using the [HLOToLinalgPass][HLOToLinalgPass] from Tensorflow. An example of the
+conversion is shown below, where each of the `mhlo.add` and `mhlo.multiply`
+operations are converted to `linalg.generic` operations on tensors.
+
+```mlir
+#map0 = affine_map<(d0, d1) -> (d0, d1)>
+%3 = linalg.generic
+ {args_in = 2 : i64, args_out = 1 : i64,
+ indexing_maps = [#map0, #map0, #map0],
+ iterator_types = ["parallel", "parallel"]} %0, %1 {
+ ^bb0(%arg0: f32, %arg1: f32): // no predecessors
+ %5 = addf %arg0, %arg1 : f32
+ linalg.yield %5 : f32
+ } : tensor<10x5xf32>, tensor<10x5xf32> -> tensor<10x5xf32>
+%4 = linalg.generic
+ {args_in = 2 : i64, args_out = 1 : i64,
+ indexing_maps = [#map0, #map0, #map0],
+ iterator_types = ["parallel", "parallel"]} %3, %2 {
+ ^bb0(%arg0: f32, %arg1: f32): // no predecessors
+ %5 = mulf %arg0, %arg1 : f32
+ linalg.yield %5 : f32
+ }: tensor<10x5xf32>, tensor<10x5xf32> -> tensor<10x5xf32>
+```
+<a name="snippet3"></a>
+Snippet 3 : MHLO to Linalg conversion for [element-wise operations](#snippet2)
+
+At the time of writing the representation of Linalg on `tensor`s does not model
+reduction iterator types completely. Specifically, the reduction in Linalg is
+modeled using read-modify-write approach, i.e. each iteration of the reduction
+loop reads the value stored in the output, adds its contribution, and writes
+back to the same location. This means the output has to be *initialized* to the
+null element of the reduction operator (i.e. 0 if the reduction is done using
+addition). This works for operations on buffers. Since tensors are SSA values
+they cannot be updated in-place. As a result, the reduction semantics does not
+map as well to `tensor`s. For now it is treated as a convention that when the
+Linalg operation is converted to use `memref`s it has to be initialized
+appropriately before performing the reduction. Due to this, the conversion from
+MHLO op to Linalg op is only done for operations which do not need a *reduction*
+iterator type in the converted Linalg op. Consequently, only element-wise
+operations, broadcast operations and data movement operations (like copy and
+transpose) are converted to Linalg operations at this stage.
+
+__Roadmap note__: One long term solution for the above is to have operations on
+tensors that have *reduction* iterator type to take an additional argument that
+contains the initial value of the result tensor. When the operation is converted
+to use `memref`s, the buffer for the initial value operand can be reused for the
+result. The details involved have not been fully worked out yet.
+
+### Fusion of Linalg on tensor operations
+
+The Linalg on `tensor` operations generated at the previous step are fused using
+the [LinalgFusionOfTensorOps][LinalgFusionOfTensorOps] from MLIR. Since
+`tensor`s are SSA values, fusion at this stage can be done without using alias
+analysis or dependence analysis based on reads and writes. Instead the use-def
+chains for the `tensor` values can be used to implement producer-consumer
+fusion. This stage fuses most elementwise operations, broadcast operations and
+data movement operations. An example of the fused op is shown below.
+
+```mlir
+#map0 = affine_map<(d0, d1) -> (d0, d1)>
+%3 = linalg.generic
+ {args_in = 3 : i64, args_out = 1 : i64,
+ indexing_maps = [#map0, #map0, #map0, #map0],
+ iterator_types = ["parallel", "parallel"]} %0, %1, %2 {
+ ^bb0(%arg0: f32, %arg1: f32, %arg2: f32): // no predecessors
+ %4 = addf %arg0, %arg1 : f32
+ %5 = mulf %4, %arg2 : f32
+ linalg.yield %5 : f32
+ }: tensor<?x5xf32>, tensor<?x5xf32>, tensor<?x5xf32> -> tensor<?x5xf32>
+```
+<a name="snippet4"></a>
+Snippet 4: Fusion of Linalg operation on tensors for element-wise operations
+shown in [Snippet 3](#snippet3)
+
+### Conversion of Linalg on tensors to Linalg on buffers
+
+Post fusion all the operation on `tensor`s are converted to analogous operations
+on `memref`s. In general, this requires a buffer allocation pass. In IREE,
+buffer allocation happens at the granularity of dispatch region, and as
+mentioned [earlier](#input-to-the-codegen-pipeline), the dispatch region is not
+expected to use any additional temporary buffers. So instead of having another
+buffer allocation pass within the code generation pipeline, a simpler approach
+is used within IREE:
+
+- For each `hal.interface.store.tensor` an `iree.placeholder` operation is
+ created. The latter uses the same `hal.interface.binding` as the former, but
+ returns a `memref` view of the output of the dispatch region instead of a
+ `tensor` view. This `iree.placeholder` operation is added to start of the
+ entry point function.
+
+- A map is constructed that for a given `tensor` records the `memref` value to
+ use during the conversion. In this map the `tensor` value used in the
+ `hal.interface.store.tensor` is mapped to the `memref` value returned by the
+ created `iree.placeholder` operation.
+
+- The Dialect Conversion framework is used to implement a set of patterns that
+ convert from operations on `tensor`s to operation on `memref`s,
+
+ - A `hal.interface.load.tensor`, is replaced with an `iree.placeholder` to
+ get the `memref` view of the input to the dispatch region.
+ - All Linalg operation on `tensor`s (expected to be just `linalg.generic`
+ or `linalg.indexed_generic` operations) are converted to the
+ corresponding operation on `memref`s. Instead of returning a `tensor`
+ value the converted operation takes an additional `memref` operand as
+ argument. This `memref` is where the result of the operation is
+ populated. Current implementation looks for the `memref` to use from the
+ map constructed previously. If there is no `memref` associated with the
+ result `tensor` the conversion fails.
+ - At this stage, any `mhlo` operation not converted to a Linalg operation
+ are directly converted to a Linalg operation on buffers. This is done
+ for operations that when converted to Linalg have a *reduction* iterator
+ type. Some examples of ops converted this way are
+
+ - `mhlo.dot`
+ - `mhlo.reduce`
+ - `mhlo.conv`
+ - `mhlo.reduce_window`.
+
+ Since the specification of the Linalg operations require the output
+ `memref` to be initialized appropriately, a `linalg.fill` operation is
+ used to achieve this.
+
+__Roadmap Note__ : Right now the code-generation pipeline relies on fusion of
+operations on tensor level. In the near future, we want to be able to fuse
+operations like `linalg.matmul` and `linalg.conv` with consumers/producers that
+are element-wise operations using the [fusion of Linalg operation on
+`memref`s][LinalgFusionOnBuffers].
+
+At this stage of the compilation all operations must have been converted to
+Linalg operations on buffers. Shown below are the IR at the end of this stage
+for the two examples in Snippets 1 and 2.
+
+```mlir
+func @main_ex_dispatch() {
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<4x10xf32>
+ %c0 = constant 0 : index
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<4x5xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<5x10xf32>
+ %cst = constant 0.000000e+00 : f32
+ linalg.matmul(%1, %2, %0) :
+ memref<4x5xf32>, memref<5x10xf32>, memref<4x10xf32>
+ return
+}
+```
+<a name="snippet5"></a>
+Snippet 5 : Matrix-matrix multiply after conversion to
+Linalg operation on `memref`s.
+
+```mlir
+#map0 = affine_map<(d0, d1) -> (d0, d1)>
+func @main_ex_dispatch() {
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<10x5xf32>
+ %c0 = constant 0 : index
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<10x5xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<10x5xf32>
+ %3 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg2} : memref<10x5xf32>
+ linalg.generic
+ {args_in = 3 : i64, args_out = 1 : i64,
+ indexing_maps = [#map0, #map0, #map0],
+ iterator_types = ["parallel", "parallel"]} %1, %2, %3, %0 {
+ ^bb0(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32): // no predecessors
+ %4 = addf %arg0, %arg1 : f32
+ %5 = mulf %4, %arg2 : f32
+ linalg.yield %5 : f32
+ }: memref<10x5xf32>, memref<10x5xf32>, memref<10x5xf32>, memref<10x5xf32>
+ return
+}
+```
+<a name="snippet6"></a>
+Snippet 6 : Elementwise operations after conversion to Linalg operation on
+`memref`s
+
+The rest of the code-generation differs on whether the compilation is for CPU
+(using LLVM) or for GPU (using SPIR-V).
+
+## Conversion from Linalg on buffers to SPIR-V dialect
+
+The following sections describe the progressive lowering of Linalg operation on
+buffers to SPIR-V dialect. Once lowered to the SPIR-V dialect, it can be
+serialized into a SPIR-V binary using the [serialization mechanism provided by
+the SPIR-V dialect][SpirvSerialization]. The steps involved in the lowering are
+described below, with each of the arrows representing a pass.  These passes
+are described below in more detail.
+
+### Tiling and fusion on buffer operations
+
+The GPU hardware typically provides multiple-levels of compute hierarchy, namely
+*workgroup* level, *subgroup* level and *workitem* level. These map to blocks,
+warps and threads, respectively, in CUDA terminology. Tiling is a way to map the
+computations to each level of the compute hierarchy. For example 3-D tiling a
+`linalg.matmul` operation decomposes the computation into several tiled
+matrix-matrix multiplies. [Tiling transformation in Linalg
+dialect][LinalgTiling] generates the outer-loops that iterate over tiled
+`linalg.matmul` operations. These outer loops can be mapped to different
+workgroups, if they are parallel. The tiled `linalg.matmul` operation can be
+further tiled to map to subgroups. Finally, the tiled operation can be lowered
+to loops with individual iterations mapped to workitems. The
+[LinalgTileAndFusePass][LinalgTileAndFuse] uses the Linalg Tiling patterns
+([defined here][LinalgTilingPatterns]) to tile operations like `linalg.matmul`,
+`linalg.conv` and `linalg.*_pooling`. The result of tiling the code in Snippet 5
+is shown below. As expected there are 2-parallel loops that iterate over tiles
+of the original iteration space (i.e. inter-tile loops) and can be distributed
+to workgroups.
+
+```mlir
+func @main_ex_dispatch_0()
+ attributes {
+ spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
+ %cst = constant 0.000000e+00 : f32
+ %c0 = constant 0 : index
+ %c4 = constant 4 : index
+ %c10 = constant 10 : index
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<4x10xf32>
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<4x5xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<5x10xf32>
+ linalg.fill(%0, %cst) : memref<4x10xf32>, f32
+ scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c4, %c10) step (%c8, %c8) {
+ scf.for %arg2 = %c0 to %c5 step %c4 {
+ ...
+ %5 = subview %1[%arg0, %arg2]...
+ ...
+ %8 = subview %2[%arg2, %arg1]...
+ ...
+ %11 = subview %0[%arg0, %arg1]..
+ linalg.matmul {__internal_linalg_transform__ = "workgroup"} %5, %8, %11...
+ }
+ scf.yield
+ }
+ return
+}
+```
+<a name="snippet7"></a>
+Snippet 7 : `linalg.matmul` after tiling.
+
+#### Tile Size and Workgroup Size
+
+When operations that are to be tiled exist within the dispatch function (like
+`linalg.matmul` or `linalg.conv`), this pass also decides the
+1. Tile size to be used for the tiling.
+1. The workgroup size to be used.
+
+The tile size and workgroup size are closely linked since the code within the
+tiled loops are to be collectively executed by the entire workgroup. In other
+words, all workitems in the workgroup collaborate to execute the tiled
+`linalg.matmul`.
+
+__Roadmap Note__ : Currently the tile sizes used in this pass are hard-wired.
+Not much effort has been put into finding ideal tile size for each operation on
+different hardware. The value used is meant to be a baseline to test
+functionality, with performance considerations addressed over time.
+
+#### Markers
+
+Downstream passes have to handle tiled Linalg operations and untiled Linalg
+operation that might exist in the same function in different ways. For example,
+while the former are to be executed collectively by workitems within a
+workgroup, the latter have to be executed by all workitems across
+workgroups. One way to distinguish these two operations is to use the marker
+mechanism in Linalg ([LinalgMarker][LinalgTilingPatterns]). This is a `StrAttr`
+whose value can be used to encode the scope of the operation. For example, in
+Snippet 7 above, the tiled `linalg.matmul` operation has a marker `workgroup` to
+indicate that this operation needs to be executed by a workgroup in a collective
+manner. At this time, the code-generation pipeline uses only the `workgroup`
+marker.
+
+__Roadmap Note__ : Markers are meant to be short-lived, ideally set and consumed
+within the same pass. In the current pipeline the lifetime spans passes to allow
+lowering to different hierarchies. The separate passes that implement the
+lowering from Linalg to SPIR-V can be combined into a single pass, relying A ->
+B -> C translation mechanism of the Dialect Conversion framework to implement
+the progressive lowering. In interest of separation of concerns and for better
+debuggability these passes are kept separate at the cost of having lifetimes of
+markers span passes.
+
+#### Promoting subviews to use workgroup local memory and use of synchronizations
+
+`Workgroup` memory (or `shared memory` in CUDA terminology) can be used to
+prefetch the inputs to the tiled operation. For example in the matrix-matrix
+multiply case, the same data row (column) of the LHS (RHS) matrix is read by
+multiple workitems. Prefetching the data into `Workgroup` memory can reduce the
+number of loads to `StorageClass` memory by an order of magnitude. This
+transformation can be achieved by using the [`Linalg
+Promotion`][LinalgPromotionPatterns] which modifies the `subview`s that are the
+operands to the tiled Linalg operation to use a new `memref` object. The size of
+this `memref` is computed from the size of the `subview`. This `memref` object
+is later lowered to use `Workgroup` memory Storage Class. The snippet below
+shows this transformation when applied to `linalg.matmul` (along with
+tiling). The newly created `memref` objects are annotated with the memory space
+`3` to indicate that they are to be lowered to use `Workgroup` memory. The copy
+of data from the original `memref` into the new `memref`, as well as the
+necessary synchronization constructs are generated as well. Note the memory
+space annotation used here is consistent with what [address space annotations
+used in NVVM][NVVMAddressSpace].
+
+```mlir
+func @matmul_tile()
+ attributes {
+ spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
+ %c96 = constant 96 : index
+ %c4 = constant 4 : index
+ %c8 = constant 8 : index
+ %c0 = constant 0 : index
+ %c1 = constant 1 : index
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<96x96xf32>
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<96x96xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<96x96xf32>
+ scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c96, %c96) step (%c8, %c8) {
+ scf.for %arg2 = %c0 to %c96 step %c4 {
+ ...
+ %5 = subview %0[%arg0, %arg2]...
+ ...
+ %8 = subview %1[%arg2, %arg1]...
+ ...
+ %11 = subview %2[%arg0, %arg1]...
+ %12 = alloc(%c8, %c4) : memref<?x?xf32, 3>
+ %13 = subview %12[%c0, %c0]...
+ %14 = alloc(%c4, %c8) : memref<?x?xf32, 3>
+ %15 = subview %14[%c0, %c0]...
+ linalg.copy(%5, %13) {__internal_linalg_transform__ = "workgroup"}
+ : memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
+ spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
+ linalg.copy(%8, %15) {__internal_linalg_transform__ = "workgroup"}
+ : memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
+ spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
+ linalg.matmul {__internal_linalg_transform__ = "workgroup"} %13, %15, %11...
+ spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
+ dealloc %12 : memref<?x?xf32, 3>
+ dealloc %14 : memref<?x?xf32, 3>
+ }
+ scf.yield
+ }
+ return
+}
+```
+
+<a name="snippet8"></a>
+Snippet 8: `linalg.matmul` after tiling and promotion of operand subviews to use
+`Workgroup` memory.
+
+### Distributing to workgroups and workitems
+
+After tiling the operations within the dispatch functions are either
+`scf.parallel` operations or Linalg operations.
+
+- The outer `scf.parallel` operations represent parallel loops that are to be
+ distributed across workgroups. The distribution here assumes that the number
+ of workgroups along each dimension is equal to the number of iterations of the
+ `scf.parallel` operation.
+
+- Linalg operations that are not tiled, and are therefore __not within__ `scf`
+ operations, are lowered to loops. The resulting outer `scf.parallel` operations are
+ collapsed to have a single induction variable. This loop is then distributed
+ across workitems using their `GlobalInvocationId`, (which is same as
+ `blockIdx * blockDim + threadIdx` in CUDA terminology).
+
+- Linalg operations that are tiled, and are therefore __within__ `scf`
+ operations, are lowered to loops and the iterations of the `scf.parallel`
+ operations are mapped to workitems using their `LocalInvocationId` (which is
+ same as `threadIdx` in CUDA terminology). Note that these operations are
+ tagged with the `workgroup` marker which makes it easy to disambiguate from
+ the case where Linalg operations are outside of `scf` operations. Here too,
+ the distribution assumes that the workgroup size is greater than or equal to
+ the number of iterations of the partitioned loop.
+
+These transformations are applied by the
+[`ConvertToGPUPass`][ConvertToGPU]. Below is the result of applying this pass to
+Snippet 7. The outer `scf.parallel` loop is distributed across workgroups. The
+tiled `linalg.matmul` operation is lowered to loops, and the outer
+`scf.parallel` operation generated during this lowering are distributed across
+workitems within the workgroup.
+
+```mlir
+func @main_ex_dispatch_0_dispatch_1()
+ attributes {
+ spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
+ %c5 = constant 5 : index
+ %c8 = constant 8 : index
+ %c4 = constant 4 : index
+ %c0 = constant 0 : index
+ %c1 = constant 1 : index
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<4x10xf32>
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<4x5xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<5x10xf32>
+ %3 = "gpu.block_id"() {dimension = "x"} : () -> index
+ %4 = muli %3, %c8 : index
+ scf.for %arg0 = %c0 to %c5 step %c4 {
+ ...
+ %9 = subview %1[0, %arg0]
+ ...
+ %14 = subview %2[%arg0, %4]
+ %15 = subview %0[0, %4]
+ %16 = "gpu.thread_id"() {dimension = "x"} : () -> index
+ %17 = "gpu.thread_id"() {dimension = "y"} : () -> index
+ %18 = cmpi "slt", %17, %c4 : index
+ %19 = cmpi "slt", %16, %13 : index
+ %20 = and %18, %19 : i1
+ scf.if %20 {
+ scf.for %arg1 = %c0 to %8 step %c1 {
+ %21 = load %9[%17, %arg1] : memref<4x?xf32, #map0>
+ %22 = load %14[%arg1, %16] : memref<?x?xf32, #map1>
+ %23 = load %15[%17, %16] : memref<4x?xf32, #map1>
+ %24 = mulf %21, %22 : f32
+ %25 = addf %23, %24 : f32
+ store %25, %15[%17, %16] : memref<4x?xf32, #map1>
+ }
+ }
+ }
+ return
+}
+```
+<a name="snippet9"></a>
+Snippet 9: `linalg.matmul` after distributing parallel inter-tile loops to
+workgroups and intra-tile loops to workitems.
+
+[Snippet 6](#snippet6) shows the fused element-wise operations represented using
+a `linalg.generic` operation. This operation is not tiled in the
+`LinalgTileAndFusePass`. So the `ConvertToGPUPass` lowers this operation to
+`scf.parallel` loops, which are collapsed into a `scf.parallel` operation with a
+single induction variable. This loop is then distributed across workitems using
+the `GlobalInvocationId`. The resulting IR is shown below.
+
+```mlir
+func @main_ex_dispatch_0()
+ attributes {
+ spv.entry_point_abi = {local_size = dense<[32, 1, 1]> : vector<3xi32>}} {
+ %c50 = constant 50 : index
+ %c5 = constant 5 : index
+ %0 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@ret0} : memref<10x5xf32>
+ %1 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg0} : memref<10x5xf32>
+ %2 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg1} : memref<10x5xf32>
+ %3 = iree.placeholder for "interface buffer"
+ {binding = @legacy_io::@arg2} : memref<10x5xf32>
+ %4 = "gpu.block_id"() {dimension = "x"} : () -> index
+ %5 = "gpu.block_dim"() {dimension = "x"} : () -> index
+ %6 = "gpu.thread_id"() {dimension = "x"} : () -> index
+ %7 = muli %4, %5 : index
+ %8 = addi %7, %6 : index
+ %9 = cmpi "slt", %8, %c50 : index
+ scf.if %9 {
+ %10 = divi_signed %8, %c5 : index
+ %11 = remi_signed %8, %c5 : index
+ %12 = load %1[%10, %11] : memref<10x5xf32>
+ %13 = load %2[%10, %11] : memref<10x5xf32>
+ %14 = load %3[%10, %11] : memref<10x5xf32>
+ %15 = addf %12, %13 : f32
+ %16 = mulf %15, %14 : f32
+ store %16, %0[%10, %11] : memref<10x5xf32>
+ }
+ return
+}
+```
+<a name="snippet10"></a>
+Snippet 10: Distributing the iterations for pointwise operations for GPU execution.
+
+### Lowering to SPIR-V dialect
+
+The last step is to take the result of the previous pass and lowering it to
+SPIR-V dialect. Since SPIR-V dialect is *closed*, i.e. it has a separate type
+system, its best to lower all the operations to SPIR-V in one step. This is done
+by applying all the patterns that lower all the different IR constructs into
+SPIR-V within the [`ConvertToSPIRVPass`][ConvertToSPIRV]. These are
+
+- [GPU dialect to SPIR-V conversion][GPUToSPIRV].
+- [SCF dialect to SPIR-V conversion][SCFToSPIRV].
+- [Standard dialect to SPIR-V conversion][StandardToSPIRV].
+- Patterns that lower the `iree.placeholder` instruction into a SPIR-V.
+
+Once applied the resulting IR is in SPIR-V dialect that can be serialized to a
+SPIR-V binary.
+
+[ConvertToGPU]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
+[ConvertToSPIRV]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp
+[DotAfterAll]: https://gist.github.com/MaheshRavishankar/9e2d406296f469515c4a79bf1e7eef44
+[GPUToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/GPUToSPIRV/ConvertGPUToSPIRV.h
+[HLOToLinalgPass]: https://github.com/tensorflow/tensorflow/blob/75c40f6bff2faa3d90a375dfa4025b2e6e2d7a3d/tensorflow/compiler/mlir/xla/transforms/passes.h#L67
+[LinalgDialect]: https://mlir.llvm.org/docs/Dialects/Linalg/
+[LinalgFusionOnBuffers]: https://github.com/llvm/llvm-project/blob/ef868a848e6def288d2df7a1b3ebe09463afc8d0/mlir/include/mlir/Dialect/Linalg/Utils/Utils.h#L86
+[LinalgFusionOfTensorOps]: https://github.com/llvm/llvm-project/blob/80cb25cbd555f9634836b766c86aead435b60eaa/mlir/include/mlir/Dialect/Linalg/Passes.td#L30
+[LinalgPromotionPatterns]: https://github.com/llvm/llvm-project/blob/303a7f7a26e2aae1cb85f49dccbc0b5d14e0b2e0/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h#L358
+[LinalgRationale]: https://mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/
+[LinalgTileAndFuse]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
+[LinalgTiling]: https://mlir.llvm.org/docs/Dialects/Linalg/#set-of-key-transformationsa-namekey_transformationsa
+[LinalgTilingPatterns]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
+[NVVMAddressSpace]: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#address-space
+[PwAfterAll]: https://gist.github.com/MaheshRavishankar/02cdd22f7c99e568f933244b5a679510
+[SCFToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/SCFToSPIRV/SCFToSPIRV.h
+[SpirvSerialization]: https://mlir.llvm.org/docs/Dialects/SPIR-V/#serialization-and-deserialization
+[StandardToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.h
diff --git a/docs/design_docs/hlo_to_linalg.png b/docs/design_docs/hlo_to_linalg.png
new file mode 100755
index 0000000..469ed26
--- /dev/null
+++ b/docs/design_docs/hlo_to_linalg.png
Binary files differ
diff --git a/docs/design_docs/linalg_to_spirv.png b/docs/design_docs/linalg_to_spirv.png
new file mode 100755
index 0000000..fd6aee7
--- /dev/null
+++ b/docs/design_docs/linalg_to_spirv.png
Binary files differ
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp b/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
index 98e91fa..81514ec 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
+++ b/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
@@ -563,7 +563,7 @@
ConversionPatternRewriter &rewriter) const override {
// Check for marker that specifies that the linalg op is to be partitioned
// across threads within a workgroup.
- if (!hasWorkItemMarker(linalgOp)) return failure();
+ if (!hasWorkGroupMarker(linalgOp)) return failure();
Optional<linalg::LinalgLoops> loops =
linalg::linalgLowerOpToLoops<scf::ParallelOp>(rewriter, linalgOp);
if (!loops) return failure();
@@ -587,7 +587,7 @@
LogicalResult matchAndRewrite(
LinalgOpTy linalgOp, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const override {
- if (!hasWorkItemMarker(linalgOp)) return failure();
+ if (!hasWorkGroupMarker(linalgOp)) return failure();
Optional<linalg::LinalgLoops> loops =
linalg::linalgLowerOpToLoops<scf::ParallelOp>(rewriter, linalgOp);
if (!loops) return failure();
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp b/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
index e9dddd6..934e5ae 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
+++ b/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
@@ -314,7 +314,7 @@
LogicalResult matchAndRewrite(Operation *op,
PatternRewriter &rewriter) const override {
- if (!hasWorkItemMarker(op)) return failure();
+ if (!hasWorkGroupMarker(op)) return failure();
return linalg::LinalgPromotionPattern<linalg::MatmulOp>::matchAndRewrite(
op, rewriter);
}
@@ -365,7 +365,7 @@
.setLoopType(linalg::LinalgTilingLoopType::ParallelLoops),
tileSizeCalculator.getWorkGroupSize(),
linalg::LinalgMarker(ArrayRef<Identifier>(),
- Identifier::get(getWorkItemMarker(), context)));
+ Identifier::get(getWorkGroupMarker(), context)));
applyPatternsAndFoldGreedily(getOperation(), tilingPatterns);
if (useWorkgroupMemory) {
@@ -385,7 +385,7 @@
[&](OpBuilder &b, Value src, Value dst) -> LogicalResult {
return copyToFromWorkgroupMemory(b, src, dst);
}),
- linalg::LinalgMarker(Identifier::get(getWorkItemMarker(), context),
+ linalg::LinalgMarker(Identifier::get(getWorkGroupMarker(), context),
Identifier::get(PromotionMarker, context)));
applyPatternsAndFoldGreedily(getOperation(), promotionPatterns);
}
@@ -394,7 +394,7 @@
OpBuilder builder(context);
funcOp.walk([&builder](linalg::LinalgOp linalgOp) {
if (hasMarker(linalgOp, PromotionMarker)) {
- setWorkItemMarker(linalgOp);
+ setWorkGroupMarker(linalgOp);
insertBarrierAfter(builder, linalgOp.getLoc(), linalgOp);
}
});
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.cpp b/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.cpp
index c874234..47747de 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.cpp
+++ b/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.cpp
@@ -36,8 +36,6 @@
StringRef getWorkGroupMarker() { return "workgroup"; }
-StringRef getWorkItemMarker() { return "workitem"; }
-
bool hasMarker(Operation *op, StringRef marker) {
return checkMarkerValue(op, marker);
}
@@ -46,10 +44,6 @@
return checkMarkerValue(op, getWorkGroupMarker());
}
-bool hasWorkItemMarker(Operation *op) {
- return checkMarkerValue(op, getWorkItemMarker());
-}
-
void setMarker(Operation *op, StringRef marker) {
op->setAttr(linalg::LinalgTransforms::kLinalgTransformMarker,
StringAttr::get(marker, op->getContext()));
@@ -57,6 +51,5 @@
void setWorkGroupMarker(Operation *op) { setMarker(op, getWorkGroupMarker()); }
-void setWorkItemMarker(Operation *op) { setMarker(op, getWorkItemMarker()); }
} // namespace iree_compiler
} // namespace mlir
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.h b/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.h
index 36dccca..e512ead 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.h
+++ b/iree/compiler/Conversion/LinalgToSPIRV/MarkerUtils.h
@@ -31,7 +31,7 @@
namespace iree_compiler {
/// Marker to denote that a linalg operation is to be partitioned to workitems.
-StringRef getWorkItemMarker();
+StringRef getWorkGroupMarker();
/// Returns true if an operation has the specified `marker`. When `marker` is
/// empty, returns true if the operation has any marker.
@@ -39,14 +39,14 @@
/// Returns true if an operation has marker to denote that it is to be
/// partitioned to workitems.
-bool hasWorkItemMarker(Operation *);
+bool hasWorkGroupMarker(Operation *);
/// Sets a given marker on an operation.
void setMarker(Operation *, StringRef);
/// Sets marker to denote that a linalg operation is to be partitioned to
/// workitems.
-void setWorkItemMarker(Operation *);
+void setWorkGroupMarker(Operation *);
} // namespace iree_compiler
} // namespace mlir
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu.mlir b/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu.mlir
index 679f523..64621f3 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu.mlir
+++ b/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu.mlir
@@ -162,7 +162,7 @@
%12 = dim %arg2, %c1 : memref<?x?xf32>
%13 = affine.min #map0(%arg4)[%12]
%14 = subview %arg2[%arg3, %arg4] [%11, %13] [1, 1] : memref<?x?xf32> to memref<?x?xf32, #map2>
- linalg.matmul %5, %9, %14 {__internal_linalg_transform__ = "workitem"} : (memref<?x?xf32, #map2>, memref<?x?xf32, #map2>, memref<?x?xf32, #map2>)
+ linalg.matmul %5, %9, %14 {__internal_linalg_transform__ = "workgroup"} : (memref<?x?xf32, #map2>, memref<?x?xf32, #map2>, memref<?x?xf32, #map2>)
}
scf.yield
}
@@ -235,7 +235,7 @@
%13 = affine.min #map5(%arg5)[%4]
%14 = dim %arg2, %c3 : memref<?x?x?x?xf32>
%15 = subview %arg2[%arg3, %arg4, %arg5, 0] [%11, %12, %13, %14] [1, 1, 1, 1] : memref<?x?x?x?xf32> to memref<?x?x?x?xf32, #map3>
- linalg.conv(%arg0, %9, %15) {__internal_linalg_transform__ = "workitem", dilations = [1, 1], strides = [1, 1]} : memref<?x?x?x?xf32>, memref<?x?x?x?xf32, #map3>, memref<?x?x?x?xf32, #map3>
+ linalg.conv(%arg0, %9, %15) {__internal_linalg_transform__ = "workgroup", dilations = [1, 1], strides = [1, 1]} : memref<?x?x?x?xf32>, memref<?x?x?x?xf32, #map3>, memref<?x?x?x?xf32, #map3>
scf.yield
}
return
@@ -364,7 +364,7 @@
%9 = affine.min #map3(%arg3)[%2]
%10 = affine.min #map4(%arg4)[%3]
%11 = subview %arg2[%arg3, %arg4] [%9, %10] [1, 1] : memref<?x?xf32> to memref<?x?xf32, #map2>
- linalg.pooling_max(%8, %arg1, %11) {__internal_linalg_transform__ = "workitem", dilations = [1, 1], strides = [1, 1]} : memref<?x?xf32, #map2>, memref<?x?xf32>, memref<?x?xf32, #map2>
+ linalg.pooling_max(%8, %arg1, %11) {__internal_linalg_transform__ = "workgroup", dilations = [1, 1], strides = [1, 1]} : memref<?x?xf32, #map2>, memref<?x?xf32>, memref<?x?xf32, #map2>
scf.yield
}
return
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu_option.mlir b/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu_option.mlir
index 1701535..63f8aa5 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu_option.mlir
+++ b/iree/compiler/Conversion/LinalgToSPIRV/test/convert_to_gpu_option.mlir
@@ -32,7 +32,7 @@
%13 = affine.min #map5(%arg5)[%4]
%14 = dim %arg2, %c3 : memref<?x?x?x?xf32>
%15 = subview %arg2[%arg3, %arg4, %arg5, 0] [%11, %12, %13, %14] [1, 1, 1, 1] : memref<?x?x?x?xf32> to memref<?x?x?x?xf32, #map3>
- linalg.conv(%arg0, %9, %15) {__internal_linalg_transform__ = "workitem", dilations = [1, 1], strides = [1, 1]} : memref<?x?x?x?xf32>, memref<?x?x?x?xf32, #map3>, memref<?x?x?x?xf32, #map3>
+ linalg.conv(%arg0, %9, %15) {__internal_linalg_transform__ = "workgroup", dilations = [1, 1], strides = [1, 1]} : memref<?x?x?x?xf32>, memref<?x?x?x?xf32, #map3>, memref<?x?x?x?xf32, #map3>
scf.yield
}
return
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/test/cyclic_to_workgroup.mlir b/iree/compiler/Conversion/LinalgToSPIRV/test/cyclic_to_workgroup.mlir
index 110ac24..cac18ab 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/test/cyclic_to_workgroup.mlir
+++ b/iree/compiler/Conversion/LinalgToSPIRV/test/cyclic_to_workgroup.mlir
@@ -27,7 +27,7 @@
%12 = dim %arg2, %c1 : memref<?x?xf32>
%13 = affine.min #map0(%arg4)[%12]
%14 = subview %arg2[%arg3, %arg4] [%11, %13] [1, 1] : memref<?x?xf32> to memref<?x?xf32, #map2>
- linalg.matmul %5, %9, %14 {__internal_linalg_transform__ = "workitem"} : (memref<?x?xf32, #map2>, memref<?x?xf32, #map2>, memref<?x?xf32, #map2>)
+ linalg.matmul %5, %9, %14 {__internal_linalg_transform__ = "workgroup"} : (memref<?x?xf32, #map2>, memref<?x?xf32, #map2>, memref<?x?xf32, #map2>)
}
scf.yield
}
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/test/linalg_tile_and_fuse.mlir b/iree/compiler/Conversion/LinalgToSPIRV/test/linalg_tile_and_fuse.mlir
index 0e2fe6d..1728d35 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/test/linalg_tile_and_fuse.mlir
+++ b/iree/compiler/Conversion/LinalgToSPIRV/test/linalg_tile_and_fuse.mlir
@@ -51,7 +51,7 @@
// CHECK: %[[VIEW2:.+]] = subview %[[ARG2]]
// CHECK: linalg.conv
// CHECK-SAME: %[[ARG0]], %[[VIEW1]], %[[VIEW2]]
-// CHECK-SAME: "workitem"
+// CHECK-SAME: "workgroup"
// -----
@@ -81,7 +81,7 @@
// CHECK: %[[VIEW1:.+]] = subview %[[ARG1]]
// CHECK: %[[VIEW2:.+]] = subview %[[ARG2]]
// CHECK: linalg.matmul
-// CHECK-SAME: "workitem"
+// CHECK-SAME: "workgroup"
// CHECK-SAME: %[[VIEW0]], %[[VIEW1]], %[[VIEW2]]
// -----
@@ -111,4 +111,4 @@
// CHECK: %[[VIEW2:.+]] = subview %[[ARG2]]
// CHECK: linalg.pooling_max
// CHECK-SAME: %[[VIEW0]], %[[ARG1]], %[[VIEW2]]
-// CHECK-SAME: "workitem"
+// CHECK-SAME: "workgroup"
diff --git a/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir b/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir
index 76cfcb8..a24c77b 100644
--- a/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir
+++ b/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir
@@ -36,12 +36,12 @@
// CHECK: %[[ALLOC2:.+]] = alloc(%[[C4]], %[[C8]]) : memref<?x?xf32, 3>
// CHECK: %[[SUBVIEW2:.+]] = subview %[[ALLOC2]]
// CHECK: linalg.copy(%[[ARG0SV]], %[[SUBVIEW1]])
-// CHECK-SAME: "workitem"
+// CHECK-SAME: "workgroup"
// CHECK: spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
// CHECK: linalg.copy(%[[ARG1SV]], %[[SUBVIEW2]])
-// CHECK-SAME: "workitem"
+// CHECK-SAME: "workgroup"
// CHECK: spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
-// CHECK: linalg.matmul {{.*}}"workitem"{{.*}} %[[SUBVIEW1]], %[[SUBVIEW2]], %[[RET0SV]]
+// CHECK: linalg.matmul {{.*}}"workgroup"{{.*}} %[[SUBVIEW1]], %[[SUBVIEW2]], %[[RET0SV]]
// CHECK: spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
// CHECK-DAG: dealloc %[[ALLOC1]] : memref<?x?xf32, 3>
// CHECK-DAG: dealloc %[[ALLOC2]] : memref<?x?xf32, 3>