blob: d4d76d10f9ef58917dad1cb27917cb8acde146b4 [file] [log] [blame] [view]
# IREE CPU/GPU Code Generation Pipeline
This document is intended to provide an overview of the codegen pipeline within
IREE used to generate CPU/GPU code. It intends to give an overview of the main
passes used, the objective of the pass, the current implementation, and what it
is expected to achieve in the long term.
Note that while the code generation pipeline supports dynamic shapes, this work
is very preliminary. The description of this is not covered here.
## Input to the codegen pipeline
The input to the code generation pipeline is the module within the
`hal.executable.variant` operation. Functions within this module that do __not__
have `Visibility::Private` are the *entry point* functions of the dispatch
region. These are the functions that are *invoked* by the IREE runtime. In
addition, each dispatch region also contains a `hal.interface` operation that
describes the ABI to use for the dispatch region. Two examples of the input to
the code generation pipeline are shown below. In both of these, a single
dispatch function contains a sequence of MHLO operations that the dispatch
region creation has grouped into a single region. Ideally the grouped operations
are fused into a single kernel.
```mlir
hal.executable.variant "vulkan*" {
module attributes {spv.target_env = ...} {
func @main_ex_dispatch() {
%c0 = constant 0 : index
%0 = hal.interface.load.tensor @legacy_io::@arg0,
offset = %c0 : tensor<32x24xf32>
%1 = hal.interface.load.tensor @legacy_io::@arg1,
offset = %c0 : tensor<24x16xf32>
%2 = "mhlo.dot"(%0, %1) {precision_config = ["DEFAULT", "DEFAULT"]} :
(tensor<32x24xf32>, tensor<24x16xf32>) -> tensor<32x16xf32>
hal.interface.store.tensor %2, @legacy_io::@ret0,
offset = %c0 : tensor<32x16xf32>
return
}
hal.interface private @legacy_io {
hal.interface.binding @arg0, set=0, binding=0,
type="StorageBuffer", access="Read"
hal.interface.binding @arg1, set=0, binding=1,
type="StorageBuffer", access="Read"
hal.interface.binding @ret0, set=0, binding=2,
type="StorageBuffer", access="Write|Discard"
}
}
}
```
<a name="snippet1"></a> Snippet 1 : Dispatch region with matrix-matrix multiply
operation.
```mlir
hal.executable.variant "vulkan*" {
module attributes {spv.target_env = ...} {
func @main_ex_dispatch() {
%c0 = constant 0 : index
%0 = hal.interface.load.tensor @legacy_io::@arg0,
offset = %c0 : tensor<10x15xf32>
%1 = hal.interface.load.tensor @legacy_io::@arg1,
offset = %c0 : tensor<10x15xf32>
%2 = hal.interface.load.tensor @legacy_io::@arg2,
offset = %c0 : tensor<15xf32>
%3 = "mhlo.add"(%0, %1) :
(tensor<10x15xf32>, tensor<10x15xf32>) -> tensor<10x15xf32>
%4 = "mhlo.broadcast"(%2) : (tensor<15xf32>) -> tensor<10x15xf32>
%5 = "mhlo.multiply"(%3, %4) :
(tensor<10x15xf32>, tensor<10x15xf32>) -> tensor<10x15xf32>
hal.interface.store.tensor %5, @legacy_io::@ret0,
offset = %c0 : tensor<10x15xf32>
return
}
hal.interface private @legacy_io {
hal.interface.binding @arg0, set=0, binding=0,
type="StorageBuffer", access="Read"
hal.interface.binding @arg1, set=0, binding=1,
type="StorageBuffer", access="Read"
hal.interface.binding @arg2, set=0, binding=2,
type="StorageBuffer", access="Read"
hal.interface.binding @ret0, set=0, binding=3,
type="StorageBuffer", access="Write|Discard"
}
}
}
```
<a name="snippet2"></a> Snippet 2 : Dispatch region with element-wise
operations.
__Roadmap Note__: The current implementation might not actually fuse the
operations grouped into a dispatch region into a single kernel. It is possible
to end up with multiple kernels per dispatch region. Over time we plan to
address this by using fusion at different levels (see below).
The inputs to the dispatch region are materialized within the entry point
function using the `hal.interface.load.tensor` operation, This operation returns
a `tensor` view of the buffer used to store the inputs. Similarly the result of
the dispatch region are *written* out using the `hal.interface.store.tensor`
operation.
The main constraint that the code generation operates under is that it should
not require additional (temporary) buffers to execute the operations grouped
together within a dispatch region. The rationale behind this constraint is that
buffer allocation/synchronization in IREE happens at the granularity of dispatch
regions, allowing the scheduler to make better decision about where to insert
appropriate synchronizations.
The IR after all the passes used in the lowering from MHLO to SPIR-V for the
above two examples can be found here ([matrix-matrix multiply op][DotAfterAll],
[elementwise ops][PwAfterAll]). Below is a description of the major passes used.
## Conversion from MHLO dialect to Linalg on buffers
The code generation pipeline heavily relies on use of
[Structured Operations][LinalgRationale], specifically the
[Linalg Dialect][LinalgDialect]. Both, the Linalg operations on `tensor`s and on
`memref`s are central to the progressive lowering approach followed here. The
first part of the code generation pipeline is to convert the MHLO operations on
`tensor`s to Linalg operation on `memref`s. This part of the pipeline is common
to both CPU and GPU code generation.
The steps involved in this conversion is shown below. Each of the arrows
represents a pass in the pipeline:
![MHLO To Linalg on `memref` conversion](./hlo_to_linalg.png)
The next sections describe each of these passes in more detail.
### MHLO to Linalg on tensors
The first step is to convert MHLO operations to Linalg on tensors. This is done
using the [HLOToLinalgPass][HLOToLinalgPass] from Tensorflow. An example of the
conversion is shown below, where the `mhlo.add`, `mhlo.broadcast` and
`mhlo.multiply` operations are converted to `linalg.generic` operations on
tensors.
```mlir
#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d1)>
%3 = linalg.generic
{args_in = 2 : i64, args_out = 1 : i64,
indexing_maps = [#map0, #map0, #map0],
iterator_types = ["parallel", "parallel"]} %0, %1 {
^bb0(%arg0: f32, %arg1: f32): // no predecessors
%5 = addf %arg0, %arg1 : f32
linalg.yield %5 : f32
} : tensor<10x15xf32>, tensor<10x15xf32> -> tensor<10x15xf32>
%4 = linalg.generic
{args_in = 1 : i64, args_out = 1 : i64,
indexing_maps = [#map1, #map0],
iterator_types = ["parallel", "parallel"]} %2 {
^bb0(%arg0: f32): // no predecessors
linalg.yield %arg0 : f32
}: tensor<15xf32> -> tensor<10x15xf32>
%5 = linalg.generic
{args_in = 2 : i64, args_out = 1 : i64,
indexing_maps = [#map0, #map0, #map0],
iterator_types = ["parallel", "parallel"]} %3, %4 {
^bb0(%arg0: f32, %arg1: f32): // no predecessors
%5 = mulf %arg0, %arg1 : f32
linalg.yield %5 : f32
}: tensor<10x15xf32>, tensor<10x15xf32> -> tensor<10x15xf32>
```
<a name="snippet3"></a> Snippet 3 : MHLO to Linalg conversion for
[element-wise operations](#snippet2)
At the time of writing the representation of Linalg on `tensor`s does not model
reduction iterator types completely. Specifically, the reduction in Linalg is
modeled using read-modify-write approach, i.e. each iteration of the reduction
loop reads the value stored in the output, adds its contribution, and writes
back to the same location. This means the output has to be *initialized* to the
null element of the reduction operator (i.e. 0 if the reduction is done using
addition). This works for operations on buffers. Since tensors are SSA values
they cannot be updated in-place. As a result, the reduction semantics does not
map as well to `tensor`s. For now it is treated as a convention that when the
Linalg operation is converted to use `memref`s it has to be initialized
appropriately before performing the reduction. Due to this, the conversion from
MHLO op to Linalg op is only done for operations which do not need a *reduction*
iterator type in the converted Linalg op. Consequently, only element-wise
operations, broadcast operations and data movement operations (like copy and
transpose) are converted to Linalg operations at this stage.
__Roadmap note__: One long term solution for the above is to have operations on
tensors that have *reduction* iterator type to take an additional argument that
contains the initial value of the result tensor. When the operation is converted
to use `memref`s, the buffer for the initial value operand can be reused for the
result. The details involved have not been fully worked out yet.
### Fusion of Linalg on tensor operations
The Linalg on `tensor` operations generated at the previous step are fused using
the [LinalgFusionOfTensorOps][LinalgFusionOfTensorOps] from MLIR. Since
`tensor`s are SSA values, fusion at this stage can be done without using alias
analysis or dependence analysis based on reads and writes. Instead the use-def
chains for the `tensor` values can be used to implement producer-consumer
fusion. This stage fuses most elementwise operations, broadcast operations and
data movement operations. An example of the fused op is shown below.
```mlir
#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d1)>
%3 = linalg.generic
{args_in = 3 : i64, args_out = 1 : i64,
indexing_maps = [#map0, #map0, #map1, #map0],
iterator_types = ["parallel", "parallel"]} %0, %1, %2 {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32): // no predecessors
%4 = addf %arg0, %arg1 : f32
%5 = mulf %4, %arg2 : f32
linalg.yield %5 : f32
}: tensor<10x15xf32>, tensor<10x15xf32>, tensor<15xf32> -> tensor<10x15xf32>
```
<a name="snippet4"></a> Snippet 4: Fusion of Linalg operation on tensors for
element-wise operations shown in [Snippet 3](#snippet3)
### Conversion of Linalg on tensors to Linalg on buffers
Post fusion all the operation on `tensor`s are converted to analogous operations
on `memref`s. In general, this requires a buffer allocation pass. In IREE,
buffer allocation happens at the granularity of dispatch region, and as
mentioned [earlier](#input-to-the-codegen-pipeline), the dispatch region is not
expected to use any additional temporary buffers. So instead of having another
buffer allocation pass within the code generation pipeline, a simpler approach
is used within IREE:
- For each `hal.interface.store.tensor` an `iree.placeholder` operation is
created. The latter uses the same `hal.interface.binding` as the former, but
returns a `memref` view of the output of the dispatch region instead of a
`tensor` view. This `iree.placeholder` operation is added to start of the
entry point function.
- A map is constructed that for a given `tensor` records the `memref` value to
use during the conversion. In this map the `tensor` value used in the
`hal.interface.store.tensor` is mapped to the `memref` value returned by the
created `iree.placeholder` operation.
- The Dialect Conversion framework is used to implement a set of patterns that
convert from operations on `tensor`s to operation on `memref`s,
- A `hal.interface.load.tensor`, is replaced with an `iree.placeholder` to
get the `memref` view of the input to the dispatch region.
- All Linalg operation on `tensor`s (expected to be just `linalg.generic`
or `linalg.indexed_generic` operations) are converted to the
corresponding operation on `memref`s. Instead of returning a `tensor`
value the converted operation takes an additional `memref` operand as
argument. This `memref` is where the result of the operation is
populated. Current implementation looks for the `memref` to use from the
map constructed previously. If there is no `memref` associated with the
result `tensor` the conversion fails.
- At this stage, any `mhlo` operation not converted to a Linalg operation
are directly converted to a Linalg operation on buffers. This is done
for operations that when converted to Linalg have a *reduction* iterator
type. Some examples of ops converted this way are
- `mhlo.dot`
- `mhlo.reduce`
- `mhlo.conv`
- `mhlo.reduce_window`.
Since the specification of the Linalg operations require the output
`memref` to be initialized appropriately, a `linalg.fill` operation is
used to achieve this.
__Roadmap Note__ : Right now the code-generation pipeline relies on fusion of
operations on tensor level. In the near future, we want to be able to fuse
operations like `linalg.matmul` and `linalg.conv` with consumers/producers that
are element-wise operations using the
[fusion of Linalg operation on `memref`s][LinalgFusionOnBuffers].
At this stage of the compilation all operations must have been converted to
Linalg operations on buffers. Shown below are the IR at the end of this stage
for the two examples in Snippets 1 and 2.
```mlir
func @main_ex_dispatch() {
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<32x16xf32>
%c0 = constant 0 : index
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<32x24xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<24x16xf32>
%cst = constant 0.000000e+00 : f32
linalg.matmul(%1, %2, %0) :
memref<32x24xf32>, memref<24x16xf32>, memref<32x16xf32>
return
}
```
<a name="snippet5"></a> Snippet 5 : Matrix-matrix multiply after conversion to
Linalg operation on `memref`s.
```mlir
#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d1)>
func @main_ex_dispatch() {
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<10x15xf32>
%c0 = constant 0 : index
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<10x15xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<10x15xf32>
%3 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg2} : memref<15xf32>
linalg.generic
{args_in = 3 : i64, args_out = 1 : i64,
indexing_maps = [#map0, #map0, #map1, #map0],
iterator_types = ["parallel", "parallel"]} %1, %2, %3, %0 {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32): // no predecessors
%4 = addf %arg0, %arg1 : f32
%5 = mulf %4, %arg2 : f32
linalg.yield %5 : f32
}: memref<10x15xf32>, memref<10x15xf32>, memref<15xf32>, memref<10x15xf32>
return
}
```
<a name="snippet6"></a> Snippet 6 : Elementwise operations after conversion to
Linalg operation on `memref`s
The rest of the code-generation differs on whether the compilation is for CPU
(using LLVM) or for GPU (using SPIR-V).
## Conversion from Linalg on buffers to SPIR-V dialect
The following sections describe the progressive lowering of Linalg operation on
buffers to SPIR-V dialect. Once lowered to the SPIR-V dialect, it can be
serialized into a SPIR-V binary using the
[serialization mechanism provided by the SPIR-V dialect][SpirvSerialization].
The steps involved in the lowering are described below, with each of the arrows
representing a pass.
![Linalg on `memref` to SPIR-V conversion](./linalg_to_spirv.png)
These passes are described below in more detail.
### Tiling and fusion on buffer operations
The GPU hardware typically provides multiple-levels of compute hierarchy, namely
*workgroup* level, *subgroup* level and *workitem* level. These map to blocks,
warps and threads, respectively, in CUDA terminology. Tiling is a way to map the
computations to each level of the compute hierarchy. For example 3-D tiling a
`linalg.matmul` operation decomposes the computation into several tiled
matrix-matrix multiplies.
[Tiling transformation in Linalg dialect][LinalgTiling] generates the
outer-loops that iterate over tiled `linalg.matmul` operations. These outer
loops can be mapped to different workgroups, if they are parallel. The tiled
`linalg.matmul` operation can be further tiled to map to subgroups. Finally, the
tiled operation can be lowered to loops with individual iterations mapped to
workitems. The [LinalgTileAndFusePass][LinalgTileAndFuse] uses the Linalg Tiling
patterns ([defined here][LinalgTilingPatterns]) to tile operations like
`linalg.matmul`, `linalg.conv` and `linalg.*_pooling`. The result of tiling the
code in Snippet 5 is shown below. As expected there are 2-parallel loops that
iterate over tiles of the original iteration space (i.e. inter-tile loops) and
can be distributed to workgroups.
```mlir
func @main_ex_dispatch_0()
attributes {
spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
%cst = constant 0.000000e+00 : f32
%c32 = constant 32 : index
%c24 = constant 24 : index
%c16 = constant 16 : index
%c0 = constant 0 : index
%c4 = constant 4 : index
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<32x16xf32>
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<32x24xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<24x16xf32>
linalg.fill(%cst, %0) : f32, memref<32x16xf32>
scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c32, %c16) step (%c8, %c8) {
scf.for %arg2 = %c0 to %24 step %c4 {
...
%5 = subview %1[%arg0, %arg2]...
...
%8 = subview %2[%arg2, %arg1]...
...
%11 = subview %0[%arg0, %arg1]..
linalg.matmul {__internal_linalg_transform__ = "workgroup"} %5, %8, %11...
}
scf.yield
}
return
}
```
<a name="snippet7"></a> Snippet 7 : `linalg.matmul` after tiling.
#### Tile Size and Workgroup Size
When operations that are to be tiled exist within the dispatch function (like
`linalg.matmul` or `linalg.conv`), this pass also decides the 1. Tile size to be
used for the tiling. 1. The workgroup size to be used.
The tile size and workgroup size are closely linked since the code within the
tiled loops are to be collectively executed by the entire workgroup. In other
words, all workitems in the workgroup collaborate to execute the tiled
`linalg.matmul`.
__Roadmap Note__ : Currently the tile sizes used in this pass are hard-wired.
Not much effort has been put into finding ideal tile size for each operation on
different hardware. The value used is meant to be a baseline to test
functionality, with performance considerations addressed over time.
#### Markers
Downstream passes have to handle tiled Linalg operations and untiled Linalg
operation that might exist in the same function in different ways. For example,
while the former are to be executed collectively by workitems within a
workgroup, the latter have to be executed by all workitems across workgroups.
One way to distinguish these two operations is to use the marker mechanism in
Linalg ([LinalgTransformationFilter][LinalgTilingPatterns]). This is a `StrAttr`
whose value can be used to encode the scope of the operation. For example, in
Snippet 7 above, the tiled `linalg.matmul` operation has a marker `workgroup` to
indicate that this operation needs to be executed by a workgroup in a collective
manner. At this time, the code-generation pipeline uses only the `workgroup`
marker.
__Roadmap Note__ : Markers are meant to be short-lived, ideally set and consumed
within the same pass. In the current pipeline the lifetime spans passes to allow
lowering to different hierarchies. The separate passes that implement the
lowering from Linalg to SPIR-V can be combined into a single pass, relying A ->
B -> C translation mechanism of the Dialect Conversion framework to implement
the progressive lowering. In interest of separation of concerns and for better
debuggability these passes are kept separate at the cost of having lifetimes of
markers span passes.
#### Promoting subviews to use workgroup local memory and use of synchronizations
`Workgroup` memory (or `shared memory` in CUDA terminology) can be used to
prefetch the inputs to the tiled operation. For example in the matrix-matrix
multiply case, the same data row (column) of the LHS (RHS) matrix is read by
multiple workitems. Prefetching the data into `Workgroup` memory can reduce the
number of loads to `StorageClass` memory by an order of magnitude. This
transformation can be achieved by using the
[`Linalg Promotion`][LinalgPromotionPatterns] which modifies the `subview`s that
are the operands to the tiled Linalg operation to use a new `memref` object. The
size of this `memref` is computed from the size of the `subview`. This `memref`
object is later lowered to use `Workgroup` memory Storage Class. The snippet
below shows this transformation when applied to `linalg.matmul` (along with
tiling). The newly created `memref` objects are annotated with the memory space
`3` to indicate that they are to be lowered to use `Workgroup` memory. The copy
of data from the original `memref` into the new `memref`, as well as the
necessary synchronization constructs are generated as well. Note the memory
space annotation used here is consistent with what
[address space annotations used in NVVM][NVVMAddressSpace].
```mlir
func @matmul_tile()
attributes {
spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
%c32 = constant 32 : index
%c24 = constant 24 : index
%c16 = constant 16 : index
%c4 = constant 4 : index
%c8 = constant 8 : index
%c0 = constant 0 : index
%c1 = constant 1 : index
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<32x24xf32>
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<24x16xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<32x16xf32>
scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c32, %c16) step (%c8, %c8) {
scf.for %arg2 = %c0 to %c24 step %c4 {
...
%5 = subview %0[%arg0, %arg2]...
...
%8 = subview %1[%arg2, %arg1]...
...
%11 = subview %2[%arg0, %arg1]...
%12 = alloc(%c8, %c4) : memref<?x?xf32, 3>
%13 = subview %12[%c0, %c0]...
%14 = alloc(%c4, %c8) : memref<?x?xf32, 3>
%15 = subview %14[%c0, %c0]...
linalg.copy(%5, %13) {__internal_linalg_transform__ = "workgroup"}
: memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
linalg.copy(%8, %15) {__internal_linalg_transform__ = "workgroup"}
: memref<?x?xf32, #map2>, memref<?x?xf32, #map2, 3>
spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
linalg.matmul {__internal_linalg_transform__ = "workgroup"} %13, %15, %11...
spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"
dealloc %12 : memref<?x?xf32, 3>
dealloc %14 : memref<?x?xf32, 3>
}
scf.yield
}
return
}
```
<a name="snippet8"></a> Snippet 8: `linalg.matmul` after tiling and promotion of
operand subviews to use `Workgroup` memory.
### Distributing to workgroups and workitems
After tiling the operations within the dispatch functions are either
`scf.parallel` operations or Linalg operations.
- The outer `scf.parallel` operations represent parallel loops that are to be
distributed across workgroups. The distribution here assumes that the number
of workgroups along each dimension is equal to the number of iterations of
the `scf.parallel` operation.
- Linalg operations that are not tiled, and are therefore __not within__ `scf`
operations, are lowered to loops. The resulting outer `scf.parallel`
operations are collapsed to have a single induction variable. This loop is
then distributed across workitems using their `GlobalInvocationId`, (which
is same as `blockIdx * blockDim + threadIdx` in CUDA terminology).
- Linalg operations that are tiled, and are therefore __within__ `scf`
operations, are lowered to loops and the iterations of the `scf.parallel`
operations are mapped to workitems using their `LocalInvocationId` (which is
same as `threadIdx` in CUDA terminology). Note that these operations are
tagged with the `workgroup` marker which makes it easy to disambiguate from
the case where Linalg operations are outside of `scf` operations. Here too,
the distribution assumes that the workgroup size is greater than or equal to
the number of iterations of the partitioned loop.
These transformations are applied by the [`ConvertToGPUPass`][ConvertToGPU].
Below is the result of applying this pass to Snippet 7. The outer `scf.parallel`
loop is distributed across workgroups. The tiled `linalg.matmul` operation is
lowered to loops, and the outer `scf.parallel` operation generated during this
lowering are distributed across workitems within the workgroup.
```mlir
func @main_ex_dispatch_0_dispatch_1()
attributes {
spv.entry_point_abi = {local_size = dense<[8, 8, 1]> : vector<3xi32>}} {
%c24 = constant 24 : index
%c8 = constant 8 : index
%c4 = constant 4 : index
%c0 = constant 0 : index
%c1 = constant 1 : index
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<32x16xf32>
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<32x24xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<24x16xf32>
%3 = "gpu.block_id"() {dimension = "x"} : () -> index
%4 = "gpu.block_id"() {dimension = "y"} : () -> index
%5 = muli %4, %c8 : index
%6 = muli %3, %c8 : index
scf.for %arg0 = %c0 to %c24 step %c4 {
...
%15 = subview %1[%5, %arg0]
...
%20 = subview %2[%arg0, %6]
%21 = subview %0[%5, %6]
%22 = "gpu.thread_id"() {dimension = "x"} : () -> index
%23 = "gpu.thread_id"() {dimension = "y"} : () -> index
%24 = cmpi "slt", %23, %10 : index
%25 = cmpi "slt", %22, %19 : index
%26 = and %24, %25 : i1
scf.if %26 {
scf.for %arg1 = %c0 to %14 step %c1 {
%27 = load %15[%23, %arg1] : memref<?x?xf32, #map0>
%28 = load %20[%arg1, %22] : memref<?x?xf32, #map1>
%29 = load %21[%23, %22] : memref<?x?xf32, #map1>
%30 = mulf %21, %22 : f32
%31 = addf %23, %24 : f32
store %25, %15[%23, %22] : memref<4x?xf32, #map1>
}
}
}
return
}
```
<a name="snippet9"></a> Snippet 9: `linalg.matmul` after distributing parallel
inter-tile loops to workgroups and intra-tile loops to workitems.
[Snippet 6](#snippet6) shows the fused element-wise operations represented using
a `linalg.generic` operation. This operation is not tiled in the
`LinalgTileAndFusePass`. So the `ConvertToGPUPass` lowers this operation to
`scf.parallel` loops, which are collapsed into a `scf.parallel` operation with a
single induction variable. This loop is then distributed across workitems using
the `GlobalInvocationId`. The resulting IR is shown below.
```mlir
func @main_ex_dispatch_0()
attributes {
spv.entry_point_abi = {local_size = dense<[32, 1, 1]> : vector<3xi32>}} {
%c50 = constant 50 : index
%c5 = constant 5 : index
%0 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@ret0} : memref<10x15xf32>
%1 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg0} : memref<10x15xf32>
%2 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg1} : memref<10x15xf32>
%3 = iree.placeholder for "interface buffer"
{binding = @legacy_io::@arg2} : memref<15xf32>
%4 = "gpu.block_id"() {dimension = "x"} : () -> index
%5 = "gpu.block_dim"() {dimension = "x"} : () -> index
%6 = "gpu.thread_id"() {dimension = "x"} : () -> index
%7 = muli %4, %5 : index
%8 = addi %7, %6 : index
%9 = cmpi "slt", %8, %c50 : index
scf.if %9 {
%10 = divi_signed %8, %c5 : index
%11 = remi_signed %8, %c5 : index
%12 = load %1[%10, %11] : memref<10x15xf32>
%13 = load %2[%10, %11] : memref<10x15xf32>
%14 = load %3[%11] : memref<15xf32>
%15 = addf %12, %13 : f32
%16 = mulf %15, %14 : f32
store %16, %0[%10, %11] : memref<10x15xf32>
}
return
}
```
<a name="snippet10"></a> Snippet 10: Distributing the iterations for pointwise
operations for GPU execution.
### Lowering to SPIR-V dialect
The last step is to take the result of the previous pass and lowering it to
SPIR-V dialect. Since SPIR-V dialect is *closed*, i.e. it has a separate type
system, its best to lower all the operations to SPIR-V in one step. This is done
by applying all the patterns that lower all the different IR constructs into
SPIR-V within the [`ConvertToSPIRVPass`][ConvertToSPIRV]. These are
- [GPU dialect to SPIR-V conversion][GPUToSPIRV].
- [SCF dialect to SPIR-V conversion][SCFToSPIRV].
- [Standard dialect to SPIR-V conversion][StandardToSPIRV].
- Patterns that lower the `iree.placeholder` instruction into a SPIR-V.
Once applied the resulting IR is in SPIR-V dialect that can be serialized to a
SPIR-V binary.
[ConvertToGPU]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToGPUPass.cpp
[ConvertToSPIRV]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp
[DotAfterAll]: https://gist.github.com/MaheshRavishankar/9e2d406296f469515c4a79bf1e7eef44
[GPUToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/GPUToSPIRV/ConvertGPUToSPIRV.h
[HLOToLinalgPass]: https://github.com/tensorflow/tensorflow/blob/75c40f6bff2faa3d90a375dfa4025b2e6e2d7a3d/tensorflow/compiler/mlir/xla/transforms/passes.h#L67
[LinalgDialect]: https://mlir.llvm.org/docs/Dialects/Linalg/
[LinalgFusionOnBuffers]: https://github.com/llvm/llvm-project/blob/ef868a848e6def288d2df7a1b3ebe09463afc8d0/mlir/include/mlir/Dialect/Linalg/Utils/Utils.h#L86
[LinalgFusionOfTensorOps]: https://github.com/llvm/llvm-project/blob/80cb25cbd555f9634836b766c86aead435b60eaa/mlir/include/mlir/Dialect/Linalg/Passes.td#L30
[LinalgPromotionPatterns]: https://github.com/llvm/llvm-project/blob/303a7f7a26e2aae1cb85f49dccbc0b5d14e0b2e0/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h#L358
[LinalgRationale]: https://mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/
[LinalgTileAndFuse]: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/LinalgTileAndFusePass.cpp
[LinalgTiling]: https://mlir.llvm.org/docs/Dialects/Linalg/#set-of-key-transformationsa-namekey_transformationsa
[LinalgTilingPatterns]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
[NVVMAddressSpace]: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#address-space
[PwAfterAll]: https://gist.github.com/MaheshRavishankar/02cdd22f7c99e568f933244b5a679510
[SCFToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/SCFToSPIRV/SCFToSPIRV.h
[SpirvSerialization]: https://mlir.llvm.org/docs/Dialects/SPIR-V/#serialization-and-deserialization
[StandardToSPIRV]: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.h