commit	c319c2db6034a5359f73e0e2fe66b81617707b81	[log] [tgz]
author	Ben Vanik <ben.vanik@gmail.com>	Fri Feb 24 18:18:47 2023 -0800
committer	GitHub <noreply@github.com>	Sat Feb 25 02:18:47 2023 +0000
tree	bba00e9bb4e6a985472f739da116253d6f8cab19
parent	289b9a1ebf954be3995cc1895254d09f2623341f [diff]

Adding initial dispatch instrumention support. (#12357)

This adds a few new `hal.instrument.*` ops, a pass that instruments
dispatches to a basic level, LLVM CPU support, a runtime tooling flag,
and a prototype tool to dump the instrument data.

At the core of this is support for a new compiler-generated function
`__query_instruments` that allows modules to pass back a list of
buffers. The `--instrument_file=` tooling flag will gather all buffers
from all modules and concatenate them together into a binary file. The
format of the resulting file is defined by a chunked transport stream
containing today just the dispatch instrumentation chunk types.

Dispatch instrumentation will be enabled when the
`--iree-hal-instrument-dispatches=` flag is set to a power-of-two buffer
size. Most programs can usually get by with 16mib while memory access
instrumentation may require 256mib or 2gib.

On the CPU side the `--iree-llvmcpu-instrument-memory-accesses=true`
flag will enable tracking every load/store from/to a memref (scalars and
vectors) by address and length. This can be used to observe memory
access patterns and the addresses being accessed by particular
workgroups. We should be able to support this on other backends
(definitely CUDA, but possibly with SPIR-V using relative buffer offsets
or something).

In addition to tracking workgroup launches and optionally memory
accesses there are also placeholders for printf-style string formatting
and value probes. `hal.instrument.print` still needs conversion work in
each backend and though `hal.instrument.value` works there's no nice way
of inserting them today.

Example commands for getting a memory access dump (16MB is pretty small
for this, 2gib is better when access tracking is enabled):
```sh
iree-compile \
    --iree-hal-target-backends=llvm-cpu \
    --iree-hal-instrument-dispatches=16mib \
    --iree-llvmcpu-instrument-memory-accesses=true \
    runtime/src/iree/runtime/testdata/simple_mul.mlir \
    -o=simple_mul_instr.vmfb
iree-run-module \
    --device=local-sync \
    --module=simple_mul_instr.vmfb \
    --function=simple_mul \
    --input=4xf32=2 \
    --input=4xf32=4 \
    --instrument_file=instrument.bin
iree-dump-instruments instrument.bin
```

Expected output for a simple_mul which has 4x4096 (multiple workgroups),
note that export sources are listed as well as all sites to that export
(in this case just one):
```
$ ../iree-build/tools/iree-dump-instruments instrument_mem.bin

//===----------------------------------------------------------------------===//
// export[0]: simple_mul_dispatch_0_generic_16384
//===----------------------------------------------------------------------===//
func.func @simple_mul_dispatch_0_generic_16384(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}) {
  %c0 = arith.constant 0 : index
  %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>>
  %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>>
  %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>>
  %3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32>
  %4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32>
  %5 = tensor.empty() : tensor<16384xf32>
  %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<16384xf32>, tensor<16384xf32>) outs(%5 : tensor<16384xf32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %7 = arith.mulf %in, %in_0 : f32
    linalg.yield %7 : f32
  } -> tensor<16384xf32>
  flow.dispatch.tensor.store %6, %2, offsets = [0], sizes = [16384], strides = [1] : tensor<16384xf32> -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>>
  return
}

//===----------------------------------------------------------------------===//
// dispatch site 0: simple_mul_dispatch_0_generic_16384
//===----------------------------------------------------------------------===//

0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:52
0000000000000000 | LOAD  000002705bc3ad80 16
0000000000000000 | LOAD  000002705bc4ae80 16
0000000000000000 | STORE 000002705bc7a100 16
0000000000000000 | LOAD  000002705bc3ad90 16
0000000000000000 | LOAD  000002705bc4ae90 16
0000000000000000 | STORE 000002705bc7a110 16
0000000000000000 | LOAD  000002705bc3ada0 16
0000000000000000 | LOAD  000002705bc4aea0 16
0000000000000000 | STORE 000002705bc7a120 16
0000000000000000 | LOAD  000002705bc3adb0 16
0000000000000000 | LOAD  000002705bc4aeb0 16
0000000000000000 | STORE 000002705bc7a130 16
0000000000000000 | LOAD  000002705bc3adc0 16
0000000000000000 | LOAD  000002705bc4aec0 16
0000000000000000 | STORE 000002705bc7a140 16
...
000000000000c040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:52
000000000000c040 | LOAD  000002705bc3ed80 16
000000000000c040 | LOAD  000002705bc4ee80 16
000000000000c040 | STORE 000002705bc7e100 16
...
```

The printed data is currently just a proof of concept - the first column
is the workgroup key - more advanced visualizations can do better things
(still WIP, showing a dispatch listing and memory accesses by a
particular dispatch):

![image](https://user-images.githubusercontent.com/75337/221122437-2eb50c30-f8af-415e-90cd-f67ae7fdbdcd.png)

![image](https://user-images.githubusercontent.com/75337/221122967-bf37cf44-4991-484e-b654-a0b267ce527f.png)

![image](https://user-images.githubusercontent.com/75337/221122600-767b7a1b-4f60-45f0-8a88-1c504d36c063.png)

There's still some iteration needed on printing values. For now a simple
test with this:
```diff
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
index 55bab04a8..24abf2255 100644
--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
@@ -442,6 +442,22 @@ struct ConvertHALInstrumentWorkgroupOp
             rewriter.create<LLVM::ConstantOp>(loc, i64Type, 0xFFFFFFFFFFll)),
         rewriter.create<LLVM::ConstantOp>(loc, i64Type, 24));

+    // HACK: test writing out a value.
+    {
+      Value valueOperand = rewriter.create<arith::ConstantOp>(
+          loc, rewriter.getUI32IntegerAttr(0xFFFFFFFFu));
+      rewriter.create<IREE::HAL::InstrumentValueOp>(
+          loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey,
+          rewriter.getI8IntegerAttr(0), valueOperand);
+    }
+    {
+      Value valueOperand = rewriter.create<arith::ConstantOp>(
+          loc, rewriter.getF32FloatAttr(1.234f));
+      rewriter.create<IREE::HAL::InstrumentValueOp>(
+          loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey,
+          rewriter.getI8IntegerAttr(1), valueOperand);
+    }
+
     rewriter.replaceOp(instrumentOp, workgroupKey);
     return success();
   }
```
will produce this output:
```
0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:50
0000000000000000 | VALUE 0000 = 4294967295
0000000000000000 | VALUE 0001 = 1.234000e+00 1.234000
0000000000000040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:50
0000000000000040 | VALUE 0000 = 4294967295
0000000000000040 | VALUE 0001 = 1.234000e+00 1.234000
0000000000000080 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 2,0,0 pid:50
0000000000000080 | VALUE 0000 = 4294967295
0000000000000080 | VALUE 0001 = 1.234000e+00 1.234000
00000000000000C0 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 3,0,0 pid:50
00000000000000C0 | VALUE 0000 = 4294967295
00000000000000C0 | VALUE 0001 = 1.234000e+00 1.234000
```

Ergonomics-wise for things like the value instrumentation we could try
adding dialect attributes or something earlier on that tracks values or
flagged patterns/etc for particular experiments.

In the future we can add both additional dispatch instrumentation and
new chunk types from various modules. Examples include
compiler-generated profile-guided optimization markers, VM code coverage
markers, HAL counters for allocations or submissions, or device
timestamp streams.

Only the CPU backend is supported right now as that's what I'm familiar
with. I'd hoped to do the `hal.instrument.*` op lowering to memrefs so
it could be shared but memref descriptors are unfortunately still a
thing and with something tracking every single memory access it created
hundreds of thousands of instructions. I eagerly await the day when
someone works to kill memref descriptors.

38 files changed

tree: bba00e9bb4e6a985472f739da116253d6f8cab19

README.md

IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

See our website for project details, user guides, and instructions on building from source.

Project Status

IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!

Communication Channels

GitHub issues: Feature requests, bugs, and other work tracking
IREE Discord server: Daily development discussions with the core team and collaborators
iree-discuss email list: Announcements, general and low-priority discussion

Related Project Channels

MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.

Architecture Overview

IREE Architecture

See our website for more information.

Presentations and Talks

2021-06-09: IREE Runtime Design Tech Talk (recording and slides)
2020-08-20: IREE CodeGen: MLIR Open Design Meeting Presentation (recording and slides)
2020-03-18: Interactive HAL IR Walkthrough (recording)
2020-01-31: End-to-end MLIR Workflow in IREE: MLIR Open Design Meeting Presentation (recording and slides)

License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.