Adding initial dispatch instrumention support. (#12357)

This adds a few new `hal.instrument.*` ops, a pass that instruments
dispatches to a basic level, LLVM CPU support, a runtime tooling flag,
and a prototype tool to dump the instrument data.

At the core of this is support for a new compiler-generated function
`__query_instruments` that allows modules to pass back a list of
buffers. The `--instrument_file=` tooling flag will gather all buffers
from all modules and concatenate them together into a binary file. The
format of the resulting file is defined by a chunked transport stream
containing today just the dispatch instrumentation chunk types.

Dispatch instrumentation will be enabled when the
`--iree-hal-instrument-dispatches=` flag is set to a power-of-two buffer
size. Most programs can usually get by with 16mib while memory access
instrumentation may require 256mib or 2gib.

On the CPU side the `--iree-llvmcpu-instrument-memory-accesses=true`
flag will enable tracking every load/store from/to a memref (scalars and
vectors) by address and length. This can be used to observe memory
access patterns and the addresses being accessed by particular
workgroups. We should be able to support this on other backends
(definitely CUDA, but possibly with SPIR-V using relative buffer offsets
or something).

In addition to tracking workgroup launches and optionally memory
accesses there are also placeholders for printf-style string formatting
and value probes. `hal.instrument.print` still needs conversion work in
each backend and though `hal.instrument.value` works there's no nice way
of inserting them today.

Example commands for getting a memory access dump (16MB is pretty small
for this, 2gib is better when access tracking is enabled):
iree-compile \
    --iree-hal-target-backends=llvm-cpu \
    --iree-hal-instrument-dispatches=16mib \
    --iree-llvmcpu-instrument-memory-accesses=true \
    runtime/src/iree/runtime/testdata/simple_mul.mlir \
iree-run-module \
    --device=local-sync \
    --module=simple_mul_instr.vmfb \
    --function=simple_mul \
    --input=4xf32=2 \
    --input=4xf32=4 \
iree-dump-instruments instrument.bin

Expected output for a simple_mul which has 4x4096 (multiple workgroups),
note that export sources are listed as well as all sites to that export
(in this case just one):
$ ../iree-build/tools/iree-dump-instruments instrument_mem.bin

// export[0]: simple_mul_dispatch_0_generic_16384
func.func @simple_mul_dispatch_0_generic_16384(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}) {
  %c0 = arith.constant 0 : index
  %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>>
  %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>>
  %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>>
  %3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32>
  %4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32>
  %5 = tensor.empty() : tensor<16384xf32>
  %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<16384xf32>, tensor<16384xf32>) outs(%5 : tensor<16384xf32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %7 = arith.mulf %in, %in_0 : f32
    linalg.yield %7 : f32
  } -> tensor<16384xf32> %6, %2, offsets = [0], sizes = [16384], strides = [1] : tensor<16384xf32> -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>>

// dispatch site 0: simple_mul_dispatch_0_generic_16384

0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:52
0000000000000000 | LOAD  000002705bc3ad80 16
0000000000000000 | LOAD  000002705bc4ae80 16
0000000000000000 | STORE 000002705bc7a100 16
0000000000000000 | LOAD  000002705bc3ad90 16
0000000000000000 | LOAD  000002705bc4ae90 16
0000000000000000 | STORE 000002705bc7a110 16
0000000000000000 | LOAD  000002705bc3ada0 16
0000000000000000 | LOAD  000002705bc4aea0 16
0000000000000000 | STORE 000002705bc7a120 16
0000000000000000 | LOAD  000002705bc3adb0 16
0000000000000000 | LOAD  000002705bc4aeb0 16
0000000000000000 | STORE 000002705bc7a130 16
0000000000000000 | LOAD  000002705bc3adc0 16
0000000000000000 | LOAD  000002705bc4aec0 16
0000000000000000 | STORE 000002705bc7a140 16
000000000000c040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:52
000000000000c040 | LOAD  000002705bc3ed80 16
000000000000c040 | LOAD  000002705bc4ee80 16
000000000000c040 | STORE 000002705bc7e100 16

The printed data is currently just a proof of concept - the first column
is the workgroup key - more advanced visualizations can do better things
(still WIP, showing a dispatch listing and memory accesses by a
particular dispatch):




There's still some iteration needed on printing values. For now a simple
test with this:
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
index 55bab04a8..24abf2255 100644
--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp
@@ -442,6 +442,22 @@ struct ConvertHALInstrumentWorkgroupOp
             rewriter.create<LLVM::ConstantOp>(loc, i64Type, 0xFFFFFFFFFFll)),
         rewriter.create<LLVM::ConstantOp>(loc, i64Type, 24));

+    // HACK: test writing out a value.
+    {
+      Value valueOperand = rewriter.create<arith::ConstantOp>(
+          loc, rewriter.getUI32IntegerAttr(0xFFFFFFFFu));
+      rewriter.create<IREE::HAL::InstrumentValueOp>(
+          loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey,
+          rewriter.getI8IntegerAttr(0), valueOperand);
+    }
+    {
+      Value valueOperand = rewriter.create<arith::ConstantOp>(
+          loc, rewriter.getF32FloatAttr(1.234f));
+      rewriter.create<IREE::HAL::InstrumentValueOp>(
+          loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey,
+          rewriter.getI8IntegerAttr(1), valueOperand);
+    }
     rewriter.replaceOp(instrumentOp, workgroupKey);
     return success();
will produce this output:
0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:50
0000000000000000 | VALUE 0000 = 4294967295
0000000000000000 | VALUE 0001 = 1.234000e+00 1.234000
0000000000000040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:50
0000000000000040 | VALUE 0000 = 4294967295
0000000000000040 | VALUE 0001 = 1.234000e+00 1.234000
0000000000000080 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 2,0,0 pid:50
0000000000000080 | VALUE 0000 = 4294967295
0000000000000080 | VALUE 0001 = 1.234000e+00 1.234000
00000000000000C0 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 3,0,0 pid:50
00000000000000C0 | VALUE 0000 = 4294967295
00000000000000C0 | VALUE 0001 = 1.234000e+00 1.234000

Ergonomics-wise for things like the value instrumentation we could try
adding dialect attributes or something earlier on that tracks values or
flagged patterns/etc for particular experiments.

In the future we can add both additional dispatch instrumentation and
new chunk types from various modules. Examples include
compiler-generated profile-guided optimization markers, VM code coverage
markers, HAL counters for allocations or submissions, or device
timestamp streams.

Only the CPU backend is supported right now as that's what I'm familiar
with. I'd hoped to do the `hal.instrument.*` op lowering to memrefs so
it could be shared but memref descriptors are unfortunately still a
thing and with something tracking every single memory access it created
hundreds of thousands of instructions. I eagerly await the day when
someone works to kill memref descriptors.
