commit | c319c2db6034a5359f73e0e2fe66b81617707b81 | [log] [tgz] |
---|---|---|
author | Ben Vanik <ben.vanik@gmail.com> | Fri Feb 24 18:18:47 2023 -0800 |
committer | GitHub <noreply@github.com> | Sat Feb 25 02:18:47 2023 +0000 |
tree | bba00e9bb4e6a985472f739da116253d6f8cab19 | |
parent | 289b9a1ebf954be3995cc1895254d09f2623341f [diff] |
Adding initial dispatch instrumention support. (#12357) This adds a few new `hal.instrument.*` ops, a pass that instruments dispatches to a basic level, LLVM CPU support, a runtime tooling flag, and a prototype tool to dump the instrument data. At the core of this is support for a new compiler-generated function `__query_instruments` that allows modules to pass back a list of buffers. The `--instrument_file=` tooling flag will gather all buffers from all modules and concatenate them together into a binary file. The format of the resulting file is defined by a chunked transport stream containing today just the dispatch instrumentation chunk types. Dispatch instrumentation will be enabled when the `--iree-hal-instrument-dispatches=` flag is set to a power-of-two buffer size. Most programs can usually get by with 16mib while memory access instrumentation may require 256mib or 2gib. On the CPU side the `--iree-llvmcpu-instrument-memory-accesses=true` flag will enable tracking every load/store from/to a memref (scalars and vectors) by address and length. This can be used to observe memory access patterns and the addresses being accessed by particular workgroups. We should be able to support this on other backends (definitely CUDA, but possibly with SPIR-V using relative buffer offsets or something). In addition to tracking workgroup launches and optionally memory accesses there are also placeholders for printf-style string formatting and value probes. `hal.instrument.print` still needs conversion work in each backend and though `hal.instrument.value` works there's no nice way of inserting them today. Example commands for getting a memory access dump (16MB is pretty small for this, 2gib is better when access tracking is enabled): ```sh iree-compile \ --iree-hal-target-backends=llvm-cpu \ --iree-hal-instrument-dispatches=16mib \ --iree-llvmcpu-instrument-memory-accesses=true \ runtime/src/iree/runtime/testdata/simple_mul.mlir \ -o=simple_mul_instr.vmfb iree-run-module \ --device=local-sync \ --module=simple_mul_instr.vmfb \ --function=simple_mul \ --input=4xf32=2 \ --input=4xf32=4 \ --instrument_file=instrument.bin iree-dump-instruments instrument.bin ``` Expected output for a simple_mul which has 4x4096 (multiple workgroups), note that export sources are listed as well as all sites to that export (in this case just one): ``` $ ../iree-build/tools/iree-dump-instruments instrument_mem.bin //===----------------------------------------------------------------------===// // export[0]: simple_mul_dispatch_0_generic_16384 //===----------------------------------------------------------------------===// func.func @simple_mul_dispatch_0_generic_16384(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}) { %c0 = arith.constant 0 : index %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>> %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<16384xf32>> %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>> %3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32> %4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [16384], strides = [1] : !flow.dispatch.tensor<readonly:tensor<16384xf32>> -> tensor<16384xf32> %5 = tensor.empty() : tensor<16384xf32> %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<16384xf32>, tensor<16384xf32>) outs(%5 : tensor<16384xf32>) { ^bb0(%in: f32, %in_0: f32, %out: f32): %7 = arith.mulf %in, %in_0 : f32 linalg.yield %7 : f32 } -> tensor<16384xf32> flow.dispatch.tensor.store %6, %2, offsets = [0], sizes = [16384], strides = [1] : tensor<16384xf32> -> !flow.dispatch.tensor<writeonly:tensor<16384xf32>> return } //===----------------------------------------------------------------------===// // dispatch site 0: simple_mul_dispatch_0_generic_16384 //===----------------------------------------------------------------------===// 0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:52 0000000000000000 | LOAD 000002705bc3ad80 16 0000000000000000 | LOAD 000002705bc4ae80 16 0000000000000000 | STORE 000002705bc7a100 16 0000000000000000 | LOAD 000002705bc3ad90 16 0000000000000000 | LOAD 000002705bc4ae90 16 0000000000000000 | STORE 000002705bc7a110 16 0000000000000000 | LOAD 000002705bc3ada0 16 0000000000000000 | LOAD 000002705bc4aea0 16 0000000000000000 | STORE 000002705bc7a120 16 0000000000000000 | LOAD 000002705bc3adb0 16 0000000000000000 | LOAD 000002705bc4aeb0 16 0000000000000000 | STORE 000002705bc7a130 16 0000000000000000 | LOAD 000002705bc3adc0 16 0000000000000000 | LOAD 000002705bc4aec0 16 0000000000000000 | STORE 000002705bc7a140 16 ... 000000000000c040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:52 000000000000c040 | LOAD 000002705bc3ed80 16 000000000000c040 | LOAD 000002705bc4ee80 16 000000000000c040 | STORE 000002705bc7e100 16 ... ``` The printed data is currently just a proof of concept - the first column is the workgroup key - more advanced visualizations can do better things (still WIP, showing a dispatch listing and memory accesses by a particular dispatch):    There's still some iteration needed on printing values. For now a simple test with this: ```diff diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp index 55bab04a8..24abf2255 100644 --- a/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp +++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/ConvertToLLVM.cpp @@ -442,6 +442,22 @@ struct ConvertHALInstrumentWorkgroupOp rewriter.create<LLVM::ConstantOp>(loc, i64Type, 0xFFFFFFFFFFll)), rewriter.create<LLVM::ConstantOp>(loc, i64Type, 24)); + // HACK: test writing out a value. + { + Value valueOperand = rewriter.create<arith::ConstantOp>( + loc, rewriter.getUI32IntegerAttr(0xFFFFFFFFu)); + rewriter.create<IREE::HAL::InstrumentValueOp>( + loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey, + rewriter.getI8IntegerAttr(0), valueOperand); + } + { + Value valueOperand = rewriter.create<arith::ConstantOp>( + loc, rewriter.getF32FloatAttr(1.234f)); + rewriter.create<IREE::HAL::InstrumentValueOp>( + loc, valueOperand.getType(), instrumentOp.getBuffer(), workgroupKey, + rewriter.getI8IntegerAttr(1), valueOperand); + } + rewriter.replaceOp(instrumentOp, workgroupKey); return success(); } ``` will produce this output: ``` 0000000000000000 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 0,0,0 pid:50 0000000000000000 | VALUE 0000 = 4294967295 0000000000000000 | VALUE 0001 = 1.234000e+00 1.234000 0000000000000040 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 1,0,0 pid:50 0000000000000040 | VALUE 0000 = 4294967295 0000000000000040 | VALUE 0001 = 1.234000e+00 1.234000 0000000000000080 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 2,0,0 pid:50 0000000000000080 | VALUE 0000 = 4294967295 0000000000000080 | VALUE 0001 = 1.234000e+00 1.234000 00000000000000C0 | WORKGROUP dispatch(0 simple_mul_dispatch_0_generic_16384 4x1x1) 3,0,0 pid:50 00000000000000C0 | VALUE 0000 = 4294967295 00000000000000C0 | VALUE 0001 = 1.234000e+00 1.234000 ``` Ergonomics-wise for things like the value instrumentation we could try adding dialect attributes or something earlier on that tracks values or flagged patterns/etc for particular experiments. In the future we can add both additional dispatch instrumentation and new chunk types from various modules. Examples include compiler-generated profile-guided optimization markers, VM code coverage markers, HAL counters for allocations or submissions, or device timestamp streams. Only the CPU backend is supported right now as that's what I'm familiar with. I'd hoped to do the `hal.instrument.*` op lowering to memrefs so it could be shared but memref descriptors are unfortunately still a thing and with something tracking every single memory access it created hundreds of thousands of instructions. I eagerly await the day when someone works to kill memref descriptors.
IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!
See our website for more information.
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.