hal_executable_library_call
hook to study CPU event counts on Linux.To use this, build IREE with:
cmake -DCMAKE_C_FLAGS=-DIREE_HAL_EXECUTABLE_LIBRARY_CALL_HOOK .
to enable the hooks in the IREE runtime. This enables using hooks by LD_PRELOAD=...some_hooks.so
cmake -DIREE_BUILD_EXPERIMENTAL_HAL_EXECUTABLE_LIBRARY_CALL_HOOKS=ON .
to enable building this directory, which provides such a hooks .so
implementation.Example:
Suppose that we have a program like this matmul.mlir
:
func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> { %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32> return %result: tensor<?x?xf32> }
Compile it like usual, but just make sure that we dump the actual function names of each dispatch function, so that we will be able to filter for it:
tools/iree-compile ~/matmul.mlir -o /tmp/matmul.vmfb \ --iree-hal-target-backends=llvm-cpu \ --iree-llvmcpu-target-cpu=znver4 \ --iree-llvmcpu-enable-ukernels=all \ --iree-hal-dump-executable-intermediates-to=/tmp
Thanks to the dumped intermediates in /tmp
we see, for instance, that the interesting function for us is named matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32
.
So we run like this:
IREE_HOOK_FILTER_NAME=matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32 \ IREE_HOOK_PERF_EVENT_TYPES=ls_any_fills_from_sys.all \ LD_PRELOAD=/home/benoit/iree-build/experimental/hal_executable_library_call_hooks/libiree_experimental_hal_executable_library_call_hooks_hook_cpu_events_linux.so \ tools/iree-benchmark-module --device_allocator=caching --module=/tmp/matmul.vmfb --function=matmul_dynamic --input=4000x4000xf32 --input=4000x4000xf32 --input=4000x4000xf32
[!NOTE] This tool relies on the
perf_event_open
system call. Most Linux systems do not give sufficient permissions by default and need to be overridden by writing0
to the file/proc/sys/kernel/perf_event_paranoid
We get output like this:
Statistics for thread iree-worker-15: 15536 matching calls, of which: 15536 calls on cpu 31 duration_ms: mean: 52.7 16-ile means: 39.6 44.8 46.5 47.6 48.4 49.1 49.7 50.3 50.8 51.4 52 52.8 53.9 55.4 59 92.6 ls_any_fills_from_sys.all_dram_io: mean: 1.57e+03 16-ile means: 942 1.24e+03 1.35e+03 1.44e+03 1.5e+03 1.54e+03 1.58e+03 1.61e+03 1.63e+03 1.66e+03 1.68e+03 1.71e+03 1.73e+03 1.77e+03 1.81e+03 2.02e+03 correlation of duration_ms vs. ls_any_fills_from_sys.all_dram_io: 0.46 conditional probability of duration_ms 16-ile (↓) given ls_any_fills_from_sys.all_dram_io 16-ile (→): 0 1 2 3 4 5 6 7 8 9 a b c d e f 0 ▆▆ ▃▃ ▂▁ ▁▁ ▁_ __ __ _ _ _ _ _ _ _ 1 ▃▃ ▄▄ ▃▃ ▂▂ ▁▁ ▁_ __ __ __ __ _ __ _ _ __ 2 ▂▁ ▃▃ ▃▃ ▃▂ ▂▂ ▂▁ ▁_ ▁_ ▁_ __ __ __ __ _ _ __ 3 ▁_ ▂▂ ▂▂ ▂▂ ▂▂ ▂▁ ▂▁ ▂▁ ▁▁ ▁▁ ▁▁ ▁_ ▁_ __ __ __ 4 __ ▁▁ ▂▁ ▂▂ ▂▁ ▂▂ ▂▁ ▂▁ ▁▁ ▁▁ ▁▁ ▁▁ ▁▁ ▁_ __ __ 5 _ ▁_ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▁▁ ▂▁ ▁▁ ▁_ __ 6 _ __ ▁▁ ▁▁ ▁▁ ▂▁ ▂▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▁▁ ▂▁ ▂▁ __ 7 _ __ ▁_ ▁▁ ▁▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▂ ▂▁ ▂▁ ▂▁ ▁▁ ▁_ 8 _ __ ▁_ ▁_ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▁▁ 9 _ _ __ ▁_ ▁▁ ▁▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▁▁ a _ _ ▁_ ▁▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▂ ▂▂ ▂▁ b _ _ __ ▁_ ▁▁ ▁▁ ▁▁ ▁▁ ▂▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▂ ▂▂ ▂▁ c _ _ ▁_ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ d _ _ _ ▁_ ▁▁ ▁▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▁ ▂▂ ▂▁ ▂▁ ▂▁ e _ __ __ ▁_ ▁_ ▁▁ ▁▁ ▂▁ ▂▁ ▂▁ ▁▁ ▂▁ ▂▁ ▂▂ ▂▁ ▂▂ f __ _ _ __ ▁_ ▁_ ▁_ ▁_ ▁_ ▁_ ▁▁ ▁▁ ▂▁ ▂▁ ▂▂ ▅▄
As this is a LD_PRELOAD hook, this can't take command-line arguments, so all the settings are controlled by environment variables:
IREE_HOOK_FILTER_NAME
: If specified, will filter executable library calls for this specific function name. Otherwise will gather all calls, which would typically make for hard-to-interpret results. One almost always wants to specify this.IREE_HOOK_SKIP_START_MS
: How many milliseconds to skip initially before recording data. Think of it as warm-up. Default 0.IREE_HOOK_PERF_EVENT_TYPES
: Comma-separated list of events to count. The available event names are a subset of the ones available in Linux's perf
. The exact list is what is dumped by IREE_HOOK_LIST_EVENT_TYPES=1
. If multiple event types are specified, cross-event conditional probability tables will be printed for each pair of event, so this grows quadratically. In general, one will pass only one event type unless specifically interested in correlating two events. There may also be CPU-specific overhead or limits associated with querying multiple event types simultaneously.IREE_HOOK_LIST_EVENT_TYPES
: if defined, will dump a list of event type to use in IREE_HOOK_PERF_EVENT_TYPES
, and exit. Note that some are AMD-specific (indicated by [AMD]
) while others are generic.IREE_HOOK_OUTPUT_CSV
: if defined, must point to an existing directory (e.g. /tmp/csv
) where to dump raw CSV data files. Otherwise, no CSV will be dumped, just overall stats.IREE_HOOK_BUCKET_COUNT
: controls the number of buckets (i.e. percentiles) to distinguish when printing stats. Higher values result in more detailed but heavier output. Default 16.IREE_HOOK_NO_PROBABILITY_TABLE
: if defined, skips printing probability tables. Their size is (bucket_count) x (bucket_count).IREE_HOOK_GAMMA
: gamma-correction factor in the semigraphical probability-table rendering. Default 0.5.