IREE uses benchmarks to inspect performance at varying levels of granularity. Benchmarking is implemented using the Google Benchmark library. To understand performance details and guide optimization, please refer to the IREE profiling documentation.
iree-benchmark-module
is a program accepting (almost) the same inputs as iree-run-module
that will benchmark the invocation of a single entry function. It measures timing for the whole process of invoking a function through the VM, including allocating and freeing output buffers. This is a high-level benchmark of an entire invocation flow. It provides a big picture view, but depends on many different variables, like an integration test. For finer-grained measurements more akin to unit tests, see Executable Benchmarks.
To use iree-benchmark-module
, generate an IREE module for the target backend:
$ bazel run //iree/tools:iree-compile -- \ -iree-mlir-to-vm-bytecode-module \ -iree-hal-target-backends=vmvx \ $PWD/iree/samples/models/simple_abs.mlir \ -o /tmp/module.fb
and then benchmark an exported function in that module:
$ bazel run //iree/tools:iree-benchmark-module -- \ --module_file=/tmp/module.fb \ --driver=vmvx \ --entry_function=abs \ --function_input=f32=-2
You'll see output like
Run on (12 X 4500 MHz CPU s) CPU Caches: L1 Data 32K (x6) L1 Instruction 32K (x6) L2 Unified 1024K (x6) L3 Unified 8448K (x1) Load Average: 2.21, 1.93, 3.34 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_RunModule/process_time/real_time 0.22 ms 0.23 ms 3356
Notice that there are a few warnings in there (you may not see all of these). The benchmark library helpfully warns about some common issues that will affect benchmark timing. When trying to obtain real benchmark numbers, you should generally build an optimized build (-c opt
in Bazel) and disable CPU scaling.
$ bazel build -c opt //iree/tools:iree-benchmark-module
Another thing to consider is that depending on where you are running the benchmark you might want to avoid additional programs running at the same time. Bazel itself runs a server even when it‘s not being actively invoked that can be quite a memory hog, so we’ll instead invoke the binary directly. Use your favorite process manager (e.g. htop or pkill on Linux) to kill heavy-weight programs such as Chrome and Bazel.
Now we'll actually invoke the binary:
$ ./bazel-bin/iree/tools/iree-benchmark-module \ --module_file=/tmp/module.fb \ --driver=vmvx \ --entry_function=abs \ --function_input=f32=-2
Run on (12 X 4500 MHz CPU s) CPU Caches: L1 Data 32K (x6) L1 Instruction 32K (x6) L2 Unified 1024K (x6) L3 Unified 8448K (x1) Load Average: 1.49, 3.42, 3.49 ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_RunModule/process_time/real_time 0.011 ms 0.014 ms 61654
Remember to restore CPU scaling when you're done.
We also benchmark the performance of individual parts of the IREE system in isolation. IREE breaks a model down to dispatch functions. To benchmark all the dispatch functions, generate an IREE module with the -iree-flow-export-benchmark-funcs
flag set:
$ build/iree/tools/iree-compile \ -iree-input-type=mhlo \ -iree-mlir-to-vm-bytecode-module \ -iree-flow-export-benchmark-funcs \ -iree-hal-target-backends=vmvx \ iree/test/e2e/models/fullyconnected.mlir \ -o /tmp/fullyconnected.vmfb
and then benchmark all exported dispatch functions (and all exported functions) in that module:
$ build/iree/tools/iree-benchmark-module --module_file=/tmp/fullyconnected.vmfb --driver=vmvx
If no entry_function
is specified, iree-benchmark-module
will register a benchmark for each exported function that takes no inputs.
You will see output like:
Run on (72 X 3700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x36) L1 Instruction 32 KiB (x36) L2 Unified 1024 KiB (x36) L3 Unified 25344 KiB (x2) Load Average: 4.39, 5.72, 6.76 --------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------- BM_main_ex_dispatch_0_benchmark/process_time/real_time 0.030 ms 0.037 ms 34065 BM_main_ex_dispatch_1_benchmark/process_time/real_time 0.034 ms 0.042 ms 20567 BM_main_ex_dispatch_2_benchmark/process_time/real_time 0.043 ms 0.051 ms 18576 BM_main_ex_dispatch_3_benchmark/process_time/real_time 0.029 ms 0.036 ms 21345 BM_main_ex_dispatch_4_benchmark/process_time/real_time 0.042 ms 0.051 ms 15880 BM_main_ex_dispatch_5_benchmark/process_time/real_time 0.030 ms 0.037 ms 17854 BM_main_ex_dispatch_6_benchmark/process_time/real_time 0.043 ms 0.052 ms 14919 BM_main_benchmark/process_time/real_time 0.099 ms 0.107 ms 5892
Normally, the IREE VM is expected to be integrated into applications and driving model execution. So its performance is of crucial importance. We strive to introduce as little overhead as possible and have several benchmark binaries dedicated for evaluating the VM's performance. These benchmark binaries are named as *_benchmark
in the iree/vm/
directory. They also use the Google Benchmark library as the above.
When benchmarking, it‘s important to consider the configuration of your CPUs. Most notably, CPU scaling can give variable results, so you’ll usually want to disable it. This can get pretty complex, but the most basic thing to do is to run all CPUs at maximum frequency. The other thing to consider is what CPU(s) your program is running on. Both of these get more complicated on mobile and in multithreaded workloads.
Google benchmark provides some instructions. Note that the library will print “CPU scaling is enabled” warnings for any configuration that doesn't have the quota governor set to performance. Similarly the CPU frequency it reports is the maximum frequency of cpu0, not the frequency of the processor it's actually running on. This means that more advanced configurations should ignore these messages.
Turn off CPU scaling before benchmarking.
$ sudo cpupower frequency-set --governor performance
Restore CPU scaling after benchmarking:
$ sudo cpupower frequency-set --governor powersave
To learn more about different quota governor settings, see https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict which CPUs you run on, use the taskset
command which takes a hexadecimal mask.
To only run on the lowest-numbered CPU you can run
$ taskset 1 sleep 20 &
You can confirm that the process is running on the given CPU:
$ ps -o psr $!
Note that $!
indicates the process ID of the last executed background command, so you can only use this shorthand if you didn't run any commands after the sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.
Read and understand the Linux instructions first.
Android doesn't give us quite as nice tooling, but the principle is basically the same. One important difference is that thermal throttling is a much bigger concern on mobile. Without a cooling plate, it is likely that high clock speeds will overheat the device and engage thermal throttling, which will ignore whatever clock speeds you may have set to prevent things from catching on fire. Therefore the naive approach above is likely not a good idea.
You will likely need to be root (use su
or adb root
). The commands will depend on your exact phone and number of cores. First play around and make sure you understand what everything means. Note that each CPU has its own files which are used to control its behavior, but changes to a single CPU will sometimes affect others (see /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
).
Some useful files:
/proc/cpuinfo /sys/devices/system/cpu/possible /sys/devices/system/cpu/present /sys/devices/system/cpu/cpu0/online /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
See the clockspeed of each CPU
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ paste \ "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \ "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \ "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \ done
Before changing things, make sure to check the current scaling governor settings first so you can put them back when you're done.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ done
Here's an example to run IREE in a single-threaded context on CPU 7 at its lowest clock speed.
First we'll take control of the clockspeed by setting the governor to “userspace”.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ echo userspace > \ "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ done
We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency. We choose the minimum instead of the maximum here to mitigate thermal throttling concerns
$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \ /sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed
We can confirm the frequencies of all the CPUs by running the same command above. Now to run a command specifically on cpu7, use taskset 80
(hex for 10000000):
$ taskset 80 sleep 20 & $ ps -o psr $!
Remember to cleanup when you‘re done! Here we’ll set the scaling governor back to schedutil because that's what they were before on the particular device this, was tested on, but that may not exist on all devices.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ echo schedutil > \ "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ done
TODO(scotttodd): Windows instructions