|  | # Benchmarking | 
|  |  | 
|  | IREE uses benchmarks to inspect performance at varying levels of granularity. | 
|  | Benchmarking is implemented using the | 
|  | [Google Benchmark library](https://github.com/google/benchmark). To understand | 
|  | performance details and guide optimization, please refer to the | 
|  | IREE [profiling](./profiling.md) documentation. | 
|  |  | 
|  | ## Module Benchmarks | 
|  |  | 
|  | `iree-benchmark-module` is a program accepting (almost) the same inputs as | 
|  | `iree-run-module` that will benchmark the invocation of a single entry function. | 
|  | It measures timing for the whole process of invoking a function through the VM, | 
|  | including allocating and freeing output buffers. This is a high-level benchmark | 
|  | of an entire invocation flow. It provides a big picture view, but depends on | 
|  | many different variables, like an integration test. For finer-grained | 
|  | measurements more akin to unit tests, see [Microbenchmarks](#microbenchmarks). | 
|  |  | 
|  | To use `iree-benchmark-module`, generate an IREE module for the target backend: | 
|  |  | 
|  | ```shell | 
|  | $ bazel run //iree/tools:iree-translate -- \ | 
|  | -iree-mlir-to-vm-bytecode-module \ | 
|  | --iree-hal-target-backends=vmla \ | 
|  | $PWD/iree/tools/test/simple.mlir \ | 
|  | -o /tmp/module.fb | 
|  | ``` | 
|  |  | 
|  | and then benchmark an exported function in that module: | 
|  |  | 
|  | ```shell | 
|  | $ bazel run //iree/tools:iree-benchmark-module -- \ | 
|  | --module_file=/tmp/module.fb \ | 
|  | --driver=vmla \ | 
|  | --entry_function=abs \ | 
|  | --function_inputs="i32=-2" | 
|  | ``` | 
|  |  | 
|  | You'll see output like | 
|  |  | 
|  | ```shell | 
|  | Run on (12 X 4500 MHz CPU s) | 
|  | CPU Caches: | 
|  | L1 Data 32K (x6) | 
|  | L1 Instruction 32K (x6) | 
|  | L2 Unified 1024K (x6) | 
|  | L3 Unified 8448K (x1) | 
|  | Load Average: 2.21, 1.93, 3.34 | 
|  | ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may | 
|  | be noisy and will incur extra overhead. | 
|  | ***WARNING*** Library was built as DEBUG. Timings may be affected. | 
|  | ------------------------------------------------------------------------------ | 
|  | Benchmark                                    Time             CPU   Iterations | 
|  | ------------------------------------------------------------------------------ | 
|  | BM_RunModule/process_time/real_time       0.22 ms         0.23 ms         3356 | 
|  | ``` | 
|  |  | 
|  | Notice that there are a few warnings in there (you may not see all of these). | 
|  | The benchmark library helpfully warns about some common issues that will affect | 
|  | benchmark timing. When trying to obtain real benchmark numbers, you should | 
|  | generally build an optimized build (`-c opt` in Bazel) and | 
|  | [disable CPU scaling](#cpu-configuration). | 
|  |  | 
|  | ```shell | 
|  | $ bazel build -c opt //iree/tools:iree-benchmark-module | 
|  | ``` | 
|  |  | 
|  | Another thing to consider is that depending on where you are running the | 
|  | benchmark you might want to avoid additional programs running at the same time. | 
|  | Bazel itself runs a server even when it's not being actively invoked that can be | 
|  | quite a memory hog, so we'll instead invoke the binary directly. Use your | 
|  | favorite process manager (e.g. [htop](https://hisham.hm/htop/) or | 
|  | [pkill](https://en.wikipedia.org/wiki/Pkill) on Linux) to kill heavy-weight | 
|  | programs such as Chrome and Bazel. | 
|  |  | 
|  | Now we'll actually invoke the binary: | 
|  |  | 
|  | ```shell | 
|  | $ ./bazel-bin/iree/tools/iree-benchmark-module \ | 
|  | --module_file=/tmp/module.fb \ | 
|  | --driver=vmla \ | 
|  | --entry_function=abs \ | 
|  | --function_inputs="i32=-2" | 
|  | ``` | 
|  |  | 
|  | ```shell | 
|  | Run on (12 X 4500 MHz CPU s) | 
|  | CPU Caches: | 
|  | L1 Data 32K (x6) | 
|  | L1 Instruction 32K (x6) | 
|  | L2 Unified 1024K (x6) | 
|  | L3 Unified 8448K (x1) | 
|  | Load Average: 1.49, 3.42, 3.49 | 
|  | ------------------------------------------------------------------------------ | 
|  | Benchmark                                    Time             CPU   Iterations | 
|  | ------------------------------------------------------------------------------ | 
|  | BM_RunModule/process_time/real_time      0.011 ms        0.014 ms        61654 | 
|  | ``` | 
|  |  | 
|  | Remember to [restore CPU scaling](#cpu-configuration) when you're done. | 
|  |  | 
|  | ## Executable Benchmarks | 
|  |  | 
|  | We also benchmark the performance of individual parts of the IREE system in | 
|  | isolation. IREE breaks a model down to dispatch functions. To benchmark all the | 
|  | dispatch functions, generate an IREE module with | 
|  | `-iree-mlir-to-executable-benchmark-vm-module` for the target backend: | 
|  |  | 
|  | ```shell | 
|  | $ build/iree/tools/iree-translate \ | 
|  | -iree-mlir-to-executable-benchmark-vm-module \ | 
|  | -iree-hal-target-backends=vmla \ | 
|  | iree/test/e2e/models/fullyconnected.mlir \ | 
|  | -o /tmp/fullyconnected.vmfb | 
|  | ``` | 
|  |  | 
|  | and then benchmark all exported dispatch functions (and all exported functions) | 
|  | in that module: | 
|  |  | 
|  | ```shell | 
|  | $ build/iree/tools/iree-benchmark-module | 
|  | --module_file=/tmp/fullyconnected.vmfb | 
|  | --driver=vmla | 
|  | ``` | 
|  |  | 
|  | If no `entry_function` is specified, `iree-benchmark-module` will register a | 
|  | benchmark for each exported function that takes no inputs. | 
|  |  | 
|  | You will see output like: | 
|  |  | 
|  | ```shell | 
|  | Run on (72 X 3700 MHz CPU s) | 
|  | CPU Caches: | 
|  | L1 Data 32 KiB (x36) | 
|  | L1 Instruction 32 KiB (x36) | 
|  | L2 Unified 1024 KiB (x36) | 
|  | L3 Unified 25344 KiB (x2) | 
|  | Load Average: 4.39, 5.72, 6.76 | 
|  | --------------------------------------------------------------------------------------------- | 
|  | Benchmark                                                   Time             CPU   Iterations | 
|  | --------------------------------------------------------------------------------------------- | 
|  | BM_main_ex_dispatch_0_entry/process_time/real_time      0.030 ms        0.037 ms        34065 | 
|  | BM_main_ex_dispatch_1_entry/process_time/real_time      0.034 ms        0.042 ms        20567 | 
|  | BM_main_ex_dispatch_2_entry/process_time/real_time      0.043 ms        0.051 ms        18576 | 
|  | BM_main_ex_dispatch_3_entry/process_time/real_time      0.029 ms        0.036 ms        21345 | 
|  | BM_main_ex_dispatch_4_entry/process_time/real_time      0.042 ms        0.051 ms        15880 | 
|  | BM_main_ex_dispatch_5_entry/process_time/real_time      0.030 ms        0.037 ms        17854 | 
|  | BM_main_ex_dispatch_6_entry/process_time/real_time      0.043 ms        0.052 ms        14919 | 
|  | BM_main_dummy_args/process_time/real_time               0.099 ms        0.107 ms         5892 | 
|  | ``` | 
|  |  | 
|  | ### Bytecode Module Benchmarks | 
|  |  | 
|  | Normally, the IREE VM is expected to be integrated into applications and driving | 
|  | model execution. So its performance is of crucial importance. We strive to | 
|  | introduce as little overhead as possible and have several benchmark binaries | 
|  | dedicated for evaluating the VM's performance. These benchmark binaries are | 
|  | named as `*_benchmark` in the [`iree/vm/`](https://github.com/google/iree/tree/main/iree/vm) | 
|  | directory. They also use the Google Benchmark library as the above. | 
|  |  | 
|  | ## CPU Configuration | 
|  |  | 
|  | When benchmarking, it's important to consider the configuration of your CPUs. | 
|  | Most notably, CPU scaling can give variable results, so you'll usually want to | 
|  | disable it. This can get pretty complex, but the most basic thing to do is to | 
|  | run all CPUs at maximum frequency. The other thing to consider is what CPU(s) | 
|  | your program is running on. Both of these get more complicated on mobile and in | 
|  | multithreaded workloads. | 
|  |  | 
|  | ### Linux | 
|  |  | 
|  | Google benchmark provides some | 
|  | [instructions](https://github.com/google/benchmark#disabling-cpu-frequency-scaling). | 
|  | Note that the library will print "CPU scaling is enabled" warnings for any | 
|  | configuration that | 
|  | [doesn't have the quota governor set to performance](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L228). | 
|  | Similarly the CPU frequency it reports is the | 
|  | [maximum frequency of cpu0](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L533), | 
|  | not the frequency of the processor it's actually running on. This means that | 
|  | more advanced configurations should ignore these messages. | 
|  |  | 
|  | Turn off CPU scaling before benchmarking. | 
|  |  | 
|  | ```shell | 
|  | $ sudo cpupower frequency-set --governor performance | 
|  | ``` | 
|  |  | 
|  | Restore CPU scaling after benchmarking: | 
|  |  | 
|  | ```shell | 
|  | $ sudo cpupower frequency-set --governor powersave | 
|  | ``` | 
|  |  | 
|  | To learn more about different quota | 
|  | governor settings, see | 
|  | https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict | 
|  | which CPUs you run on, use the `taskset` command which takes a hexadecimal mask. | 
|  |  | 
|  | To only run on the lowest-numbered CPU you can run | 
|  |  | 
|  | ```shell | 
|  | $ taskset 1 sleep 20 & | 
|  | ``` | 
|  |  | 
|  | You can confirm that the process is running on the given CPU: | 
|  |  | 
|  | ```shell | 
|  | $ ps -o psr $! | 
|  | ``` | 
|  |  | 
|  | Note that `$!` indicates the process ID of the last executed background command, | 
|  | so you can only use this shorthand if you didn't run any commands after the | 
|  | sleep. For more info on taskset, see https://linux.die.net/man/1/taskset. | 
|  |  | 
|  | ### Android | 
|  |  | 
|  | Read and understand the [Linux](#linux) instructions first. | 
|  |  | 
|  | Android doesn't give us quite as nice tooling, but the principle is basically | 
|  | the same. One important difference is that thermal throttling is a much bigger | 
|  | concern on mobile. Without a cooling plate, it is likely that high clock speeds | 
|  | will overheat the device and engage thermal throttling, which will ignore | 
|  | whatever clock speeds you may have set to prevent things from catching on fire. | 
|  | Therefore the naive approach above is likely not a good idea. | 
|  |  | 
|  | You will likely need to be root (use `su` or `adb root`). The commands will | 
|  | depend on your exact phone and number of cores. First play around and make sure | 
|  | you understand what everything means. Note that each CPU has its own files which | 
|  | are used to control its behavior, but changes to a single CPU will sometimes | 
|  | affect others (see `/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus`). | 
|  |  | 
|  | Some useful files: | 
|  |  | 
|  | ```shell | 
|  | /proc/cpuinfo | 
|  | /sys/devices/system/cpu/possible | 
|  | /sys/devices/system/cpu/present | 
|  | /sys/devices/system/cpu/cpu0/online | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus | 
|  | /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed | 
|  | ``` | 
|  |  | 
|  | See the clockspeed of each CPU | 
|  |  | 
|  | ```shell | 
|  | $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ | 
|  | paste \ | 
|  | "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \ | 
|  | "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \ | 
|  | "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \ | 
|  | done | 
|  | ``` | 
|  |  | 
|  | Before changing things, make sure to check the current scaling governor settings | 
|  | first so you can put them back when you're done. | 
|  |  | 
|  | ```shell | 
|  | $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ | 
|  | cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ | 
|  | done | 
|  | ``` | 
|  |  | 
|  | #### Single-Core Example | 
|  |  | 
|  | Here's an example to run IREE in a single-threaded context on CPU 7 at its | 
|  | lowest clock speed. | 
|  |  | 
|  | First we'll take control of the clockspeed by setting the governor to | 
|  | "userspace". | 
|  |  | 
|  | ```shell | 
|  | $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ | 
|  | echo userspace > \ | 
|  | "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ | 
|  | done | 
|  | ``` | 
|  |  | 
|  | We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency. | 
|  | We choose the minimum instead of the maximum here to mitigate thermal throttling | 
|  | concerns | 
|  |  | 
|  | ```shell | 
|  | $ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \ | 
|  | /sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed | 
|  | ``` | 
|  |  | 
|  | We can confirm the frequencies of all the CPUs by running the same command | 
|  | above. Now to run a command specifically on cpu7, use `taskset 80` | 
|  | (hex for 10000000): | 
|  |  | 
|  | ```shell | 
|  | $ tasket 80 sleep 20 & | 
|  | $ ps -o psr $! | 
|  | ``` | 
|  |  | 
|  | Remember to cleanup when you're done! Here we'll set the scaling governor back | 
|  | to schedutil because that's what they were before on the particular device this, | 
|  | was tested on, but that may not exist on all devices. | 
|  |  | 
|  | ```shell | 
|  | $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \ | 
|  | echo schedutil > \ | 
|  | "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \ | 
|  | done | 
|  | ``` | 
|  |  | 
|  | TODO(scotttodd): Windows instructions |