Benchmarking

IREE uses benchmarks to inspect performance at varying levels of granularity. Benchmarking is implemented using the Google Benchmark library. To understand performance details and guide optimization, please refer to the IREE profiling documentation.

Module Benchmarks

iree-benchmark-module is a program accepting (almost) the same inputs as iree-run-module that will benchmark the invocation of a single entry function. It measures timing for the whole process of invoking a function through the VM, including allocating and freeing output buffers. This is a high-level benchmark of an entire invocation flow. It provides a big picture view, but depends on many different variables, like an integration test. For finer-grained measurements more akin to unit tests, see Executable Benchmarks.

To use iree-benchmark-module, generate an IREE module for the target backend:

$ bazel run //iree/tools:iree-compile -- \
  -iree-mlir-to-vm-bytecode-module \
  -iree-hal-target-backends=vmvx \
  $PWD/iree/samples/models/simple_abs.mlir \
  -o /tmp/module.fb

and then benchmark an exported function in that module:

$ bazel run //iree/tools:iree-benchmark-module -- \
  --module_file=/tmp/module.fb \
  --driver=vmvx \
  --entry_function=abs \
  --function_input=f32=-2

You'll see output like

Run on (12 X 4500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 1024K (x6)
  L3 Unified 8448K (x1)
Load Average: 2.21, 1.93, 3.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
 be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
BM_RunModule/process_time/real_time       0.22 ms         0.23 ms         3356

Notice that there are a few warnings in there (you may not see all of these). The benchmark library helpfully warns about some common issues that will affect benchmark timing. When trying to obtain real benchmark numbers, you should generally build an optimized build (-c opt in Bazel) and disable CPU scaling.

$ bazel build -c opt //iree/tools:iree-benchmark-module

Another thing to consider is that depending on where you are running the benchmark you might want to avoid additional programs running at the same time. Bazel itself runs a server even when it‘s not being actively invoked that can be quite a memory hog, so we’ll instead invoke the binary directly. Use your favorite process manager (e.g. htop or pkill on Linux) to kill heavy-weight programs such as Chrome and Bazel.

Now we'll actually invoke the binary:

$ ./bazel-bin/iree/tools/iree-benchmark-module \
  --module_file=/tmp/module.fb \
  --driver=vmvx \
  --entry_function=abs \
  --function_input=f32=-2

Run on (12 X 4500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 1024K (x6)
  L3 Unified 8448K (x1)
Load Average: 1.49, 3.42, 3.49
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
BM_RunModule/process_time/real_time      0.011 ms        0.014 ms        61654

Remember to restore CPU scaling when you're done.

Executable Benchmarks

We also benchmark the performance of individual parts of the IREE system in isolation. IREE breaks a model down to dispatch functions. To benchmark all the dispatch functions, generate an IREE module with the -iree-flow-export-benchmark-funcs flag set:

$ build/iree/tools/iree-compile \
  -iree-input-type=mhlo \
  -iree-mlir-to-vm-bytecode-module \
  -iree-flow-export-benchmark-funcs \
  -iree-hal-target-backends=vmvx \
  iree/test/e2e/models/fullyconnected.mlir \
  -o /tmp/fullyconnected.vmfb

and then benchmark all exported dispatch functions (and all exported functions) in that module:

$ build/iree/tools/iree-benchmark-module
  --module_file=/tmp/fullyconnected.vmfb
  --driver=vmvx

If no entry_function is specified, iree-benchmark-module will register a benchmark for each exported function that takes no inputs.

You will see output like:

Run on (72 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 4.39, 5.72, 6.76
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_main_ex_dispatch_0_benchmark/process_time/real_time  0.030 ms        0.037 ms        34065
BM_main_ex_dispatch_1_benchmark/process_time/real_time  0.034 ms        0.042 ms        20567
BM_main_ex_dispatch_2_benchmark/process_time/real_time  0.043 ms        0.051 ms        18576
BM_main_ex_dispatch_3_benchmark/process_time/real_time  0.029 ms        0.036 ms        21345
BM_main_ex_dispatch_4_benchmark/process_time/real_time  0.042 ms        0.051 ms        15880
BM_main_ex_dispatch_5_benchmark/process_time/real_time  0.030 ms        0.037 ms        17854
BM_main_ex_dispatch_6_benchmark/process_time/real_time  0.043 ms        0.052 ms        14919
BM_main_benchmark/process_time/real_time                0.099 ms        0.107 ms         5892

Bytecode Module Benchmarks

Normally, the IREE VM is expected to be integrated into applications and driving model execution. So its performance is of crucial importance. We strive to introduce as little overhead as possible and have several benchmark binaries dedicated for evaluating the VM's performance. These benchmark binaries are named as *_benchmark in the iree/vm/ directory. They also use the Google Benchmark library as the above.

CPU Configuration

When benchmarking, it‘s important to consider the configuration of your CPUs. Most notably, CPU scaling can give variable results, so you’ll usually want to disable it. This can get pretty complex, but the most basic thing to do is to run all CPUs at maximum frequency. The other thing to consider is what CPU(s) your program is running on. Both of these get more complicated on mobile and in multithreaded workloads.

Linux

Google benchmark provides some instructions. Note that the library will print “CPU scaling is enabled” warnings for any configuration that doesn't have the quota governor set to performance. Similarly the CPU frequency it reports is the maximum frequency of cpu0, not the frequency of the processor it's actually running on. This means that more advanced configurations should ignore these messages.

Turn off CPU scaling before benchmarking.

$ sudo cpupower frequency-set --governor performance

Restore CPU scaling after benchmarking:

$ sudo cpupower frequency-set --governor powersave

To learn more about different quota governor settings, see https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict which CPUs you run on, use the taskset command which takes a hexadecimal mask.

To only run on the lowest-numbered CPU you can run

$ taskset 1 sleep 20 &

You can confirm that the process is running on the given CPU:

$ ps -o psr $!

Note that $! indicates the process ID of the last executed background command, so you can only use this shorthand if you didn't run any commands after the sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.

Android

Read and understand the Linux instructions first.

Android doesn't give us quite as nice tooling, but the principle is basically the same. One important difference is that thermal throttling is a much bigger concern on mobile. Without a cooling plate, it is likely that high clock speeds will overheat the device and engage thermal throttling, which will ignore whatever clock speeds you may have set to prevent things from catching on fire. Therefore the naive approach above is likely not a good idea.

You will likely need to be root (use su or adb root). The commands will depend on your exact phone and number of cores. First play around and make sure you understand what everything means. Note that each CPU has its own files which are used to control its behavior, but changes to a single CPU will sometimes affect others (see /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus).

Some useful files:

/proc/cpuinfo
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/present
/sys/devices/system/cpu/cpu0/online
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

See the clockspeed of each CPU

$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
    paste \
      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \
      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \
      "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \
done

Before changing things, make sure to check the current scaling governor settings first so you can put them back when you're done.

$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
    cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done

Single-Core Example

Here's an example to run IREE in a single-threaded context on CPU 7 at its lowest clock speed.

First we'll take control of the clockspeed by setting the governor to “userspace”.

$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
  echo userspace > \
    "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done

We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency. We choose the minimum instead of the maximum here to mitigate thermal throttling concerns

$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \
/sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed

We can confirm the frequencies of all the CPUs by running the same command above. Now to run a command specifically on cpu7, use taskset 80 (hex for 10000000):

$ taskset 80 sleep 20 &
$ ps -o psr $!

Remember to cleanup when you‘re done! Here we’ll set the scaling governor back to schedutil because that's what they were before on the particular device this, was tested on, but that may not exist on all devices.

$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
  echo schedutil > \
    "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done

TODO(scotttodd): Windows instructions