docs/developing_iree/benchmarking.md - 3p/openxla/iree - Git at Google

 # Benchmarking

 IREE uses benchmarks to inspect performance at varying levels of granularity.
 Benchmarking is implemented using the
 [Google Benchmark library](https://github.com/google/benchmark). To understand
 performance details and guide optimization, please refer to the
 IREE [profiling](./profiling.md) documentation.

 ## Module Benchmarks

 `iree-benchmark-module` is a program accepting (almost) the same inputs as
 `iree-run-module` that will benchmark the invocation of a single entry function.
 It measures timing for the whole process of invoking a function through the VM,
 including allocating and freeing output buffers. This is a high-level benchmark
 of an entire invocation flow. It provides a big picture view, but depends on
 many different variables, like an integration test. For finer-grained
 measurements more akin to unit tests, see [Microbenchmarks](#microbenchmarks).

 To use `iree-benchmark-module`, generate an IREE module for the target backend:

 ```shell
 $ bazel run //iree/tools:iree-translate -- \
   -iree-mlir-to-vm-bytecode-module \
   --iree-hal-target-backends=vmla \
   $PWD/iree/tools/test/iree-benchmark-module.mlir \
   -o /tmp/module.fb
 ```

 and then benchmark an exported function in that module:

 ```shell
 $ bazel run //iree/tools:iree-benchmark-module -- \
   --module_file=/tmp/module.fb \
   --driver=vmla \
   --entry_function=abs \
   --function_input=i32=-2
 ```

 You'll see output like

 ```shell
 Run on (12 X 4500 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x6)
   L1 Instruction 32K (x6)
   L2 Unified 1024K (x6)
   L3 Unified 8448K (x1)
 Load Average: 2.21, 1.93, 3.34
 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
  be noisy and will incur extra overhead.
 ***WARNING*** Library was built as DEBUG. Timings may be affected.
 ------------------------------------------------------------------------------
 Benchmark                                    Time             CPU   Iterations
 ------------------------------------------------------------------------------
 BM_RunModule/process_time/real_time       0.22 ms         0.23 ms         3356
 ```

 Notice that there are a few warnings in there (you may not see all of these).
 The benchmark library helpfully warns about some common issues that will affect
 benchmark timing. When trying to obtain real benchmark numbers, you should
 generally build an optimized build (`-c opt` in Bazel) and
 [disable CPU scaling](#cpu-configuration).

 ```shell
 $ bazel build -c opt //iree/tools:iree-benchmark-module
 ```

 Another thing to consider is that depending on where you are running the
 benchmark you might want to avoid additional programs running at the same time.
 Bazel itself runs a server even when it's not being actively invoked that can be
 quite a memory hog, so we'll instead invoke the binary directly. Use your
 favorite process manager (e.g. [htop](https://hisham.hm/htop/) or
 [pkill](https://en.wikipedia.org/wiki/Pkill) on Linux) to kill heavy-weight
 programs such as Chrome and Bazel.

 Now we'll actually invoke the binary:

 ```shell
 $ ./bazel-bin/iree/tools/iree-benchmark-module \
   --module_file=/tmp/module.fb \
   --driver=vmla \
   --entry_function=abs \
   --function_input=i32=-2
 ```

 ```shell
 Run on (12 X 4500 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x6)
   L1 Instruction 32K (x6)
   L2 Unified 1024K (x6)
   L3 Unified 8448K (x1)
 Load Average: 1.49, 3.42, 3.49
 ------------------------------------------------------------------------------
 Benchmark                                    Time             CPU   Iterations
 ------------------------------------------------------------------------------
 BM_RunModule/process_time/real_time      0.011 ms        0.014 ms        61654
 ```

 Remember to [restore CPU scaling](#cpu-configuration) when you're done.

 ## Executable Benchmarks

 We also benchmark the performance of individual parts of the IREE system in
 isolation. IREE breaks a model down to dispatch functions. To benchmark all the
 dispatch functions, generate an IREE module with the
 `-iree-flow-export-benchmark-funcs` flag set:

 ```shell
 $ build/iree/tools/iree-translate \
   -iree-mlir-to-vm-bytecode-module \
   -iree-flow-export-benchmark-funcs \
   -iree-hal-target-backends=vmla \
   iree/test/e2e/models/fullyconnected.mlir \
   -o /tmp/fullyconnected.vmfb
 ```

 and then benchmark all exported dispatch functions (and all exported functions)
 in that module:

 ```shell
 $ build/iree/tools/iree-benchmark-module
   --module_file=/tmp/fullyconnected.vmfb
   --driver=vmla
 ```

 If no `entry_function` is specified, `iree-benchmark-module` will register a
 benchmark for each exported function that takes no inputs.

 You will see output like:

 ```shell
 Run on (72 X 3700 MHz CPU s)
 CPU Caches:
   L1 Data 32 KiB (x36)
   L1 Instruction 32 KiB (x36)
   L2 Unified 1024 KiB (x36)
   L3 Unified 25344 KiB (x2)
 Load Average: 4.39, 5.72, 6.76
 ---------------------------------------------------------------------------------------------
 Benchmark                                                   Time             CPU   Iterations
 ---------------------------------------------------------------------------------------------
 BM_main_ex_dispatch_0_benchmark/process_time/real_time  0.030 ms        0.037 ms        34065
 BM_main_ex_dispatch_1_benchmark/process_time/real_time  0.034 ms        0.042 ms        20567
 BM_main_ex_dispatch_2_benchmark/process_time/real_time  0.043 ms        0.051 ms        18576
 BM_main_ex_dispatch_3_benchmark/process_time/real_time  0.029 ms        0.036 ms        21345
 BM_main_ex_dispatch_4_benchmark/process_time/real_time  0.042 ms        0.051 ms        15880
 BM_main_ex_dispatch_5_benchmark/process_time/real_time  0.030 ms        0.037 ms        17854
 BM_main_ex_dispatch_6_benchmark/process_time/real_time  0.043 ms        0.052 ms        14919
 BM_main_benchmark/process_time/real_time                0.099 ms        0.107 ms         5892
 ```

 ### Bytecode Module Benchmarks

 Normally, the IREE VM is expected to be integrated into applications and driving
 model execution. So its performance is of crucial importance. We strive to
 introduce as little overhead as possible and have several benchmark binaries
 dedicated for evaluating the VM's performance. These benchmark binaries are
 named as `*_benchmark` in the
 [`iree/vm/`](https://github.com/google/iree/tree/main/iree/vm) directory. They
 also use the Google Benchmark library as the above.

 ## CPU Configuration

 When benchmarking, it's important to consider the configuration of your CPUs.
 Most notably, CPU scaling can give variable results, so you'll usually want to
 disable it. This can get pretty complex, but the most basic thing to do is to
 run all CPUs at maximum frequency. The other thing to consider is what CPU(s)
 your program is running on. Both of these get more complicated on mobile and in
 multithreaded workloads.

 ### Linux

 Google benchmark provides some
 [instructions](https://github.com/google/benchmark#disabling-cpu-frequency-scaling).
 Note that the library will print "CPU scaling is enabled" warnings for any
 configuration that
 [doesn't have the quota governor set to performance](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L228).
 Similarly the CPU frequency it reports is the
 [maximum frequency of cpu0](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L533),
 not the frequency of the processor it's actually running on. This means that
 more advanced configurations should ignore these messages.

 Turn off CPU scaling before benchmarking.

 ```shell
 $ sudo cpupower frequency-set --governor performance
 ```

 Restore CPU scaling after benchmarking:

 ```shell
 $ sudo cpupower frequency-set --governor powersave
 ```

 To learn more about different quota
 governor settings, see
 https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict
 which CPUs you run on, use the `taskset` command which takes a hexadecimal mask.

 To only run on the lowest-numbered CPU you can run

 ```shell
 $ taskset 1 sleep 20 &
 ```

 You can confirm that the process is running on the given CPU:

 ```shell
 $ ps -o psr $!
 ```

 Note that `$!` indicates the process ID of the last executed background command,
 so you can only use this shorthand if you didn't run any commands after the
 sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.

 ### Android

 Read and understand the [Linux](#linux) instructions first.

 Android doesn't give us quite as nice tooling, but the principle is basically
 the same. One important difference is that thermal throttling is a much bigger
 concern on mobile. Without a cooling plate, it is likely that high clock speeds
 will overheat the device and engage thermal throttling, which will ignore
 whatever clock speeds you may have set to prevent things from catching on fire.
 Therefore the naive approach above is likely not a good idea.

 You will likely need to be root (use `su` or `adb root`). The commands will
 depend on your exact phone and number of cores. First play around and make sure
 you understand what everything means. Note that each CPU has its own files which
 are used to control its behavior, but changes to a single CPU will sometimes
 affect others (see `/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus`).

 Some useful files:

 ```shell
 /proc/cpuinfo
 /sys/devices/system/cpu/possible
 /sys/devices/system/cpu/present
 /sys/devices/system/cpu/cpu0/online
 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
 /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
 /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
 ```

 See the clockspeed of each CPU

 ```shell
 $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
     paste \
       "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \
       "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \
       "/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \
 done
 ```

 Before changing things, make sure to check the current scaling governor settings
 first so you can put them back when you're done.

 ```shell
 $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
     cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
 done
 ```

 #### Single-Core Example

 Here's an example to run IREE in a single-threaded context on CPU 7 at its
 lowest clock speed.

 First we'll take control of the clockspeed by setting the governor to
 "userspace".

 ```shell
 $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
   echo userspace > \
     "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
 done
 ```

 We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency.
 We choose the minimum instead of the maximum here to mitigate thermal throttling
 concerns

 ```shell
 $ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \
 /sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed
 ```

 We can confirm the frequencies of all the CPUs by running the same command
 above. Now to run a command specifically on cpu7, use `taskset 80`
 (hex for 10000000):

 ```shell
 $ taskset 80 sleep 20 &
 $ ps -o psr $!
 ```

 Remember to cleanup when you're done! Here we'll set the scaling governor back
 to schedutil because that's what they were before on the particular device this,
 was tested on, but that may not exist on all devices.

 ```shell
 $ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
   echo schedutil > \
     "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
 done
 ```

 TODO(scotttodd): Windows instructions
	# Benchmarking

	IREE uses benchmarks to inspect performance at varying levels of granularity.
	Benchmarking is implemented using the
	[Google Benchmark library](https://github.com/google/benchmark). To understand
	performance details and guide optimization, please refer to the
	IREE [profiling](./profiling.md) documentation.

	## Module Benchmarks

	`iree-benchmark-module` is a program accepting (almost) the same inputs as
	`iree-run-module` that will benchmark the invocation of a single entry function.
	It measures timing for the whole process of invoking a function through the VM,
	including allocating and freeing output buffers. This is a high-level benchmark
	of an entire invocation flow. It provides a big picture view, but depends on
	many different variables, like an integration test. For finer-grained
	measurements more akin to unit tests, see [Microbenchmarks](#microbenchmarks).

	To use `iree-benchmark-module`, generate an IREE module for the target backend:

	```shell
	$ bazel run //iree/tools:iree-translate -- \
	-iree-mlir-to-vm-bytecode-module \
	--iree-hal-target-backends=vmla \
	$PWD/iree/tools/test/iree-benchmark-module.mlir \
	-o /tmp/module.fb
	```

	and then benchmark an exported function in that module:

	```shell
	$ bazel run //iree/tools:iree-benchmark-module -- \
	--module_file=/tmp/module.fb \
	--driver=vmla \
	--entry_function=abs \
	--function_input=i32=-2
	```

	You'll see output like

	```shell
	Run on (12 X 4500 MHz CPU s)
	CPU Caches:
	L1 Data 32K (x6)
	L1 Instruction 32K (x6)
	L2 Unified 1024K (x6)
	L3 Unified 8448K (x1)
	Load Average: 2.21, 1.93, 3.34
	*WARNING* CPU scaling is enabled, the benchmark real time measurements may
	be noisy and will incur extra overhead.
	*WARNING* Library was built as DEBUG. Timings may be affected.
	------------------------------------------------------------------------------
	Benchmark Time CPU Iterations
	------------------------------------------------------------------------------
	BM_RunModule/process_time/real_time 0.22 ms 0.23 ms 3356
	```

	Notice that there are a few warnings in there (you may not see all of these).
	The benchmark library helpfully warns about some common issues that will affect
	benchmark timing. When trying to obtain real benchmark numbers, you should
	generally build an optimized build (`-c opt` in Bazel) and
	[disable CPU scaling](#cpu-configuration).

	```shell
	$ bazel build -c opt //iree/tools:iree-benchmark-module
	```

	Another thing to consider is that depending on where you are running the
	benchmark you might want to avoid additional programs running at the same time.
	Bazel itself runs a server even when it's not being actively invoked that can be
	quite a memory hog, so we'll instead invoke the binary directly. Use your
	favorite process manager (e.g. [htop](https://hisham.hm/htop/) or
	[pkill](https://en.wikipedia.org/wiki/Pkill) on Linux) to kill heavy-weight
	programs such as Chrome and Bazel.

	Now we'll actually invoke the binary:

	```shell
	$ ./bazel-bin/iree/tools/iree-benchmark-module \
	--module_file=/tmp/module.fb \
	--driver=vmla \
	--entry_function=abs \
	--function_input=i32=-2
	```

	```shell
	Run on (12 X 4500 MHz CPU s)
	CPU Caches:
	L1 Data 32K (x6)
	L1 Instruction 32K (x6)
	L2 Unified 1024K (x6)
	L3 Unified 8448K (x1)
	Load Average: 1.49, 3.42, 3.49
	------------------------------------------------------------------------------
	Benchmark Time CPU Iterations
	------------------------------------------------------------------------------
	BM_RunModule/process_time/real_time 0.011 ms 0.014 ms 61654
	```

	Remember to [restore CPU scaling](#cpu-configuration) when you're done.

	## Executable Benchmarks

	We also benchmark the performance of individual parts of the IREE system in
	isolation. IREE breaks a model down to dispatch functions. To benchmark all the
	dispatch functions, generate an IREE module with the
	`-iree-flow-export-benchmark-funcs` flag set:

	```shell
	$ build/iree/tools/iree-translate \
	-iree-mlir-to-vm-bytecode-module \
	-iree-flow-export-benchmark-funcs \
	-iree-hal-target-backends=vmla \
	iree/test/e2e/models/fullyconnected.mlir \
	-o /tmp/fullyconnected.vmfb
	```

	and then benchmark all exported dispatch functions (and all exported functions)
	in that module:

	```shell
	$ build/iree/tools/iree-benchmark-module
	--module_file=/tmp/fullyconnected.vmfb
	--driver=vmla
	```

	If no `entry_function` is specified, `iree-benchmark-module` will register a
	benchmark for each exported function that takes no inputs.

	You will see output like:

	```shell
	Run on (72 X 3700 MHz CPU s)
	CPU Caches:
	L1 Data 32 KiB (x36)
	L1 Instruction 32 KiB (x36)
	L2 Unified 1024 KiB (x36)
	L3 Unified 25344 KiB (x2)
	Load Average: 4.39, 5.72, 6.76
	---------------------------------------------------------------------------------------------
	Benchmark Time CPU Iterations
	---------------------------------------------------------------------------------------------
	BM_main_ex_dispatch_0_benchmark/process_time/real_time 0.030 ms 0.037 ms 34065
	BM_main_ex_dispatch_1_benchmark/process_time/real_time 0.034 ms 0.042 ms 20567
	BM_main_ex_dispatch_2_benchmark/process_time/real_time 0.043 ms 0.051 ms 18576
	BM_main_ex_dispatch_3_benchmark/process_time/real_time 0.029 ms 0.036 ms 21345
	BM_main_ex_dispatch_4_benchmark/process_time/real_time 0.042 ms 0.051 ms 15880
	BM_main_ex_dispatch_5_benchmark/process_time/real_time 0.030 ms 0.037 ms 17854
	BM_main_ex_dispatch_6_benchmark/process_time/real_time 0.043 ms 0.052 ms 14919
	BM_main_benchmark/process_time/real_time 0.099 ms 0.107 ms 5892
	```

	### Bytecode Module Benchmarks

	Normally, the IREE VM is expected to be integrated into applications and driving
	model execution. So its performance is of crucial importance. We strive to
	introduce as little overhead as possible and have several benchmark binaries
	dedicated for evaluating the VM's performance. These benchmark binaries are
	named as `*_benchmark` in the
	[`iree/vm/`](https://github.com/google/iree/tree/main/iree/vm) directory. They
	also use the Google Benchmark library as the above.

	## CPU Configuration

	When benchmarking, it's important to consider the configuration of your CPUs.
	Most notably, CPU scaling can give variable results, so you'll usually want to
	disable it. This can get pretty complex, but the most basic thing to do is to
	run all CPUs at maximum frequency. The other thing to consider is what CPU(s)
	your program is running on. Both of these get more complicated on mobile and in
	multithreaded workloads.

	### Linux

	Google benchmark provides some
	[instructions](https://github.com/google/benchmark#disabling-cpu-frequency-scaling).
	Note that the library will print "CPU scaling is enabled" warnings for any
	configuration that
	[doesn't have the quota governor set to performance](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L228).
	Similarly the CPU frequency it reports is the
	[maximum frequency of cpu0](https://github.com/google/benchmark/blob/3d1c2677686718d906f28c1d4da001c42666e6d2/src/sysinfo.cc#L533),
	not the frequency of the processor it's actually running on. This means that
	more advanced configurations should ignore these messages.

	Turn off CPU scaling before benchmarking.

	```shell
	$ sudo cpupower frequency-set --governor performance
	```

	Restore CPU scaling after benchmarking:

	```shell
	$ sudo cpupower frequency-set --governor powersave
	```

	To learn more about different quota
	governor settings, see
	https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict
	which CPUs you run on, use the `taskset` command which takes a hexadecimal mask.

	To only run on the lowest-numbered CPU you can run

	```shell
	$ taskset 1 sleep 20 &
	```

	You can confirm that the process is running on the given CPU:

	```shell
	$ ps -o psr $!
	```

	Note that `$!` indicates the process ID of the last executed background command,
	so you can only use this shorthand if you didn't run any commands after the
	sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.

	### Android

	Read and understand the [Linux](#linux) instructions first.

	Android doesn't give us quite as nice tooling, but the principle is basically
	the same. One important difference is that thermal throttling is a much bigger
	concern on mobile. Without a cooling plate, it is likely that high clock speeds
	will overheat the device and engage thermal throttling, which will ignore
	whatever clock speeds you may have set to prevent things from catching on fire.
	Therefore the naive approach above is likely not a good idea.

	You will likely need to be root (use `su` or `adb root`). The commands will
	depend on your exact phone and number of cores. First play around and make sure
	you understand what everything means. Note that each CPU has its own files which
	are used to control its behavior, but changes to a single CPU will sometimes
	affect others (see `/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus`).

	Some useful files:

	```shell
	/proc/cpuinfo
	/sys/devices/system/cpu/possible
	/sys/devices/system/cpu/present
	/sys/devices/system/cpu/cpu0/online
	/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
	/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
	/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
	/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
	/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
	/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
	/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
	/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
	```

	See the clockspeed of each CPU

	```shell
	$ for i in `cat /sys/devices/system/cpu/present \| tr '-' ' ' \| xargs seq`; do \
	paste \
	"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \
	"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \
	"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \
	done
	```

	Before changing things, make sure to check the current scaling governor settings
	first so you can put them back when you're done.

	```shell
	$ for i in `cat /sys/devices/system/cpu/present \| tr '-' ' ' \| xargs seq`; do \
	cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
	done
	```

	#### Single-Core Example

	Here's an example to run IREE in a single-threaded context on CPU 7 at its
	lowest clock speed.

	First we'll take control of the clockspeed by setting the governor to
	"userspace".

	```shell
	$ for i in `cat /sys/devices/system/cpu/present \| tr '-' ' ' \| xargs seq`; do \
	echo userspace > \
	"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
	done
	```

	We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency.
	We choose the minimum instead of the maximum here to mitigate thermal throttling
	concerns

	```shell
	$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \
	/sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed
	```

	We can confirm the frequencies of all the CPUs by running the same command
	above. Now to run a command specifically on cpu7, use `taskset 80`
	(hex for 10000000):

	```shell
	$ taskset 80 sleep 20 &
	$ ps -o psr $!
	```

	Remember to cleanup when you're done! Here we'll set the scaling governor back
	to schedutil because that's what they were before on the particular device this,
	was tested on, but that may not exist on all devices.

	```shell
	$ for i in `cat /sys/devices/system/cpu/present \| tr '-' ' ' \| xargs seq`; do \
	echo schedutil > \
	"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
	done
	```

	TODO(scotttodd): Windows instructions