docs/developers/developing_iree/profiling_cpu_events.md - 3p/openxla/iree - Git at Google

 # CPU cache and other CPU event profiling

 CPUs are able to
 [record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain
 events that may be relevant when investigating the performance of a program. A
 common example of such an event is a ["cache
 miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the program
 tries to access data in memory that isn't already in some CPU cache, causing
 that access to be slower than it could otherwise be.

 Querying and analyzing this data can be useful, but is hard in two distinct
 ways:
 * Depending on the CPU and on the OS, both hardware and software limitations can
   get in the way of obtaining accurate data.
 * This data tends to be inherently difficult to interpret, even when it is
   perfectly accurate. In practice it is often noisy and inaccurate, which makes
   interpretation even more complicated.

 There are two parts to this page: platform-specific information about [how to
 query](#perf-and-simpleperf-on-linux-and-android) this data, and, at the end, a
 platform-independent explanation of [how to
 interpret](#interpreting-cpu-event-counts) it.

 ## Perf and Simpleperf, on Linux and Android

 ### Overview

 The Linux kernel exposes system event counters to user-space programs by means
 of the
 [`perf_event_open`](https://man7.org/linux/man-pages/man2/perf_event_open.2.html)
 system call. This includes both hardware event counters (such as CPU cache
 events) and software events from the kernel (such as page faults and context
 switches). Anyone may use this system call to implement a profiler, but Linux
 readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page).

 ### Preserving Artifacts

 By default IREE cleans up any temporary files it creates while running. Tools
 like perf, however, require those files exist even after the process has exited.
 The environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` can be set to preserve
 the files. This is only needed for the CPU path when using the system loader.

 ```shell
 export IREE_PRESERVE_DYLIB_TEMP_FILES=1
 ```

 ### Desktop Linux

 On desktop Linux we can use
 [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on
 most Linux distributions, for instance on Debian-based distributions do:
 ```shell
 sudo apt install linux-perf
 ```

 Run the program to be profiled, prepending its command line with `perf record`.
 By default this will write the profile data to the current directory,
 `./perf.data`. Sometimes this isn't ideal, such as then the current directory is
 under version control. Explicit paths can be specified by `-o` flag to direct
 the output of `perf record`, and then by `-i` flags to select the input of
 subsequent commands analyzing the profile. Example:

 ```shell
 perf record -o /tmp/perf.data \
   ./iree/tools/iree-benchmark-module \
     --driver=dylib \
     ... command-line arguments of iree-benchmark-module as usual ...
 ```

 By default, this samples time spent. One may specify instead an event to sample
 by, with the `-e` flag. For instance, to sample by L1 cache misses, one may do:

 ```shell
 perf record -o /tmp/perf.data -e L1-dcache-load-misses \
   ./iree/tools/iree-benchmark-module \
     --driver=dylib \
     ... command-line arguments of iree-benchmark-module as usual ...
 ```

 `perf list` dumps the list of event types.

 Once you have recorded a profile, there are two main ways to analyze it: `perf
 report` and `perf annotate`.

 `perf report` breaks down the event counts by symbol. In the default case where
 what was sampled was time, this is just an ordinary profile by symbol name, no
 different than what could be viewed in other profilers such as
 [Tracy](profiling_with_tracy.md). Where it gets really interesting is when the
 profile was recording a specific event type, as in the above `-e
 L1-dcache-load-misses` example:

 ```
 perf report -i /tmp/perf.data

 Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861
 Overhead  Command          Shared Object              Symbol
   61.53%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_31
   13.30%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_11
    2.11%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_13
    1.90%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_19
    1.54%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_25
    1.49%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_5
 ```

 `perf annotate` breaks down the event counts by instruction. Again, in the
 default case where what was sampled was time, this is no different than what
 could be viewed in Tracy, and the real motivation to use `perf` is when
 profiling by specific event types as in the above `-e L1-dcache-load-misses`
 example:

 ```
 perf annotate -i perf.data

 Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861
 serving_default_ex_dispatch_31  /tmp/dylib_executablenzpx2Q.so [Percent: local period]
   1.66 │        movups -0x1000(%rdi),%xmm10
   0.48 │        movups -0x800(%rdi),%xmm9
   0.82 │        movups (%rdi),%xmm8
   0.49 │        movaps %xmm1,%xmm4
   0.12 │        shufps $0x0,%xmm1,%xmm4
   0.14 │        mulps  %xmm5,%xmm4
   0.28 │        addps  %xmm6,%xmm4
   0.60 │        movaps %xmm3,%xmm6
   0.34 │        shufps $0x0,%xmm3,%xmm6
 ```

 #### Warning

 `perf annotate` is even noisier than `perf report` as it can be overly
 optimistic, depending on the CPU, to pin an event to a specific instruction.
 Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more
 generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the
 above example (recorded on a Skylake workstation) shows: it blamed a `mulps
 %xmm5,%xmm4` instruction for a cache miss, which doesn't make sense as that
 instruction only touches registers.

 ### Android

 On Android we can use
 [`simpleperf`](https://developer.android.com/ndk/guides/simpleperf). It's
 preinstalled on current Android `userdebug` images, and part of the Android NDK.

 In theory, as Android is Linux, it should be possible to use `perf`.
 Unfortunately, `perf` is difficult to build for Android. Fortunately,
 `simpleperf` is readily available: it is preinstalled in Android `userdebug`
 images, and it is part of the Android NDK.

 First, we record on the device:
 ```shell
 adb shell \
   simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \
     /data/local/tmp/iree-benchmark-module \
       --driver=dylib \
       ... command-line arguments of iree-benchmark-module as usual ...
 ```

 Then pull the recorded data from the device, and analyze on the desktop. We
 assume that `${ANDROID_NDK}` points to the local copy of the Android NDK.
 ```shell
 adb pull /data/local/tmp/perf.data /tmp/perf.data
 ${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data
 ```

 This prints a breakdown of `raw-l1d-cache-refill` events by symbol.

 Like with `perf`, a list of event types can be queried by the `list` subcommand:

 ```shell
 adb shell simpleperf list
 ```

 #### No support for `annotate` by CPU event

 There is no `simpleperf annotate`. The `simpleperf` documentation lists a couple
 of
 [ways](https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md#show-annotated-source-code-and-disassembly)
 of achieving the same thing.

 However:
 * The common case of annotating by time, as opposed to annotating by CPU event,
   is supported by [Tracy](profiling_with_tracy.md).
 * Annotating by CPU event is inherently not working due to hardware limitations
   of the ARM CPUs found in Android devices. That is, the hardware is too
   imprecise at pinning an event to a particular instruction.

 ## Interpreting CPU event counts

 ### Problems

 There are multiple layers of complexity in interpreting CPU event counts.

 #### These events are in themselves normal

 The first difficulty is in the fact that most of these events are *normal*. So
 just knowing that they happened is not in itself actionable.

 For example, if we learn that some code causes cache misses, that isn't big
 news: so does all code. Maybe this code has *too many* cache misses, but how
 many is too many? Maybe this code alone accounts for a large fraction of the
 overall total of the whole program, but maybe even that is normal, for instance
 if the code being studied is the 'hot' part of the program where a large
 fraction of overall time is spent?

 #### These events are hardware-dependent and under-documented

 Many of these events have a meaning that varies between CPUs and that is
 difficult to characterize on any CPU, let alone in a way that applies to all
 CPUs.

 For example, take the "L2 data cache refill". On ARM, with `simpleperf`, that
 would be `raw-l2d-cache-refill`. Questions:
 * Is “L2” [inclusive](https://en.wikipedia.org/wiki/Cache_inclusion_policy) of
   “L1”?
 * How many bytes are transferred per “refill”?
 * Are accesses induced by speculative execution or by automatic pre-fetching
   counted in the same way as accesses induced by actual code execution?

 The answers to all of the above questions are CPU-dependent. They may even vary
 between the CPU cores of the same Android device.

 #### These events are imprecise and noisy, particularly on ARM CPUs

 Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on
 ARM, as discussed above, there is inaccuracy in which instruction is blamed for
 which event, which will increase inaccuracy of per-symbol breakdowns for very
 cheap symbols (and makes `perf annotate` impossible as noted above). Finally, be
 aware that some ARM CPUs may perform event count interpolation, so we may not
 have any access to true hardware counts.

 ### Recommendations

 Here is a workflow pattern that allows to make significant use of CPU event
 counts, despite all the problems noted above:
 * Hypothesize that some code diff might help performance, and might help
   reducing the number of CPU events of a certain type, and that the two might be
   related.
 * Benchmark with and without the code diff, on the same device, everything else
   being equal.
   * Let your benchmark perform a fixed number of iterations, or, if using a
     benchmark termination condition of the form "run until at least N seconds
     have elapsed", carefully divide event counts by the actual number of
     iterations that were run.
 * If the observed CPU event count difference is significant, go ahead and claim
   that your code diff probably helps with that aspect of CPU behavior.

 Some things NOT to be done:
 * Don’t try to compare different metrics, not even when it seems obvious that
   they should satisfy a simple relationship, not even on the same CPU (e.g. “L1
   accesses should be greater than L2 accesses”).
 * Don’t divide by some “total” metric to get some kinds of ratios. For example,
   don’t try to compute a “cache miss ratio” as quotient of “cache refill” over
   “all cache accesses” metrics. The first problem with that (even before we get
   to CPU-specific issues) is that that’s rewarding increases to the “all cache
   accesses” metrics, so if something bad happens in your codegen and your kernel
   ends up spilling a lot of register to the stack, that’s going to be a lot more
   accesses which will all be L1 hits so that’ll help this ratio look better!  So
   more generally, just try to minimize some CPU metrics (that count “costly”
   events), not some more complex math expression formed from arithmetic on CPU
   metrics.
	# CPU cache and other CPU event profiling

	CPUs are able to
	[record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain
	events that may be relevant when investigating the performance of a program. A
	common example of such an event is a ["cache
	miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the program
	tries to access data in memory that isn't already in some CPU cache, causing
	that access to be slower than it could otherwise be.

	Querying and analyzing this data can be useful, but is hard in two distinct
	ways:
	* Depending on the CPU and on the OS, both hardware and software limitations can
	get in the way of obtaining accurate data.
	* This data tends to be inherently difficult to interpret, even when it is
	perfectly accurate. In practice it is often noisy and inaccurate, which makes
	interpretation even more complicated.

	There are two parts to this page: platform-specific information about [how to
	query](#perf-and-simpleperf-on-linux-and-android) this data, and, at the end, a
	platform-independent explanation of [how to
	interpret](#interpreting-cpu-event-counts) it.

	## Perf and Simpleperf, on Linux and Android

	### Overview

	The Linux kernel exposes system event counters to user-space programs by means
	of the
	[`perf_event_open`](https://man7.org/linux/man-pages/man2/perf_event_open.2.html)
	system call. This includes both hardware event counters (such as CPU cache
	events) and software events from the kernel (such as page faults and context
	switches). Anyone may use this system call to implement a profiler, but Linux
	readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page).

	### Preserving Artifacts

	By default IREE cleans up any temporary files it creates while running. Tools
	like perf, however, require those files exist even after the process has exited.
	The environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` can be set to preserve
	the files. This is only needed for the CPU path when using the system loader.

	```shell
	export IREE_PRESERVE_DYLIB_TEMP_FILES=1
	```

	### Desktop Linux

	On desktop Linux we can use
	[`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on
	most Linux distributions, for instance on Debian-based distributions do:
	```shell
	sudo apt install linux-perf
	```

	Run the program to be profiled, prepending its command line with `perf record`.
	By default this will write the profile data to the current directory,
	`./perf.data`. Sometimes this isn't ideal, such as then the current directory is
	under version control. Explicit paths can be specified by `-o` flag to direct
	the output of `perf record`, and then by `-i` flags to select the input of
	subsequent commands analyzing the profile. Example:

	```shell
	perf record -o /tmp/perf.data \
	./iree/tools/iree-benchmark-module \
	--driver=dylib \
	... command-line arguments of iree-benchmark-module as usual ...
	```

	By default, this samples time spent. One may specify instead an event to sample
	by, with the `-e` flag. For instance, to sample by L1 cache misses, one may do:

	```shell
	perf record -o /tmp/perf.data -e L1-dcache-load-misses \
	./iree/tools/iree-benchmark-module \
	--driver=dylib \
	... command-line arguments of iree-benchmark-module as usual ...
	```

	`perf list` dumps the list of event types.

	Once you have recorded a profile, there are two main ways to analyze it: `perf
	report` and `perf annotate`.

	`perf report` breaks down the event counts by symbol. In the default case where
	what was sampled was time, this is just an ordinary profile by symbol name, no
	different than what could be viewed in other profilers such as
	[Tracy](profiling_with_tracy.md). Where it gets really interesting is when the
	profile was recording a specific event type, as in the above `-e
	L1-dcache-load-misses` example:

	```
	perf report -i /tmp/perf.data

	Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861
	Overhead Command Shared Object Symbol
	61.53% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_31
	13.30% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_11
	2.11% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_13
	1.90% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_19
	1.54% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_25
	1.49% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_5
	```

	`perf annotate` breaks down the event counts by instruction. Again, in the
	default case where what was sampled was time, this is no different than what
	could be viewed in Tracy, and the real motivation to use `perf` is when
	profiling by specific event types as in the above `-e L1-dcache-load-misses`
	example:

	```
	perf annotate -i perf.data

	Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861
	serving_default_ex_dispatch_31 /tmp/dylib_executablenzpx2Q.so [Percent: local period]
	1.66 │ movups -0x1000(%rdi),%xmm10
	0.48 │ movups -0x800(%rdi),%xmm9
	0.82 │ movups (%rdi),%xmm8
	0.49 │ movaps %xmm1,%xmm4
	0.12 │ shufps $0x0,%xmm1,%xmm4
	0.14 │ mulps %xmm5,%xmm4
	0.28 │ addps %xmm6,%xmm4
	0.60 │ movaps %xmm3,%xmm6
	0.34 │ shufps $0x0,%xmm3,%xmm6
	```

	#### Warning

	`perf annotate` is even noisier than `perf report` as it can be overly
	optimistic, depending on the CPU, to pin an event to a specific instruction.
	Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more
	generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the
	above example (recorded on a Skylake workstation) shows: it blamed a `mulps
	%xmm5,%xmm4` instruction for a cache miss, which doesn't make sense as that
	instruction only touches registers.

	### Android

	On Android we can use
	[`simpleperf`](https://developer.android.com/ndk/guides/simpleperf). It's
	preinstalled on current Android `userdebug` images, and part of the Android NDK.

	In theory, as Android is Linux, it should be possible to use `perf`.
	Unfortunately, `perf` is difficult to build for Android. Fortunately,
	`simpleperf` is readily available: it is preinstalled in Android `userdebug`
	images, and it is part of the Android NDK.

	First, we record on the device:
	```shell
	adb shell \
	simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \
	/data/local/tmp/iree-benchmark-module \
	--driver=dylib \
	... command-line arguments of iree-benchmark-module as usual ...
	```

	Then pull the recorded data from the device, and analyze on the desktop. We
	assume that `${ANDROID_NDK}` points to the local copy of the Android NDK.
	```shell
	adb pull /data/local/tmp/perf.data /tmp/perf.data
	${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data
	```

	This prints a breakdown of `raw-l1d-cache-refill` events by symbol.

	Like with `perf`, a list of event types can be queried by the `list` subcommand:

	```shell
	adb shell simpleperf list
	```

	#### No support for `annotate` by CPU event

	There is no `simpleperf annotate`. The `simpleperf` documentation lists a couple
	of
	[ways](https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md#show-annotated-source-code-and-disassembly)
	of achieving the same thing.

	However:
	* The common case of annotating by time, as opposed to annotating by CPU event,
	is supported by [Tracy](profiling_with_tracy.md).
	* Annotating by CPU event is inherently not working due to hardware limitations
	of the ARM CPUs found in Android devices. That is, the hardware is too
	imprecise at pinning an event to a particular instruction.

	## Interpreting CPU event counts

	### Problems

	There are multiple layers of complexity in interpreting CPU event counts.

	#### These events are in themselves normal

	The first difficulty is in the fact that most of these events are normal. So
	just knowing that they happened is not in itself actionable.

	For example, if we learn that some code causes cache misses, that isn't big
	news: so does all code. Maybe this code has too many cache misses, but how
	many is too many? Maybe this code alone accounts for a large fraction of the
	overall total of the whole program, but maybe even that is normal, for instance
	if the code being studied is the 'hot' part of the program where a large
	fraction of overall time is spent?

	#### These events are hardware-dependent and under-documented

	Many of these events have a meaning that varies between CPUs and that is
	difficult to characterize on any CPU, let alone in a way that applies to all
	CPUs.

	For example, take the "L2 data cache refill". On ARM, with `simpleperf`, that
	would be `raw-l2d-cache-refill`. Questions:
	* Is “L2” [inclusive](https://en.wikipedia.org/wiki/Cache_inclusion_policy) of
	“L1”?
	* How many bytes are transferred per “refill”?
	* Are accesses induced by speculative execution or by automatic pre-fetching
	counted in the same way as accesses induced by actual code execution?

	The answers to all of the above questions are CPU-dependent. They may even vary
	between the CPU cores of the same Android device.

	#### These events are imprecise and noisy, particularly on ARM CPUs

	Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on
	ARM, as discussed above, there is inaccuracy in which instruction is blamed for
	which event, which will increase inaccuracy of per-symbol breakdowns for very
	cheap symbols (and makes `perf annotate` impossible as noted above). Finally, be
	aware that some ARM CPUs may perform event count interpolation, so we may not
	have any access to true hardware counts.

	### Recommendations

	Here is a workflow pattern that allows to make significant use of CPU event
	counts, despite all the problems noted above:
	* Hypothesize that some code diff might help performance, and might help
	reducing the number of CPU events of a certain type, and that the two might be
	related.
	* Benchmark with and without the code diff, on the same device, everything else
	being equal.
	* Let your benchmark perform a fixed number of iterations, or, if using a
	benchmark termination condition of the form "run until at least N seconds
	have elapsed", carefully divide event counts by the actual number of
	iterations that were run.
	* If the observed CPU event count difference is significant, go ahead and claim
	that your code diff probably helps with that aspect of CPU behavior.

	Some things NOT to be done:
	* Don’t try to compare different metrics, not even when it seems obvious that
	they should satisfy a simple relationship, not even on the same CPU (e.g. “L1
	accesses should be greater than L2 accesses”).
	* Don’t divide by some “total” metric to get some kinds of ratios. For example,
	don’t try to compute a “cache miss ratio” as quotient of “cache refill” over
	“all cache accesses” metrics. The first problem with that (even before we get
	to CPU-specific issues) is that that’s rewarding increases to the “all cache
	accesses” metrics, so if something bad happens in your codegen and your kernel
	ends up spilling a lot of register to the stack, that’s going to be a lot more
	accesses which will all be L1 hits so that’ll help this ratio look better! So
	more generally, just try to minimize some CPU metrics (that count “costly”
	events), not some more complex math expression formed from arithmetic on CPU
	metrics.