blob: 13c8ee8c030c949cfad89c38cfa3ab908434dd18 [file] [log] [blame] [view]
# CPU cache and other CPU event profiling
CPUs are able to
[record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain
events that may be relevant when investigating the performance of a program. A
common example of such an event is a ["cache
miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the program
tries to access data in memory that isn't already in some CPU cache, causing
that access to be slower than it could otherwise be.
Querying and analyzing this data can be useful, but is hard in two distinct
ways:
* Depending on the CPU and on the OS, both hardware and software limitations can
get in the way of obtaining accurate data.
* This data tends to be inherently difficult to interpret, even when it is
perfectly accurate. In practice it is often noisy and inaccurate, which makes
interpretation even more complicated.
There are two parts to this page: platform-specific information about [how to
query](#perf-and-simpleperf-on-linux-and-android) this data, and, at the end, a
platform-independent explanation of [how to
interpret](#interpreting-cpu-event-counts) it.
## Perf and Simpleperf, on Linux and Android
### Overview
The Linux kernel exposes system event counters to user-space programs by means
of the
[`perf_event_open`](https://man7.org/linux/man-pages/man2/perf_event_open.2.html)
system call. This includes both hardware event counters (such as CPU cache
events) and software events from the kernel (such as page faults and context
switches). Anyone may use this system call to implement a profiler, but Linux
readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page).
### Preserving Artifacts
By default IREE cleans up any temporary files it creates while running. Tools
like perf, however, require those files exist even after the process has exited.
The environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` can be set to preserve
the files. This is only needed for the CPU path when using the system loader.
```shell
export IREE_PRESERVE_DYLIB_TEMP_FILES=1
```
### Desktop Linux
On desktop Linux we can use
[`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on
most Linux distributions, for instance on Debian-based distributions do:
```shell
sudo apt install linux-perf
```
Run the program to be profiled, prepending its command line with `perf record`.
By default this will write the profile data to the current directory,
`./perf.data`. Sometimes this isn't ideal, such as then the current directory is
under version control. Explicit paths can be specified by `-o` flag to direct
the output of `perf record`, and then by `-i` flags to select the input of
subsequent commands analyzing the profile. Example:
```shell
perf record -o /tmp/perf.data \
./iree/tools/iree-benchmark-module \
--driver=dylib \
... command-line arguments of iree-benchmark-module as usual ...
```
By default, this samples time spent. One may specify instead an event to sample
by, with the `-e` flag. For instance, to sample by L1 cache misses, one may do:
```shell
perf record -o /tmp/perf.data -e L1-dcache-load-misses \
./iree/tools/iree-benchmark-module \
--driver=dylib \
... command-line arguments of iree-benchmark-module as usual ...
```
`perf list` dumps the list of event types.
Once you have recorded a profile, there are two main ways to analyze it: `perf
report` and `perf annotate`.
`perf report` breaks down the event counts by symbol. In the default case where
what was sampled was time, this is just an ordinary profile by symbol name, no
different than what could be viewed in other profilers such as
[Tracy](profiling_with_tracy.md). Where it gets really interesting is when the
profile was recording a specific event type, as in the above `-e
L1-dcache-load-misses` example:
```
perf report -i /tmp/perf.data
Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861
Overhead Command Shared Object Symbol
61.53% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_31
13.30% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_11
2.11% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_13
1.90% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_19
1.54% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_25
1.49% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_5
```
`perf annotate` breaks down the event counts by instruction. Again, in the
default case where what was sampled was time, this is no different than what
could be viewed in Tracy, and the real motivation to use `perf` is when
profiling by specific event types as in the above `-e L1-dcache-load-misses`
example:
```
perf annotate -i perf.data
Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861
serving_default_ex_dispatch_31 /tmp/dylib_executablenzpx2Q.so [Percent: local period]
1.66 │ movups -0x1000(%rdi),%xmm10
0.48 │ movups -0x800(%rdi),%xmm9
0.82 │ movups (%rdi),%xmm8
0.49 │ movaps %xmm1,%xmm4
0.12 │ shufps $0x0,%xmm1,%xmm4
0.14 │ mulps %xmm5,%xmm4
0.28 │ addps %xmm6,%xmm4
0.60 │ movaps %xmm3,%xmm6
0.34 │ shufps $0x0,%xmm3,%xmm6
```
#### Warning
`perf annotate` is even noisier than `perf report` as it can be overly
optimistic, depending on the CPU, to pin an event to a specific instruction.
Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more
generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the
above example (recorded on a Skylake workstation) shows: it blamed a `mulps
%xmm5,%xmm4` instruction for a cache miss, which doesn't make sense as that
instruction only touches registers.
### Android
On Android we can use
[`simpleperf`](https://developer.android.com/ndk/guides/simpleperf). It's
preinstalled on current Android `userdebug` images, and part of the Android NDK.
In theory, as Android is Linux, it should be possible to use `perf`.
Unfortunately, `perf` is difficult to build for Android. Fortunately,
`simpleperf` is readily available: it is preinstalled in Android `userdebug`
images, and it is part of the Android NDK.
First, we record on the device:
```shell
adb shell \
simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \
/data/local/tmp/iree-benchmark-module \
--driver=dylib \
... command-line arguments of iree-benchmark-module as usual ...
```
Then pull the recorded data from the device, and analyze on the desktop. We
assume that `${ANDROID_NDK}` points to the local copy of the Android NDK.
```shell
adb pull /data/local/tmp/perf.data /tmp/perf.data
${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data
```
This prints a breakdown of `raw-l1d-cache-refill` events by symbol.
Like with `perf`, a list of event types can be queried by the `list` subcommand:
```shell
adb shell simpleperf list
```
#### No support for `annotate` by CPU event
There is no `simpleperf annotate`. The `simpleperf` documentation lists a couple
of
[ways](https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md#show-annotated-source-code-and-disassembly)
of achieving the same thing.
However:
* The common case of annotating by time, as opposed to annotating by CPU event,
is supported by [Tracy](profiling_with_tracy.md).
* Annotating by CPU event is inherently not working due to hardware limitations
of the ARM CPUs found in Android devices. That is, the hardware is too
imprecise at pinning an event to a particular instruction.
## Interpreting CPU event counts
### Problems
There are multiple layers of complexity in interpreting CPU event counts.
#### These events are in themselves normal
The first difficulty is in the fact that most of these events are *normal*. So
just knowing that they happened is not in itself actionable.
For example, if we learn that some code causes cache misses, that isn't big
news: so does all code. Maybe this code has *too many* cache misses, but how
many is too many? Maybe this code alone accounts for a large fraction of the
overall total of the whole program, but maybe even that is normal, for instance
if the code being studied is the 'hot' part of the program where a large
fraction of overall time is spent?
#### These events are hardware-dependent and under-documented
Many of these events have a meaning that varies between CPUs and that is
difficult to characterize on any CPU, let alone in a way that applies to all
CPUs.
For example, take the "L2 data cache refill". On ARM, with `simpleperf`, that
would be `raw-l2d-cache-refill`. Questions:
* Is “L2” [inclusive](https://en.wikipedia.org/wiki/Cache_inclusion_policy) of
“L1”?
* How many bytes are transferred per “refill”?
* Are accesses induced by speculative execution or by automatic pre-fetching
counted in the same way as accesses induced by actual code execution?
The answers to all of the above questions are CPU-dependent. They may even vary
between the CPU cores of the same Android device.
#### These events are imprecise and noisy, particularly on ARM CPUs
Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on
ARM, as discussed above, there is inaccuracy in which instruction is blamed for
which event, which will increase inaccuracy of per-symbol breakdowns for very
cheap symbols (and makes `perf annotate` impossible as noted above). Finally, be
aware that some ARM CPUs may perform event count interpolation, so we may not
have any access to true hardware counts.
### Recommendations
Here is a workflow pattern that allows to make significant use of CPU event
counts, despite all the problems noted above:
* Hypothesize that some code diff might help performance, and might help
reducing the number of CPU events of a certain type, and that the two might be
related.
* Benchmark with and without the code diff, on the same device, everything else
being equal.
* Let your benchmark perform a fixed number of iterations, or, if using a
benchmark termination condition of the form "run until at least N seconds
have elapsed", carefully divide event counts by the actual number of
iterations that were run.
* If the observed CPU event count difference is significant, go ahead and claim
that your code diff probably helps with that aspect of CPU behavior.
Some things NOT to be done:
* Don’t try to compare different metrics, not even when it seems obvious that
they should satisfy a simple relationship, not even on the same CPU (e.g. “L1
accesses should be greater than L2 accesses”).
* Don’t divide by some “total” metric to get some kinds of ratios. For example,
don’t try to compute a “cache miss ratio” as quotient of “cache refill” over
“all cache accesses” metrics. The first problem with that (even before we get
to CPU-specific issues) is that that’s rewarding increases to the “all cache
accesses” metrics, so if something bad happens in your codegen and your kernel
ends up spilling a lot of register to the stack, that’s going to be a lot more
accesses which will all be L1 hits so that’ll help this ratio look better! So
more generally, just try to minimize some CPU metrics (that count “costly”
events), not some more complex math expression formed from arithmetic on CPU
metrics.