CPU cache and other CPU event profiling

CPUs are able to record certain events that may be relevant when investigating the performance of a program. A common example of such an event is a “cache miss”, when the program tries to access data in memory that isn't already in some CPU cache, causing that access to be slower than it could otherwise be.

Querying and analyzing this data can be useful, but is hard in two distinct ways:

Depending on the CPU and on the OS, both hardware and software limitations can get in the way of obtaining accurate data.
This data tends to be inherently difficult to interpret, even when it is perfectly accurate. In practice it is often noisy and inaccurate, which makes interpretation even more complicated.

There are two parts to this page: platform-specific information about how to query this data, and, at the end, a platform-independent explanation of how to interpret it.

Perf and Simpleperf, on Linux and Android

Overview

The Linux kernel exposes system event counters to user-space programs by means of the perf_event_open system call. This includes both hardware event counters (such as CPU cache events) and software events from the kernel (such as page faults and context switches). Anyone may use this system call to implement a profiler, but Linux readily offers one, perf.

Preserving Artifacts

By default IREE cleans up any temporary files it creates while running. Tools like perf, however, require those files exist even after the process has exited. The environment variable IREE_PRESERVE_DYLIB_TEMP_FILES can be set to preserve the files. This is only needed for the CPU path when using the system loader.

export IREE_PRESERVE_DYLIB_TEMP_FILES=1

Desktop Linux

On desktop Linux we can use perf. It is provided on most Linux distributions, for instance on Debian-based distributions do:

sudo apt install linux-perf

Run the program to be profiled, prepending its command line with perf record. By default this will write the profile data to the current directory, ./perf.data. Sometimes this isn't ideal, such as then the current directory is under version control. Explicit paths can be specified by -o flag to direct the output of perf record, and then by -i flags to select the input of subsequent commands analyzing the profile. Example:

perf record -o /tmp/perf.data \
  ./iree/tools/iree-benchmark-module \
    --driver=dylib \
    ... command-line arguments of iree-benchmark-module as usual ...

By default, this samples time spent. One may specify instead an event to sample by, with the -e flag. For instance, to sample by L1 cache misses, one may do:

perf record -o /tmp/perf.data -e L1-dcache-load-misses \
  ./iree/tools/iree-benchmark-module \
    --driver=dylib \
    ... command-line arguments of iree-benchmark-module as usual ...

perf list dumps the list of event types.

Once you have recorded a profile, there are two main ways to analyze it: perf report and perf annotate.

perf report breaks down the event counts by symbol. In the default case where what was sampled was time, this is just an ordinary profile by symbol name, no different than what could be viewed in other profilers such as Tracy. Where it gets really interesting is when the profile was recording a specific event type, as in the above -e L1-dcache-load-misses example:

perf report -i /tmp/perf.data

Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861
Overhead  Command          Shared Object              Symbol
  61.53%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_31
  13.30%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_11
   2.11%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_13
   1.90%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_19
   1.54%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_25
   1.49%  cpu0             dylib_executablenzpx2Q.so  [.] serving_default_ex_dispatch_5

perf annotate breaks down the event counts by instruction. Again, in the default case where what was sampled was time, this is no different than what could be viewed in Tracy, and the real motivation to use perf is when profiling by specific event types as in the above -e L1-dcache-load-misses example:

perf annotate -i perf.data

Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861
serving_default_ex_dispatch_31  /tmp/dylib_executablenzpx2Q.so [Percent: local period]
  1.66 │        movups -0x1000(%rdi),%xmm10
  0.48 │        movups -0x800(%rdi),%xmm9
  0.82 │        movups (%rdi),%xmm8
  0.49 │        movaps %xmm1,%xmm4
  0.12 │        shufps $0x0,%xmm1,%xmm4
  0.14 │        mulps  %xmm5,%xmm4
  0.28 │        addps  %xmm6,%xmm4
  0.60 │        movaps %xmm3,%xmm6
  0.34 │        shufps $0x0,%xmm3,%xmm6

Warning

perf annotate is even noisier than perf report as it can be overly optimistic, depending on the CPU, to pin an event to a specific instruction. Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the above example (recorded on a Skylake workstation) shows: it blamed a mulps %xmm5,%xmm4 instruction for a cache miss, which doesn't make sense as that instruction only touches registers.

Android

On Android we can use simpleperf. It's preinstalled on current Android userdebug images, and part of the Android NDK.

In theory, as Android is Linux, it should be possible to use perf. Unfortunately, perf is difficult to build for Android. Fortunately, simpleperf is readily available: it is preinstalled in Android userdebug images, and it is part of the Android NDK.

First, we record on the device:

adb shell \
  simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \
    /data/local/tmp/iree-benchmark-module \
      --driver=dylib \
      ... command-line arguments of iree-benchmark-module as usual ...

Then pull the recorded data from the device, and analyze on the desktop. We assume that ${ANDROID_NDK} points to the local copy of the Android NDK.

adb pull /data/local/tmp/perf.data /tmp/perf.data
${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data

This prints a breakdown of raw-l1d-cache-refill events by symbol.

Like with perf, a list of event types can be queried by the list subcommand:

adb shell simpleperf list

No support for `annotate` by CPU event

There is no simpleperf annotate. The simpleperf documentation lists a couple of ways of achieving the same thing.

However:

The common case of annotating by time, as opposed to annotating by CPU event, is supported by Tracy.
Annotating by CPU event is inherently not working due to hardware limitations of the ARM CPUs found in Android devices. That is, the hardware is too imprecise at pinning an event to a particular instruction.

Interpreting CPU event counts

Problems

There are multiple layers of complexity in interpreting CPU event counts.

These events are in themselves normal

The first difficulty is in the fact that most of these events are normal. So just knowing that they happened is not in itself actionable.

For example, if we learn that some code causes cache misses, that isn't big news: so does all code. Maybe this code has too many cache misses, but how many is too many? Maybe this code alone accounts for a large fraction of the overall total of the whole program, but maybe even that is normal, for instance if the code being studied is the ‘hot’ part of the program where a large fraction of overall time is spent?

These events are hardware-dependent and under-documented

Many of these events have a meaning that varies between CPUs and that is difficult to characterize on any CPU, let alone in a way that applies to all CPUs.

For example, take the “L2 data cache refill”. On ARM, with simpleperf, that would be raw-l2d-cache-refill. Questions:

Is “L2” inclusive of “L1”?
How many bytes are transferred per “refill”?
Are accesses induced by speculative execution or by automatic pre-fetching counted in the same way as accesses induced by actual code execution?

The answers to all of the above questions are CPU-dependent. They may even vary between the CPU cores of the same Android device.

These events are imprecise and noisy, particularly on ARM CPUs

Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on ARM, as discussed above, there is inaccuracy in which instruction is blamed for which event, which will increase inaccuracy of per-symbol breakdowns for very cheap symbols (and makes perf annotate impossible as noted above). Finally, be aware that some ARM CPUs may perform event count interpolation, so we may not have any access to true hardware counts.

Recommendations

Here is a workflow pattern that allows to make significant use of CPU event counts, despite all the problems noted above:

Hypothesize that some code diff might help performance, and might help reducing the number of CPU events of a certain type, and that the two might be related.
Benchmark with and without the code diff, on the same device, everything else being equal.
- Let your benchmark perform a fixed number of iterations, or, if using a benchmark termination condition of the form “run until at least N seconds have elapsed”, carefully divide event counts by the actual number of iterations that were run.
If the observed CPU event count difference is significant, go ahead and claim that your code diff probably helps with that aspect of CPU behavior.

Some things NOT to be done:

Don’t try to compare different metrics, not even when it seems obvious that they should satisfy a simple relationship, not even on the same CPU (e.g. “L1 accesses should be greater than L2 accesses”).
Don’t divide by some “total” metric to get some kinds of ratios. For example, don’t try to compute a “cache miss ratio” as quotient of “cache refill” over “all cache accesses” metrics. The first problem with that (even before we get to CPU-specific issues) is that that’s rewarding increases to the “all cache accesses” metrics, so if something bad happens in your codegen and your kernel ends up spilling a lot of register to the stack, that’s going to be a lot more accesses which will all be L1 hits so that’ll help this ratio look better! So more generally, just try to minimize some CPU metrics (that count “costly” events), not some more complex math expression formed from arithmetic on CPU metrics.