CPUs are able to record certain events that may be relevant when investigating the performance of a program. A common example of such an event is a “cache miss”, when the program tries to access data in memory that isn't already in some CPU cache, causing that access to be slower than it could otherwise be.
Querying and analyzing this data can be useful, but is hard in two distinct ways:
There are two parts to this page: platform-specific information about how to query this data, and, at the end, a platform-independent explanation of how to interpret it.
The Linux kernel exposes system event counters to user-space programs by means of the perf_event_open
system call. This includes both hardware event counters (such as CPU cache events) and software events from the kernel (such as page faults and context switches). Anyone may use this system call to implement a profiler, but Linux readily offers one, perf
.
By default IREE cleans up any temporary files it creates while running. Tools like perf, however, require those files exist even after the process has exited. The environment variable IREE_PRESERVE_DYLIB_TEMP_FILES
can be set to preserve the files. This is only needed for the CPU path when using the system loader.
export IREE_PRESERVE_DYLIB_TEMP_FILES=1
On desktop Linux we can use perf
. It is provided on most Linux distributions, for instance on Debian-based distributions do:
sudo apt install linux-perf
Run the program to be profiled, prepending its command line with perf record
. By default this will write the profile data to the current directory, ./perf.data
. Sometimes this isn't ideal, such as then the current directory is under version control. Explicit paths can be specified by -o
flag to direct the output of perf record
, and then by -i
flags to select the input of subsequent commands analyzing the profile. Example:
perf record -o /tmp/perf.data \ ./iree/tools/iree-benchmark-module \ --driver=dylib \ ... command-line arguments of iree-benchmark-module as usual ...
By default, this samples time spent. One may specify instead an event to sample by, with the -e
flag. For instance, to sample by L1 cache misses, one may do:
perf record -o /tmp/perf.data -e L1-dcache-load-misses \ ./iree/tools/iree-benchmark-module \ --driver=dylib \ ... command-line arguments of iree-benchmark-module as usual ...
perf list
dumps the list of event types.
Once you have recorded a profile, there are two main ways to analyze it: perf report
and perf annotate
.
perf report
breaks down the event counts by symbol. In the default case where what was sampled was time, this is just an ordinary profile by symbol name, no different than what could be viewed in other profilers such as Tracy. Where it gets really interesting is when the profile was recording a specific event type, as in the above -e L1-dcache-load-misses
example:
perf report -i /tmp/perf.data Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861 Overhead Command Shared Object Symbol 61.53% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_31 13.30% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_11 2.11% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_13 1.90% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_19 1.54% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_25 1.49% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_5
perf annotate
breaks down the event counts by instruction. Again, in the default case where what was sampled was time, this is no different than what could be viewed in Tracy, and the real motivation to use perf
is when profiling by specific event types as in the above -e L1-dcache-load-misses
example:
perf annotate -i perf.data Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861 serving_default_ex_dispatch_31 /tmp/dylib_executablenzpx2Q.so [Percent: local period] 1.66 │ movups -0x1000(%rdi),%xmm10 0.48 │ movups -0x800(%rdi),%xmm9 0.82 │ movups (%rdi),%xmm8 0.49 │ movaps %xmm1,%xmm4 0.12 │ shufps $0x0,%xmm1,%xmm4 0.14 │ mulps %xmm5,%xmm4 0.28 │ addps %xmm6,%xmm4 0.60 │ movaps %xmm3,%xmm6 0.34 │ shufps $0x0,%xmm3,%xmm6
perf annotate
is even noisier than perf report
as it can be overly optimistic, depending on the CPU, to pin an event to a specific instruction. Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the above example (recorded on a Skylake workstation) shows: it blamed a mulps %xmm5,%xmm4
instruction for a cache miss, which doesn't make sense as that instruction only touches registers.
On Android we can use simpleperf
. It's preinstalled on current Android userdebug
images, and part of the Android NDK.
In theory, as Android is Linux, it should be possible to use perf
. Unfortunately, perf
is difficult to build for Android. Fortunately, simpleperf
is readily available: it is preinstalled in Android userdebug
images, and it is part of the Android NDK.
First, we record on the device:
adb shell \ simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \ /data/local/tmp/iree-benchmark-module \ --driver=dylib \ ... command-line arguments of iree-benchmark-module as usual ...
Then pull the recorded data from the device, and analyze on the desktop. We assume that ${ANDROID_NDK}
points to the local copy of the Android NDK.
adb pull /data/local/tmp/perf.data /tmp/perf.data ${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data
This prints a breakdown of raw-l1d-cache-refill
events by symbol.
Like with perf
, a list of event types can be queried by the list
subcommand:
adb shell simpleperf list
annotate
by CPU eventThere is no simpleperf annotate
. The simpleperf
documentation lists a couple of ways of achieving the same thing.
However:
There are multiple layers of complexity in interpreting CPU event counts.
The first difficulty is in the fact that most of these events are normal. So just knowing that they happened is not in itself actionable.
For example, if we learn that some code causes cache misses, that isn't big news: so does all code. Maybe this code has too many cache misses, but how many is too many? Maybe this code alone accounts for a large fraction of the overall total of the whole program, but maybe even that is normal, for instance if the code being studied is the ‘hot’ part of the program where a large fraction of overall time is spent?
Many of these events have a meaning that varies between CPUs and that is difficult to characterize on any CPU, let alone in a way that applies to all CPUs.
For example, take the “L2 data cache refill”. On ARM, with simpleperf
, that would be raw-l2d-cache-refill
. Questions:
The answers to all of the above questions are CPU-dependent. They may even vary between the CPU cores of the same Android device.
Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on ARM, as discussed above, there is inaccuracy in which instruction is blamed for which event, which will increase inaccuracy of per-symbol breakdowns for very cheap symbols (and makes perf annotate
impossible as noted above). Finally, be aware that some ARM CPUs may perform event count interpolation, so we may not have any access to true hardware counts.
Here is a workflow pattern that allows to make significant use of CPU event counts, despite all the problems noted above:
Some things NOT to be done: