| # CPU cache and other CPU event profiling |
| |
| CPUs are able to |
| [record](https://en.wikipedia.org/wiki/Hardware_performance_counter) certain |
| events that may be relevant when investigating the performance of a program. A |
| common example of such an event is a ["cache |
| miss"](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), when the program |
| tries to access data in memory that isn't already in some CPU cache, causing |
| that access to be slower than it could otherwise be. |
| |
| Querying and analyzing this data can be useful, but is hard in two distinct |
| ways: |
| * Depending on the CPU and on the OS, both hardware and software limitations can |
| get in the way of obtaining accurate data. |
| * This data tends to be inherently difficult to interpret, even when it is |
| perfectly accurate. In practice it is often noisy and inaccurate, which makes |
| interpretation even more complicated. |
| |
| There are two parts to this page: platform-specific information about [how to |
| query](#perf-and-simpleperf-on-linux-and-android) this data, and, at the end, a |
| platform-independent explanation of [how to |
| interpret](#interpreting-cpu-event-counts) it. |
| |
| ## Perf and Simpleperf, on Linux and Android |
| |
| ### Overview |
| |
| The Linux kernel exposes system event counters to user-space programs by means |
| of the |
| [`perf_event_open`](https://man7.org/linux/man-pages/man2/perf_event_open.2.html) |
| system call. This includes both hardware event counters (such as CPU cache |
| events) and software events from the kernel (such as page faults and context |
| switches). Anyone may use this system call to implement a profiler, but Linux |
| readily offers one, [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). |
| |
| ### Preserving Artifacts |
| |
| By default IREE cleans up any temporary files it creates while running. Tools |
| like perf, however, require those files exist even after the process has exited. |
| The environment variable `IREE_PRESERVE_DYLIB_TEMP_FILES` can be set to preserve |
| the files. This is only needed for the CPU path when using the system loader. |
| |
| ```shell |
| export IREE_PRESERVE_DYLIB_TEMP_FILES=1 |
| ``` |
| |
| ### Desktop Linux |
| |
| On desktop Linux we can use |
| [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page). It is provided on |
| most Linux distributions, for instance on Debian-based distributions do: |
| ```shell |
| sudo apt install linux-perf |
| ``` |
| |
| Run the program to be profiled, prepending its command line with `perf record`. |
| By default this will write the profile data to the current directory, |
| `./perf.data`. Sometimes this isn't ideal, such as then the current directory is |
| under version control. Explicit paths can be specified by `-o` flag to direct |
| the output of `perf record`, and then by `-i` flags to select the input of |
| subsequent commands analyzing the profile. Example: |
| |
| ```shell |
| perf record -o /tmp/perf.data \ |
| ./iree/tools/iree-benchmark-module \ |
| --driver=dylib \ |
| ... command-line arguments of iree-benchmark-module as usual ... |
| ``` |
| |
| By default, this samples time spent. One may specify instead an event to sample |
| by, with the `-e` flag. For instance, to sample by L1 cache misses, one may do: |
| |
| ```shell |
| perf record -o /tmp/perf.data -e L1-dcache-load-misses \ |
| ./iree/tools/iree-benchmark-module \ |
| --driver=dylib \ |
| ... command-line arguments of iree-benchmark-module as usual ... |
| ``` |
| |
| `perf list` dumps the list of event types. |
| |
| Once you have recorded a profile, there are two main ways to analyze it: `perf |
| report` and `perf annotate`. |
| |
| `perf report` breaks down the event counts by symbol. In the default case where |
| what was sampled was time, this is just an ordinary profile by symbol name, no |
| different than what could be viewed in other profilers such as |
| [Tracy](profiling_with_tracy.md). Where it gets really interesting is when the |
| profile was recording a specific event type, as in the above `-e |
| L1-dcache-load-misses` example: |
| |
| ``` |
| perf report -i /tmp/perf.data |
| |
| Samples: 6K of event 'L1-dcache-load-misses', Event count (approx.): 362571861 |
| Overhead Command Shared Object Symbol |
| 61.53% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_31 |
| 13.30% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_11 |
| 2.11% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_13 |
| 1.90% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_19 |
| 1.54% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_25 |
| 1.49% cpu0 dylib_executablenzpx2Q.so [.] serving_default_ex_dispatch_5 |
| ``` |
| |
| `perf annotate` breaks down the event counts by instruction. Again, in the |
| default case where what was sampled was time, this is no different than what |
| could be viewed in Tracy, and the real motivation to use `perf` is when |
| profiling by specific event types as in the above `-e L1-dcache-load-misses` |
| example: |
| |
| ``` |
| perf annotate -i perf.data |
| |
| Samples: 6K of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 362571861 |
| serving_default_ex_dispatch_31 /tmp/dylib_executablenzpx2Q.so [Percent: local period] |
| 1.66 │ movups -0x1000(%rdi),%xmm10 |
| 0.48 │ movups -0x800(%rdi),%xmm9 |
| 0.82 │ movups (%rdi),%xmm8 |
| 0.49 │ movaps %xmm1,%xmm4 |
| 0.12 │ shufps $0x0,%xmm1,%xmm4 |
| 0.14 │ mulps %xmm5,%xmm4 |
| 0.28 │ addps %xmm6,%xmm4 |
| 0.60 │ movaps %xmm3,%xmm6 |
| 0.34 │ shufps $0x0,%xmm3,%xmm6 |
| ``` |
| |
| #### Warning |
| |
| `perf annotate` is even noisier than `perf report` as it can be overly |
| optimistic, depending on the CPU, to pin an event to a specific instruction. |
| Typically, this works fairly well on x86 CPUs and less well on ARM CPUs and more |
| generally on anything mobile. Even on a desktop x86 CPU, this is noisy, as the |
| above example (recorded on a Skylake workstation) shows: it blamed a `mulps |
| %xmm5,%xmm4` instruction for a cache miss, which doesn't make sense as that |
| instruction only touches registers. |
| |
| ### Android |
| |
| On Android we can use |
| [`simpleperf`](https://developer.android.com/ndk/guides/simpleperf). It's |
| preinstalled on current Android `userdebug` images, and part of the Android NDK. |
| |
| In theory, as Android is Linux, it should be possible to use `perf`. |
| Unfortunately, `perf` is difficult to build for Android. Fortunately, |
| `simpleperf` is readily available: it is preinstalled in Android `userdebug` |
| images, and it is part of the Android NDK. |
| |
| First, we record on the device: |
| ```shell |
| adb shell \ |
| simpleperf record -e raw-l1d-cache-refill -o /data/local/tmp/perf.data \ |
| /data/local/tmp/iree-benchmark-module \ |
| --driver=dylib \ |
| ... command-line arguments of iree-benchmark-module as usual ... |
| ``` |
| |
| Then pull the recorded data from the device, and analyze on the desktop. We |
| assume that `${ANDROID_NDK}` points to the local copy of the Android NDK. |
| ```shell |
| adb pull /data/local/tmp/perf.data /tmp/perf.data |
| ${ANDROID_NDK}/simpleperf/report.py -i /tmp/perf.data |
| ``` |
| |
| This prints a breakdown of `raw-l1d-cache-refill` events by symbol. |
| |
| Like with `perf`, a list of event types can be queried by the `list` subcommand: |
| |
| ```shell |
| adb shell simpleperf list |
| ``` |
| |
| #### No support for `annotate` by CPU event |
| |
| There is no `simpleperf annotate`. The `simpleperf` documentation lists a couple |
| of |
| [ways](https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md#show-annotated-source-code-and-disassembly) |
| of achieving the same thing. |
| |
| However: |
| * The common case of annotating by time, as opposed to annotating by CPU event, |
| is supported by [Tracy](profiling_with_tracy.md). |
| * Annotating by CPU event is inherently not working due to hardware limitations |
| of the ARM CPUs found in Android devices. That is, the hardware is too |
| imprecise at pinning an event to a particular instruction. |
| |
| ## Interpreting CPU event counts |
| |
| ### Problems |
| |
| There are multiple layers of complexity in interpreting CPU event counts. |
| |
| #### These events are in themselves normal |
| |
| The first difficulty is in the fact that most of these events are *normal*. So |
| just knowing that they happened is not in itself actionable. |
| |
| For example, if we learn that some code causes cache misses, that isn't big |
| news: so does all code. Maybe this code has *too many* cache misses, but how |
| many is too many? Maybe this code alone accounts for a large fraction of the |
| overall total of the whole program, but maybe even that is normal, for instance |
| if the code being studied is the 'hot' part of the program where a large |
| fraction of overall time is spent? |
| |
| #### These events are hardware-dependent and under-documented |
| |
| Many of these events have a meaning that varies between CPUs and that is |
| difficult to characterize on any CPU, let alone in a way that applies to all |
| CPUs. |
| |
| For example, take the "L2 data cache refill". On ARM, with `simpleperf`, that |
| would be `raw-l2d-cache-refill`. Questions: |
| * Is “L2” [inclusive](https://en.wikipedia.org/wiki/Cache_inclusion_policy) of |
| “L1”? |
| * How many bytes are transferred per “refill”? |
| * Are accesses induced by speculative execution or by automatic pre-fetching |
| counted in the same way as accesses induced by actual code execution? |
| |
| The answers to all of the above questions are CPU-dependent. They may even vary |
| between the CPU cores of the same Android device. |
| |
| #### These events are imprecise and noisy, particularly on ARM CPUs |
| |
| Expect noise levels above 10% in many CPU event counts on ARM CPUs. Moreover, on |
| ARM, as discussed above, there is inaccuracy in which instruction is blamed for |
| which event, which will increase inaccuracy of per-symbol breakdowns for very |
| cheap symbols (and makes `perf annotate` impossible as noted above). Finally, be |
| aware that some ARM CPUs may perform event count interpolation, so we may not |
| have any access to true hardware counts. |
| |
| ### Recommendations |
| |
| Here is a workflow pattern that allows to make significant use of CPU event |
| counts, despite all the problems noted above: |
| * Hypothesize that some code diff might help performance, and might help |
| reducing the number of CPU events of a certain type, and that the two might be |
| related. |
| * Benchmark with and without the code diff, on the same device, everything else |
| being equal. |
| * Let your benchmark perform a fixed number of iterations, or, if using a |
| benchmark termination condition of the form "run until at least N seconds |
| have elapsed", carefully divide event counts by the actual number of |
| iterations that were run. |
| * If the observed CPU event count difference is significant, go ahead and claim |
| that your code diff probably helps with that aspect of CPU behavior. |
| |
| Some things NOT to be done: |
| * Don’t try to compare different metrics, not even when it seems obvious that |
| they should satisfy a simple relationship, not even on the same CPU (e.g. “L1 |
| accesses should be greater than L2 accesses”). |
| * Don’t divide by some “total” metric to get some kinds of ratios. For example, |
| don’t try to compute a “cache miss ratio” as quotient of “cache refill” over |
| “all cache accesses” metrics. The first problem with that (even before we get |
| to CPU-specific issues) is that that’s rewarding increases to the “all cache |
| accesses” metrics, so if something bad happens in your codegen and your kernel |
| ends up spilling a lot of register to the stack, that’s going to be a lot more |
| accesses which will all be L1 hits so that’ll help this ratio look better! So |
| more generally, just try to minimize some CPU metrics (that count “costly” |
| events), not some more complex math expression formed from arithmetic on CPU |
| metrics. |