Tweaks to e2e_benchmarking.md (#3513)

* Clarify benchmarking docs on Ruy in TFLite.

* Clarifications on -mavx2 flag on x86 build and remove it on ARM build.

* Expand docs on profiling and avoid benchmarking with profiling enabled.
diff --git a/docs/developing_iree/e2e_benchmarking.md b/docs/developing_iree/e2e_benchmarking.md
index 3fe8604..96523c4 100644
--- a/docs/developing_iree/e2e_benchmarking.md
+++ b/docs/developing_iree/e2e_benchmarking.md
@@ -178,12 +178,19 @@
 # Enter the TensorFlow Bazel workspace.
 cd third_party/tensorflow/
 
-# Build the benchmark_model binary without RUY...
+# Build the benchmark_model binary.
 bazel build --copt=-mavx2 -c opt \
   //tensorflow/lite/tools/benchmark:benchmark_model
 
-# ...or build the benchmark_model binary with RUY. This will overwrite the
+# By default, TFLite/x86 uses various matrix multiplication libraries.
+# It is possible to force it to only use Ruy for all matrix multiplications.
+# That is the default on ARM but not on x86. This will overwrite the
 # previous binary unless you move it.
+#
+# Note that Ruy takes care of -mavx2 and other AVX extensions internally,
+# so this passing this flag here isn't going to make a difference to
+# matrix multiplications. However, the rest of TFLite's kernels outside
+# of ruy will still benefit from -mavx2.
 bazel build --copt=-mavx2 -c opt \
   --define=tflite_with_ruy=true \
   //tensorflow/lite/tools/benchmark:benchmark_model
@@ -274,6 +281,9 @@
 
 ```shell
 # Build the benchmark_model binary without any add-ons.
+# Note that unlike TFLite/x86, TFLite/ARM uses Ruy by default for all
+# matrix multiplications (No need to pass tflite_with_ruy), except for some
+# matrix*vector products. Below we show how to force using ruy also for that.
 bazel build -c opt \
   --config=android_arm64 \
   --cxxopt='--std=c++17' \
@@ -286,20 +296,21 @@
 ```
 
 ```shell
-# Build the benchmark_model binary with ruy.
-bazel build --copt=-mavx2 -c opt \
+# Build the benchmark_model binary using ruy even for matrix*vector
+# products. This is only worth trying in models that are heavy on matrix*vector
+# shapes, typically LSTMs and other RNNs.
+bazel build -c opt \
   --config=android_arm64 \
   --cxxopt='--std=c++17' \
-  --define=tflite_with_ruy=true \
   --copt=-DTFLITE_WITH_RUY_GEMV \
   //tensorflow/lite/tools/benchmark:benchmark_model
 
 # Rename the binary for comparison with the standard benchmark_model.
 mv bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
-  bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy
-adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy \
+  bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy_gemv
+adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy_gemv \
   /data/local/tmp/
-adb shell chmod +x /data/local/tmp/benchmark_model_plus_ruy
+adb shell chmod +x /data/local/tmp/benchmark_model_plus_ruy_gemv
 ```
 
 ```shell
@@ -336,17 +347,15 @@
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
-  --enable_op_profiling=true
 ```
 
 ```shell
-# Benchmark with TFLite + RUY.
-adb shell taskset f0 /data/local/tmp/benchmark_model_plus_ruy \
+# Benchmark with TFLite + RUY GEMV
+adb shell taskset f0 /data/local/tmp/benchmark_model_plus_ruy_gemv \
   --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
-  --enable_op_profiling=true
 ```
 
 ```shell
@@ -356,7 +365,6 @@
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
-  --enable_op_profiling=true
 ```
 
 ```shell
@@ -366,7 +374,6 @@
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
-  --enable_op_profiling=true \
   --use_gpu=true
 ```
 
@@ -384,3 +391,30 @@
 name of the `.tflite` graph that you need to benchmark _may_ be different from
 the name of the trace that you want to benchmark, but you can use `cat` on
 the `graph_path` file to verify the correct `.tflite` filename if you're unsure.
+
+### Profile
+
+There are 2 profilers built into TFLite's `benchmark_model` program. Both of them impact latencies, so they should only be used to get a breakdown of the relative time spent in each operator type, they should not be enabled for the purpose of measuring a latency.
+
+The first is `enable_op_profiling`. It's based on timestamps before and after each op. It's a runtime commandline flag taken by `benchmark_model`. Example:
+
+```
+adb shell taskset f0 /data/local/tmp/benchmark_model \
+  --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
+  --warmup_runs=1 \
+  --num_threads=1 \
+  --num_runs=10 \
+  --enable_op_profiling=true
+```
+
+The second is `ruy_profiler`. Despite its name, it's available regardless of whether `ruy` is used for the matrix multiplications. It's a sampling profiler, which allows it to provide some more detailed informations, particularly on matrix multiplications. It's a build-time switch:
+
+```
+blaze build \
+  --define=ruy_profiler=true \
+  -c opt \
+  --config=android_arm64 \
+  //tensorflow/lite/tools/benchmark:benchmark_model
+```
+
+The binary thus built can be run like above, no commandline flag needed.