)]}'
{
  "commit": "4983668103817e5f6c58b410e6ebd6c49aa03b2b",
  "tree": "90c7b0f4d121ae7a3e8a4670d5471bda7f9733cf",
  "parents": [
    "a7a784e1f3f2df1080525c41f19a942771694480"
  ],
  "author": {
    "name": "bjacob",
    "email": "benoitjacob@google.com",
    "time": "Wed Oct 11 16:21:21 2023 -0400"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Wed Oct 11 16:21:21 2023 -0400"
  },
  "message": "`mmt4d` ukernel for the `bf16*bf16-\u003ef32` case using AVX-512-BF16 (#15089)\n\n`bf16*bf16-\u003ef32` is the one floating-point element type that can perform\r\nfaster than regular `f32` on widely-available x86 CPUs, as AVX-512-BF16\r\nis available on Intel Cooper Lake and AMD Zen4 microarchitectures.\r\n\r\nBy contrast, AVX-512-FP16 is only available on Intel Sapphire Rapids.\r\n\r\nThe below benchmark results show this being a 2x speedup over `f32` and\r\nmatching current `i8*i8-\u003ei32` performance with VNNI, which could make it\r\nquite useful as rounding f32 to the nearest bf16 is easier and less\r\naccuracy-compromising than i8-quantizing. The accumulator is still f32\r\nso there\u0027s no change there.\r\n\r\n### Crazy idea\r\n\r\nRewriting `f32` matmuls into rounding LHS/RHS to `bf16` then doing a\r\n`bf16*bf16-\u003ef32` matmul is something that we could do as an (opt-in)\r\nautomatic rewrite for people who have a `f32` workload on their hands\r\nand who would like a quick flag to try to get a cheap speedup. Accuracy\r\nis likely to be good enough for a majority of workloads, unlike much\r\nharder `i8` quantization. Try this flag, get 2x speedup on f32 workloads\r\non recent x86 CPU, and a 4x speedup on recent ARM. By contrast, an\r\nequivalent flag for fp16 would be a 2x slowdown on x86 (except Sapphire\r\nRapids) and a 2x speedup on recent ARM.\r\n\r\n### Benchmark on AMD Zen4 (Ryzen 9 7940HS), `mmt4d_benchmark`\r\n(single-thread microbenchmark):\r\n\r\n|Benchmark                                              |Gop/s|\r\n|-------------------------------------------------------|-----|\r\n|BM_mmt4d_f32f32f32_tile_1x8x1_avx2_fma                 |25.5 |\r\n|BM_mmt4d_f32f32f32_tile_2x8x1_avx2_fma                 |43.3 |\r\n|BM_mmt4d_f32f32f32_tile_4x8x1_avx2_fma                 |82.3 |\r\n|BM_mmt4d_f32f32f32_tile_8x8x1_avx2_fma                 |132.9|\r\n|BM_mmt4d_f32f32f32_tile_1x16x1_avx512_base             |44.1 |\r\n|BM_mmt4d_f32f32f32_tile_2x16x1_avx512_base             |87.8 |\r\n|BM_mmt4d_f32f32f32_tile_4x16x1_avx512_base             |167.1|\r\n|BM_mmt4d_f32f32f32_tile_8x16x1_avx512_base             |166.7|\r\n|BM_mmt4d_f32f32f32_tile_16x16x1_avx512_base            |166.3|\r\n|BM_mmt4d_f16f16f32_tile_1x8x1_avx2_fma                 |21.6 |\r\n|BM_mmt4d_f16f16f32_tile_2x8x1_avx2_fma                 |41.0 |\r\n|BM_mmt4d_f16f16f32_tile_4x8x1_avx2_fma                 |74.1 |\r\n|BM_mmt4d_f16f16f32_tile_8x8x1_avx2_fma                 |81.0 |\r\n|BM_mmt4d_f16f16f32_tile_1x16x1_avx512_base             |38.6 |\r\n|BM_mmt4d_f16f16f32_tile_2x16x1_avx512_base             |64.5 |\r\n|BM_mmt4d_f16f16f32_tile_4x16x1_avx512_base             |73.5 |\r\n|BM_mmt4d_f16f16f32_tile_8x16x1_avx512_base             |79.4 |\r\n|BM_mmt4d_f16f16f32_tile_16x16x1_avx512_base            |82.6 |\r\n|BM_mmt4d_f16f16f16_tile_1x8x1_avx2_fma                 |21.6 |\r\n|BM_mmt4d_f16f16f16_tile_2x8x1_avx2_fma                 |40.9 |\r\n|BM_mmt4d_f16f16f16_tile_4x8x1_avx2_fma                 |73.8 |\r\n|BM_mmt4d_f16f16f16_tile_8x8x1_avx2_fma                 |80.1 |\r\n|BM_mmt4d_f16f16f16_tile_1x16x1_avx512_base             |39.1 |\r\n|BM_mmt4d_f16f16f16_tile_2x16x1_avx512_base             |66.5 |\r\n|BM_mmt4d_f16f16f16_tile_4x16x1_avx512_base             |73.7 |\r\n|BM_mmt4d_f16f16f16_tile_8x16x1_avx512_base             |79.3 |\r\n|BM_mmt4d_f16f16f16_tile_16x16x1_avx512_base            |82.3 |\r\n|BM_mmt4d_bf16bf16f32_tile_1x16x2_avx512_bf16           |68.0 |\r\n|BM_mmt4d_bf16bf16f32_tile_2x16x2_avx512_bf16           |123.2|\r\n|BM_mmt4d_bf16bf16f32_tile_4x16x2_avx512_bf16           |228.9|\r\n|BM_mmt4d_bf16bf16f32_tile_8x16x2_avx512_bf16           |333.6|\r\n|BM_mmt4d_bf16bf16f32_tile_16x16x2_avx512_bf16          |332.2|\r\n|BM_mmt4d_i8i8i32_tile_1x8x2_avx2_fma                   |57.6 |\r\n|BM_mmt4d_i8i8i32_tile_2x8x2_avx2_fma                   |78.3 |\r\n|BM_mmt4d_i8i8i32_tile_4x8x2_avx2_fma                   |92.9 |\r\n|BM_mmt4d_i8i8i32_tile_8x8x2_avx2_fma                   |186.0|\r\n|BM_mmt4d_i8i8i32_tile_1x16x2_avx512_base               |37.8 |\r\n|BM_mmt4d_i8i8i32_tile_2x16x2_avx512_base               |49.0 |\r\n|BM_mmt4d_i8i8i32_tile_4x16x2_avx512_base               |59.2 |\r\n|BM_mmt4d_i8i8i32_tile_8x16x2_avx512_base               |118.4|\r\n|BM_mmt4d_i8i8i32_tile_16x16x2_avx512_base              |231.8|\r\n|BM_mmt4d_i8i8i32_tile_1x16x2_avx512_vnni               |50.8 |\r\n|BM_mmt4d_i8i8i32_tile_2x16x2_avx512_vnni               |70.0 |\r\n|BM_mmt4d_i8i8i32_tile_4x16x2_avx512_vnni               |83.3 |\r\n|BM_mmt4d_i8i8i32_tile_8x16x2_avx512_vnni               |165.5|\r\n|BM_mmt4d_i8i8i32_tile_16x16x2_avx512_vnni              |328.1|",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "cdbce72a0fec176ecaee2a6dd3f080673b322bc6",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel",
      "new_id": "88a4cc8b153f2f0d6c3d1bc1238604923b3d0ff0",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel"
    },
    {
      "type": "modify",
      "old_id": "fb8ddedff3cd7ffea6d6aecda5d8bc24c585c257",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt",
      "new_id": "8da4a7eaf84aa5afa01c3a5a15e29673d3fbe1b8",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt"
    },
    {
      "type": "modify",
      "old_id": "53627c49bad54407975f814c5a4b51bafc06f546",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/common_x86_64_entry_point.h",
      "new_id": "9720e64c374ca2f5d395d8b3b5ee8f58b597f90e",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/common_x86_64_entry_point.h"
    },
    {
      "type": "modify",
      "old_id": "3740e33ec40aca6d287827d6a32eab28f201b7cd",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/config_x86_64.h.in",
      "new_id": "776fe95ee6a8eee7aff53a170b6b0a058b464bee",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/config_x86_64.h.in"
    },
    {
      "type": "add",
      "old_id": "0000000000000000000000000000000000000000",
      "old_mode": 0,
      "old_path": "/dev/null",
      "new_id": "2171393a02dd904d29db5b00e9f5aefb302fe8da",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_avx512_bf16.c"
    },
    {
      "type": "modify",
      "old_id": "18cd9ddc45de22503ea94f04866ef0a4b1356077",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_entry_point.c",
      "new_id": "cbc6b780066bf0b8f0f7ef9e8068722cd4a277be",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_entry_point.c"
    },
    {
      "type": "modify",
      "old_id": "6d63762503a09664b20d731e438b564064ec3b86",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_internal.h",
      "new_id": "b98de8c5854300237a983aa03287fc1c3f915afe",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_internal.h"
    },
    {
      "type": "modify",
      "old_id": "25ee8c9588d13387b0b8717a030fd20d40ad0f5d",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark.c",
      "new_id": "46556bb614354fb7435d9b842dc79851eb8aae32",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark.c"
    },
    {
      "type": "modify",
      "old_id": "9b9d0ba4d988b34eb6d4e0bef5fe44ddcfdef80a",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c",
      "new_id": "eb5c28ee874ff5dc2bf3d7c4a4052a47a2391bdc",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c"
    },
    {
      "type": "modify",
      "old_id": "1861c3c53212dae9ef19b77edd5e54a26ec5d48c",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/tools/util.c",
      "new_id": "e5fd58a90eeea3e6b0be22b3789f5808868e0ea1",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/tools/util.c"
    }
  ]
}
