)]}'
{
  "commit": "f458a88724397414192c38e64f8af76e18eee49e",
  "tree": "02020cab391bb4029a3f45552603e95530bc0f49",
  "parents": [
    "f9d7599874087fa30b32446d5459b627a5024558"
  ],
  "author": {
    "name": "bjacob",
    "email": "benoitjacob@google.com",
    "time": "Fri Nov 10 10:08:36 2023 -0500"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Fri Nov 10 10:08:36 2023 -0500"
  },
  "message": "Optimized vecmat ukernel tile functions for `i16 x u4 -\u003e i32` on AVX-512-VNNI (#15525)\n\nThis kernel is parametrized in N0, allowing N0\u003d\u003d16 and N0\u003d\u003d32.\r\nPerformance on AMD Ryzen 9 7950X3D:\r\n  - With N0\u003d16:  180 Gop/s.\r\n  - With N0\u003d32:  240 Gop/s.\r\n\r\nThese numbers show that there\u0027s a nice reward for going extra large, but\r\nthat\u0027s also a liability for vecmat shapes whose N dimension isn\u0027t a\r\nmultiple of 32. Maybe we can keep both for now.\r\n\r\nThis is currently by far our fastest vecmat tile function --- it\u0027s fast\r\neven by general-matmul standards, while usually vecmat\u0027s low arithmetic\r\nintensity relegates it to lower performance levels. It shows what\u0027s\r\npossible now that we\u0027ve decoupled vecmat tile shapes from general matmul\r\ntile shapes in #15431 . That 32x8 is not a truncation of a general\r\nmatmul tile shape. Other element types and CPU architectures all need to\r\nget the same treatment.\r\n\r\nThe idea of this kernel is to split the LHS s16 values into high and low\r\n8-bit components to be able to use `_mm512_dpbusd_epi32`.\r\n\r\nIn itself, that doesn\u0027t reduce the number of arithmetic instructions:\r\nwhile each now computes a 4D dot-product instead of a 2D one as in\r\n`_mm512_dpwssd_epi32`, we now need twice more of them to do separately\r\nthe high and low 8bit parts of the LHS s16 values.\r\n\r\nThe real benefit is that this removes the need to extend RHS u4 values\r\nto s16. Since this is a vecmat kernel, the LHS is small and the RHS is\r\nbig, so it matters to avoid RHS-processing work.\r\n\r\nIt\u0027s not trivial how to use `_mm512_dpbusd_epi32`, with its quirky\r\nunsigned * signed semantics. We take advantage of the fact that our u4\r\nRHS values, when extended to u8, do not use the top bit -- so they are\r\nalso interpretable as s8 values in place. So this is specific to RHS\r\nbeing less-than-8-bit values (it\u0027s not specific beyond that to 4bit).\r\nMeanwhile, when we split the LHS s16 values into high and low 8bit\r\ncomponents the high 8bits are signed s8 and the low 8bit are unsigned\r\nu8. So, for each of the combinations of operands that we have to feed\r\n`_mm512_dpbusd_epi32`, we manage to find an operand order that\r\naccomodates the instruction\u0027s requirements on signednesses.",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "943d6074ffc5c59d6699455b84cf824fc3143964",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_avx512_vnni.c",
      "new_id": "db47a07516320bfb13db6089cfd9bdb6d0b9083e",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_avx512_vnni.c"
    },
    {
      "type": "modify",
      "old_id": "b957957e13c05e0d4280d0974c26cd779bb276de",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_entry_point.c",
      "new_id": "b82451adf1e3a5b8e2f4d2415572fd95a6964ba9",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_entry_point.c"
    },
    {
      "type": "modify",
      "old_id": "f82ff0c889b79d54b46c8fbcd3b245fde094625e",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_internal.h",
      "new_id": "c295edfaf7536c56c3fe97770e61320ab6f15019",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_internal.h"
    },
    {
      "type": "modify",
      "old_id": "ffa40a775a8765039574f9889c3d25c0df8010a7",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark.c",
      "new_id": "c81ab9364443b399817a94a84a313fc146c69ba3",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark.c"
    },
    {
      "type": "modify",
      "old_id": "995309fe374a927d0b05010e67ceb0dd80d78067",
      "old_mode": 33188,
      "old_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c",
      "new_id": "440a028bd724da8d0b15349d20133209249fa668",
      "new_mode": 33188,
      "new_path": "runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c"
    }
  ]
}
