ukernels: update README.md (#16358)

diff --git a/runtime/src/iree/builtins/ukernel/README.md b/runtime/src/iree/builtins/ukernel/README.md
index 59868f6..a2ff53f 100644
--- a/runtime/src/iree/builtins/ukernel/README.md
+++ b/runtime/src/iree/builtins/ukernel/README.md
@@ -1,155 +1,38 @@
-IREE Microkernels Library: `libukernel`
-=======================================
+# Microkernels library
 
-This library provides builtin microkernels to both the IREE VMVX module for
-runtime linkage and the IREE compiler for ahead-of-time compilation. Each
-deployment approach has tradeoffs and the intent with this library is to share
-the same compiler passes/infrastructure for emitting the microkernel ops and
-the same microkernel implementations.
+## Walk-through presentation
 
-## Runtime Linkage
+Here is a walk-through of how ukernels are built and used in IREE:
+https://gist.github.com/bjacob/2c160c102cee33562826c730945b49f2
 
-For deployments targeting the IREE VM the compiler will produce .vmfb modules
-that use the VMVX module (`iree/modules/vmvx/module.c`). The code in this
-library is linked into the runtime VMVX module and called via the VM FFI:
-```
-                     +------------+      +---------+      +================+
-                     | input.mlir | ---> | codegen | ---> |  iree-compile  |
-                     +------------+      +---------+      +================+
-                                                                  |
-                                                                  v
-+-----------+      +------------+      +--------------+    +--------------+
-| mmt4d_*.c | ---> | C compiler | ---> | libukernel.a |    | .vmfb module |
-+-----------+      +------------+      +--------------+    +--------------+
-                                             |                    |
-                                             v                    v
-                                 +-------------------+      +============+
-                                 | iree/modules/vmvx | ---> | VM context |
-                                 +-------------------+      +============+
-```
+## What is a microkernel?
 
-As the definition of the VMVX ops in the compiler have to match the ones in the
-runtime the interface is difficult to change without breaking binary
-compatibility. Because of this the exported VMVX methods are intended to be
-generic, stable, and consistent across platforms. The microkernels in this
-library are insulated by the VMVX module layer which can perform versioning and
-provide fallbacks as needed.
+A microkernel (abbreviated as "ukernel") is a function that can be used as a lowering for a MLIR arithmetic operation. Specifically, in the IREE Codegen dialect we define a `ukernel.generic` operation which takes a "ukernel function name" attribute and gets lowered to a `func.call` op calling the specified function. This directory is where we build the functions that can be used for that purpose.
 
-## Ahead-of-time Linkage
+## What can ukernels do?
 
-For deployments using ahead-of-time compilation the library is compiled to
-bitcode files that are loaded and linked while producing the generated code:
-```
-+-----------+      +-------+      +-------------------------------+
-| mmt4d_*.c | ---> | clang | ---> |+--------------------------------+
-+-----------+      +-------+      +| libukernel_[arch]_[variant].bc |
-                                   +--------------------------------+
-                                                  |||
-                                                  vvv
-      +------------+      +---------+      +================+
-      | input.mlir | ---> | codegen | ---> |  iree-compile  |
-      +------------+      +---------+      +================+
-                                                   |
-                      +----------------------------+
-                      v                            v
-         +------------------------+   +----------------------------+
-         | static library (.o/.a) |   | dynamic library (.so/.dll) |
-         +------------------------+   +----------------------------+
-```
+Ukernels can:
+* Perform arithmetic, and access (read and write) memory buffers passed to them as pointer and strides arguments.
 
-By linking the generated code together with the library bitcode the compiler can
-perform intra-procedural optimization to efficiently cull unused code paths and
-propagate known-constant values. The compiler outputs are hermetic and avoid
-version skew between the compiler and the runtime.
+Ukernels cannot:
+* Use any library, not even the C standard library.
+  * In particular, ukernels can't allocate memory. Any buffer needs to be passed to it by the caller.
+  * Ukernels can't even #include C standard library headers. Depending on the toolchain/platform, even stdint.h can bring in OS dependencies.
+* Specialize for or interface with the operating system in any way.
+  * If a ukernel needs information that would typically come from the OS, such as CPU identification details, that information needs to be passed to them as an argument, moving the problem to the caller.
+  * Ukernels are built once for each target architecture, not for each target platform. Different platforms (e.g. Windows vs Linux) need to be able to share the exact same ukernel code.
+* Have side effects, besides writing to destination buffers.
+* Have state, such as accessing globals.
+* Be non-reentrant. Ukernels will be called concurrently on multiple threads.
 
-## Bitcode Files
+## How are ukernels compiled?
 
-The IREE compiler embeds bitcode files and when producing executable libraries
-will select one for linkage based on the specified target machine. As these
-bitcode files can only be produced by a cross-compilation-enabled Clang they are
-built offline and checked into the repository. Future improvements to the
-compiler could also allow for external files to be specified to avoid the need
-to rebuild the compiler however for now this keeps things simple and hermetic.
+Ukernels are typically built in two different ways:
+1. Ukernels are compiled to LLVM bitcode using the `iree_bitcode_library` function for CPU ukernels, and analogous functions for GPU ukernels. The resulting `.bc` bitcode files are then embedded as static data in the IREE compiler. This works in exactly the same way in the CMake and Bazel builds.
+    * The IREE compiler also allows passing external ukernel bitcode files, allowing to use externally built ukernels.
+2. Ukernels are also built as a normal library using the native toolchain. This is mostly used for local development, testing and benchmarking. This part is only fully implemented with CMake with all the architecture-specific code paths, while the Bazel build only has a minimal stub with architecture-specific code paths left out.
+    * There is only one way in which this native-toolchain build of ukernels is actually used in IREE: with the VMVX back-end, ukernels are supported by linking this native ukernel code into the VMVX module, which is part of the IREE runtime.
 
-Usage is currently not wired up in the compiler but will look very similar to
-the `iree/builtins/device/` approach.
+## Unit-tests and microbenchmarks
 
-## Engineering Requirements
-
-As this library is directly merged into the compiler-generated code there are
-specific restrictions as to what can be used inherited from the IREE executable
-requirements:
-
-* No mutable globals/static variables or thread-local storage
-* No syscalls
-* No libc calls outside of builtins (like memset/memcpy) - _no mallocs_!
-
-Though precompiled bitcode files only need to work with Clang the library may
-also be built on other toolchains such as GCC and MSVC (or older version of
-Clang). When standard intrinsics are used this will generally not be a problem
-however inline assembly may need compiler-specific variants or at least
-exclusions that fall back to generic paths.
-
-### Compile-time Configuration
-
-Preprocessor statements used to control behavior must only use information known
-when the bitcode files are being compiled. This means that if the bitcode file
-being produced is for AArch64 it is safe to use the `__aarch64__` macro.
-Information that is only available after the bitcode file is produced - such as
-in the IREE compiler pipelines - must use link-time configuration.
-
-### Link-time Configuration
-
-As we are producing bitcode files we cannot rely on the C preprocessor for
-changing behavior based on some information only known during linking. In other
-cases we may want to specialize code paths based on knowledge about the context
-in which the kernels are used. To provide this link-time modification ability
-there is support for flags by way of `extern` globals. These globals are either
-specified by the IREE compiler when linking the bitcode or by the hosting
-application when linked statically.
-
-For example, this flag can be specified by either passing a define when
-compiling the library for standalone/VMVX use or using the
-`overridePlatformGlobal` helper when emitting LLVM IR in the IREE compiler:
-```c
-#if defined(IREE_UK_PLATFORM_EXAMPLE_FLAG)
-static const int iree_microkernels_platform_example_flag =
-    IREE_UK_PLATFORM_EXAMPLE_FLAG;
-#else
-extern int iree_microkernels_platform_example_flag;
-#endif  // IREE_UK_PLATFORM_EXAMPLE_FLAG
-```
-
-Any code may then use this flag to condition/control behavior:
-```c
-if (iree_microkernels_platform_example_flag >= 1) {
-  // Do something special.
-}
-```
-
-When linking libmicrokernels statically the flags can be provided by the hosting
-application via compiler defines:
-`-DIREE_UK_PLATFORM_EXAMPLE_FLAG=123`.
-
-When producing bitcode the flags are left symbolic and the IREE compiler
-provides their values:
-```c++
-overridePlatformGlobal(*bitcodeModule,
-                       "iree_microkernels_platform_example_flag", 123u);
-```
-
-What flags are useful and how to handle cases where flags are arch-dependent are
-still TBD.
-
-## Testing and Benchmarking
-
-[`tools/mmt4d_test.cc`](tools/mmt4d_test.cc) provides a gtest runner
-that compares the results of the optimized implementations for the target
-architecture against a reference implementation for correctness.
-
-[`tools/mmt4d_benchmark.c`](tools/mmt4d_benchmark.c) provides a
-benchmark suite for the optimized implementations of the target architecture.
-
-Both are compiled for the CMake target and can be used to develop
-implementations without the need to rebuild/run the compiler or produce full
-compiled artifacts that operate in the runtime.
+The `tools/` directory contains unit-tests and microbenchmarks for ukernels. It allows developing ukernels within this directory, as a self-contained C project.
diff --git a/runtime/src/iree/builtins/ukernel/common.h b/runtime/src/iree/builtins/ukernel/common.h
index 8dca92e..a981e1b 100644
--- a/runtime/src/iree/builtins/ukernel/common.h
+++ b/runtime/src/iree/builtins/ukernel/common.h
@@ -7,25 +7,6 @@
 #ifndef IREE_BUILTINS_UKERNEL_COMMON_H_
 #define IREE_BUILTINS_UKERNEL_COMMON_H_
 
-//===----------------------------------------------------------------------===//
-// Generic microkernel library
-//===----------------------------------------------------------------------===//
-//
-// Rules summary:
-// 1. Microkernels are bare-metal, excluding even the standard C library.
-//    a. Can't #include any system header.
-//    b. Can't #include any standard library header.
-//    c. Can't interface with the OS in any way.
-// 2. Microkernels code may be specialized for a target CPU architecture, but
-//    not for a target platform/OS/triple. In particular:
-//    a. It's OK to have a `#ifdef __aarch64__` but not a `#ifdef __ANDROID__`.
-// 3. Microkernels are pure/reentrant/stateless.
-//    a. Pure: the only effect of calling a ukernel is to write to destination
-//       buffers specified by pointers passed as ukernel arguments.
-//    b. Reentrant: ukernels may be called concurrently with
-//       themselves, other ukernels, or any other code, on any thread.
-//    c. Stateless: ukernels can't access any nonconstant global variable.
-
 #ifdef __cplusplus
 #error This file should only be included in ukernel/ code, which should be C, not C++.
 #endif  // __cplusplus
diff --git a/runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c.orig b/runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c.orig
deleted file mode 100644
index b3f9e2c..0000000
--- a/runtime/src/iree/builtins/ukernel/tools/mmt4d_test.c.orig
+++ /dev/null
@@ -1,357 +0,0 @@
-// Copyright 2022 The IREE Authors
-//
-// Licensed under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-
-#include "iree/base/api.h"
-#include "iree/base/internal/math.h"
-#include "iree/builtins/ukernel/api.h"
-#include "iree/builtins/ukernel/mmt4d_internal.h"
-#include "iree/builtins/ukernel/tools/test.h"
-#include "iree/builtins/ukernel/tools/util.h"
-
-static void iree_mmt4d_reference_innerloop_f32f32f32(
-    float* out_ptr, const float* lhs_ptr, const float* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  float acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0.f;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      float lhs_f32 = lhs_ptr[k * params->M0 * params->K0 + k0];
-      float rhs_f32 = rhs_ptr[k * params->N0 * params->K0 + k0];
-      acc += lhs_f32 * rhs_f32;
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference_innerloop_f16f16f32(
-    float* out_ptr, const uint16_t* lhs_ptr, const uint16_t* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  float acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0.f;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      float lhs_f32 =
-          iree_math_f16_to_f32(lhs_ptr[k * params->M0 * params->K0 + k0]);
-      float rhs_f32 =
-          iree_math_f16_to_f32(rhs_ptr[k * params->N0 * params->K0 + k0]);
-      acc += lhs_f32 * rhs_f32;
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference_innerloop_f16f16f16(
-    uint16_t* out_ptr, const uint16_t* lhs_ptr, const uint16_t* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  uint16_t acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      float lhs_f32 =
-          iree_math_f16_to_f32(lhs_ptr[k * params->M0 * params->K0 + k0]);
-      float rhs_f32 =
-          iree_math_f16_to_f32(rhs_ptr[k * params->N0 * params->K0 + k0]);
-      float acc_f32 = iree_math_f16_to_f32(acc);
-      acc = iree_math_f32_to_f16(acc_f32 + lhs_f32 * rhs_f32);
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference_innerloop_bf16bf16f32(
-    float* out_ptr, const uint16_t* lhs_ptr, const uint16_t* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  float acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0.f;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      float lhs_f32 =
-          iree_math_bf16_to_f32(lhs_ptr[k * params->M0 * params->K0 + k0]);
-      float rhs_f32 =
-          iree_math_bf16_to_f32(rhs_ptr[k * params->N0 * params->K0 + k0]);
-      acc += lhs_f32 * rhs_f32;
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference_innerloop_bf16bf16bf16(
-    uint16_t* out_ptr, const uint16_t* lhs_ptr, const uint16_t* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  uint16_t acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      float lhs_f32 =
-          iree_math_bf16_to_f32(lhs_ptr[k * params->M0 * params->K0 + k0]);
-      float rhs_f32 =
-          iree_math_bf16_to_f32(rhs_ptr[k * params->N0 * params->K0 + k0]);
-      float acc_f32 = iree_math_bf16_to_f32(acc);
-      acc = iree_math_f32_to_bf16(acc_f32 + lhs_f32 * rhs_f32);
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference_innerloop_s8s8s32(
-    int32_t* out_ptr, const int8_t* lhs_ptr, const int8_t* rhs_ptr,
-    const iree_uk_mmt4d_params_t* params) {
-  int32_t acc = params->flags & IREE_UK_FLAG_MMT4D_ACCUMULATE ? *out_ptr : 0;
-  for (iree_uk_index_t k = 0; k < params->K; ++k) {
-    for (iree_uk_index_t k0 = 0; k0 < params->K0; ++k0) {
-      int32_t lhs_i32 = lhs_ptr[k * params->M0 * params->K0 + k0];
-      int32_t rhs_i32 = rhs_ptr[k * params->N0 * params->K0 + k0];
-      acc += lhs_i32 * rhs_i32;
-    }
-  }
-  *out_ptr = acc;
-}
-
-static void iree_mmt4d_reference(const iree_uk_mmt4d_params_t* params) {
-  iree_uk_mmt4d_type_t mmt4d_type = iree_uk_mmt4d_type(params->flags);
-  iree_uk_index_t lhs_elem_size =
-      iree_uk_type_size(iree_uk_mmt4d_lhs_type(mmt4d_type));
-  iree_uk_index_t rhs_elem_size =
-      iree_uk_type_size(iree_uk_mmt4d_rhs_type(mmt4d_type));
-  iree_uk_index_t out_elem_size =
-      iree_uk_type_size(iree_uk_mmt4d_out_type(mmt4d_type));
-  for (iree_uk_index_t i = 0; i < params->M; ++i) {
-    for (iree_uk_index_t j = 0; j < params->N; ++j) {
-      void* out_tile_ptr = ((char*)params->out_buffer) +
-                           (params->out_offset + i * params->out_stride0 +
-                            j * params->M0 * params->N0) *
-                               out_elem_size;
-      const void* lhs_panel_ptr =
-          ((const char*)params->lhs_buffer) +
-          (params->lhs_offset + i * params->lhs_stride0) * lhs_elem_size;
-      const void* rhs_panel_ptr =
-          ((const char*)params->rhs_buffer) +
-          (params->rhs_offset + j * params->rhs_stride0) * rhs_elem_size;
-      for (iree_uk_index_t i0 = 0; i0 < params->M0; ++i0) {
-        for (iree_uk_index_t j0 = 0; j0 < params->N0; ++j0) {
-          void* out_ptr =
-              ((char*)out_tile_ptr) + (i0 * params->N0 + j0) * out_elem_size;
-          const void* lhs_ptr =
-              ((char*)lhs_panel_ptr) + i0 * params->K0 * lhs_elem_size;
-          const void* rhs_ptr =
-              ((char*)rhs_panel_ptr) + j0 * params->K0 * rhs_elem_size;
-          switch (params->flags & IREE_UK_FLAG_MMT4D_TYPE_MASK) {
-            case IREE_UK_FLAG_MMT4D_TYPE_F32F32F32:
-              iree_mmt4d_reference_innerloop_f32f32f32(
-                  (float*)out_ptr, (const float*)lhs_ptr, (const float*)rhs_ptr,
-                  params);
-              break;
-            case IREE_UK_FLAG_MMT4D_TYPE_F16F16F32:
-              iree_mmt4d_reference_innerloop_f16f16f32(
-                  (float*)out_ptr, (const uint16_t*)lhs_ptr,
-                  (const uint16_t*)rhs_ptr, params);
-              break;
-            case IREE_UK_FLAG_MMT4D_TYPE_F16F16F16:
-              iree_mmt4d_reference_innerloop_f16f16f16(
-                  (uint16_t*)out_ptr, (const uint16_t*)lhs_ptr,
-                  (const uint16_t*)rhs_ptr, params);
-              break;
-            case IREE_UK_FLAG_MMT4D_TYPE_BF16BF16F32:
-              iree_mmt4d_reference_innerloop_bf16bf16f32(
-                  (float*)out_ptr, (const uint16_t*)lhs_ptr,
-                  (const uint16_t*)rhs_ptr, params);
-              break;
-            case IREE_UK_FLAG_MMT4D_TYPE_BF16BF16BF16:
-              iree_mmt4d_reference_innerloop_bf16bf16bf16(
-                  (uint16_t*)out_ptr, (const uint16_t*)lhs_ptr,
-                  (const uint16_t*)rhs_ptr, params);
-              break;
-            case IREE_UK_FLAG_MMT4D_TYPE_S8S8S32:
-              iree_mmt4d_reference_innerloop_s8s8s32(
-                  (int32_t*)out_ptr, (const int8_t*)lhs_ptr,
-                  (const int8_t*)rhs_ptr, params);
-              break;
-            default:
-              IREE_UK_ASSERT(false && "unhandled type");
-          }
-          out_ptr = ((char*)out_ptr) + out_elem_size;
-        }
-      }
-    }
-  }
-}
-
-static void iree_uk_test_mmt4d_for_shape_params(
-    iree_uk_test_t* test, const iree_uk_mmt4d_params_t* src_params) {
-  iree_uk_mmt4d_params_t params;
-  memcpy(&params, src_params, sizeof params);
-  // Populate strides first - we need them below to compute buffer lengths.
-  // Randomly make strides either tight or not to exercise all cases.
-  iree_uk_random_engine_t* engine = iree_uk_test_random_engine(test);
-  params.lhs_stride0 =
-      params.K * params.M0 * params.K0 + iree_uk_random_engine_get_0_1(engine);
-  params.rhs_stride0 =
-      params.K * params.N0 * params.K0 + iree_uk_random_engine_get_0_1(engine);
-  params.out_stride0 =
-      params.N * params.M0 * params.N0 + iree_uk_random_engine_get_0_1(engine);
-  iree_uk_mmt4d_type_t mmt4d_type = iree_uk_mmt4d_type(params.flags);
-  iree_uk_type_t lhs_type = iree_uk_mmt4d_lhs_type(mmt4d_type);
-  iree_uk_type_t rhs_type = iree_uk_mmt4d_rhs_type(mmt4d_type);
-  iree_uk_type_t out_type = iree_uk_mmt4d_out_type(mmt4d_type);
-  iree_uk_index_t lhs_buffer_size =
-      iree_uk_2d_buffer_length(lhs_type, params.M, params.lhs_stride0);
-  iree_uk_index_t rhs_buffer_size =
-      iree_uk_2d_buffer_length(rhs_type, params.N, params.rhs_stride0);
-  void* lhs_buffer = malloc(lhs_buffer_size);
-  void* rhs_buffer = malloc(rhs_buffer_size);
-  iree_uk_write_random_buffer(lhs_buffer, lhs_buffer_size, lhs_type, engine);
-  iree_uk_write_random_buffer(rhs_buffer, rhs_buffer_size, rhs_type, engine);
-  params.lhs_offset = iree_uk_random_engine_get_0_65535(engine);
-  params.rhs_offset = iree_uk_random_engine_get_0_65535(engine);
-  params.out_offset = iree_uk_random_engine_get_0_65535(engine);
-  params.lhs_buffer = (const char*)lhs_buffer -
-                      (params.lhs_offset * iree_uk_type_size(lhs_type));
-  params.rhs_buffer = (const char*)rhs_buffer -
-                      (params.rhs_offset * iree_uk_type_size(rhs_type));
-
-  iree_uk_mmt4d_params_t reference_params;
-  memcpy(&reference_params, &params, sizeof params);
-  iree_uk_index_t out_buffer_size =
-      iree_uk_2d_buffer_length(out_type, params.M, params.out_stride0);
-  void* reference_out_buffer = malloc(out_buffer_size);
-  iree_uk_write_random_buffer(reference_out_buffer, out_buffer_size, out_type,
-                              engine);
-  reference_params.out_buffer =
-      (char*)reference_out_buffer -
-      (params.out_offset * iree_uk_type_size(out_type));
-
-  iree_uk_mmt4d_params_t actual_params;
-  memcpy(&actual_params, &params, sizeof params);
-  void* actual_out_buffer = malloc(out_buffer_size);
-  memcpy(actual_out_buffer, reference_out_buffer, out_buffer_size);
-  actual_params.out_buffer = (char*)actual_out_buffer -
-                             (params.out_offset * iree_uk_type_size(out_type));
-
-  iree_mmt4d_reference(&reference_params);
-  iree_uk_mmt4d(&actual_params);
-
-  // For now we use exact comparisons, even for float, even though the reference
-  // code accumulates in a different order compared to the actual code. This
-  // relies on picking input test matrix elements so that all intermediate
-  // values are exactly representable - i.e. small integer numerators. This
-  // become problematic when we do float16. See the comment at the top of this
-  // file explaining how we refrain from letting this grow into a 1000-line-long
-  // fully-featured test.
-  if (memcmp(actual_out_buffer, reference_out_buffer, out_buffer_size)) {
-    IREE_UK_TEST_FAIL(test);
-  }
-
-  free(reference_out_buffer);
-  free(actual_out_buffer);
-  free(lhs_buffer);
-  free(rhs_buffer);
-}
-
-static void iree_uk_test_mmt4d_for_tile_params(iree_uk_test_t* test,
-                                               const void* src_params) {
-  typedef struct shape_mnk_t {
-    int m, n, k;
-  } shape_mnk_t;
-  const shape_mnk_t shapes[] = {
-      // Degenerate case M==0. Vacuous.
-      {0, 1, 1},
-      {0, 5, 7},
-      // Degenerate case N==0. Vacuous.
-      {1, 0, 1},
-      {5, 0, 7},
-      // Degenerate case K==0. Vacuous if flags have ACCUMULATE. Zeroing the
-      // output buffer otherwise.
-      {1, 1, 0},
-      {5, 7, 0},
-      // Non-degenerate cases.
-      {1, 1, 1},
-      {1, 1, 2},
-      {1, 1, 10},
-      {1, 1, 1000},
-      {2, 1, 1},
-      {1, 2, 1},
-      {2, 2, 2},
-      {5, 7, 13},
-  };
-  for (int i = 0; i < IREE_ARRAYSIZE(shapes); ++i) {
-    iree_uk_mmt4d_params_t params;
-    memcpy(&params, src_params, sizeof params);
-    params.cpu_data = iree_uk_test_cpu_data(test);
-    shape_mnk_t shape = shapes[i];
-    params.M = shape.m;
-    params.N = shape.n;
-    params.K = shape.k;
-    for (int accumulate = 0; accumulate <= 1; ++accumulate) {
-      if (accumulate) params.flags |= IREE_UK_FLAG_MMT4D_ACCUMULATE;
-      iree_uk_test_mmt4d_for_shape_params(test, &params);
-    }
-  }
-}
-
-static void iree_uk_test_mmt4d_impl(iree_uk_uint32_t flags, int M0, int N0,
-                                    int K0, const char* cpu_features,
-                                    const char* code_path_suffix) {
-  char types_str[32];
-  iree_uk_mmt4d_type_t mmt4d_type = iree_uk_mmt4d_type(flags);
-  iree_uk_type_triple_str(types_str, sizeof types_str, mmt4d_type);
-  iree_uk_mmt4d_params_t params = {
-      .flags = flags, .M0 = M0, .N0 = N0, .K0 = K0};
-  char test_label_str[256];
-  snprintf(test_label_str, sizeof test_label_str, "types:%s tile:%dx%dx%d%s",
-           types_str, M0, N0, K0, code_path_suffix);
-  iree_uk_test(test_label_str, iree_uk_test_mmt4d_for_tile_params, &params,
-               cpu_features);
-}
-
-static void iree_uk_test_mmt4d(iree_uk_uint32_t flags, int M0, int N0, int K0,
-                               const char* cpu_features) {
-  iree_uk_test_mmt4d_impl(flags, M0, N0, K0, cpu_features, "");
-}
-
-static void iree_uk_test_mmt4d_default_and_intrinsics(
-    iree_uk_uint32_t flags, int M0, int N0, int K0, const char* cpu_features) {
-  iree_uk_test_mmt4d_impl(flags, M0, N0, K0, cpu_features, "");
-#if defined(IREE_UK_HAVE_BOTH_INLINE_ASM_AND_INTRINSICS)
-  iree_uk_test_mmt4d_impl(flags | IREE_UK_FLAG_MMT4D_PREFER_INTRINSICS, M0, N0,
-                          K0, cpu_features, " intrinsics");
-#endif  // defined(IREE_UK_HAVE_BOTH_INLINE_ASM_AND_INTRINSICS)
-}
-
-int main(int argc, char** argv) {
-  // Generic tests, not matching any particular CPU feature. This is the place
-  // to test weird M0, N0, K0 to ensure e.g. that we haven't unwittingly baked
-  // in a power-of-two assumption
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F32F32F32, 3, 5, 7, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 9, 6, 3, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F32, 4, 6, 5, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F16, 3, 5, 8, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_BF16BF16F32, 11, 4, 1, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_BF16BF16BF16, 2, 9, 3, "");
-
-#if defined(IREE_ARCH_ARM_64)
-  // On arm64, some code paths have inline asm and intrinsics variants. For them
-  // we use iree_uk_test_mmt4d_default_and_intrinsics to test both.
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F32F32F32, 8, 8, 1, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F32, 8, 8, 1, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F16, 8, 8, 1, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 8, 8, 1, "");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 8, 8, 4, "dotprod");
-  iree_uk_test_mmt4d_default_and_intrinsics(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 8,
-                                            8, 8, "i8mm");
-#elif defined(IREE_ARCH_X86_64)
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F32F32F32, 8, 4, 1, "");  // SSE
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F32F32F32, 8, 8, 1, "avx2_fma");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F32F32F32, 16, 16, 1,
-                     "avx512_base");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F32, 8, 8, 1, "avx2_fma");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F32, 16, 16, 1,
-                     "avx512_base");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F16, 8, 8, 1, "avx2_fma");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_F16F16F16, 16, 16, 1,
-                     "avx512_base");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 8, 4, 2, "");  // SSE2
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 8, 8, 2, "avx2_fma");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 16, 16, 2, "avx512_base");
-  iree_uk_test_mmt4d(IREE_UK_FLAG_MMT4D_TYPE_S8S8S32, 16, 16, 2, "avx512_vnni");
-#endif  // defined(IREE_ARCH_ARM_64)
-
-  return iree_uk_test_exit_status();
-}