[Codegen][CPU] Seed C-bitcode ukernel framework with bf16 and i8 seeds. (#24567)
Adds compiler/plugins/target/LLVMCPU/builtins/ukernel/, the new home for
LLVMCPU-only C microkernels. Modelled after the ROCM C ukernel framework
(compiler/plugins/target/ROCM/builtins/ukernel/): compile each ukernel
to LLVM bitcode, embed those .bc files into iree-compile, and emit them
as self-contained `hal.executable_object`s on dispatch variants —
distinct from the legacy mmt4d ukernels under runtime/, which serve a
now-defunct VMVX path and are built twice.
This first commit lands the scaffolding plus two inner-K-loop seeds with
stub bodies (filled in by a later commit):
* `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c` (bf16 x bf16 -> f32)
and `iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c` (i8 x i8 ->
i32 via VNNI), plus `common.h` (no-stdlib stdint replacements);
* `iree_bitcode_library` + `iree_c_embed_data` rules, driven by a
`(stem, features, copts)` table so adding a seed is one row (CMake
regenerated; `iree_bitcode_library` in `bazel_to_cmake_converter.py`
extended to forward `out`, so the embedded filename is
`<name>.<arch>_<features>.bc`);
* a `README.md` design doc covering scope, build, IR representation,
end-to-end flow, and authoring/test guidance;
* `#iree_cpu.ukernel_provider`, the CPU analogue of
`#rocm.ukernel_provider`, declared in `IREECPUAttrs.td`/`.cpp` with a
`createAndReplaceWithUkernelOp` that currently delegates to the default
`LowerBitcodeUKernelsPass` fallback. Specialized `inner_tiled` handling
(threading `intrinsics_{m,n,k}` and the outer K count) will land in a
follow-up alongside `SelectUKernels` and pipeline integration.
Tests:
* roundtrip lit test for the new `#iree_cpu.ukernel_provider` attribute;
* lit tests running `--iree-codegen-lower-bitcode-ukernels` on ops
carrying both `iree_codegen.ukernel = "..."` and a user-supplied
`hal.executable.objects` bitcode (one per seed), asserting they are
rewritten to `iree_codegen.ukernel.generic` with the bitcode preserved.
Progress towards https://github.com/iree-org/iree/issues/24574.
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>diff --git a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
index ab6df6c..fdd76a7 100644
--- a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
+++ b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
@@ -1150,9 +1150,12 @@
f" PUBLIC\n)\n\n"
)
- def iree_bitcode_library(self, name, arch, srcs, internal_hdrs=None, copts=None):
+ def iree_bitcode_library(
+ self, name, arch, srcs, internal_hdrs=None, copts=None, out=None
+ ):
name_block = self._convert_string_arg_block("NAME", name, quote=False)
arch_block = self._convert_string_arg_block("ARCH", arch, quote=False)
+ out_block = self._convert_string_arg_block("OUT", out, quote=True)
hdrs_block = self._convert_srcs_block(internal_hdrs, block_name="INTERNAL_HDRS")
srcs_block = self._convert_srcs_block(srcs)
copts_block = self._convert_string_list_block("COPTS", copts, sort=False)
@@ -1161,6 +1164,7 @@
f"iree_bitcode_library(\n"
f"{name_block}"
f"{arch_block}"
+ f"{out_block}"
f"{hdrs_block}"
f"{srcs_block}"
f"{copts_block}"
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel b/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel
new file mode 100644
index 0000000..d99ef18
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel
@@ -0,0 +1,87 @@
+# Copyright 2026 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("//build_tools/bazel:build_defs.oss.bzl", "iree_cmake_extra_content")
+load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library")
+load("//build_tools/embed_data:build_defs.bzl", "iree_c_embed_data")
+
+package(
+ default_visibility = ["//visibility:public"],
+ features = ["layering_check"],
+ licenses = ["notice"], # Apache 2.0
+)
+
+iree_cmake_extra_content(
+ content = """
+if(NOT IREE_TARGET_BACKEND_LLVM_CPU)
+ return()
+endif()
+""",
+ inline = True,
+)
+
+#===------------------------------------------------------------------------===#
+# UKernel bitcode files
+#===------------------------------------------------------------------------===#
+#
+# Each entry below compiles one .c file to one .bc file for a specific
+# (arch, features) tuple. Each .c file is named after the MMA intrinsic it
+# implements (lowercased, with `iree_uk_` prefix), matching the
+# verbatim-intrinsic convention used by the AMDGPU C ukernels. The bitcode
+# output name appends the build-side `<arch>_<features>` suffix so the
+# compiler-side lookup in `EmbeddedDataDirectory` can find the right bitcode
+# for a given target's feature set. See README.md for the design.
+
+X86_64_AVX512_BF16_COPTS = [
+ "-mavx512f",
+ "-mavx512bf16",
+]
+
+X86_64_AVX512_VNNI_COPTS = [
+ "-mavx512f",
+ "-mavx512vnni",
+]
+
+# (stem, features, copts) for each x86_64 ukernel. `stem` is both the .c file
+# base name and the ukernel function name; `features` is the ISA-extension
+# suffix appended (as `_x86_64_<features>` to the library name and
+# `.x86_64_<features>.bc` to the output) for the EmbeddedDataDirectory lookup.
+X86_64_UKERNELS = [
+ (
+ "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16",
+ "avx512bf16",
+ X86_64_AVX512_BF16_COPTS,
+ ),
+ (
+ "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16",
+ "avx512vnni",
+ X86_64_AVX512_VNNI_COPTS,
+ ),
+]
+
+[iree_bitcode_library(
+ name = "%s_x86_64_%s" % (stem, features),
+ srcs = ["%s.c" % stem],
+ out = "%s.x86_64_%s.bc" % (stem, features),
+ arch = "x86_64",
+ copts = copts,
+ internal_hdrs = ["common.h"],
+) for (stem, features, copts) in X86_64_UKERNELS]
+
+#===------------------------------------------------------------------------===#
+# Embedded bitcode TOC
+#===------------------------------------------------------------------------===#
+
+iree_c_embed_data(
+ name = "iree_uk_cpu_bitcode",
+ srcs = [
+ ":%s.x86_64_%s.bc" % (stem, features)
+ for (stem, features, copts) in X86_64_UKERNELS
+ ],
+ c_file_output = "iree_uk_cpu_bitcode.c",
+ flatten = True,
+ h_file_output = "iree_uk_cpu_bitcode.h",
+)
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt b/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt
new file mode 100644
index 0000000..c7b3069
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt
@@ -0,0 +1,63 @@
+################################################################################
+# Autogenerated by build_tools/bazel_to_cmake/bazel_to_cmake.py from #
+# compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel #
+# #
+# Use iree_cmake_extra_content from iree/build_defs.oss.bzl to add arbitrary #
+# CMake-only content. #
+# #
+# To disable autogeneration for this file entirely, delete this header. #
+################################################################################
+
+iree_add_all_subdirs()
+
+if(NOT IREE_TARGET_BACKEND_LLVM_CPU)
+ return()
+endif()
+
+iree_bitcode_library(
+ NAME
+ iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16_x86_64_avx512bf16
+ ARCH
+ x86_64
+ OUT
+ "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+ INTERNAL_HDRS
+ "common.h"
+ SRCS
+ "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c"
+ COPTS
+ "-mavx512f"
+ "-mavx512bf16"
+)
+
+iree_bitcode_library(
+ NAME
+ iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16_x86_64_avx512vnni
+ ARCH
+ x86_64
+ OUT
+ "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+ INTERNAL_HDRS
+ "common.h"
+ SRCS
+ "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c"
+ COPTS
+ "-mavx512f"
+ "-mavx512vnni"
+)
+
+iree_c_embed_data(
+ NAME
+ iree_uk_cpu_bitcode
+ SRCS
+ "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+ "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+ C_FILE_OUTPUT
+ "iree_uk_cpu_bitcode.c"
+ H_FILE_OUTPUT
+ "iree_uk_cpu_bitcode.h"
+ FLATTEN
+ PUBLIC
+)
+
+### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ###
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h b/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h
new file mode 100644
index 0000000..d340276
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h
@@ -0,0 +1,45 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+// Local definitions, substitute for standard headers.
+//
+// Our microkernels are compiled with `-nostdinc -ffreestanding`, so we do not
+// include standard C library headers (not even `<stdint.h>`). This is to
+// guarantee build portability: the compilation host's standard headers may
+// not be configured for the CPU bitcode target, and pulling them in invariably
+// runs into platform-specific surprises. Clang's *intrinsic* headers
+// (`<immintrin.h>` etc.) are header-only and target-aware, so they are fine to
+// include directly.
+
+#ifndef COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
+#define COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
+
+//===----------------------------------------------------------------------===//
+// Local replacements for stdint.h, using Clang's predefined type macros.
+//===----------------------------------------------------------------------===//
+
+typedef __INT8_TYPE__ int8_t;
+typedef __INT16_TYPE__ int16_t;
+typedef __INT32_TYPE__ int32_t;
+typedef __INT64_TYPE__ int64_t;
+typedef __UINT8_TYPE__ uint8_t;
+typedef __UINT16_TYPE__ uint16_t;
+typedef __UINT32_TYPE__ uint32_t;
+typedef __UINT64_TYPE__ uint64_t;
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+//===----------------------------------------------------------------------===//
+// Ukernel function attributes.
+//===----------------------------------------------------------------------===//
+
+// Ukernels are designed to be inlined into their caller so the per-call
+// constant arguments (e.g. `intrinsics_m`) drive specialization of unrolled
+// inner loops via constant propagation + DCE. Always inline.
+#define IREE_UK_ALWAYS_INLINE __attribute__((always_inline))
+
+#endif // COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c
new file mode 100644
index 0000000..978c478
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c
@@ -0,0 +1,52 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include <immintrin.h>
+#include "common.h"
+
+// Microkernel for `iree_codegen.inner_tiled` with
+// `#iree_cpu.data_tiled_mma_layout<intrinsic =
+// MMA_X86_AVX512BF16_1x16x2_F32_BF16>`. Function name matches the
+// intrinsic name verbatim (lowercased, with the `iree_uk_` prefix), in line
+// with the AMDGPU C ukernel convention.
+//
+// The "inner K loop" the ukernel owns is the loop over the K *tiles* that
+// sits *inside* the outer M/N loops; those outer M/N loops are tiled away by
+// ordinary IREE tiling before this ukernel runs. The ukernel handles
+// arbitrary positive `intrinsics_{m,n,k}` (passed as arguments and looped
+// over); the loops fully unroll after the ukernel is inlined into its
+// constant-`intrinsics_*` caller -- the bitcode-LTO equivalent of a C++
+// template.
+//
+// ABI: each shaped operand is passed as (base pointer, element offset) so the
+// caller doesn't need a GEP before the call; the accumulator additionally
+// gets the element stride of its innermost cross-intrinsic (N) dimension.
+//
+// NOTE (seed scaffolding): this initial seed has a stub body. It exists so
+// that the surrounding *framework* -- bitcode build, embedding,
+// `hal.executable_object` injection, IR rewrite to `ukernel.generic` -- can
+// be landed and lit-tested. A follow-up commit replaces the body with the
+// `_mm512_dpbf16_ps`-based inner loop and adds an e2e matmul test for it.
+IREE_UK_ALWAYS_INLINE
+void iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16(
+ const uint16_t *lhs_base, int64_t lhs_offset, const uint16_t *rhs_base,
+ int64_t rhs_offset, float *acc_base, int64_t acc_offset, int64_t acc_stride,
+ int32_t k_outer, int32_t intrinsics_m, int32_t intrinsics_n,
+ int32_t intrinsics_k) {
+ (void)lhs_base;
+ (void)lhs_offset;
+ (void)rhs_base;
+ (void)rhs_offset;
+ (void)acc_base;
+ (void)acc_offset;
+ (void)acc_stride;
+ (void)k_outer;
+ (void)intrinsics_m;
+ (void)intrinsics_n;
+ (void)intrinsics_k;
+ // TODO(ukernels): real inner K loop using `_mm512_dpbf16_ps`, looping over
+ // intrinsics_{m,n,k}.
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c
new file mode 100644
index 0000000..478ea1a
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c
@@ -0,0 +1,61 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include <immintrin.h>
+#include "common.h"
+
+// Microkernel for `iree_codegen.inner_tiled` with
+// `#iree_cpu.data_tiled_mma_layout<intrinsic =
+// MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16>`. Function name matches the
+// intrinsic name verbatim (lowercased, with the `iree_uk_` prefix), in line
+// with the AMDGPU C ukernel convention.
+//
+// Implements the inner K-loop for the unrolled (intrinsics_m, intrinsics_n,
+// intrinsics_k) tile built from the 16x16x2 i8 VNNI intrinsic via AVX-512
+// VNNI `vpdpwssd`. The "CASTI16" in the MMA intrinsic name reflects that
+// the s8 inputs are zero/sign-extended into i16 lanes before being fed to
+// the 16-bit VNNI instruction; that cast is handled in the inner loop.
+//
+// `intrinsics_{m,n,k}` are passed as function arguments and so look like
+// runtime values inside this translation unit, but the ukernel is always
+// inlined into its caller (a bug otherwise) and the caller always passes
+// the matching `DataTiledMMAAttr` constants. Together with post-inline IR
+// optimization on the linked bitcode, the body specializes to specific
+// compile-time `intrinsics_{m,n,k}` values at each call site.
+//
+// NOTE (seed scaffolding): this initial seed has a stub body. It exists so
+// that the surrounding *framework* -- bitcode build, embedding,
+// `hal.executable_object` injection, IR rewrite to `ukernel.generic` -- can
+// be exercised end-to-end. A follow-up commit replaces the body with the
+// `_mm512_dpwssd_epi32`-based inner loop and adds an e2e matmul test for
+// it. This is the "practically useful" seed: i8x i8->i32 via VNNI is a
+// workhorse for quantized inference, and codegen has a residual perf gap
+// on this case.
+// ABI matches the inner_tiled -> ukernel.generic lowering (see
+// `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16`): each shaped operand passed
+// as (base, element offset), ACC additionally as its innermost
+// cross-intrinsic stride, then the scalar `k_outer` / `intrinsics_{m,n,k}`.
+IREE_UK_ALWAYS_INLINE
+void iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16(
+ const void *lhs_base, int64_t lhs_offset, const void *rhs_base,
+ int64_t rhs_offset, void *acc_base, int64_t acc_offset, int64_t acc_stride,
+ int32_t k_outer, int32_t intrinsics_m, int32_t intrinsics_n,
+ int32_t intrinsics_k) {
+ (void)lhs_base;
+ (void)lhs_offset;
+ (void)rhs_base;
+ (void)rhs_offset;
+ (void)acc_base;
+ (void)acc_offset;
+ (void)acc_stride;
+ (void)k_outer;
+ (void)intrinsics_m;
+ (void)intrinsics_n;
+ (void)intrinsics_k;
+ // TODO(ukernels): real inner K loop using `_mm512_dpwssd_epi32` after
+ // widening the s8 LHS/RHS halves to i16 lanes (loop over
+ // intrinsics_{m,n,k} like the bf16 ukernel).
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel
new file mode 100644
index 0000000..5ff5b44
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel
@@ -0,0 +1,30 @@
+# Copyright 2026 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("//build_tools/bazel:enforce_glob.bzl", "enforce_glob")
+load("//build_tools/bazel:iree_lit_test.bzl", "iree_lit_test_suite")
+
+package(
+ features = ["layering_check"],
+ licenses = ["notice"], # Apache 2.0
+)
+
+iree_lit_test_suite(
+ name = "lit",
+ srcs = enforce_glob(
+ # keep sorted
+ [
+ "lower_inner_tiled_to_bitcode_ukernel.mlir",
+ "lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir",
+ ],
+ include = ["*.mlir"],
+ ),
+ cfg = "//compiler:lit.cfg.py",
+ tools = [
+ "//tools:iree-opt",
+ "@llvm-project//llvm:FileCheck",
+ ],
+)
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt
new file mode 100644
index 0000000..1671a1e
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt
@@ -0,0 +1,24 @@
+################################################################################
+# Autogenerated by build_tools/bazel_to_cmake/bazel_to_cmake.py from #
+# compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel #
+# #
+# Use iree_cmake_extra_content from iree/build_defs.oss.bzl to add arbitrary #
+# CMake-only content. #
+# #
+# To disable autogeneration for this file entirely, delete this header. #
+################################################################################
+
+iree_add_all_subdirs()
+
+iree_lit_test_suite(
+ NAME
+ lit
+ SRCS
+ "lower_inner_tiled_to_bitcode_ukernel.mlir"
+ "lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir"
+ TOOLS
+ FileCheck
+ iree-opt
+)
+
+### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ###
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir
new file mode 100644
index 0000000..6021168
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir
@@ -0,0 +1,64 @@
+// RUN: iree-opt --iree-codegen-lower-bitcode-ukernels --split-input-file %s | FileCheck %s
+
+// Tests that an op carrying a `iree_codegen.ukernel = <"name", bitcode>`
+// descriptor is rewritten by `LowerBitcodeUKernels` to an
+// `iree_codegen.ukernel.generic` call, and that an attached
+// `hal.executable.objects` (carrying the ukernel bitcode bytes) is preserved
+// on the new op. The `#iree_cpu.ukernel_provider` on the executable target is
+// looked up but currently delegates the rewrite to the default fallback in the
+// pass; specialized handling of `iree_codegen.inner_tiled` (threading
+// `intrinsics_{m,n,k}` and the outer K count as scalar operands) will be
+// added in a follow-up commit alongside the SelectUKernels pass that sets
+// the descriptor and attaches the bitcode.
+//
+// Bitcode bytes here are an opaque placeholder; the pass treats
+// `hal.executable.objects` as a discardable attribute and does not parse it.
+
+#executable_target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {
+ iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider
+}>
+#map = affine_map<(d0, d1, d2) -> (d0, d2)>
+#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
+#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
+
+// CHECK-LABEL: @bf16_matmul_with_ukernel_descriptor
+// CHECK-SAME: %[[LHS:[a-zA-Z0-9]+]]: tensor<16x2xbf16>
+// CHECK-SAME: %[[RHS:[a-zA-Z0-9]+]]: tensor<16x2xbf16>
+// CHECK-NOT: linalg.generic
+// CHECK: %[[OUT:.+]] = linalg.fill
+// CHECK: %[[UKERNEL:.+]] = iree_codegen.ukernel.generic
+// CHECK-SAME: {hal.executable.objects = [{{.*}}"iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+// CHECK-SAME: iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<"iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16", bitcode>}
+// CHECK-SAME: "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16"
+// CHECK-SAME: ins(%[[LHS]], %[[RHS]] : tensor<16x2xbf16>, tensor<16x2xbf16>)
+// CHECK-SAME: outs(%[[OUT]] : tensor<16x16xf32>)
+// CHECK: return %[[UKERNEL]]
+module attributes {hal.executable.target = #executable_target} {
+ func.func @bf16_matmul_with_ukernel_descriptor(
+ %arg0: tensor<16x2xbf16>, %arg1: tensor<16x2xbf16>) -> tensor<16x16xf32> {
+ %cst = arith.constant 0.000000e+00 : f32
+ %0 = tensor.empty() : tensor<16x16xf32>
+ %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<16x16xf32>) -> tensor<16x16xf32>
+ %2 = linalg.generic {
+ indexing_maps = [#map, #map1, #map2],
+ iterator_types = ["parallel", "parallel", "reduction"]
+ } ins(%arg0, %arg1 : tensor<16x2xbf16>, tensor<16x2xbf16>)
+ outs(%1 : tensor<16x16xf32>) attrs = {
+ iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<
+ "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16", bitcode>,
+ hal.executable.objects = [
+ #hal.executable.object<{
+ path = "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc",
+ data = dense<[0, 1, 2, 3]> : vector<4xi8>}>
+ ]
+ } {
+ ^bb0(%in: bf16, %in_0: bf16, %out: f32):
+ %3 = arith.extf %in : bf16 to f32
+ %4 = arith.extf %in_0 : bf16 to f32
+ %5 = arith.mulf %3, %4 : f32
+ %6 = arith.addf %out, %5 : f32
+ linalg.yield %6 : f32
+ } -> tensor<16x16xf32>
+ return %2 : tensor<16x16xf32>
+ }
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir
new file mode 100644
index 0000000..fdc348d
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir
@@ -0,0 +1,57 @@
+// RUN: iree-opt --iree-codegen-lower-bitcode-ukernels --split-input-file %s | FileCheck %s
+
+// Companion to lower_inner_tiled_to_bitcode_ukernel.mlir, exercising the
+// i8 x i8 -> i32 VNNI seed. Same shape of test (user-supplied
+// `hal.executable.objects` carrying placeholder bitcode bytes), distinct
+// from the bf16 test only in the ukernel name and element types — this
+// guards against a regression where the framework would accidentally
+// hard-code the bf16 seed.
+
+#executable_target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {
+ iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider
+}>
+#map = affine_map<(d0, d1, d2) -> (d0, d2)>
+#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
+#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
+
+// CHECK-LABEL: @i8_matmul_with_ukernel_descriptor
+// CHECK-SAME: %[[LHS:[a-zA-Z0-9]+]]: tensor<16x2xi8>
+// CHECK-SAME: %[[RHS:[a-zA-Z0-9]+]]: tensor<16x2xi8>
+// CHECK-NOT: linalg.generic
+// CHECK: %[[OUT:.+]] = linalg.fill
+// CHECK: %[[UKERNEL:.+]] = iree_codegen.ukernel.generic
+// CHECK-SAME: {hal.executable.objects = [{{.*}}"iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+// CHECK-SAME: iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<"iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16", bitcode>}
+// CHECK-SAME: "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16"
+// CHECK-SAME: ins(%[[LHS]], %[[RHS]] : tensor<16x2xi8>, tensor<16x2xi8>)
+// CHECK-SAME: outs(%[[OUT]] : tensor<16x16xi32>)
+// CHECK: return %[[UKERNEL]]
+module attributes {hal.executable.target = #executable_target} {
+ func.func @i8_matmul_with_ukernel_descriptor(
+ %arg0: tensor<16x2xi8>, %arg1: tensor<16x2xi8>) -> tensor<16x16xi32> {
+ %c0 = arith.constant 0 : i32
+ %0 = tensor.empty() : tensor<16x16xi32>
+ %1 = linalg.fill ins(%c0 : i32) outs(%0 : tensor<16x16xi32>) -> tensor<16x16xi32>
+ %2 = linalg.generic {
+ indexing_maps = [#map, #map1, #map2],
+ iterator_types = ["parallel", "parallel", "reduction"]
+ } ins(%arg0, %arg1 : tensor<16x2xi8>, tensor<16x2xi8>)
+ outs(%1 : tensor<16x16xi32>) attrs = {
+ iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<
+ "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16", bitcode>,
+ hal.executable.objects = [
+ #hal.executable.object<{
+ path = "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc",
+ data = dense<[0, 1, 2, 3]> : vector<4xi8>}>
+ ]
+ } {
+ ^bb0(%in: i8, %in_0: i8, %out: i32):
+ %3 = arith.extsi %in : i8 to i32
+ %4 = arith.extsi %in_0 : i8 to i32
+ %5 = arith.muli %3, %4 : i32
+ %6 = arith.addi %out, %5 : i32
+ linalg.yield %6 : i32
+ } -> tensor<16x16xi32>
+ return %2 : tensor<16x16xi32>
+ }
+}
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
index e2aca45..60ed072 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
@@ -6,6 +6,8 @@
#include "iree/compiler/Codegen/Dialect/CPU/IR/IREECPUTypes.h"
#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h"
+#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenOps.h"
+#include "iree/compiler/Codegen/Dialect/Codegen/IR/UKernelOps.h"
#include "iree/compiler/Codegen/Dialect/Codegen/Utils/MMAUtils.h"
#include "iree/compiler/Dialect/Encoding/IR/EncodingTypes.h"
#include "llvm/ADT/STLExtras.h"
@@ -1094,6 +1096,22 @@
bool InnerTiledSemanticsAttr::getOpaque() const { return false; }
//===----------------------------------------------------------------------===//
+// UKernelProviderAttr
+//===----------------------------------------------------------------------===//
+
+std::optional<LogicalResult> UKernelProviderAttr::createAndReplaceWithUkernelOp(
+ RewriterBase &rewriter, StringRef name, DictionaryAttr targetConfiguration,
+ Operation *contextualOp, ArrayRef<Value> inputs, ArrayRef<Value> outputs,
+ SmallVectorImpl<Value> &otherOperands) const {
+ // Fall through to the default UKernelGenericOp construction in
+ // LowerBitcodeUKernelsPass. Specialized handling of `inner_tiled` ops
+ // (threading `intrinsics_{m,n,k}` and the outer K count as scalar operands)
+ // will be added in a follow-up commit alongside the SelectUKernels pass that
+ // sets the `iree_codegen.ukernel` descriptor and attaches the bitcode.
+ return std::nullopt;
+}
+
+//===----------------------------------------------------------------------===//
// Attribute Registration
//===----------------------------------------------------------------------===//
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
index 413eb35..9629fd8 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
@@ -131,6 +131,37 @@
}
//===----------------------------------------------------------------------===//
+// UKernelProvider attribute
+//===----------------------------------------------------------------------===//
+
+def IREECPU_UKernelProviderAttr :
+ AttrDef<IREECPU_Dialect, "UKernelProvider", [
+ DeclareAttrInterfaceMethods<IREECodegen_UKernelProviderInterface, [
+ "createAndReplaceWithUkernelOp"
+ ]>
+ ]> {
+ let mnemonic = "ukernel_provider";
+ let summary = [{
+ Provides built-in C-bitcode ukernel implementations for the LLVMCPU
+ target backend.
+ }];
+ let description = [{
+ Implements the `UKernelProviderInterface` for the LLVMCPU target.
+ `createAndReplaceWithUkernelOp` rewrites an `iree_codegen.inner_tiled` op
+ carrying an `iree_cpu.data_tiled_mma_layout` and an
+ `iree_codegen.ukernel = "<name>"` descriptor into an
+ `iree_codegen.ukernel.generic` call, and attaches the matching ukernel
+ bitcode as a `hal.executable_object` on the dispatch's executable variant.
+
+ Bitcode is resolved by name: first against any `hal.executable.objects`
+ already attached above the op (so user-supplied bitcode wins), then
+ against ukernels embedded into `iree-compile` at LLVMCPU plugin init.
+ }];
+ let parameters = (ins);
+ let assemblyFormat = [{}];
+}
+
+//===----------------------------------------------------------------------===//
// CPU Pipeline Attribute
//===----------------------------------------------------------------------===//
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
index 27fb3b4..29d2d44 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
@@ -69,3 +69,16 @@
// CHECK-SAME: test = 123 : i32
// CHECK-LABEL: @test_arbitrary_keys()
// CHECK-SAME: lowering_config = #[[$CONFIG]]
+
+// -----
+
+// Round-trip the LLVMCPU ukernel provider attribute. This is the CPU analogue
+// of #rocm.ukernel_provider, set on a `hal.executable.target` config to enable
+// the built-in C-bitcode ukernels under
+// compiler/plugins/target/LLVMCPU/builtins/ukernel/.
+func.func @test_ukernel_provider() attributes {
+ iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider} {
+ return
+}
+// CHECK-LABEL: @test_ukernel_provider()
+// CHECK-SAME: iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider