[Codegen][CPU] Seed C-bitcode ukernel framework with bf16 and i8 seeds. (#24567)

Adds compiler/plugins/target/LLVMCPU/builtins/ukernel/, the new home for
LLVMCPU-only C microkernels. Modelled after the ROCM C ukernel framework
(compiler/plugins/target/ROCM/builtins/ukernel/): compile each ukernel
to LLVM bitcode, embed those .bc files into iree-compile, and emit them
as self-contained `hal.executable_object`s on dispatch variants —
distinct from the legacy mmt4d ukernels under runtime/, which serve a
now-defunct VMVX path and are built twice.

This first commit lands the scaffolding plus two inner-K-loop seeds with
stub bodies (filled in by a later commit):
* `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c` (bf16 x bf16 -> f32)
and `iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c` (i8 x i8 ->
i32 via VNNI), plus `common.h` (no-stdlib stdint replacements);
* `iree_bitcode_library` + `iree_c_embed_data` rules, driven by a
`(stem, features, copts)` table so adding a seed is one row (CMake
regenerated; `iree_bitcode_library` in `bazel_to_cmake_converter.py`
extended to forward `out`, so the embedded filename is
`<name>.<arch>_<features>.bc`);
* a `README.md` design doc covering scope, build, IR representation,
end-to-end flow, and authoring/test guidance;
* `#iree_cpu.ukernel_provider`, the CPU analogue of
`#rocm.ukernel_provider`, declared in `IREECPUAttrs.td`/`.cpp` with a
`createAndReplaceWithUkernelOp` that currently delegates to the default
`LowerBitcodeUKernelsPass` fallback. Specialized `inner_tiled` handling
(threading `intrinsics_{m,n,k}` and the outer K count) will land in a
follow-up alongside `SelectUKernels` and pipeline integration.

Tests:
* roundtrip lit test for the new `#iree_cpu.ukernel_provider` attribute;
* lit tests running `--iree-codegen-lower-bitcode-ukernels` on ops
carrying both `iree_codegen.ukernel = "..."` and a user-supplied
`hal.executable.objects` bitcode (one per seed), asserting they are
rewritten to `iree_codegen.ukernel.generic` with the bitcode preserved.

Progress towards https://github.com/iree-org/iree/issues/24574.

---------

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
index ab6df6c..fdd76a7 100644
--- a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
+++ b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
@@ -1150,9 +1150,12 @@
             f"  PUBLIC\n)\n\n"
         )
 
-    def iree_bitcode_library(self, name, arch, srcs, internal_hdrs=None, copts=None):
+    def iree_bitcode_library(
+        self, name, arch, srcs, internal_hdrs=None, copts=None, out=None
+    ):
         name_block = self._convert_string_arg_block("NAME", name, quote=False)
         arch_block = self._convert_string_arg_block("ARCH", arch, quote=False)
+        out_block = self._convert_string_arg_block("OUT", out, quote=True)
         hdrs_block = self._convert_srcs_block(internal_hdrs, block_name="INTERNAL_HDRS")
         srcs_block = self._convert_srcs_block(srcs)
         copts_block = self._convert_string_list_block("COPTS", copts, sort=False)
@@ -1161,6 +1164,7 @@
             f"iree_bitcode_library(\n"
             f"{name_block}"
             f"{arch_block}"
+            f"{out_block}"
             f"{hdrs_block}"
             f"{srcs_block}"
             f"{copts_block}"
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel b/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel
new file mode 100644
index 0000000..d99ef18
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel
@@ -0,0 +1,87 @@
+# Copyright 2026 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("//build_tools/bazel:build_defs.oss.bzl", "iree_cmake_extra_content")
+load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library")
+load("//build_tools/embed_data:build_defs.bzl", "iree_c_embed_data")
+
+package(
+    default_visibility = ["//visibility:public"],
+    features = ["layering_check"],
+    licenses = ["notice"],  # Apache 2.0
+)
+
+iree_cmake_extra_content(
+    content = """
+if(NOT IREE_TARGET_BACKEND_LLVM_CPU)
+  return()
+endif()
+""",
+    inline = True,
+)
+
+#===------------------------------------------------------------------------===#
+# UKernel bitcode files
+#===------------------------------------------------------------------------===#
+#
+# Each entry below compiles one .c file to one .bc file for a specific
+# (arch, features) tuple. Each .c file is named after the MMA intrinsic it
+# implements (lowercased, with `iree_uk_` prefix), matching the
+# verbatim-intrinsic convention used by the AMDGPU C ukernels. The bitcode
+# output name appends the build-side `<arch>_<features>` suffix so the
+# compiler-side lookup in `EmbeddedDataDirectory` can find the right bitcode
+# for a given target's feature set. See README.md for the design.
+
+X86_64_AVX512_BF16_COPTS = [
+    "-mavx512f",
+    "-mavx512bf16",
+]
+
+X86_64_AVX512_VNNI_COPTS = [
+    "-mavx512f",
+    "-mavx512vnni",
+]
+
+# (stem, features, copts) for each x86_64 ukernel. `stem` is both the .c file
+# base name and the ukernel function name; `features` is the ISA-extension
+# suffix appended (as `_x86_64_<features>` to the library name and
+# `.x86_64_<features>.bc` to the output) for the EmbeddedDataDirectory lookup.
+X86_64_UKERNELS = [
+    (
+        "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16",
+        "avx512bf16",
+        X86_64_AVX512_BF16_COPTS,
+    ),
+    (
+        "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16",
+        "avx512vnni",
+        X86_64_AVX512_VNNI_COPTS,
+    ),
+]
+
+[iree_bitcode_library(
+    name = "%s_x86_64_%s" % (stem, features),
+    srcs = ["%s.c" % stem],
+    out = "%s.x86_64_%s.bc" % (stem, features),
+    arch = "x86_64",
+    copts = copts,
+    internal_hdrs = ["common.h"],
+) for (stem, features, copts) in X86_64_UKERNELS]
+
+#===------------------------------------------------------------------------===#
+# Embedded bitcode TOC
+#===------------------------------------------------------------------------===#
+
+iree_c_embed_data(
+    name = "iree_uk_cpu_bitcode",
+    srcs = [
+        ":%s.x86_64_%s.bc" % (stem, features)
+        for (stem, features, copts) in X86_64_UKERNELS
+    ],
+    c_file_output = "iree_uk_cpu_bitcode.c",
+    flatten = True,
+    h_file_output = "iree_uk_cpu_bitcode.h",
+)
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt b/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt
new file mode 100644
index 0000000..c7b3069
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/CMakeLists.txt
@@ -0,0 +1,63 @@
+################################################################################
+# Autogenerated by build_tools/bazel_to_cmake/bazel_to_cmake.py from           #
+# compiler/plugins/target/LLVMCPU/builtins/ukernel/BUILD.bazel                 #
+#                                                                              #
+# Use iree_cmake_extra_content from iree/build_defs.oss.bzl to add arbitrary   #
+# CMake-only content.                                                          #
+#                                                                              #
+# To disable autogeneration for this file entirely, delete this header.        #
+################################################################################
+
+iree_add_all_subdirs()
+
+if(NOT IREE_TARGET_BACKEND_LLVM_CPU)
+  return()
+endif()
+
+iree_bitcode_library(
+  NAME
+    iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16_x86_64_avx512bf16
+  ARCH
+    x86_64
+  OUT
+    "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+  INTERNAL_HDRS
+    "common.h"
+  SRCS
+    "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c"
+  COPTS
+    "-mavx512f"
+    "-mavx512bf16"
+)
+
+iree_bitcode_library(
+  NAME
+    iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16_x86_64_avx512vnni
+  ARCH
+    x86_64
+  OUT
+    "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+  INTERNAL_HDRS
+    "common.h"
+  SRCS
+    "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c"
+  COPTS
+    "-mavx512f"
+    "-mavx512vnni"
+)
+
+iree_c_embed_data(
+  NAME
+    iree_uk_cpu_bitcode
+  SRCS
+    "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+    "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+  C_FILE_OUTPUT
+    "iree_uk_cpu_bitcode.c"
+  H_FILE_OUTPUT
+    "iree_uk_cpu_bitcode.h"
+  FLATTEN
+  PUBLIC
+)
+
+### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ###
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h b/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h
new file mode 100644
index 0000000..d340276
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/common.h
@@ -0,0 +1,45 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+// Local definitions, substitute for standard headers.
+//
+// Our microkernels are compiled with `-nostdinc -ffreestanding`, so we do not
+// include standard C library headers (not even `<stdint.h>`). This is to
+// guarantee build portability: the compilation host's standard headers may
+// not be configured for the CPU bitcode target, and pulling them in invariably
+// runs into platform-specific surprises. Clang's *intrinsic* headers
+// (`<immintrin.h>` etc.) are header-only and target-aware, so they are fine to
+// include directly.
+
+#ifndef COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
+#define COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
+
+//===----------------------------------------------------------------------===//
+// Local replacements for stdint.h, using Clang's predefined type macros.
+//===----------------------------------------------------------------------===//
+
+typedef __INT8_TYPE__ int8_t;
+typedef __INT16_TYPE__ int16_t;
+typedef __INT32_TYPE__ int32_t;
+typedef __INT64_TYPE__ int64_t;
+typedef __UINT8_TYPE__ uint8_t;
+typedef __UINT16_TYPE__ uint16_t;
+typedef __UINT32_TYPE__ uint32_t;
+typedef __UINT64_TYPE__ uint64_t;
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+//===----------------------------------------------------------------------===//
+// Ukernel function attributes.
+//===----------------------------------------------------------------------===//
+
+// Ukernels are designed to be inlined into their caller so the per-call
+// constant arguments (e.g. `intrinsics_m`) drive specialization of unrolled
+// inner loops via constant propagation + DCE. Always inline.
+#define IREE_UK_ALWAYS_INLINE __attribute__((always_inline))
+
+#endif // COMPILER_PLUGINS_TARGET_LLVMCPU_BUILTINS_UKERNEL_COMMON_H_
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c
new file mode 100644
index 0000000..978c478
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c
@@ -0,0 +1,52 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include <immintrin.h>
+#include "common.h"
+
+// Microkernel for `iree_codegen.inner_tiled` with
+// `#iree_cpu.data_tiled_mma_layout<intrinsic =
+//     MMA_X86_AVX512BF16_1x16x2_F32_BF16>`. Function name matches the
+// intrinsic name verbatim (lowercased, with the `iree_uk_` prefix), in line
+// with the AMDGPU C ukernel convention.
+//
+// The "inner K loop" the ukernel owns is the loop over the K *tiles* that
+// sits *inside* the outer M/N loops; those outer M/N loops are tiled away by
+// ordinary IREE tiling before this ukernel runs. The ukernel handles
+// arbitrary positive `intrinsics_{m,n,k}` (passed as arguments and looped
+// over); the loops fully unroll after the ukernel is inlined into its
+// constant-`intrinsics_*` caller -- the bitcode-LTO equivalent of a C++
+// template.
+//
+// ABI: each shaped operand is passed as (base pointer, element offset) so the
+// caller doesn't need a GEP before the call; the accumulator additionally
+// gets the element stride of its innermost cross-intrinsic (N) dimension.
+//
+// NOTE (seed scaffolding): this initial seed has a stub body. It exists so
+// that the surrounding *framework* -- bitcode build, embedding,
+// `hal.executable_object` injection, IR rewrite to `ukernel.generic` -- can
+// be landed and lit-tested. A follow-up commit replaces the body with the
+// `_mm512_dpbf16_ps`-based inner loop and adds an e2e matmul test for it.
+IREE_UK_ALWAYS_INLINE
+void iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16(
+    const uint16_t *lhs_base, int64_t lhs_offset, const uint16_t *rhs_base,
+    int64_t rhs_offset, float *acc_base, int64_t acc_offset, int64_t acc_stride,
+    int32_t k_outer, int32_t intrinsics_m, int32_t intrinsics_n,
+    int32_t intrinsics_k) {
+  (void)lhs_base;
+  (void)lhs_offset;
+  (void)rhs_base;
+  (void)rhs_offset;
+  (void)acc_base;
+  (void)acc_offset;
+  (void)acc_stride;
+  (void)k_outer;
+  (void)intrinsics_m;
+  (void)intrinsics_n;
+  (void)intrinsics_k;
+  // TODO(ukernels): real inner K loop using `_mm512_dpbf16_ps`, looping over
+  // intrinsics_{m,n,k}.
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c
new file mode 100644
index 0000000..478ea1a
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c
@@ -0,0 +1,61 @@
+// Copyright 2026 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include <immintrin.h>
+#include "common.h"
+
+// Microkernel for `iree_codegen.inner_tiled` with
+// `#iree_cpu.data_tiled_mma_layout<intrinsic =
+//     MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16>`. Function name matches the
+// intrinsic name verbatim (lowercased, with the `iree_uk_` prefix), in line
+// with the AMDGPU C ukernel convention.
+//
+// Implements the inner K-loop for the unrolled (intrinsics_m, intrinsics_n,
+// intrinsics_k) tile built from the 16x16x2 i8 VNNI intrinsic via AVX-512
+// VNNI `vpdpwssd`. The "CASTI16" in the MMA intrinsic name reflects that
+// the s8 inputs are zero/sign-extended into i16 lanes before being fed to
+// the 16-bit VNNI instruction; that cast is handled in the inner loop.
+//
+// `intrinsics_{m,n,k}` are passed as function arguments and so look like
+// runtime values inside this translation unit, but the ukernel is always
+// inlined into its caller (a bug otherwise) and the caller always passes
+// the matching `DataTiledMMAAttr` constants. Together with post-inline IR
+// optimization on the linked bitcode, the body specializes to specific
+// compile-time `intrinsics_{m,n,k}` values at each call site.
+//
+// NOTE (seed scaffolding): this initial seed has a stub body. It exists so
+// that the surrounding *framework* -- bitcode build, embedding,
+// `hal.executable_object` injection, IR rewrite to `ukernel.generic` -- can
+// be exercised end-to-end. A follow-up commit replaces the body with the
+// `_mm512_dpwssd_epi32`-based inner loop and adds an e2e matmul test for
+// it. This is the "practically useful" seed: i8x i8->i32 via VNNI is a
+// workhorse for quantized inference, and codegen has a residual perf gap
+// on this case.
+// ABI matches the inner_tiled -> ukernel.generic lowering (see
+// `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16`): each shaped operand passed
+// as (base, element offset), ACC additionally as its innermost
+// cross-intrinsic stride, then the scalar `k_outer` / `intrinsics_{m,n,k}`.
+IREE_UK_ALWAYS_INLINE
+void iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16(
+    const void *lhs_base, int64_t lhs_offset, const void *rhs_base,
+    int64_t rhs_offset, void *acc_base, int64_t acc_offset, int64_t acc_stride,
+    int32_t k_outer, int32_t intrinsics_m, int32_t intrinsics_n,
+    int32_t intrinsics_k) {
+  (void)lhs_base;
+  (void)lhs_offset;
+  (void)rhs_base;
+  (void)rhs_offset;
+  (void)acc_base;
+  (void)acc_offset;
+  (void)acc_stride;
+  (void)k_outer;
+  (void)intrinsics_m;
+  (void)intrinsics_n;
+  (void)intrinsics_k;
+  // TODO(ukernels): real inner K loop using `_mm512_dpwssd_epi32` after
+  // widening the s8 LHS/RHS halves to i16 lanes (loop over
+  // intrinsics_{m,n,k} like the bf16 ukernel).
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel
new file mode 100644
index 0000000..5ff5b44
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel
@@ -0,0 +1,30 @@
+# Copyright 2026 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("//build_tools/bazel:enforce_glob.bzl", "enforce_glob")
+load("//build_tools/bazel:iree_lit_test.bzl", "iree_lit_test_suite")
+
+package(
+    features = ["layering_check"],
+    licenses = ["notice"],  # Apache 2.0
+)
+
+iree_lit_test_suite(
+    name = "lit",
+    srcs = enforce_glob(
+        # keep sorted
+        [
+            "lower_inner_tiled_to_bitcode_ukernel.mlir",
+            "lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir",
+        ],
+        include = ["*.mlir"],
+    ),
+    cfg = "//compiler:lit.cfg.py",
+    tools = [
+        "//tools:iree-opt",
+        "@llvm-project//llvm:FileCheck",
+    ],
+)
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt
new file mode 100644
index 0000000..1671a1e
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/CMakeLists.txt
@@ -0,0 +1,24 @@
+################################################################################
+# Autogenerated by build_tools/bazel_to_cmake/bazel_to_cmake.py from           #
+# compiler/plugins/target/LLVMCPU/builtins/ukernel/test/BUILD.bazel            #
+#                                                                              #
+# Use iree_cmake_extra_content from iree/build_defs.oss.bzl to add arbitrary   #
+# CMake-only content.                                                          #
+#                                                                              #
+# To disable autogeneration for this file entirely, delete this header.        #
+################################################################################
+
+iree_add_all_subdirs()
+
+iree_lit_test_suite(
+  NAME
+    lit
+  SRCS
+    "lower_inner_tiled_to_bitcode_ukernel.mlir"
+    "lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir"
+  TOOLS
+    FileCheck
+    iree-opt
+)
+
+### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ###
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir
new file mode 100644
index 0000000..6021168
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel.mlir
@@ -0,0 +1,64 @@
+// RUN: iree-opt --iree-codegen-lower-bitcode-ukernels --split-input-file %s | FileCheck %s
+
+// Tests that an op carrying a `iree_codegen.ukernel = <"name", bitcode>`
+// descriptor is rewritten by `LowerBitcodeUKernels` to an
+// `iree_codegen.ukernel.generic` call, and that an attached
+// `hal.executable.objects` (carrying the ukernel bitcode bytes) is preserved
+// on the new op. The `#iree_cpu.ukernel_provider` on the executable target is
+// looked up but currently delegates the rewrite to the default fallback in the
+// pass; specialized handling of `iree_codegen.inner_tiled` (threading
+// `intrinsics_{m,n,k}` and the outer K count as scalar operands) will be
+// added in a follow-up commit alongside the SelectUKernels pass that sets
+// the descriptor and attaches the bitcode.
+//
+// Bitcode bytes here are an opaque placeholder; the pass treats
+// `hal.executable.objects` as a discardable attribute and does not parse it.
+
+#executable_target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {
+  iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider
+}>
+#map = affine_map<(d0, d1, d2) -> (d0, d2)>
+#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
+#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
+
+// CHECK-LABEL: @bf16_matmul_with_ukernel_descriptor
+// CHECK-SAME:    %[[LHS:[a-zA-Z0-9]+]]: tensor<16x2xbf16>
+// CHECK-SAME:    %[[RHS:[a-zA-Z0-9]+]]: tensor<16x2xbf16>
+// CHECK-NOT:     linalg.generic
+// CHECK:         %[[OUT:.+]] = linalg.fill
+// CHECK:         %[[UKERNEL:.+]] = iree_codegen.ukernel.generic
+// CHECK-SAME:        {hal.executable.objects = [{{.*}}"iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc"
+// CHECK-SAME:         iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<"iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16", bitcode>}
+// CHECK-SAME:        "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16"
+// CHECK-SAME:        ins(%[[LHS]], %[[RHS]] : tensor<16x2xbf16>, tensor<16x2xbf16>)
+// CHECK-SAME:        outs(%[[OUT]] : tensor<16x16xf32>)
+// CHECK:         return %[[UKERNEL]]
+module attributes {hal.executable.target = #executable_target} {
+  func.func @bf16_matmul_with_ukernel_descriptor(
+      %arg0: tensor<16x2xbf16>, %arg1: tensor<16x2xbf16>) -> tensor<16x16xf32> {
+    %cst = arith.constant 0.000000e+00 : f32
+    %0 = tensor.empty() : tensor<16x16xf32>
+    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<16x16xf32>) -> tensor<16x16xf32>
+    %2 = linalg.generic {
+      indexing_maps = [#map, #map1, #map2],
+      iterator_types = ["parallel", "parallel", "reduction"]
+    } ins(%arg0, %arg1 : tensor<16x2xbf16>, tensor<16x2xbf16>)
+      outs(%1 : tensor<16x16xf32>) attrs = {
+      iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<
+          "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16", bitcode>,
+      hal.executable.objects = [
+        #hal.executable.object<{
+          path = "iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc",
+          data = dense<[0, 1, 2, 3]> : vector<4xi8>}>
+      ]
+    } {
+    ^bb0(%in: bf16, %in_0: bf16, %out: f32):
+      %3 = arith.extf %in : bf16 to f32
+      %4 = arith.extf %in_0 : bf16 to f32
+      %5 = arith.mulf %3, %4 : f32
+      %6 = arith.addf %out, %5 : f32
+      linalg.yield %6 : f32
+    } -> tensor<16x16xf32>
+    return %2 : tensor<16x16xf32>
+  }
+}
diff --git a/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir
new file mode 100644
index 0000000..fdc348d
--- /dev/null
+++ b/compiler/plugins/target/LLVMCPU/builtins/ukernel/test/lower_inner_tiled_to_bitcode_ukernel_i8_vnni.mlir
@@ -0,0 +1,57 @@
+// RUN: iree-opt --iree-codegen-lower-bitcode-ukernels --split-input-file %s | FileCheck %s
+
+// Companion to lower_inner_tiled_to_bitcode_ukernel.mlir, exercising the
+// i8 x i8 -> i32 VNNI seed. Same shape of test (user-supplied
+// `hal.executable.objects` carrying placeholder bitcode bytes), distinct
+// from the bf16 test only in the ukernel name and element types — this
+// guards against a regression where the framework would accidentally
+// hard-code the bf16 seed.
+
+#executable_target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {
+  iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider
+}>
+#map = affine_map<(d0, d1, d2) -> (d0, d2)>
+#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
+#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
+
+// CHECK-LABEL: @i8_matmul_with_ukernel_descriptor
+// CHECK-SAME:    %[[LHS:[a-zA-Z0-9]+]]: tensor<16x2xi8>
+// CHECK-SAME:    %[[RHS:[a-zA-Z0-9]+]]: tensor<16x2xi8>
+// CHECK-NOT:     linalg.generic
+// CHECK:         %[[OUT:.+]] = linalg.fill
+// CHECK:         %[[UKERNEL:.+]] = iree_codegen.ukernel.generic
+// CHECK-SAME:        {hal.executable.objects = [{{.*}}"iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc"
+// CHECK-SAME:         iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<"iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16", bitcode>}
+// CHECK-SAME:        "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16"
+// CHECK-SAME:        ins(%[[LHS]], %[[RHS]] : tensor<16x2xi8>, tensor<16x2xi8>)
+// CHECK-SAME:        outs(%[[OUT]] : tensor<16x16xi32>)
+// CHECK:         return %[[UKERNEL]]
+module attributes {hal.executable.target = #executable_target} {
+  func.func @i8_matmul_with_ukernel_descriptor(
+      %arg0: tensor<16x2xi8>, %arg1: tensor<16x2xi8>) -> tensor<16x16xi32> {
+    %c0 = arith.constant 0 : i32
+    %0 = tensor.empty() : tensor<16x16xi32>
+    %1 = linalg.fill ins(%c0 : i32) outs(%0 : tensor<16x16xi32>) -> tensor<16x16xi32>
+    %2 = linalg.generic {
+      indexing_maps = [#map, #map1, #map2],
+      iterator_types = ["parallel", "parallel", "reduction"]
+    } ins(%arg0, %arg1 : tensor<16x2xi8>, tensor<16x2xi8>)
+      outs(%1 : tensor<16x16xi32>) attrs = {
+      iree_codegen.ukernel = #iree_codegen.ukernel_descriptor<
+          "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16", bitcode>,
+      hal.executable.objects = [
+        #hal.executable.object<{
+          path = "iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.x86_64_avx512vnni.bc",
+          data = dense<[0, 1, 2, 3]> : vector<4xi8>}>
+      ]
+    } {
+    ^bb0(%in: i8, %in_0: i8, %out: i32):
+      %3 = arith.extsi %in : i8 to i32
+      %4 = arith.extsi %in_0 : i8 to i32
+      %5 = arith.muli %3, %4 : i32
+      %6 = arith.addi %out, %5 : i32
+      linalg.yield %6 : i32
+    } -> tensor<16x16xi32>
+    return %2 : tensor<16x16xi32>
+  }
+}
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
index e2aca45..60ed072 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp
@@ -6,6 +6,8 @@
 
 #include "iree/compiler/Codegen/Dialect/CPU/IR/IREECPUTypes.h"
 #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h"
+#include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenOps.h"
+#include "iree/compiler/Codegen/Dialect/Codegen/IR/UKernelOps.h"
 #include "iree/compiler/Codegen/Dialect/Codegen/Utils/MMAUtils.h"
 #include "iree/compiler/Dialect/Encoding/IR/EncodingTypes.h"
 #include "llvm/ADT/STLExtras.h"
@@ -1094,6 +1096,22 @@
 bool InnerTiledSemanticsAttr::getOpaque() const { return false; }
 
 //===----------------------------------------------------------------------===//
+// UKernelProviderAttr
+//===----------------------------------------------------------------------===//
+
+std::optional<LogicalResult> UKernelProviderAttr::createAndReplaceWithUkernelOp(
+    RewriterBase &rewriter, StringRef name, DictionaryAttr targetConfiguration,
+    Operation *contextualOp, ArrayRef<Value> inputs, ArrayRef<Value> outputs,
+    SmallVectorImpl<Value> &otherOperands) const {
+  // Fall through to the default UKernelGenericOp construction in
+  // LowerBitcodeUKernelsPass. Specialized handling of `inner_tiled` ops
+  // (threading `intrinsics_{m,n,k}` and the outer K count as scalar operands)
+  // will be added in a follow-up commit alongside the SelectUKernels pass that
+  // sets the `iree_codegen.ukernel` descriptor and attaches the bitcode.
+  return std::nullopt;
+}
+
+//===----------------------------------------------------------------------===//
 // Attribute Registration
 //===----------------------------------------------------------------------===//
 
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
index 413eb35..9629fd8 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.td
@@ -131,6 +131,37 @@
 }
 
 //===----------------------------------------------------------------------===//
+// UKernelProvider attribute
+//===----------------------------------------------------------------------===//
+
+def IREECPU_UKernelProviderAttr  :
+    AttrDef<IREECPU_Dialect, "UKernelProvider", [
+    DeclareAttrInterfaceMethods<IREECodegen_UKernelProviderInterface, [
+        "createAndReplaceWithUkernelOp"
+      ]>
+    ]> {
+  let mnemonic = "ukernel_provider";
+  let summary = [{
+    Provides built-in C-bitcode ukernel implementations for the LLVMCPU
+    target backend.
+  }];
+  let description = [{
+    Implements the `UKernelProviderInterface` for the LLVMCPU target.
+    `createAndReplaceWithUkernelOp` rewrites an `iree_codegen.inner_tiled` op
+    carrying an `iree_cpu.data_tiled_mma_layout` and an
+    `iree_codegen.ukernel = "<name>"` descriptor into an
+    `iree_codegen.ukernel.generic` call, and attaches the matching ukernel
+    bitcode as a `hal.executable_object` on the dispatch's executable variant.
+
+    Bitcode is resolved by name: first against any `hal.executable.objects`
+    already attached above the op (so user-supplied bitcode wins), then
+    against ukernels embedded into `iree-compile` at LLVMCPU plugin init.
+  }];
+  let parameters = (ins);
+  let assemblyFormat = [{}];
+}
+
+//===----------------------------------------------------------------------===//
 // CPU Pipeline Attribute
 //===----------------------------------------------------------------------===//
 
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
index 27fb3b4..29d2d44 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
+++ b/compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/roundtrip.mlir
@@ -69,3 +69,16 @@
 // CHECK-SAME:      test = 123 : i32
 // CHECK-LABEL: @test_arbitrary_keys()
 // CHECK-SAME:    lowering_config = #[[$CONFIG]]
+
+// -----
+
+// Round-trip the LLVMCPU ukernel provider attribute. This is the CPU analogue
+// of #rocm.ukernel_provider, set on a `hal.executable.target` config to enable
+// the built-in C-bitcode ukernels under
+// compiler/plugins/target/LLVMCPU/builtins/ukernel/.
+func.func @test_ukernel_provider() attributes {
+    iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider} {
+  return
+}
+// CHECK-LABEL: @test_ukernel_provider()
+// CHECK-SAME:    iree_codegen.ukernel_provider = #iree_cpu.ukernel_provider