Separate architecture generic<->specific bitcode (#13825)

This is the main PR towards #13804 . `iree_bitcode_library` gains the
ability to produce either arch-specific or generic bitcode. We build
separately the architecture-specific parts of ukernel code (what's under
`ukernel/arch/`) and the generic parts (what's directly in `ukernel/`).
Then in the compiler, we unconditionally load the generic bitcode, and
architecture-specific bitcode if any is availble for the target
architecture.

Before you ask: why not just produce N side-by-side,
architecture-specific bitcode modules, one per architecture that we care
about? We want microkernels to just work, all the time, not be forever
stuck in "advanced feature that may cause trouble" limbo. Since lacking
a required microkernel is a linker error (unless perhaps you go through
the trouble of linking a
[plugin](https://github.com/openxla/iree/tree/main/experimental/cpu_ukernel)
at runtime), we want to always unconditionally have bitcode for all
ukernels for all architectures, even the ones that we don't have really
optimized microkernels for yet and just want functional correctness for.
That means at least 8 architectures today
(`{x86,arm,riscv,wasm}_{32,64}`), probably dozens in the future. So that
would be a lot of side-by-side copies. We would start to have to be
reluctant to add more ukernels. By contrast, if we can get
architecture-generic bitcode to work (as this PR does) then we can have
1 single copy of that architecture-generic bitcode regardless of the
number of target architectures supported; and any additional bitcode,
architecture-specific bitcode, is proportional to the engineering effort
invested in optimizing for each target architecture.

So that's why I think architecture-generic bitcode is worth the effort.

The central difficulty is that Clang doesn't have any switch allowing to
directly produce target-independent bitcode.

From Clang's perspective (which IIUC is well summarized by [this
answer](https://stackoverflow.com/questions/71868733/how-to-make-target-independent-ir-with-llvm)),
target-independence is a property of the source language, and C isn't a
target-independent language in general.

But ukernels code isn't any C code, it's C code that's carefully written
to be target-independent outside of that `arch` subdir:
* We don't use target-dependent types (e.g. `ssize_t`) only fixed-width
types (e.g. `iree_uk_ssize_t` is `iree_uk_int64_t`, see #13834).
* We do use pointers, which are technically target-dependent, but that
target-dependence doesn't appear until later down the lowerings: as we
are outputting LLVM IR here, pointers are still an opaque `ptr` type.
* We don't do `#if` based on target-dependent tokens. Selection of
architecture-specific code paths has been reimplemented as strong
symbols (in architecture-specific code) overriding weak symbols (in
architecture-independent code) in #13715.
* We don't `#include` any standard library or system header, so our code
is truly self-contained, and that's guarded by the flags we pass Clang
when compiling to bitcode.

So we are in a special case here, so it's not unreasonable to think that
we known better than Clang and try to work past its reluctance to
produce target-independent IR.

Inspecting the IR produced from compiling our architecture-independent
ukernel files showed that the target-dependence in the resulting IR is
limited to a few target attributes and a target triple, that have been
automatically added but don't seem to play any role. Editing these away
made `llc` happy to compile that IR to *another* target architecture.

This motivated the approach in this PR: a `strip_target_info.py` script
simply drops the target details from LLVM IR.

`iree_bitcode_library` gains an `arch=` parameter. When not specified,
IR is processed with `strip_target_info.py`. When specified, IR is left
unprocessed and the right `-target` flag is passed. Generally, all the
copts are automatically set by `iree_bitcode_library` now, though each
call site may still override anything as usual (rule copts being
appended after).
diff --git a/build_tools/bazel/iree_bitcode_library.bzl b/build_tools/bazel/iree_bitcode_library.bzl
index b4f9b38..09282f3 100644
--- a/build_tools/bazel/iree_bitcode_library.bzl
+++ b/build_tools/bazel/iree_bitcode_library.bzl
@@ -6,65 +6,156 @@
 
 """Rules for compiling with clang to produce bitcode libraries."""
 
+def iree_arch_to_llvm_arch(
+        iree_arch = None):
+    """Converts an IREE_ARCH value to the corresponding LLVM arch name.
+
+    Similar to the CMake function with the same name.
+
+    Args:
+        iree_arch: IREE_ARCH string value.
+
+    Returns:
+        The LLVM name for that architecture (first component of target triple).
+    """
+
+    if not iree_arch:
+        return None
+    if iree_arch == "arm_64":
+        return "aarch64"
+    if iree_arch == "arm_32":
+        return "arm"
+    if iree_arch == "x86_64":
+        return "x86_64"
+    if iree_arch == "x86_32":
+        return "i386"
+    if iree_arch == "riscv_64":
+        return "riscv64"
+    if iree_arch == "riscv_32":
+        return "riscv32"
+    if iree_arch == "wasm_64":
+        return "wasm64"
+    if iree_arch == "wasm_32":
+        return "wasm32"
+    fail("Unhandled IREE_ARCH value %s" % iree_arch)
+
 def iree_bitcode_library(
         name,
         srcs,
-        hdrs = [],
+        internal_hdrs = [],
         copts = [],
-        defines = [],
-        data = [],
         out = None,
-        clang_tool = "@llvm-project//clang:clang",
-        link_tool = "@llvm-project//llvm:llvm-link",
-        builtin_headers_dep = "@llvm-project//clang:builtin_headers_gen",
-        builtin_headers_path = "external/llvm-project/clang/staging/include/",
+        arch = None,
         **kwargs):
     """Builds an LLVM bitcode library from an input file via clang.
 
     Args:
         name: Name of the target.
+        arch: Target architecture to compile for, in IREE_ARCH format. If left
+              empty, will produce architecture-independent bitcode by stripping
+              target triple and target attributes; that only makes sense if the
+              sources being compiled are truly architecture-independent.
         srcs: source files to pass to clang.
-        hdrs: additional headers included by the source files.
+        internal_hdrs: all headers transitively included by the source files.
+                       Unlike typical Bazel `hdrs`, these are not exposed as
+                       interface headers. This would normally be part of `srcs`,
+                       but separating it was easier for `bazel_to_cmake`, as
+                       CMake does not need this, and making this explicitly
+                       Bazel-only allows using `filegroup` on the Bazel side.
         copts: additional flags to pass to clang.
-        defines: preprocessor definitions to pass to clang.
-        data: additional data required during compilation.
         out: output file name (defaults to name.bc).
-        clang_tool: the clang to use to compile the source.
-        link_tool: llvm-link tool used for linking bitcode files.
-        builtin_headers_dep: clang builtin headers (stdbool, stdint, etc).
-        builtin_headers_path: relative path to the builtin headers rule.
         **kwargs: any additional attributes to pass to the underlying rules.
     """
 
+    clang_tool = "@llvm-project//clang:clang"
+    link_tool = "@llvm-project//llvm:llvm-link"
+    builtin_headers_dep = "@llvm-project//clang:builtin_headers_gen"
+    builtin_headers_path = "external/llvm-project/clang/staging/include/"
+
+    base_copts = [
+        # C17 with no system deps.
+        "-std=c17",
+        "-nostdinc",
+        "-ffreestanding",
+
+        # Optimized and unstamped.
+        "-O3",
+        "-DNDEBUG",
+        "-fno-ident",
+        "-fdiscard-value-names",
+
+        # Set the size of wchar_t to 4 bytes (instead of 2 bytes).
+        # This must match what the runtime is built with.
+        "-fno-short-wchar",
+
+        # Object file only in bitcode format:
+        "-c",
+        "-emit-llvm",
+
+        # Force the library into standalone mode (not depending on build-directory
+        # configuration).
+        "-DIREE_DEVICE_STANDALONE=1",
+    ]
+
+    llvmir_processing_tool = None
+    if arch:
+        # Compile to the specified target architecture.
+        base_copts.extend(["-target", iree_arch_to_llvm_arch(arch)])
+    else:
+        # Output text rather than binary serialization of LLVM IR for processing
+        base_copts.append("-S")
+
+        # Strip target information from generated LLVM IR.
+        llvmir_processing_tool = "//build_tools/scripts:strip_target_info"
+
     bitcode_files = []
-    for bitcode_src in srcs:
-        bitcode_out = "%s_%s.bc" % (name, bitcode_src)
-        bitcode_files.append(bitcode_out)
-        system_headers = ["immintrin.h"]
+    for src in srcs:
+        bitcode_out = "%s_%s.bc" % (name, src)
         native.genrule(
             name = "gen_%s" % (bitcode_out),
-            srcs = [bitcode_src] + hdrs + [builtin_headers_dep],
+            srcs = [src, builtin_headers_dep] + internal_hdrs,
             outs = [bitcode_out],
             cmd = " && ".join([
                 " ".join([
                     "$(location %s)" % (clang_tool),
                     "-isystem $(BINDIR)/%s" % builtin_headers_path,
-                    " ".join(copts),
-                    " ".join(["-D%s" % (define) for define in defines]),
+                    " ".join(base_copts + copts),
                     " ".join(["-I $(BINDIR)/runtime/src"]),
                     " ".join(["-I runtime/src"]),
                     "-o $(location %s)" % (bitcode_out),
-                    "$(location %s)" % (bitcode_src),
+                    "$(location %s)" % (src),
                 ]),
             ]),
-            tools = data + [
+            tools = [
                 clang_tool,
             ],
-            message = "Compiling %s to %s..." % (bitcode_src, bitcode_out),
+            message = "Compiling %s to %s..." % (src, bitcode_out),
             output_to_bindir = 1,
             **kwargs
         )
 
+        if llvmir_processing_tool:
+            processed_bitcode_out = "%s_%s.processed.bc" % (name, src)
+            native.genrule(
+                name = "gen_%s" % (processed_bitcode_out),
+                srcs = [bitcode_out],
+                outs = [processed_bitcode_out],
+                cmd = " ".join([
+                    "$(location %s)" % (llvmir_processing_tool),
+                    "< $(location %s)" % bitcode_out,
+                    "> $(location %s)" % processed_bitcode_out,
+                ]),
+                tools = [
+                    llvmir_processing_tool,
+                ],
+                message = "Processing %s into %s using %s..." % (bitcode_out, processed_bitcode_out, llvmir_processing_tool),
+                output_to_bindir = 1,
+                **kwargs
+            )
+            bitcode_files.append(processed_bitcode_out)
+        else:
+            bitcode_files.append(bitcode_out)
+
     if not out:
         out = "%s.bc" % (name)
     native.genrule(
@@ -78,7 +169,7 @@
                 " ".join(["$(locations %s)" % (src) for src in bitcode_files]),
             ]),
         ]),
-        tools = data + [link_tool],
+        tools = [link_tool],
         message = "Linking bitcode library %s to %s..." % (name, out),
         output_to_bindir = 1,
         **kwargs
diff --git a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
index 91de990..f6b479d 100644
--- a/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
+++ b/build_tools/bazel_to_cmake/bazel_to_cmake_converter.py
@@ -251,11 +251,13 @@
     pass
 
   def filegroup(self, name, **kwargs):
-    # Not implemented yet. Might be a no-op, or may want to evaluate the srcs
-    # attribute and pass them along to any targets that depend on the filegroup.
+    # Not implemented, but allowed for Bazel-only uses, such as declaring internal
+    # headers and other kinds of files that Bazel enforces but CMake doesn't care
+    # about. If we ever need to implement this, this might be a no-op, or may
+    # want to evaluate the srcs attribute and pass them along to any targets
+    # that depend on the filegroup.
     # Cross-package dependencies and complicated globs could be hard to handle.
-
-    self._convert_unimplemented_function("filegroup", name)
+    pass
 
   def sh_binary(self, name, **kwargs):
     if self._should_skip_target(**kwargs):
@@ -483,49 +485,30 @@
   def iree_bitcode_library(self,
                            name,
                            srcs,
-                           hdrs=None,
+                           internal_hdrs=None,
                            copts=None,
-                           defines=None,
-                           data=None,
-                           clang_tool=None,
-                           builtin_headers=None,
-                           testonly=None):
+                           arch=None):
     name_block = self._convert_string_arg_block("NAME", name, quote=False)
     srcs_block = self._convert_srcs_block(srcs)
-    hdrs_block = self._convert_string_list_block("HDRS", hdrs, sort=True)
     copts_block = self._convert_string_list_block("COPTS", copts, sort=False)
-    defines_block = self._convert_string_list_block("DEFINES", defines)
-    data_block = self._convert_target_list_block("DATA", data)
-    clang_tool_block = self._convert_target_block("CLANG_TOOL", clang_tool)
-    builtin_headers_block = self._convert_target_list_block(
-        "BUILTIN_HEADERS", builtin_headers)
-    testonly_block = self._convert_option_block("TESTONLY", testonly)
+    arch_block = self._convert_string_arg_block("ARCH", arch, quote=False)
 
     self._converter.body += (f"iree_bitcode_library(\n"
                              f"{name_block}"
                              f"{srcs_block}"
-                             f"{hdrs_block}"
                              f"{copts_block}"
-                             f"{defines_block}"
-                             f"{data_block}"
-                             f"{clang_tool_block}"
-                             f"{builtin_headers_block}"
-                             f"{testonly_block}"
-                             f"  PUBLIC\n)\n\n")
+                             f"{arch_block}"
+                             f")\n\n")
 
-  def iree_link_bitcode(self, name, bitcode_files, data=None, testonly=None):
+  def iree_link_bitcode(self, name, bitcode_files):
     name_block = self._convert_string_arg_block("NAME", name, quote=False)
     bitcode_files_block = self._convert_srcs_block(
         [f.replace(":", "/") for f in bitcode_files])
-    data_block = self._convert_target_list_block("DATA", data)
-    testonly_block = self._convert_option_block("TESTONLY", testonly)
 
     self._converter.body += (f"iree_link_bitcode(\n"
                              f"{name_block}"
                              f"{bitcode_files_block}"
-                             f"{data_block}"
-                             f"{testonly_block}"
-                             f"  PUBLIC\n)\n\n")
+                             f"\n)\n\n")
 
   def iree_bytecode_module(self,
                            name,
diff --git a/build_tools/cmake/iree_bitcode_library.cmake b/build_tools/cmake/iree_bitcode_library.cmake
index 05ee273..836c5ae 100644
--- a/build_tools/cmake/iree_bitcode_library.cmake
+++ b/build_tools/cmake/iree_bitcode_library.cmake
@@ -12,32 +12,22 @@
 #
 # Parameters:
 # NAME: Name of target (see Note).
-# SRCS: Source files to pass to clang.
-# HDRS: Additional headers included by the source files.
+# SRCS: Source files. Headers go here as well, as in iree_cc_library. There is
+#       no concept of public headers (HDRS) here.
 # COPTS: additional flags to pass to clang.
-# DEFINES: Preprocessor definitions to pass to clang.
-# DATA: Additional data required during compilation.
 # OUT: Output file name (defaults to NAME.bc).
-# PUBLIC: Add this so that this library will be exported under ${PACKAGE}::
-#     Also in IDE, target will appear in ${PACKAGE} folder while non PUBLIC
-#     will be in ${PACKAGE}/internal.
-# TESTONLY: When added, this target will only be built if IREE_BUILD_TESTS=ON.
 function(iree_bitcode_library)
   cmake_parse_arguments(
     _RULE
-    "PUBLIC;TESTONLY"
-    "NAME;OUT"
-    "SRCS;HDRS;COPTS;DEFINES;DATA"
+    ""
+    "NAME;OUT;ARCH"
+    "SRCS;COPTS"
     ${ARGN}
   )
 
   set(_CLANG_TOOL "$<TARGET_FILE:${IREE_CLANG_TARGET}>")
   set(_LINK_TOOL "$<TARGET_FILE:${IREE_LLVM_LINK_TARGET}>")
 
-  if(_RULE_TESTONLY AND NOT IREE_BUILD_TESTS)
-    return()
-  endif()
-
   if(DEFINED _RULE_OUT)
     set(_OUT "${_RULE_OUT}")
   else()
@@ -54,38 +44,90 @@
   # override this but it should be harmless.
   set(_BUILTIN_HEADERS_PATH "${IREE_BINARY_DIR}/llvm-project/lib/clang/${_CLANG_VERSION_MAJOR}/include/")
 
-  set(_ARGS "")
-  list(APPEND _ARGS "-isystem" "${_BUILTIN_HEADERS_PATH}")
-  list(APPEND _ARGS "-I" "${IREE_SOURCE_DIR}/runtime/src")
-  list(APPEND _ARGS "-I" "${IREE_BINARY_DIR}/runtime/src")
-  
-  list(APPEND _ARGS "${_RULE_COPTS}")
-  foreach(_DEFINE ${_RULE_DEFINES})
-    list(APPEND _ARGS "-D${_DEFINE}")
-  endforeach()
+  set(_COPTS
+    # C17 with no system deps.
+    "-std=c17"
+    "-nostdinc"
+    "-ffreestanding"
+
+    # Optimized and unstamped.
+    "-O3"
+    "-DNDEBUG"
+    "-fno-ident"
+    "-fdiscard-value-names"
+
+    # Set the size of wchar_t to 4 bytes (instead of 2 bytes).
+    # This must match what the runtime is built with.
+    "-fno-short-wchar"
+
+    # Object file only in bitcode format:
+    "-c"
+    "-emit-llvm"
+
+    # Force the library into standalone mode (not depending on build-directory
+    # configuration).
+    "-DIREE_DEVICE_STANDALONE=1"
+  )
+
+  list(APPEND _COPTS "-isystem" "${_BUILTIN_HEADERS_PATH}")
+  list(APPEND _COPTS "-I" "${IREE_SOURCE_DIR}/runtime/src")
+  list(APPEND _COPTS "-I" "${IREE_BINARY_DIR}/runtime/src")
+  list(APPEND _COPTS "${_RULE_COPTS}")
+
+  if(_RULE_ARCH)
+    # Compile to the specified target architecture.
+    iree_arch_to_llvm_arch(_LLVM_ARCH "${_RULE_ARCH}")
+    list(APPEND _COPTS "-target" "${_LLVM_ARCH}")
+  else()
+    # Output text rather than binary serialization of LLVM IR for processing.
+    list(APPEND _COPTS "-S")
+    # Strip target information from generated LLVM IR.
+    set(_LLVMIR_PROCESSING_TOOL "${IREE_SOURCE_DIR}/build_tools/scripts/strip_target_info.py")
+  endif()
 
   set(_BITCODE_FILES)
-  foreach(_BITCODE_SRC ${_RULE_SRCS})
-    get_filename_component(_BITCODE_SRC_PATH "${_BITCODE_SRC}" REALPATH)
-    set(_BITCODE_FILE "${_RULE_NAME}_${_BITCODE_SRC}.bc")
-    list(APPEND _BITCODE_FILES ${_BITCODE_FILE})
+  foreach(_SRC ${_RULE_SRCS})
+    get_filename_component(_BITCODE_SRC_PATH "${_SRC}" REALPATH)
+    set(_BITCODE_FILE "${_RULE_NAME}_${_SRC}.bc")
     add_custom_command(
       OUTPUT
-        ${_BITCODE_FILE}
+        "${_BITCODE_FILE}"
       COMMAND
-        ${_CLANG_TOOL}
-        ${_ARGS}
+        "${_CLANG_TOOL}"
+        ${_COPTS}
         "${_BITCODE_SRC_PATH}"
         "-o"
         "${_BITCODE_FILE}"
       DEPENDS
-        ${_CLANG_TOOL}
-        ${_LINK_TOOL}
-        ${_BITCODE_SRC}
+        "${_CLANG_TOOL}"
+        "${_LINK_TOOL}"
+        "${_SRC}"
       COMMENT
-        "Compiling ${_BITCODE_SRC} to ${_BITCODE_FILE}"
+        "Compiling ${_SRC} to ${_BITCODE_FILE}"
       VERBATIM
     )
+
+    if(_LLVMIR_PROCESSING_TOOL)
+      set(_PROCESSED_BITCODE_FILE "${_RULE_NAME}_${_SRC}.processed.bc")
+      list(APPEND _BITCODE_FILES ${_PROCESSED_BITCODE_FILE})
+      add_custom_command(
+        OUTPUT
+          "${_PROCESSED_BITCODE_FILE}"
+        COMMAND
+          "python3"
+          "${_LLVMIR_PROCESSING_TOOL}"
+          < "${_BITCODE_FILE}"
+          > "${_PROCESSED_BITCODE_FILE}"
+        DEPENDS
+          "${_BITCODE_FILE}"
+          "${_LLVMIR_PROCESSING_TOOL}"
+        COMMENT
+          "Processing ${_BITCODE_FILE} into ${_PROCESSED_BITCODE_FILE} using ${_LLVMIR_PROCESSING_TOOL}"
+        VERBATIM
+      )
+    else()  # _LLVMIR_PROCESSING_TOOL
+      list(APPEND _BITCODE_FILES ${_BITCODE_FILE})
+    endif()  # _LLVMIR_PROCESSING_TOOL
   endforeach()
 
   add_custom_command(
@@ -98,7 +140,6 @@
       "${_OUT}"
     DEPENDS
       ${_LINK_TOOL}
-      ${_RULE_SRCS}
       ${_BITCODE_FILES}
     COMMENT
       "Linking bitcode to ${_OUT}"
@@ -120,30 +161,18 @@
 # Parameters:
 # NAME: Name of target (see Note).
 # SRCS: Source files to pass to clang.
-# HDRS: Additional headers included by the source files.
-# COPTS: additional flags to pass to clang.
-# DEFINES: Preprocessor definitions to pass to clang.
-# DATA: Additional data required during compilation.
 # OUT: Output file name (defaults to NAME.bc).
-# PUBLIC: Add this so that this library will be exported under ${PACKAGE}::
-#     Also in IDE, target will appear in ${PACKAGE} folder while non PUBLIC
-#     will be in ${PACKAGE}/internal.
-# TESTONLY: When added, this target will only be built if IREE_BUILD_TESTS=ON.
 function(iree_link_bitcode)
   cmake_parse_arguments(
     _RULE
-    "PUBLIC;TESTONLY"
+    ""
     "NAME;OUT"
-    "SRCS;DEFINES;DATA"
+    "SRCS"
     ${ARGN}
   )
 
   set(_LINK_TOOL "$<TARGET_FILE:${IREE_LLVM_LINK_TARGET}>")
 
-  if(_RULE_TESTONLY AND NOT IREE_BUILD_TESTS)
-    return()
-  endif()
-
   if(DEFINED _RULE_OUT)
     set(_OUT "${_RULE_OUT}")
   else()
diff --git a/build_tools/cmake/iree_macros.cmake b/build_tools/cmake/iree_macros.cmake
index cf2102f..43f130e 100644
--- a/build_tools/cmake/iree_macros.cmake
+++ b/build_tools/cmake/iree_macros.cmake
@@ -109,6 +109,10 @@
     set(${DST_LLVM_ARCH_VARIABLE} "riscv64" PARENT_SCOPE)
   elseif("${SRC_ARCH}" STREQUAL "riscv_32")
     set(${DST_LLVM_ARCH_VARIABLE} "riscv32" PARENT_SCOPE)
+  elseif("${SRC_ARCH}" STREQUAL "wasm_64")
+    set(${DST_LLVM_ARCH_VARIABLE} "wasm64" PARENT_SCOPE)
+  elseif("${SRC_ARCH}" STREQUAL "wasm_32")
+    set(${DST_LLVM_ARCH_VARIABLE} "wasm32" PARENT_SCOPE)
   else()
     message(SEND_ERROR "What is the LLVM name of the architecture that we call ${SRC_ARCH} ?")
     set(${DST_LLVM_ARCH_VARIABLE} "unknown" PARENT_SCOPE)
diff --git a/build_tools/scripts/BUILD.bazel b/build_tools/scripts/BUILD.bazel
new file mode 100644
index 0000000..6c39341
--- /dev/null
+++ b/build_tools/scripts/BUILD.bazel
@@ -0,0 +1,16 @@
+# Copyright 2023 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+package(
+    default_visibility = ["//visibility:public"],
+    features = ["layering_check"],
+    licenses = ["notice"],  # Apache 2.0
+)
+
+py_binary(
+    name = "strip_target_info",
+    srcs = ["strip_target_info.py"],
+)
diff --git a/build_tools/scripts/strip_target_info.py b/build_tools/scripts/strip_target_info.py
new file mode 100644
index 0000000..7c936d7
--- /dev/null
+++ b/build_tools/scripts/strip_target_info.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python3
+# Copyright 2023 The IREE Authors
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+"""Strip LLVM IR of target triple and target-specific attributes
+"""
+
+import sys
+import re
+import os
+
+
+def main():
+  sys.stdout.write(f";\n")
+  sys.stdout.write(f"; Processed by {os.path.basename(__file__)}\n")
+  sys.stdout.write(f";\n")
+  target_triple_regex = re.compile(r'^\s*target triple\s*=\s*"[^"]*"')
+  target_cpu_regex = re.compile(r'"target-cpu"="[^"]*"')
+  target_features_regex = re.compile(r'"target-features"="[^"]*"')
+  tune_cpu_regex = re.compile(r'"tune-cpu"="[^"]*"')
+
+  for line in sys.stdin:
+    if "target" in line:
+      if re.match(target_triple_regex, line):
+        continue
+      line = re.sub(target_cpu_regex, '', line)
+      line = re.sub(target_features_regex, '', line)
+      line = re.sub(tune_cpu_regex, '', line)
+
+    sys.stdout.write(line)
+
+
+if __name__ == "__main__":
+  main()
diff --git a/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp b/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp
index 513e688..1eb2e28 100644
--- a/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp
@@ -106,6 +106,28 @@
   return llvm::Triple(triple.value().str());
 }
 
+const char *getIreeArchNameForTargetTriple(llvm::Triple triple) {
+  if (triple.isX86()) {
+    return triple.isArch64Bit() ? "x86_64" : "x86_32";
+  }
+  if (triple.isWasm()) {
+    return triple.isArch64Bit() ? "wasm_64" : "wasm_32";
+  }
+  if (triple.isAArch64()) {
+    return "arm_64";
+  }
+  if (triple.isARM()) {
+    return "arm_32";
+  }
+  if (triple.isRISCV64()) {
+    return "riscv_64";
+  }
+  if (triple.isRISCV32()) {
+    return "riscv_32";
+  }
+  return "unknown";
+}
+
 bool isVMVXBackend(IREE::HAL::ExecutableTargetAttr targetAttr) {
   return targetAttr && targetAttr.getBackend().getValue().startswith("vmvx");
 }
diff --git a/compiler/src/iree/compiler/Codegen/Utils/Utils.h b/compiler/src/iree/compiler/Codegen/Utils/Utils.h
index fcb7025..f14d980 100644
--- a/compiler/src/iree/compiler/Codegen/Utils/Utils.h
+++ b/compiler/src/iree/compiler/Codegen/Utils/Utils.h
@@ -63,6 +63,10 @@
 std::optional<llvm::Triple> getTargetTriple(
     IREE::HAL::ExecutableTargetAttr targetAttr);
 
+/// Returns the target architecture name, in IREE_ARCH convention, from the
+/// given target triple.
+const char *getIreeArchNameForTargetTriple(llvm::Triple triple);
+
 /// Methods to get target information.
 bool isVMVXBackend(IREE::HAL::ExecutableTargetAttr targetAttr);
 bool hasMicrokernels(IREE::HAL::ExecutableTargetAttr targetAttr);
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/BUILD.bazel b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/BUILD.bazel
index aa1c5b2..651ca3e 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/BUILD.bazel
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/BUILD.bazel
@@ -25,10 +25,11 @@
         "UKernel.h",
     ],
     deps = [
+        "//compiler/src/iree/compiler/Codegen/Utils",
         "//compiler/src/iree/compiler/Dialect/HAL/IR",
         "//runtime/src/iree/builtins/device:libdevice_bitcode",
         "//runtime/src/iree/builtins/musl/bin:libmusl",
-        "//runtime/src/iree/builtins/ukernel:libukernel_bitcode",
+        "//runtime/src/iree/builtins/ukernel:embed_ukernel_bitcode",
         "@llvm-project//llvm:BitReader",
         "@llvm-project//llvm:Core",
         "@llvm-project//llvm:Support",
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/CMakeLists.txt b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/CMakeLists.txt
index c89659c..ae54d3b 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/CMakeLists.txt
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/CMakeLists.txt
@@ -29,7 +29,8 @@
     MLIRSupport
     iree::builtins::device::libdevice_bitcode
     iree::builtins::musl::bin::libmusl
-    iree::builtins::ukernel::libukernel_bitcode
+    iree::builtins::ukernel::embed_ukernel_bitcode
+    iree::compiler::Codegen::Utils
     iree::compiler::Dialect::HAL::IR
   PUBLIC
 )
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
index e06b687..86df47c 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp
@@ -6,7 +6,8 @@
 
 #include "iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h"
 
-#include "iree/builtins/ukernel/libukernel.h"
+#include "iree/builtins/ukernel/ukernel_bitcode.h"
+#include "iree/compiler/Codegen/Utils/Utils.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/Support/MemoryBufferRef.h"
 #include "mlir/Support/LLVM.h"
@@ -16,46 +17,42 @@
 namespace IREE {
 namespace HAL {
 
-static const iree_file_toc_t *lookupUKernelFile(StringRef filename) {
-  for (size_t i = 0; i < iree_builtins_libukernel_size(); ++i) {
-    const auto &file_toc = iree_builtins_libukernel_create()[i];
-    if (filename == file_toc.name) return &file_toc;
+static std::unique_ptr<llvm::Module> loadUKernelBitcodeFile(
+    StringRef filename, llvm::LLVMContext& context) {
+  const iree_file_toc_t* file_start = iree_ukernel_bitcode_create();
+  const iree_file_toc_t* file_end = file_start + iree_ukernel_bitcode_size();
+  for (const iree_file_toc_t* file = file_start; file < file_end; ++file) {
+    if (filename == file->name) {
+      llvm::MemoryBufferRef bitcodeBufferRef(
+          llvm::StringRef(file->data, file->size), file->name);
+      auto bitcodeFile = llvm::parseBitcodeFile(bitcodeBufferRef, context);
+      if (!bitcodeFile) return nullptr;
+      return std::move(*bitcodeFile);
+    }
   }
   return nullptr;
 }
 
-static const iree_file_toc_t *lookupUKernelFile(
-    llvm::TargetMachine *targetMachine) {
-  const auto &triple = targetMachine->getTargetTriple();
-
-  // NOTE: other arch-specific checks go here.
-
-  // Fallback path using the generic wasm variants as they are largely
-  // machine-agnostic.
-  if (triple.isX86()) {
-    return lookupUKernelFile("ukernel_bitcode.bc");
-  } else {
-    return nullptr;
-  }
+std::unique_ptr<llvm::Module> loadUKernelBaseBitcode(
+    llvm::LLVMContext& context) {
+  std::unique_ptr<llvm::Module> baseBitcode =
+      loadUKernelBitcodeFile("ukernel_bitcode_base.bc", context);
+  assert(baseBitcode && "base ukernel bitcode file not found");
+  return baseBitcode;
 }
 
-std::optional<std::unique_ptr<llvm::Module>> loadUKernelBitcode(
-    llvm::TargetMachine *targetMachine, llvm::LLVMContext &context) {
-  // Find a bitcode file for the current architecture.
-  const auto *file = lookupUKernelFile(targetMachine);
-  if (!file) {
-    return std::nullopt;
-  }
-
-  // Load the generic bitcode file contents.
-  llvm::MemoryBufferRef bitcodeBufferRef(
-      llvm::StringRef(file->data, file->size), file->name);
-  auto bitcodeFile = llvm::parseBitcodeFile(bitcodeBufferRef, context);
-  if (!bitcodeFile) {
-    // TODO: Do we want to error out here or silently proceed.
-    return std::nullopt;
-  }
-  return std::move(*bitcodeFile);
+std::unique_ptr<llvm::Module> loadUKernelArchBitcode(
+    llvm::TargetMachine* targetMachine, llvm::LLVMContext& context) {
+  const char* archName =
+      getIreeArchNameForTargetTriple(targetMachine->getTargetTriple());
+  char archBitcodeFilename[64];
+  snprintf(archBitcodeFilename, sizeof archBitcodeFilename,
+           "ukernel_bitcode_%s.bc", archName);
+  std::unique_ptr<llvm::Module> archBitcode =
+      loadUKernelBitcodeFile(archBitcodeFilename, context);
+  // archBitcode is optional: we don't have arch-specific ukernel code for all
+  // architectures. So it's normal to be returning nullptr here.
+  return archBitcode;
 }
 
 }  // namespace HAL
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h
index af662b2..ca9d8f4 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h
@@ -15,7 +15,10 @@
 namespace IREE {
 namespace HAL {
 
-std::optional<std::unique_ptr<llvm::Module>> loadUKernelBitcode(
+std::unique_ptr<llvm::Module> loadUKernelBaseBitcode(
+    llvm::LLVMContext &context);
+
+std::unique_ptr<llvm::Module> loadUKernelArchBitcode(
     llvm::TargetMachine *targetMachine, llvm::LLVMContext &context);
 
 }  // namespace HAL
diff --git a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
index 225ec6f..23f6dc5 100644
--- a/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
+++ b/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp
@@ -12,6 +12,7 @@
 #include "iree-dialects/Dialect/LinalgTransform/LinalgTransformOps.h"
 #include "iree/compiler/Codegen/Dialect/IREECodegenDialect.h"
 #include "iree/compiler/Codegen/LLVMCPU/LLVMCPUPasses.h"
+#include "iree/compiler/Codegen/Utils/Utils.h"
 #include "iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/Device.h"
 #include "iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/Musl.h"
 #include "iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.h"
@@ -444,24 +445,47 @@
              << targetTriple.str() << "'";
     }
 
-    // Link in ukernel file.
+    // Link in ukernel bitcode.
     if (hasMicrokernel(variantOp)) {
       auto setAlwaysInline = [&](llvm::Module &module) {
         for (auto &func : module.getFunctionList()) {
           func.addFnAttr(llvm::Attribute::AlwaysInline);
         }
       };
-      if (std::optional<std::unique_ptr<llvm::Module>> libUKernel =
-              loadUKernelBitcode(targetMachine.get(), context)) {
+
+      std::unique_ptr<llvm::Module> archBitcode =
+          loadUKernelArchBitcode(targetMachine.get(), context);
+
+      // The archBitcode contains overrides for weak symbols that will come in
+      // the baseBitcode below. So we link it before baseBitcode, with
+      // OverrideFromSrc.
+      if (archBitcode) {
+        // Sequence that access before we std::move(archBitcode)!
+        StringRef archBitcodeName = archBitcode->getName();
         if (failed(linkBitcodeModule(
-                variantOp.getLoc(), moduleLinker, llvm::Linker::LinkOnlyNeeded,
-                *targetMachine, "libukernel", std::move(libUKernel.value()),
+                variantOp.getLoc(), moduleLinker, llvm::Linker::OverrideFromSrc,
+                *targetMachine, archBitcodeName, std::move(archBitcode),
                 setAlwaysInline))) {
           return mlir::emitError(variantOp.getLoc())
-                 << "failed linking in ukernel library for target triple '"
+                 << "failed linking in architecture-specific ukernel bitcode "
+                    "for target triple '"
                  << targetTriple.str() << "'";
         }
       }
+
+      // The baseBitcode module contains weak symbols for fallbacks.
+      // So we link it after the archBitcode and with LinkOnlyNeeded.
+      std::unique_ptr<llvm::Module> baseBitcode =
+          loadUKernelBaseBitcode(context);
+      // Sequence that access before we std::move(baseBitcode)!
+      StringRef baseBitcodeName = baseBitcode->getName();
+      if (failed(linkBitcodeModule(variantOp.getLoc(), moduleLinker,
+                                   llvm::Linker::LinkOnlyNeeded, *targetMachine,
+                                   baseBitcodeName, std::move(baseBitcode),
+                                   setAlwaysInline))) {
+        return mlir::emitError(variantOp.getLoc())
+               << "failed linking in base ukernel bitcode";
+      }
     }
 
     // Strip any compiler identifiers that may have snuck in. We let the linker
@@ -752,35 +776,7 @@
             break;
         }
       }
-      switch (targetTriple.getArch()) {
-        case llvm::Triple::ArchType::arm:
-          format += "arm_32";
-          break;
-        case llvm::Triple::ArchType::aarch64:
-          format += "arm_64";
-          break;
-        case llvm::Triple::ArchType::riscv32:
-          format += "riscv_32";
-          break;
-        case llvm::Triple::ArchType::riscv64:
-          format += "riscv_64";
-          break;
-        case llvm::Triple::ArchType::wasm32:
-          format += "wasm_32";
-          break;
-        case llvm::Triple::ArchType::wasm64:
-          format += "wasm_64";
-          break;
-        case llvm::Triple::ArchType::x86:
-          format += "x86_32";
-          break;
-        case llvm::Triple::ArchType::x86_64:
-          format += "x86_64";
-          break;
-        default:
-          format += "unknown";
-          break;
-      }
+      format += getIreeArchNameForTargetTriple(targetTriple);
     }
 
     // Add some configurations to the `hal.executable.target` attribute.
diff --git a/runtime/src/iree/builtins/device/BUILD.bazel b/runtime/src/iree/builtins/device/BUILD.bazel
index 531d733..347eceb 100644
--- a/runtime/src/iree/builtins/device/BUILD.bazel
+++ b/runtime/src/iree/builtins/device/BUILD.bazel
@@ -50,41 +50,19 @@
 )
 
 # TODO(benvanik): rule for building a matrix of bitcode files.
-# TODO(benvanik): make some of these flags inside of iree_bitcode_library; maybe
-# via an iree_cpu_bitcode_library so that we can have an
-# iree_cuda_bitcode_library that can differ.
-
-BITCODE_COPTS = [
-    # C17 with no system deps.
-    "-std=c17",
-    "-nostdinc",
-    "-ffreestanding",
-
-    # Optimized and unstamped.
-    "-O3",
-    "-fno-ident",
-    "-fdiscard-value-names",
-
-    # Object file only in bitcode format:
-    "-c",
-    "-emit-llvm",
-
-    # Force the library into standalone mode (not linking into hosting apps).
-    "-DIREE_DEVICE_STANDALONE=1",
-]
 
 iree_bitcode_library(
     name = "libdevice_wasm32_generic",
     srcs = BITCODE_SRCS,
-    hdrs = BITCODE_HDRS,
-    copts = BITCODE_COPTS + ["--target=wasm32"],
+    arch = "wasm_32",
+    internal_hdrs = BITCODE_HDRS,
 )
 
 iree_bitcode_library(
     name = "libdevice_wasm64_generic",
     srcs = BITCODE_SRCS,
-    hdrs = BITCODE_HDRS,
-    copts = BITCODE_COPTS + ["--target=wasm64"],
+    arch = "wasm_64",
+    internal_hdrs = BITCODE_HDRS,
 )
 
 c_embed_data(
diff --git a/runtime/src/iree/builtins/device/CMakeLists.txt b/runtime/src/iree/builtins/device/CMakeLists.txt
index 19c83d3..427f29f 100644
--- a/runtime/src/iree/builtins/device/CMakeLists.txt
+++ b/runtime/src/iree/builtins/device/CMakeLists.txt
@@ -31,20 +31,8 @@
     libdevice_wasm32_generic
   SRCS
     "device_generic.c"
-  HDRS
-    "device.h"
-  COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-c"
-    "-emit-llvm"
-    "-DIREE_DEVICE_STANDALONE=1"
-    "--target=wasm32"
-  PUBLIC
+  ARCH
+    wasm_32
 )
 
 iree_bitcode_library(
@@ -52,20 +40,8 @@
     libdevice_wasm64_generic
   SRCS
     "device_generic.c"
-  HDRS
-    "device.h"
-  COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-c"
-    "-emit-llvm"
-    "-DIREE_DEVICE_STANDALONE=1"
-    "--target=wasm64"
-  PUBLIC
+  ARCH
+    wasm_64
 )
 
 iree_c_embed_data(
diff --git a/runtime/src/iree/builtins/ukernel/BUILD.bazel b/runtime/src/iree/builtins/ukernel/BUILD.bazel
index ee07a42..b97a6a5 100644
--- a/runtime/src/iree/builtins/ukernel/BUILD.bazel
+++ b/runtime/src/iree/builtins/ukernel/BUILD.bazel
@@ -5,7 +5,7 @@
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
 load("//build_tools/bazel:build_defs.oss.bzl", "iree_cmake_extra_content", "iree_runtime_cc_library")
-load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library", "iree_link_bitcode")
+load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library")
 load("//build_tools/embed_data:build_defs.bzl", "c_embed_data")
 
 package(
@@ -99,31 +99,6 @@
     inline = True,
 )
 
-BITCODE_COPTS = [
-    # C17 with no system deps.
-    "-std=c17",
-    "-nostdinc",
-    "-ffreestanding",
-
-    # Optimized and unstamped.
-    "-O3",
-    "-DNDEBUG",
-    "-fno-ident",
-    "-fdiscard-value-names",
-
-    # Set the size of wchar_t to 4 bytes (instead of 2 bytes).
-    # This must match what the runtime is built with.
-    "-fno-short-wchar",
-
-    # Object file only in bitcode format:
-    "-c",
-    "-emit-llvm",
-
-    # Force configure for X86_64.
-    "-target",
-    "x86_64-unknown-unknown-eabi-elf",
-]
-
 UKERNEL_BASE_SRCS = [
     "mmt4d.c",
     "mmt4d_tile.c",
@@ -131,54 +106,38 @@
     "pack_tile.c",
     "query_tile_sizes.c",
     "unpack_tile.c",
+    "weak.c",
 ]
 
-UKERNEL_HDRS = [
-    "//runtime/src/iree/builtins/ukernel:common.h",
-    "//runtime/src/iree/builtins/ukernel:pack.h",
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h",
-    "//runtime/src/iree/builtins/ukernel:static_assert.h",
-    "//runtime/src/iree/builtins/ukernel:api.h",
-    "//runtime/src/iree/builtins/ukernel:unpack.h",
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h",
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h",
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h",
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h",
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h",
-    "//runtime/src/iree/builtins/ukernel:elementwise.h",
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h",
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h",
-    "//runtime/src/iree/schemas:cpu_data.h",
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl",
-]
+filegroup(
+    name = "bitcode_internal_headers",
+    srcs = internal_headers + [
+        "exported_bits.h",
+        "static_assert.h",
+        "//runtime/src/iree/schemas:cpu_data.h",
+        "//runtime/src/iree/schemas:cpu_feature_bits.inl",
+    ],
+)
 
 iree_bitcode_library(
     name = "ukernel_bitcode_base",
     srcs = UKERNEL_BASE_SRCS,
-    hdrs = UKERNEL_HDRS,
-    copts = BITCODE_COPTS,
-)
-
-iree_link_bitcode(
-    name = "ukernel_bitcode",
-    bitcode_files = [
-        "ukernel_bitcode_base.bc",
-        "arch/x86_64:ukernel_bitcode_x86_64_base.bc",
-        "arch/x86_64:ukernel_bitcode_x86_64_avx2_fma.bc",
-        "arch/x86_64:ukernel_bitcode_x86_64_avx512_base.bc",
-        "arch/x86_64:ukernel_bitcode_x86_64_avx512_vnni.bc",
+    internal_hdrs = [
+        ":bitcode_internal_headers",
+        "//runtime/src/iree/builtins/ukernel/arch/x86_64:bitcode_internal_headers",
     ],
 )
 
 c_embed_data(
-    name = "libukernel_bitcode",
+    name = "embed_ukernel_bitcode",
     srcs = [
-        ":ukernel_bitcode.bc",
+        ":ukernel_bitcode_base.bc",
+        "//runtime/src/iree/builtins/ukernel/arch/x86_64:ukernel_bitcode_x86_64.bc",
     ],
-    c_file_output = "libukernel.c",
+    c_file_output = "ukernel_bitcode.c",
     flatten = True,
-    h_file_output = "libukernel.h",
-    identifier = "iree_builtins_libukernel",
+    h_file_output = "ukernel_bitcode.h",
+    identifier = "iree_ukernel_bitcode",
     deps = [
         "//runtime/src:runtime_defines",
     ],
diff --git a/runtime/src/iree/builtins/ukernel/CMakeLists.txt b/runtime/src/iree/builtins/ukernel/CMakeLists.txt
index e9cd859..3292437 100644
--- a/runtime/src/iree/builtins/ukernel/CMakeLists.txt
+++ b/runtime/src/iree/builtins/ukernel/CMakeLists.txt
@@ -116,64 +116,23 @@
     "pack_tile.c"
     "query_tile_sizes.c"
     "unpack_tile.c"
-  HDRS
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h"
-    "//runtime/src/iree/builtins/ukernel:api.h"
-    "//runtime/src/iree/builtins/ukernel:common.h"
-    "//runtime/src/iree/builtins/ukernel:elementwise.h"
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h"
-    "//runtime/src/iree/builtins/ukernel:pack.h"
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h"
-    "//runtime/src/iree/builtins/ukernel:static_assert.h"
-    "//runtime/src/iree/builtins/ukernel:unpack.h"
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h"
-    "//runtime/src/iree/schemas:cpu_data.h"
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl"
-  COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-DNDEBUG"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-fno-short-wchar"
-    "-c"
-    "-emit-llvm"
-    "-target"
-    "x86_64-unknown-unknown-eabi-elf"
-  PUBLIC
-)
-
-iree_link_bitcode(
-  NAME
-    ukernel_bitcode
-  SRCS
-    "arch/x86_64/ukernel_bitcode_x86_64_avx2_fma.bc"
-    "arch/x86_64/ukernel_bitcode_x86_64_avx512_base.bc"
-    "arch/x86_64/ukernel_bitcode_x86_64_avx512_vnni.bc"
-    "arch/x86_64/ukernel_bitcode_x86_64_base.bc"
-    "ukernel_bitcode_base.bc"
-  PUBLIC
+    "weak.c"
 )
 
 iree_c_embed_data(
   NAME
-    libukernel_bitcode
+    embed_ukernel_bitcode
   SRCS
-    "ukernel_bitcode.bc"
+    "runtime/src/iree/builtins/ukernel/arch/x86_64/ukernel_bitcode_x86_64.bc"
+    "ukernel_bitcode_base.bc"
   DEPS
 
   C_FILE_OUTPUT
-    "libukernel.c"
+    "ukernel_bitcode.c"
   H_FILE_OUTPUT
-    "libukernel.h"
+    "ukernel_bitcode.h"
   IDENTIFIER
-    "iree_builtins_libukernel"
+    "iree_ukernel_bitcode"
   FLATTEN
   PUBLIC
 )
diff --git a/runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel b/runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel
index aa4bbcc..183c6dd 100644
--- a/runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel
+++ b/runtime/src/iree/builtins/ukernel/arch/x86_64/BUILD.bazel
@@ -5,7 +5,7 @@
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
 load("//build_tools/bazel:build_defs.oss.bzl", "iree_cmake_extra_content")
-load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library")
+load("//build_tools/bazel:iree_bitcode_library.bzl", "iree_bitcode_library", "iree_link_bitcode")
 
 package(
     default_visibility = ["//visibility:public"],
@@ -24,49 +24,12 @@
     inline = True,
 )
 
-BITCODE_COPTS = [
-    # C17 with no system deps.
-    "-std=c17",
-    "-nostdinc",
-    "-ffreestanding",
-
-    # Optimized and unstamped.
-    "-O3",
-    "-DNDEBUG",
-    "-fno-ident",
-    "-fdiscard-value-names",
-
-    # Set the size of wchar_t to 4 bytes (instead of 2 bytes).
-    # This must match what the runtime is built with.
-    "-fno-short-wchar",
-
-    # Object file only in bitcode format:
-    "-c",
-    "-emit-llvm",
-
-    # Force configure for X86_64.
-    "-target",
-    "x86_64-unknown-unknown-eabi-elf",
-]
-
-UKERNEL_HDRS = [
-    "//runtime/src/iree/builtins/ukernel:common.h",
-    "//runtime/src/iree/builtins/ukernel:pack.h",
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h",
-    "//runtime/src/iree/builtins/ukernel:static_assert.h",
-    "//runtime/src/iree/builtins/ukernel:api.h",
-    "//runtime/src/iree/builtins/ukernel:unpack.h",
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h",
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h",
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h",
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h",
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h",
-    "//runtime/src/iree/builtins/ukernel:elementwise.h",
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h",
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h",
-    "//runtime/src/iree/schemas:cpu_data.h",
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl",
-]
+filegroup(
+    name = "bitcode_internal_headers",
+    srcs = [
+        "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h",
+    ],
+)
 
 UKERNEL_X86_64_BASE_SRCS = [
     "mmt4d_x86_64.c",
@@ -78,8 +41,11 @@
 iree_bitcode_library(
     name = "ukernel_bitcode_x86_64_base",
     srcs = UKERNEL_X86_64_BASE_SRCS,
-    hdrs = UKERNEL_HDRS,
-    copts = BITCODE_COPTS,
+    arch = "x86_64",
+    internal_hdrs = [
+        ":bitcode_internal_headers",
+        "//runtime/src/iree/builtins/ukernel:bitcode_internal_headers",
+    ],
 )
 
 UKERNEL_X86_64_AVX2_FMA_SRCS = [
@@ -88,13 +54,20 @@
     "unpack_x86_64_avx2_fma.c",
 ]
 
+UKERNEL_X86_64_AVX2_FMA_COPTS = [
+    "-mavx",
+    "-mavx2",
+    "-mfma",
+]
+
 iree_bitcode_library(
     name = "ukernel_bitcode_x86_64_avx2_fma",
     srcs = UKERNEL_X86_64_AVX2_FMA_SRCS,
-    hdrs = UKERNEL_HDRS,
-    copts = BITCODE_COPTS + [
-        "-mavx2",
-        "-mfma",
+    arch = "x86_64",
+    copts = UKERNEL_X86_64_AVX2_FMA_COPTS,
+    internal_hdrs = [
+        ":bitcode_internal_headers",
+        "//runtime/src/iree/builtins/ukernel:bitcode_internal_headers",
     ],
 )
 
@@ -104,16 +77,22 @@
     "unpack_x86_64_avx512_base.c",
 ]
 
+UKERNEL_X86_64_AVX512_BASE_COPTS = UKERNEL_X86_64_AVX2_FMA_COPTS + [
+    "-mavx512f",
+    "-mavx512vl",
+    "-mavx512cd",
+    "-mavx512bw",
+    "-mavx512dq",
+]
+
 iree_bitcode_library(
     name = "ukernel_bitcode_x86_64_avx512_base",
     srcs = UKERNEL_X86_64_AVX512_BASE_SRCS,
-    hdrs = UKERNEL_HDRS,
-    copts = BITCODE_COPTS + [
-        "-mavx512f",
-        "-mavx512vl",
-        "-mavx512cd",
-        "-mavx512bw",
-        "-mavx512dq",
+    arch = "x86_64",
+    copts = UKERNEL_X86_64_AVX512_BASE_COPTS,
+    internal_hdrs = [
+        ":bitcode_internal_headers",
+        "//runtime/src/iree/builtins/ukernel:bitcode_internal_headers",
     ],
 )
 
@@ -121,17 +100,28 @@
     "mmt4d_x86_64_avx512_vnni.c",
 ]
 
+UKERNEL_X86_64_AVX512_VNNI_COPTS = UKERNEL_X86_64_AVX512_BASE_COPTS + [
+    "-mavx512vnni",
+]
+
 iree_bitcode_library(
     name = "ukernel_bitcode_x86_64_avx512_vnni",
     srcs = UKERNEL_X86_64_AVX512_VNNI_SRCS,
-    hdrs = UKERNEL_HDRS,
-    copts = BITCODE_COPTS + [
-        "-mavx512f",
-        "-mavx512vl",
-        "-mavx512cd",
-        "-mavx512bw",
-        "-mavx512dq",
-        "-mavx512vnni",
+    arch = "x86_64",
+    copts = UKERNEL_X86_64_AVX512_VNNI_COPTS,
+    internal_hdrs = [
+        ":bitcode_internal_headers",
+        "//runtime/src/iree/builtins/ukernel:bitcode_internal_headers",
+    ],
+)
+
+iree_link_bitcode(
+    name = "ukernel_bitcode_x86_64",
+    bitcode_files = [
+        "ukernel_bitcode_x86_64_base.bc",
+        "ukernel_bitcode_x86_64_avx2_fma.bc",
+        "ukernel_bitcode_x86_64_avx512_base.bc",
+        "ukernel_bitcode_x86_64_avx512_vnni.bc",
     ],
 )
 
diff --git a/runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt b/runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt
index 94bc4e5..c08fc99 100644
--- a/runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt
+++ b/runtime/src/iree/builtins/ukernel/arch/x86_64/CMakeLists.txt
@@ -20,37 +20,8 @@
     "pack_x86_64.c"
     "query_tile_sizes_x86_64.c"
     "unpack_x86_64.c"
-  HDRS
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h"
-    "//runtime/src/iree/builtins/ukernel:api.h"
-    "//runtime/src/iree/builtins/ukernel:common.h"
-    "//runtime/src/iree/builtins/ukernel:elementwise.h"
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h"
-    "//runtime/src/iree/builtins/ukernel:pack.h"
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h"
-    "//runtime/src/iree/builtins/ukernel:static_assert.h"
-    "//runtime/src/iree/builtins/ukernel:unpack.h"
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h"
-    "//runtime/src/iree/schemas:cpu_data.h"
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl"
-  COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-DNDEBUG"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-fno-short-wchar"
-    "-c"
-    "-emit-llvm"
-    "-target"
-    "x86_64-unknown-unknown-eabi-elf"
-  PUBLIC
+  ARCH
+    x86_64
 )
 
 iree_bitcode_library(
@@ -60,39 +31,12 @@
     "mmt4d_x86_64_avx2_fma.c"
     "pack_x86_64_avx2_fma.c"
     "unpack_x86_64_avx2_fma.c"
-  HDRS
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h"
-    "//runtime/src/iree/builtins/ukernel:api.h"
-    "//runtime/src/iree/builtins/ukernel:common.h"
-    "//runtime/src/iree/builtins/ukernel:elementwise.h"
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h"
-    "//runtime/src/iree/builtins/ukernel:pack.h"
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h"
-    "//runtime/src/iree/builtins/ukernel:static_assert.h"
-    "//runtime/src/iree/builtins/ukernel:unpack.h"
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h"
-    "//runtime/src/iree/schemas:cpu_data.h"
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl"
   COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-DNDEBUG"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-fno-short-wchar"
-    "-c"
-    "-emit-llvm"
-    "-target"
-    "x86_64-unknown-unknown-eabi-elf"
+    "-mavx"
     "-mavx2"
     "-mfma"
-  PUBLIC
+  ARCH
+    x86_64
 )
 
 iree_bitcode_library(
@@ -102,42 +46,17 @@
     "mmt4d_x86_64_avx512_base.c"
     "pack_x86_64_avx512_base.c"
     "unpack_x86_64_avx512_base.c"
-  HDRS
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h"
-    "//runtime/src/iree/builtins/ukernel:api.h"
-    "//runtime/src/iree/builtins/ukernel:common.h"
-    "//runtime/src/iree/builtins/ukernel:elementwise.h"
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h"
-    "//runtime/src/iree/builtins/ukernel:pack.h"
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h"
-    "//runtime/src/iree/builtins/ukernel:static_assert.h"
-    "//runtime/src/iree/builtins/ukernel:unpack.h"
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h"
-    "//runtime/src/iree/schemas:cpu_data.h"
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl"
   COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-DNDEBUG"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-fno-short-wchar"
-    "-c"
-    "-emit-llvm"
-    "-target"
-    "x86_64-unknown-unknown-eabi-elf"
+    "-mavx"
+    "-mavx2"
+    "-mfma"
     "-mavx512f"
     "-mavx512vl"
     "-mavx512cd"
     "-mavx512bw"
     "-mavx512dq"
-  PUBLIC
+  ARCH
+    x86_64
 )
 
 iree_bitcode_library(
@@ -145,43 +64,29 @@
     ukernel_bitcode_x86_64_avx512_vnni
   SRCS
     "mmt4d_x86_64_avx512_vnni.c"
-  HDRS
-    "//runtime/src/iree/builtins/ukernel/arch/x86_64:common_x86_64.h"
-    "//runtime/src/iree/builtins/ukernel:api.h"
-    "//runtime/src/iree/builtins/ukernel:common.h"
-    "//runtime/src/iree/builtins/ukernel:elementwise.h"
-    "//runtime/src/iree/builtins/ukernel:exported_bits.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d.h"
-    "//runtime/src/iree/builtins/ukernel:mmt4d_internal.h"
-    "//runtime/src/iree/builtins/ukernel:pack.h"
-    "//runtime/src/iree/builtins/ukernel:pack_internal.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes.h"
-    "//runtime/src/iree/builtins/ukernel:query_tile_sizes_internal.h"
-    "//runtime/src/iree/builtins/ukernel:static_assert.h"
-    "//runtime/src/iree/builtins/ukernel:unpack.h"
-    "//runtime/src/iree/builtins/ukernel:unpack_internal.h"
-    "//runtime/src/iree/schemas:cpu_data.h"
-    "//runtime/src/iree/schemas:cpu_feature_bits.inl"
   COPTS
-    "-std=c17"
-    "-nostdinc"
-    "-ffreestanding"
-    "-O3"
-    "-DNDEBUG"
-    "-fno-ident"
-    "-fdiscard-value-names"
-    "-fno-short-wchar"
-    "-c"
-    "-emit-llvm"
-    "-target"
-    "x86_64-unknown-unknown-eabi-elf"
+    "-mavx"
+    "-mavx2"
+    "-mfma"
     "-mavx512f"
     "-mavx512vl"
     "-mavx512cd"
     "-mavx512bw"
     "-mavx512dq"
     "-mavx512vnni"
-  PUBLIC
+  ARCH
+    x86_64
+)
+
+iree_link_bitcode(
+  NAME
+    ukernel_bitcode_x86_64
+  SRCS
+    "ukernel_bitcode_x86_64_avx2_fma.bc"
+    "ukernel_bitcode_x86_64_avx512_base.bc"
+    "ukernel_bitcode_x86_64_avx512_vnni.bc"
+    "ukernel_bitcode_x86_64_base.bc"
+
 )
 
 endif()  # IREE_BUILD_COMPILER AND IREE_TARGET_BACKEND_LLVM_CPU