Separate architecture generic<->specific bitcode (#13825)

This is the main PR towards #13804 . `iree_bitcode_library` gains the
ability to produce either arch-specific or generic bitcode. We build
separately the architecture-specific parts of ukernel code (what's under
`ukernel/arch/`) and the generic parts (what's directly in `ukernel/`).
Then in the compiler, we unconditionally load the generic bitcode, and
architecture-specific bitcode if any is availble for the target
architecture.

Before you ask: why not just produce N side-by-side,
architecture-specific bitcode modules, one per architecture that we care
about? We want microkernels to just work, all the time, not be forever
stuck in "advanced feature that may cause trouble" limbo. Since lacking
a required microkernel is a linker error (unless perhaps you go through
the trouble of linking a
[plugin](https://github.com/openxla/iree/tree/main/experimental/cpu_ukernel)
at runtime), we want to always unconditionally have bitcode for all
ukernels for all architectures, even the ones that we don't have really
optimized microkernels for yet and just want functional correctness for.
That means at least 8 architectures today
(`{x86,arm,riscv,wasm}_{32,64}`), probably dozens in the future. So that
would be a lot of side-by-side copies. We would start to have to be
reluctant to add more ukernels. By contrast, if we can get
architecture-generic bitcode to work (as this PR does) then we can have
1 single copy of that architecture-generic bitcode regardless of the
number of target architectures supported; and any additional bitcode,
architecture-specific bitcode, is proportional to the engineering effort
invested in optimizing for each target architecture.

So that's why I think architecture-generic bitcode is worth the effort.

The central difficulty is that Clang doesn't have any switch allowing to
directly produce target-independent bitcode.

From Clang's perspective (which IIUC is well summarized by [this
answer](https://stackoverflow.com/questions/71868733/how-to-make-target-independent-ir-with-llvm)),
target-independence is a property of the source language, and C isn't a
target-independent language in general.

But ukernels code isn't any C code, it's C code that's carefully written
to be target-independent outside of that `arch` subdir:
* We don't use target-dependent types (e.g. `ssize_t`) only fixed-width
types (e.g. `iree_uk_ssize_t` is `iree_uk_int64_t`, see #13834).
* We do use pointers, which are technically target-dependent, but that
target-dependence doesn't appear until later down the lowerings: as we
are outputting LLVM IR here, pointers are still an opaque `ptr` type.
* We don't do `#if` based on target-dependent tokens. Selection of
architecture-specific code paths has been reimplemented as strong
symbols (in architecture-specific code) overriding weak symbols (in
architecture-independent code) in #13715.
* We don't `#include` any standard library or system header, so our code
is truly self-contained, and that's guarded by the flags we pass Clang
when compiling to bitcode.

So we are in a special case here, so it's not unreasonable to think that
we known better than Clang and try to work past its reluctance to
produce target-independent IR.

Inspecting the IR produced from compiling our architecture-independent
ukernel files showed that the target-dependence in the resulting IR is
limited to a few target attributes and a target triple, that have been
automatically added but don't seem to play any role. Editing these away
made `llc` happy to compile that IR to *another* target architecture.

This motivated the approach in this PR: a `strip_target_info.py` script
simply drops the target details from LLVM IR.

`iree_bitcode_library` gains an `arch=` parameter. When not specified,
IR is processed with `strip_target_info.py`. When specified, IR is left
unprocessed and the right `-target` flag is passed. Generally, all the
copts are automatically set by `iree_bitcode_library` now, though each
call site may still override anything as usual (rule copts being
appended after).
diff --git a/build_tools/bazel/iree_bitcode_library.bzl b/build_tools/bazel/iree_bitcode_library.bzl
index b4f9b38..09282f3 100644
--- a/build_tools/bazel/iree_bitcode_library.bzl
+++ b/build_tools/bazel/iree_bitcode_library.bzl
@@ -6,65 +6,156 @@
 
 """Rules for compiling with clang to produce bitcode libraries."""
 
+def iree_arch_to_llvm_arch(
+        iree_arch = None):
+    """Converts an IREE_ARCH value to the corresponding LLVM arch name.
+
+    Similar to the CMake function with the same name.
+
+    Args:
+        iree_arch: IREE_ARCH string value.
+
+    Returns:
+        The LLVM name for that architecture (first component of target triple).
+    """
+
+    if not iree_arch:
+        return None
+    if iree_arch == "arm_64":
+        return "aarch64"
+    if iree_arch == "arm_32":
+        return "arm"
+    if iree_arch == "x86_64":
+        return "x86_64"
+    if iree_arch == "x86_32":
+        return "i386"
+    if iree_arch == "riscv_64":
+        return "riscv64"
+    if iree_arch == "riscv_32":
+        return "riscv32"
+    if iree_arch == "wasm_64":
+        return "wasm64"
+    if iree_arch == "wasm_32":
+        return "wasm32"
+    fail("Unhandled IREE_ARCH value %s" % iree_arch)
+
 def iree_bitcode_library(
         name,
         srcs,
-        hdrs = [],
+        internal_hdrs = [],
         copts = [],
-        defines = [],
-        data = [],
         out = None,
-        clang_tool = "@llvm-project//clang:clang",
-        link_tool = "@llvm-project//llvm:llvm-link",
-        builtin_headers_dep = "@llvm-project//clang:builtin_headers_gen",
-        builtin_headers_path = "external/llvm-project/clang/staging/include/",
+        arch = None,
         **kwargs):
     """Builds an LLVM bitcode library from an input file via clang.
 
     Args:
         name: Name of the target.
+        arch: Target architecture to compile for, in IREE_ARCH format. If left
+              empty, will produce architecture-independent bitcode by stripping
+              target triple and target attributes; that only makes sense if the
+              sources being compiled are truly architecture-independent.
         srcs: source files to pass to clang.
-        hdrs: additional headers included by the source files.
+        internal_hdrs: all headers transitively included by the source files.
+                       Unlike typical Bazel `hdrs`, these are not exposed as
+                       interface headers. This would normally be part of `srcs`,
+                       but separating it was easier for `bazel_to_cmake`, as
+                       CMake does not need this, and making this explicitly
+                       Bazel-only allows using `filegroup` on the Bazel side.
         copts: additional flags to pass to clang.
-        defines: preprocessor definitions to pass to clang.
-        data: additional data required during compilation.
         out: output file name (defaults to name.bc).
-        clang_tool: the clang to use to compile the source.
-        link_tool: llvm-link tool used for linking bitcode files.
-        builtin_headers_dep: clang builtin headers (stdbool, stdint, etc).
-        builtin_headers_path: relative path to the builtin headers rule.
         **kwargs: any additional attributes to pass to the underlying rules.
     """
 
+    clang_tool = "@llvm-project//clang:clang"
+    link_tool = "@llvm-project//llvm:llvm-link"
+    builtin_headers_dep = "@llvm-project//clang:builtin_headers_gen"
+    builtin_headers_path = "external/llvm-project/clang/staging/include/"
+
+    base_copts = [
+        # C17 with no system deps.
+        "-std=c17",
+        "-nostdinc",
+        "-ffreestanding",
+
+        # Optimized and unstamped.
+        "-O3",
+        "-DNDEBUG",
+        "-fno-ident",
+        "-fdiscard-value-names",
+
+        # Set the size of wchar_t to 4 bytes (instead of 2 bytes).
+        # This must match what the runtime is built with.
+        "-fno-short-wchar",
+
+        # Object file only in bitcode format:
+        "-c",
+        "-emit-llvm",
+
+        # Force the library into standalone mode (not depending on build-directory
+        # configuration).
+        "-DIREE_DEVICE_STANDALONE=1",
+    ]
+
+    llvmir_processing_tool = None
+    if arch:
+        # Compile to the specified target architecture.
+        base_copts.extend(["-target", iree_arch_to_llvm_arch(arch)])
+    else:
+        # Output text rather than binary serialization of LLVM IR for processing
+        base_copts.append("-S")
+
+        # Strip target information from generated LLVM IR.
+        llvmir_processing_tool = "//build_tools/scripts:strip_target_info"
+
     bitcode_files = []
-    for bitcode_src in srcs:
-        bitcode_out = "%s_%s.bc" % (name, bitcode_src)
-        bitcode_files.append(bitcode_out)
-        system_headers = ["immintrin.h"]
+    for src in srcs:
+        bitcode_out = "%s_%s.bc" % (name, src)
         native.genrule(
             name = "gen_%s" % (bitcode_out),
-            srcs = [bitcode_src] + hdrs + [builtin_headers_dep],
+            srcs = [src, builtin_headers_dep] + internal_hdrs,
             outs = [bitcode_out],
             cmd = " && ".join([
                 " ".join([
                     "$(location %s)" % (clang_tool),
                     "-isystem $(BINDIR)/%s" % builtin_headers_path,
-                    " ".join(copts),
-                    " ".join(["-D%s" % (define) for define in defines]),
+                    " ".join(base_copts + copts),
                     " ".join(["-I $(BINDIR)/runtime/src"]),
                     " ".join(["-I runtime/src"]),
                     "-o $(location %s)" % (bitcode_out),
-                    "$(location %s)" % (bitcode_src),
+                    "$(location %s)" % (src),
                 ]),
             ]),
-            tools = data + [
+            tools = [
                 clang_tool,
             ],
-            message = "Compiling %s to %s..." % (bitcode_src, bitcode_out),
+            message = "Compiling %s to %s..." % (src, bitcode_out),
             output_to_bindir = 1,
             **kwargs
         )
 
+        if llvmir_processing_tool:
+            processed_bitcode_out = "%s_%s.processed.bc" % (name, src)
+            native.genrule(
+                name = "gen_%s" % (processed_bitcode_out),
+                srcs = [bitcode_out],
+                outs = [processed_bitcode_out],
+                cmd = " ".join([
+                    "$(location %s)" % (llvmir_processing_tool),
+                    "< $(location %s)" % bitcode_out,
+                    "> $(location %s)" % processed_bitcode_out,
+                ]),
+                tools = [
+                    llvmir_processing_tool,
+                ],
+                message = "Processing %s into %s using %s..." % (bitcode_out, processed_bitcode_out, llvmir_processing_tool),
+                output_to_bindir = 1,
+                **kwargs
+            )
+            bitcode_files.append(processed_bitcode_out)
+        else:
+            bitcode_files.append(bitcode_out)
+
     if not out:
         out = "%s.bc" % (name)
     native.genrule(
@@ -78,7 +169,7 @@
                 " ".join(["$(locations %s)" % (src) for src in bitcode_files]),
             ]),
         ]),
-        tools = data + [link_tool],
+        tools = [link_tool],
         message = "Linking bitcode library %s to %s..." % (name, out),
         output_to_bindir = 1,
         **kwargs