|  | # Custom CPU Dispatch Functions for Statically-linked Embedded ELFs | 
|  |  | 
|  | See the [custom_dispatch README](/samples/custom_dispatch/README.md) for an | 
|  | overview of this approach. | 
|  |  | 
|  | This sample demonstrates how to define external device functions that can be | 
|  | dispatched from within IREE programs via simple function calls. Here the | 
|  | functions are declared in the MLIR executables, called as normal calls, and | 
|  | then defined in a .c file that is cross-compiled for various architectures. | 
|  | The compiler uses the attribute specifying which object files to link against | 
|  | when performing its final linking and runs LTO to optimize across both the | 
|  | generated portions and hand-authored portions. | 
|  |  | 
|  | ### Work in Progress | 
|  |  | 
|  | The calling convention used for passing pointers is currently a mess as the MLIR | 
|  | `memref` type is used to model buffer references and that expands to many | 
|  | arguments. Future revisions will just pass the pointers instead. | 
|  |  | 
|  | Currently weak linkage is not available and external functions must always be | 
|  | provided when referenced. In future versions fallback IR will allow for object | 
|  | files to be specified only for certain platforms while allowing others to be | 
|  | generated via normal codegen paths. | 
|  |  | 
|  | ## Workflow | 
|  |  | 
|  | ``` | 
|  | +-------------+               +---------------------+       +--------------+ | 
|  | | functions.c | -> clang -+-> | functions_aarch64.o | -+    | example.mlir | | 
|  | +-------------+           |   +---------------------+  |    +--------------+ | 
|  | |   +---------------------+  |           v | 
|  | +-> | functions_x86_64.o  | -+----> iree-compile | 
|  | +---------------------+              v | 
|  | +--------------+ | 
|  | | example.vmfb | | 
|  | +--------------+ | 
|  | ``` | 
|  |  | 
|  | 1. The user authors their functions in bare-metal C (no TLS, no threads, no | 
|  | malloc, etc). These functions can cover entire workgroups (and a dispatch can | 
|  | be a single workgroup so effectively just function calls) or be utilities | 
|  | used by the function for localized work (microkernels, data type conversion, | 
|  | etc). It's important to remember that parallelism scheduling is done | 
|  | _outside_ of the function via the workgroup count and multiple threads may be | 
|  | executing the function at any time. | 
|  |  | 
|  | ```c | 
|  | // NOTE: this will be simplified in the future: | 
|  | //  void simple_mul_workgroup( | 
|  | //    const float* restrict binding0, const float* restrict binding1, | 
|  | //    float* restrict binding2, size_t dim, size_t tid); | 
|  | void simple_mul_workgroup( | 
|  | const float* restrict binding0, const float* restrict binding0_aligned, | 
|  | size_t binding0_offset, size_t binding0_size, size_t binding0_stride, | 
|  | const float* restrict binding1, const float* restrict binding1_aligned, | 
|  | size_t binding1_offset, size_t binding1_size, size_t binding1_stride, | 
|  | float* restrict binding2, float* restrict binding2_aligned, | 
|  | size_t binding2_offset, size_t binding2_size, size_t binding2_stride, | 
|  | size_t dim, size_t tid) { | 
|  | size_t end = tid + 64; | 
|  | if (end > dim) end = dim; | 
|  | for (size_t i = tid; i < end; ++i) { | 
|  | binding2[i] = binding0[i] * binding1[i]; | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | 2. Source files are compiled to object files with bare-metal settings. Each | 
|  | architecture the user is targeting will need its own object file(s). | 
|  |  | 
|  | ```cmake | 
|  | clang -target aarch64 ...[see CMakeLists.txt]... functions.c -o functions_aarch64.o | 
|  | ``` | 
|  |  | 
|  | 3. The user (or compiler transforms) adds calls to their functions by declaring | 
|  | them and marking them as statically-linked. | 
|  |  | 
|  | ```mlir | 
|  | func.func private @simple_mul_workgroup( | 
|  | %binding0: memref<?xf32>, %binding1: memref<?xf32>, %binding2: memref<?xf32>, | 
|  | %dim: index, %tid: index) attributes {hal.import.static} | 
|  | ... | 
|  | func.call @simple_mul_workgroup(%memref0, %memref1, %memref2, %dim, %tid) : (memref<?xf32>, memref<?xf32>, memref<?xf32>, index, index) -> () | 
|  | ``` | 
|  |  | 
|  | 4. The user (or compiler transforms) annotates the executables with the objects | 
|  | to link against providing the function definitions. | 
|  |  | 
|  | ```mlir | 
|  | stream.executable private @executable attributes { | 
|  | hal.executable.objects = #hal.executable.objects<{ | 
|  | #aarch64_target = [ | 
|  | #hal.executable.object<{path = "functions_aarch64.o"}> | 
|  | ] | 
|  | }> | 
|  | ``` | 
|  |  | 
|  | 5. The IREE compiler selects the appropriate object files for the target | 
|  | configuration and links them into the binaries it produces. | 
|  |  | 
|  | ## Instructions | 
|  |  | 
|  | This presumes that `iree-compile` and `iree-run-module` have been installed or | 
|  | built. [See here](https://iree.dev/building-from-source/getting-started/) | 
|  | for instructions for CMake setup and building from source. | 
|  |  | 
|  | 0. Ensure that `clang` is on your PATH: | 
|  |  | 
|  | ``` | 
|  | clang --version | 
|  | ``` | 
|  |  | 
|  | 1. Build the `iree-sample-deps` CMake target to compile | 
|  | [functions.c](./functions.c) to object files for aarch64 and x86_64: | 
|  |  | 
|  | ``` | 
|  | cmake --build ../iree-build/ --target iree-sample-deps | 
|  | ``` | 
|  |  | 
|  | In a user application this would be replaced with whatever build | 
|  | infrastructure the user has for compiling code to object files. No IREE | 
|  | compiler or runtime changes are required and the normal compiler install can | 
|  | be used. Note that specific flags are required when producing the object | 
|  | files. | 
|  |  | 
|  | 2. Compile the [example module](./example_stream.mlir) to a .vmfb file and pass | 
|  | the path to the build directory so the .o files can be found: | 
|  |  | 
|  | ``` | 
|  | iree-compile \ | 
|  | --iree-hal-executable-object-search-path=../iree-build/ \ | 
|  | samples/custom_dispatch/cpu/embedded/example_stream.mlir \ | 
|  | -o=/tmp/example.vmfb | 
|  | ``` | 
|  |  | 
|  | [example_stream.mlir](./example_stream.mlir) demonstrates a high-level | 
|  | approach without needing to specify too much information while | 
|  | [example_hal.mlir](./example_hal.mlir) shows the lower-level representation | 
|  | it gets expanded into. | 
|  |  | 
|  | 3. Run the example program using the custom functions: | 
|  |  | 
|  | ``` | 
|  | iree-run-module \ | 
|  | --device=local-sync \ | 
|  | --function=mixed_invocation \ | 
|  | --input=8xf32=2 \ | 
|  | --input=8xf32=4 \ | 
|  | --module=/tmp/example.vmfb | 
|  | ``` | 
|  |  | 
|  | ## Custom Kernel Match and Replace Scripting Instructions | 
|  |  | 
|  | Follow the first two steps above to build the samples, and then compile with one | 
|  | additional flag to include the path to the kernel matcher and replacer. | 
|  |  | 
|  | ``` | 
|  | iree-compile \ | 
|  | --iree-hal-executable-object-search-path=../iree-build/ \ | 
|  | --iree-preprocessing-transform-spec-filename=samples/custom_dispatch/cpu/embedded/example_transform_spec.mlir \ | 
|  | samples/custom_dispatch/cpu/embedded/example_transform.mlir \ | 
|  | -o=/tmp/example.vmfb | 
|  | ``` | 
|  |  | 
|  | And then run the example the same way. | 
|  |  | 
|  | ``` | 
|  | iree-run-module \ | 
|  | --device=local-sync \ | 
|  | --function=mixed_invocation \ | 
|  | --input=5xf32=7 \ | 
|  | --input=5xf32=4 \ | 
|  | --input=10xf32=-4 \ | 
|  | --input=10xf32=3 \ | 
|  | --module=/tmp/example.vmfb | 
|  | ``` |