blob: ede94a64c66ea8c0c7c554e0f9dd6715ad263e1d [file] [log] [blame] [view]
# Custom CPU Dispatch Functions for Statically-linked Embedded ELFs
See the [custom_dispatch README](/samples/custom_dispatch/README.md) for an
overview of this approach.
This sample demonstrates how to define external device functions that can be
dispatched from within IREE programs via simple function calls. Here the
functions are declared in the MLIR executables, called as normal calls, and
then defined in a .c file that is cross-compiled for various architectures.
The compiler uses the attribute specifying which object files to link against
when performing its final linking and runs LTO to optimize across both the
generated portions and hand-authored portions.
### Work in Progress
The calling convention used for passing pointers is currently a mess as the MLIR
`memref` type is used to model buffer references and that expands to many
arguments. Future revisions will just pass the pointers instead.
Currently weak linkage is not available and external functions must always be
provided when referenced. In future versions fallback IR will allow for object
files to be specified only for certain platforms while allowing others to be
generated via normal codegen paths.
## Workflow
```
+-------------+ +---------------------+ +--------------+
| functions.c | -> clang -+-> | functions_aarch64.o | -+ | example.mlir |
+-------------+ | +---------------------+ | +--------------+
| +---------------------+ | v
+-> | functions_x86_64.o | -+----> iree-compile
+---------------------+ v
+--------------+
| example.vmfb |
+--------------+
```
1. The user authors their functions in bare-metal C (no TLS, no threads, no
malloc, etc). These functions can cover entire workgroups (and a dispatch can
be a single workgroup so effectively just function calls) or be utilities
used by the function for localized work (microkernels, data type conversion,
etc). It's important to remember that parallelism scheduling is done
_outside_ of the function via the workgroup count and multiple threads may be
executing the function at any time.
```c
// NOTE: this will be simplified in the future:
// void simple_mul_workgroup(
// const float* restrict binding0, const float* restrict binding1,
// float* restrict binding2, size_t dim, size_t tid);
void simple_mul_workgroup(
const float* restrict binding0, const float* restrict binding0_aligned,
size_t binding0_offset, size_t binding0_size, size_t binding0_stride,
const float* restrict binding1, const float* restrict binding1_aligned,
size_t binding1_offset, size_t binding1_size, size_t binding1_stride,
float* restrict binding2, float* restrict binding2_aligned,
size_t binding2_offset, size_t binding2_size, size_t binding2_stride,
size_t dim, size_t tid) {
size_t end = tid + 64;
if (end > dim) end = dim;
for (size_t i = tid; i < end; ++i) {
binding2[i] = binding0[i] * binding1[i];
}
}
```
2. Source files are compiled to object files with bare-metal settings. Each
architecture the user is targeting will need its own object file(s).
```cmake
clang -target aarch64 ...[see CMakeLists.txt]... functions.c -o functions_aarch64.o
```
3. The user (or compiler transforms) adds calls to their functions by declaring
them and marking them as statically-linked.
```mlir
func.func private @simple_mul_workgroup(
%binding0: memref<?xf32>, %binding1: memref<?xf32>, %binding2: memref<?xf32>,
%dim: index, %tid: index) attributes {hal.import.static}
...
func.call @simple_mul_workgroup(%memref0, %memref1, %memref2, %dim, %tid) : (memref<?xf32>, memref<?xf32>, memref<?xf32>, index, index) -> ()
```
4. The user (or compiler transforms) annotates the executables with the objects
to link against providing the function definitions.
```mlir
stream.executable private @executable attributes {
hal.executable.objects = #hal.executable.objects<{
#aarch64_target = [
#hal.executable.object<{path = "functions_aarch64.o"}>
]
}>
```
5. The IREE compiler selects the appropriate object files for the target
configuration and links them into the binaries it produces.
## Instructions
This presumes that `iree-compile` and `iree-run-module` have been installed or
built. [See here](https://iree.dev/building-from-source/getting-started/)
for instructions for CMake setup and building from source.
0. Ensure that `clang` is on your PATH:
```
clang --version
```
1. Build the `iree-sample-deps` CMake target to compile
[functions.c](./functions.c) to object files for aarch64 and x86_64:
```
cmake --build ../iree-build/ --target iree-sample-deps
```
In a user application this would be replaced with whatever build
infrastructure the user has for compiling code to object files. No IREE
compiler or runtime changes are required and the normal compiler install can
be used. Note that specific flags are required when producing the object
files.
2. Compile the [example module](./example_stream.mlir) to a .vmfb file and pass
the path to the build directory so the .o files can be found:
```
iree-compile \
--iree-hal-executable-object-search-path=../iree-build/ \
samples/custom_dispatch/cpu/embedded/example_stream.mlir \
-o=/tmp/example.vmfb
```
[example_stream.mlir](./example_stream.mlir) demonstrates a high-level
approach without needing to specify too much information while
[example_hal.mlir](./example_hal.mlir) shows the lower-level representation
it gets expanded into.
3. Run the example program using the custom functions:
```
iree-run-module \
--device=local-sync \
--function=mixed_invocation \
--input=8xf32=2 \
--input=8xf32=4 \
--module=/tmp/example.vmfb
```
## Custom Kernel Match and Replace Scripting Instructions
Follow the first two steps above to build the samples, and then compile with one
additional flag to include the path to the kernel matcher and replacer.
```
iree-compile \
--iree-hal-executable-object-search-path=../iree-build/ \
--iree-preprocessing-transform-spec-filename=samples/custom_dispatch/cpu/embedded/example_transform_spec.mlir \
samples/custom_dispatch/cpu/embedded/example_transform.mlir \
-o=/tmp/example.vmfb
```
And then run the example the same way.
```
iree-run-module \
--device=local-sync \
--function=mixed_invocation \
--input=5xf32=7 \
--input=5xf32=4 \
--input=10xf32=-4 \
--input=10xf32=3 \
--module=/tmp/example.vmfb
```