This sample shows how to run a simple pointwise array multiplication bytecode module on various HAL device targets with the minimum runtime overhead. Some of these devices are compatible with bare-metal system without threading or file IO support.
The main bytecode testing tool iree-run-module requires a proper operating system support to set up the runtime environment to execute an IREE bytecode module. For embedded systems, the support such as file system or multi-thread asynchronous control may not be available. This sample demonstrates how to setup the simplest framework to load and run the IREE bytecode with various target backends.
Set up the CMake configuration with -DIREE_BUILD_SAMPLES=ON
(default on)
Then run
cmake --build <build dir> --target samples/simple_embedding/all
bazel build samples/simple_embedding:all
The resulting executables are listed as simple_embedding_<HAL devices>
.
The sample consists of three parts:
The simple pointwise array multiplication op with the entry function called simple_mul
, two <4xf32> inputs, and one <4xf32> output. The ML bytecode modules are automatically generated during the build time with the target HAL device configurations from the host compiler iree-compile
.
The main function of the sample has the following steps:
The HAL device for different target backends. Devices are created using a specific executable loader and device constructor. For example, device_embedded_sync.c creates a “sync” device with the embedded ELF loader:
iree_hal_sync_device_params_t params; iree_hal_sync_device_params_initialize(¶ms); iree_hal_executable_loader_t* loader = NULL; IREE_RETURN_IF_ERROR(iree_hal_embedded_elf_loader_create( /*plugin_manager=*/NULL, iree_allocator_system(), &loader)); iree_string_view_t identifier = iree_make_cstring_view("local-sync"); iree_status_t status = iree_hal_sync_device_create(identifier, ¶ms, /*loader_count=*/1, &loader, iree_allocator_system(), device);
Whereas for device_embedded.c, the “sync device” is replaced with the multithreaded “task device”, which uses a “task executor”:
... iree_task_executor_t* executor = NULL; iree_host_size_t executor_count = 0; iree_status_t status = iree_task_executors_create_from_flags(iree_allocator_system(), 1, &executor, &executor_count); IREE_ASSERT_EQ(count, 1, "NUMA unsupported"); iree_string_view_t identifier = iree_make_cstring_view("local-task"); if (iree_status_is_ok(status)) { // Create the device. status = iree_hal_task_device_create(identifier, ¶ms, /*queue_count=*/1, &executor, /*loader_count=*/1, &loader, iree_allocator_system(), device);
An example that utilizes a higher-level driver registry is in device_vulkan.c
To avoid the file IO, the bytecode module is converted into a data stream (module_data
) that's embedded in the executable. The same strategy can be applied to build applications for the embedded systems without a proper file IO.
Some of the devices in this sample support a generic platform (or the machine mode without an operating system). For example, device_vmvx_sync
should support any architecture that IREE supports, and device_embedded_sync
should support any architecture that supports llvm-cpu
codegen target backend (may need to add the bytecode module data if it is not already in device_embedded_sync.c).