samples/custom_dispatch/cuda/kernels/README.md - 3p/openxla/iree - Git at Google

 # Custom CUDA Dispatch Kernels

 See the [custom_dispatch README](/samples/custom_dispatch/README.md) for an
 overview of this approach.

 This sample demonstrates how to define external device functions that can be
 dispatched from within IREE programs when following the IREE CUDA ABI. The user
 authoring the kernels compiles their CUDA code to PTX blobs and can dispatch
 functions within those blobs by declaring them in their IR.

 ### Work in Progress

 Note that currently only entire kernel launches can be modeled and this prevents
 IREE from performing optimizations it otherwise can. In the future PTX linking
 will be implemented such that the external functions are referenced and linked
 with the compiler-produced portions such that more information about the usage
 of the dispatch can be used to specialize/prune the hand-authored kernel. Since
 the IREE CUDA ABI is not version-stable this entire kernel approach may require
 updating when taking new IREE versions while function-level linking would not.

 Since today only entire kernels can be provided the user must specify an empty
 executable (no `builtin.module` contents) and thus must provide objects for
 all targets they are compiling for. When partial function linking is available
 it'll be possible to provide fallback code as IR for when objects are not
 available.

 ## Workflow

 ```
 +------------+              +-------------------+       +--------------+
 | kernels.cu | -> nvcc -+-> | kernels_sm_52.ptx | -+    | example.mlir |
 +------------+          |   +-------------------+  |    +--------------+
                         |   +-------------------+  |           v
                         +-> | kernels_sm_80.ptx | -+----> iree-compile
                             +-------------------+              v
                                                         +--------------+
                                                         | example.vmfb |
                                                         +--------------+
 ```

 1. The user authors their kernels in a .cu file.

 ```c
 extern "C" __global__ void simple_mul(const float* __restrict__ binding0,
                                       const float* __restrict__ binding1,
                                       float* __restrict__ binding2, int dim) {
   int tid = blockDim.x * blockIdx.x + threadIdx.x;
   if (tid < dim) binding2[tid] = binding0[tid] * binding1[tid];
 }
 ```

 2. Source files are compiled to PTX files via nvcc. Each architecture the user
    is targeting will need its own object file(s).

 ```cmake
 nvcc ... (TODO, see CMakeLists.txt) -o kernels_sm_80.ptx
 ```

 3. The user (or compiler transforms) declare the externally defined kernels and
    the target-to-objects map for each architecture. The layout specifies how
    the dispatch arguments are mapped to buffers and the workgroup size is the
    CUDA block size. The region on the export is used to compute the workgroup
    count (CUDA grid size) and can query the `%device` if runtime device
    information is needed.

 ```mlir
   hal.executable.source private @executable attributes {
     objects = #hal.executable.objects<{
       #nvptx_sm_52_target = [
         #hal.executable.object<{path = "kernels_sm_80.ptx"}>
       ]
     }>
     hal.executable.export public @simple_mul ordinal(0)
         layout(#hal.pipeline.layout<push_constants = 1, sets = [
           <0, bindings = [
               <0, storage_buffer, ReadOnly>,
               <1, storage_buffer, ReadOnly>,
               <2, storage_buffer>
           ]>
         ]>) attributes {workgroup_size = [64 : index, 1 : index, 1 : index]} {
     ^bb0(%device: !hal.device, %workload: index):
       %x = affine.apply affine_map<()[s0] -> (s0 ceildiv 64)>()[%workload]
       %c1 = arith.constant 1 : index
       hal.return %x, %c1, %c1 : index, index, index
     }
   }
 ```

 4. The user (or compiler transforms) dispatches the kernel.

 ```mlir
   %0 = flow.dispatch @executable::@simple_mul[%dim](%dim_i32, %arg0, %arg1) :
       (i32, tensor<?xf32>{%dim}, tensor<?xf32>{%dim}) -> tensor<?xf32>{%dim}
 ```

 5. The IREE compiler selects the appropriate object files for the target
    configuration and links them into the binaries it produces. Dispatches are
    automatically routed to the appropriate variant of those available at
    runtime.

 ## Instructions

 This presumes that `iree-compile` and `iree-run-module` have been installed or
 built. [See here](https://iree.dev/building-from-source/getting-started/)
 for instructions for CMake setup and building from source.

 0. Ensure that the [CUDA SDK](https://developer.nvidia.com/cuda-downloads) and `nvcc` is on your PATH:

     ```
     nvcc --version
     ```

 1. Build the `iree-sample-deps` CMake target to compile the .cu to .ptx:

     ```
     cmake --build ../iree-build/ --target iree-sample-deps
     ```

     In a user application this would be replaced with whatever build
     infrastructure the user has for compiling kernels to PTX. No IREE
     compiler or runtime changes are required and the normal compiler install can
     be used.

 2. Compile the [example module](./example.mlir) to a .vmfb file and pass the
    path to the build directory so the .spv files can be found:

     ```
     iree-compile \
         --iree-hal-executable-object-search-path=../iree-build/ \
         samples/custom_dispatch/cuda/kernels/example.mlir \
         -o=/tmp/example.vmfb
     ```

 3. Run the example program using the custom kernels:

     ```
     iree-run-module \
         --device=cuda \
         --function=mixed_invocation \
         --input=8xf32=2 \
         --input=8xf32=4 \
         /tmp/example.vmfb
     ```
	# Custom CUDA Dispatch Kernels

	See the [custom_dispatch README](/samples/custom_dispatch/README.md) for an
	overview of this approach.

	This sample demonstrates how to define external device functions that can be
	dispatched from within IREE programs when following the IREE CUDA ABI. The user
	authoring the kernels compiles their CUDA code to PTX blobs and can dispatch
	functions within those blobs by declaring them in their IR.

	### Work in Progress

	Note that currently only entire kernel launches can be modeled and this prevents
	IREE from performing optimizations it otherwise can. In the future PTX linking
	will be implemented such that the external functions are referenced and linked
	with the compiler-produced portions such that more information about the usage
	of the dispatch can be used to specialize/prune the hand-authored kernel. Since
	the IREE CUDA ABI is not version-stable this entire kernel approach may require
	updating when taking new IREE versions while function-level linking would not.

	Since today only entire kernels can be provided the user must specify an empty
	executable (no `builtin.module` contents) and thus must provide objects for
	all targets they are compiling for. When partial function linking is available
	it'll be possible to provide fallback code as IR for when objects are not
	available.

	## Workflow

	```
	+------------+ +-------------------+ +--------------+
	\| kernels.cu \| -> nvcc -+-> \| kernels_sm_52.ptx \| -+ \| example.mlir \|
	+------------+ \| +-------------------+ \| +--------------+
	\| +-------------------+ \| v
	+-> \| kernels_sm_80.ptx \| -+----> iree-compile
	+-------------------+ v
	+--------------+
	\| example.vmfb \|
	+--------------+
	```

	1. The user authors their kernels in a .cu file.

	```c
	extern "C" __global__ void simple_mul(const float* __restrict__ binding0,
	const float* __restrict__ binding1,
	float* __restrict__ binding2, int dim) {
	int tid = blockDim.x * blockIdx.x + threadIdx.x;
	if (tid < dim) binding2[tid] = binding0[tid] * binding1[tid];
	}
	```

	2. Source files are compiled to PTX files via nvcc. Each architecture the user
	is targeting will need its own object file(s).

	```cmake
	nvcc ... (TODO, see CMakeLists.txt) -o kernels_sm_80.ptx
	```

	3. The user (or compiler transforms) declare the externally defined kernels and
	the target-to-objects map for each architecture. The layout specifies how
	the dispatch arguments are mapped to buffers and the workgroup size is the
	CUDA block size. The region on the export is used to compute the workgroup
	count (CUDA grid size) and can query the `%device` if runtime device
	information is needed.

	```mlir
	hal.executable.source private @executable attributes {
	objects = #hal.executable.objects<{
	#nvptx_sm_52_target = [
	#hal.executable.object<{path = "kernels_sm_80.ptx"}>
	]
	}>
	hal.executable.export public @simple_mul ordinal(0)
	layout(#hal.pipeline.layout<push_constants = 1, sets = [
	<0, bindings = [
	<0, storage_buffer, ReadOnly>,
	<1, storage_buffer, ReadOnly>,
	<2, storage_buffer>
	]>
	]>) attributes {workgroup_size = [64 : index, 1 : index, 1 : index]} {
	^bb0(%device: !hal.device, %workload: index):
	%x = affine.apply affine_map<()[s0] -> (s0 ceildiv 64)>()[%workload]
	%c1 = arith.constant 1 : index
	hal.return %x, %c1, %c1 : index, index, index
	}
	}
	```

	4. The user (or compiler transforms) dispatches the kernel.

	```mlir
	%0 = flow.dispatch @executable::@simple_mul[%dim](%dim_i32, %arg0, %arg1) :
	(i32, tensor<?xf32>{%dim}, tensor<?xf32>{%dim}) -> tensor<?xf32>{%dim}
	```

	5. The IREE compiler selects the appropriate object files for the target
	configuration and links them into the binaries it produces. Dispatches are
	automatically routed to the appropriate variant of those available at
	runtime.

	## Instructions

	This presumes that `iree-compile` and `iree-run-module` have been installed or
	built. [See here](https://iree.dev/building-from-source/getting-started/)
	for instructions for CMake setup and building from source.

	0. Ensure that the [CUDA SDK](https://developer.nvidia.com/cuda-downloads) and `nvcc` is on your PATH:

	```
	nvcc --version
	```

	1. Build the `iree-sample-deps` CMake target to compile the .cu to .ptx:

	```
	cmake --build ../iree-build/ --target iree-sample-deps
	```

	In a user application this would be replaced with whatever build
	infrastructure the user has for compiling kernels to PTX. No IREE
	compiler or runtime changes are required and the normal compiler install can
	be used.

	2. Compile the [example module](./example.mlir) to a .vmfb file and pass the
	path to the build directory so the .spv files can be found:

	```
	iree-compile \
	--iree-hal-executable-object-search-path=../iree-build/ \
	samples/custom_dispatch/cuda/kernels/example.mlir \
	-o=/tmp/example.vmfb
	```

	3. Run the example program using the custom kernels:

	```
	iree-run-module \
	--device=cuda \
	--function=mixed_invocation \
	--input=8xf32=2 \
	--input=8xf32=4 \
	/tmp/example.vmfb
	```