Nccl integration (#11585)
This integration enables basic NCCL features in the CUDA runtime. This
enables a minimum test to run. Many more things should be done on top of
this.
Two environmental variables are introduced to set the number of
processes and process ID.
1. `IREE_CUDA_NCCL_NPROCS`
2. `IREE_CUDA_NCCL_PROCID`
The GPU ID can be set using `--device=cuda://<index>` or
`--device=cuda://GPU-<uuid>` for `iree-run-module`.
The NCCL dynamic library is loaded only when users set
`IREE_CUDA_NCCL_NPROCS` to >= 1. (Without knowing the number of
processes, we can't create a unique ID, which is needed to create a
channel.)
There are many things to be done based on this work. We need
1. a full set of E2E tests from stream async ops to the runtime,
2. supporting high level ops such as stablehlo, and
3. a CI test setup.
Here is a sample allgather test:
```mlir
func.func @main() -> !hal.buffer_view {
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c8 = arith.constant 8 : index
%c16 = arith.constant 16 : index
%input_cst = stream.tensor.constant : tensor<2xi32> in !stream.resource<constant> =
dense<[101, 102]> : tensor<2xi32>
%input = stream.async.transfer %input_cst : !stream.resource<constant>{%c8} -> !stream.resource<*>{%c8}
%fill_val = arith.constant -1 : i32
%output = stream.tensor.splat %fill_val :
i32 -> tensor<2x2xi32> in !stream.resource<*>{%c16}
%channel = stream.channel.default on(#hal.affinity.queue<[0]>) : !stream.channel
%0 = stream.async.collective<all_gather : si32>[%c2]
on(#hal.affinity.queue<[0]>) channel(%channel)
%input[%c0 to %c8 for %c8],
%output[%c0 to %c16 for %c16] :
!stream.resource<*>{%c8} -> %output as !stream.resource<*>{%c16}
%1 = stream.async.transfer %0 : !stream.resource<*>{%c16} -> !stream.resource<external>{%c16}
%result = stream.tensor.export %1 :
tensor<2x2xi32> in !stream.resource<external>{%c16} -> !hal.buffer_view
return %result : !hal.buffer_view
}
```
A sample command to build is:
```zsh
iree-compile --iree-hal-cuda-llvm-target-arch=sm_86 --iree-hal-target-backends=cuda -o allgather.vmfb allgather.mlir
```
Here is a sample command line for a host with two CUDA devices and the
result.
```zsh
IREE_CUDA_NCCL_NPROCS=2 NCCL_COMM_ID=127.0.0.1:8000 IREE_CUDA_NCCL_PROCID=0 iree-run-module --device=cuda://0 --module_file=allgather.vmfb --entry_function=main & \
IREE_CUDA_NCCL_NPROCS=2 NCCL_COMM_ID=127.0.0.1:8000 IREE_CUDA_NCCL_PROCID=1 iree-run-module --device=cuda://1 --module_file=allgather.vmfb --entry_function=main
EXEC @main
EXEC @main
result[0]: hal.buffer_view
2x2xi32=[101 102][101 102]
result[0]: hal.buffer_view
2x2xi32=[101 102][101 102]
```diff --git a/build_tools/bazel/workspace.bzl b/build_tools/bazel/workspace.bzl
index 917fb7e..0ffafa6 100644
--- a/build_tools/bazel/workspace.bzl
+++ b/build_tools/bazel/workspace.bzl
@@ -166,3 +166,10 @@
build_file = iree_repo_alias + "//:build_tools/third_party/tracy_client/BUILD.overlay",
path = paths.join(iree_path, "third_party/tracy"),
)
+
+ maybe(
+ native.new_local_repository,
+ name = "nccl",
+ build_file = iree_repo_alias + "//:build_tools/third_party/nccl/BUILD.overlay",
+ path = paths.join(iree_path, "third_party/nccl"),
+ )