Nccl integration (#11585)

This integration enables basic NCCL features in the CUDA runtime. This
enables a minimum test to run. Many more things should be done on top of
this.

Two environmental variables are introduced to set the number of
processes and process ID.
1. `IREE_CUDA_NCCL_NPROCS`
2. `IREE_CUDA_NCCL_PROCID`

The GPU ID can be set using `--device=cuda://<index>` or
`--device=cuda://GPU-<uuid>` for `iree-run-module`.

The NCCL dynamic library is loaded only when users set
`IREE_CUDA_NCCL_NPROCS` to >= 1. (Without knowing the number of
processes, we can't create a unique ID, which is needed to create a
channel.)

There are many things to be done based on this work. We need

1. a full set of E2E tests from stream async ops to the runtime,
2. supporting high level ops such as stablehlo, and
3. a CI test setup.

Here is a sample allgather test:

```mlir
func.func @main() -> !hal.buffer_view {
  %c0 = arith.constant 0 : index
  %c2 = arith.constant 2 : index
  %c8 = arith.constant 8 : index
  %c16 = arith.constant 16 : index
  %input_cst = stream.tensor.constant : tensor<2xi32> in !stream.resource<constant> =
    dense<[101, 102]> : tensor<2xi32>
  %input = stream.async.transfer %input_cst : !stream.resource<constant>{%c8} -> !stream.resource<*>{%c8}
  %fill_val = arith.constant -1 : i32
  %output = stream.tensor.splat %fill_val :
    i32 -> tensor<2x2xi32> in !stream.resource<*>{%c16}
  %channel = stream.channel.default on(#hal.affinity.queue<[0]>) : !stream.channel

  %0 = stream.async.collective<all_gather : si32>[%c2]
      on(#hal.affinity.queue<[0]>) channel(%channel)
      %input[%c0 to %c8 for %c8],
      %output[%c0 to %c16 for %c16] :
      !stream.resource<*>{%c8} -> %output as !stream.resource<*>{%c16}
  %1 = stream.async.transfer %0 : !stream.resource<*>{%c16} -> !stream.resource<external>{%c16}
  %result = stream.tensor.export %1 :
    tensor<2x2xi32> in !stream.resource<external>{%c16} -> !hal.buffer_view
  return %result : !hal.buffer_view
}
```

A sample command to build is:
```zsh
iree-compile --iree-hal-cuda-llvm-target-arch=sm_86 --iree-hal-target-backends=cuda -o allgather.vmfb allgather.mlir
```

Here is a sample command line for a host with two CUDA devices and the
result.
```zsh
IREE_CUDA_NCCL_NPROCS=2 NCCL_COMM_ID=127.0.0.1:8000 IREE_CUDA_NCCL_PROCID=0 iree-run-module --device=cuda://0 --module_file=allgather.vmfb --entry_function=main & \
IREE_CUDA_NCCL_NPROCS=2 NCCL_COMM_ID=127.0.0.1:8000 IREE_CUDA_NCCL_PROCID=1 iree-run-module --device=cuda://1 --module_file=allgather.vmfb --entry_function=main
EXEC @main
EXEC @main
result[0]: hal.buffer_view
2x2xi32=[101 102][101 102]
result[0]: hal.buffer_view
2x2xi32=[101 102][101 102]
```
diff --git a/build_tools/bazel/workspace.bzl b/build_tools/bazel/workspace.bzl
index 917fb7e..0ffafa6 100644
--- a/build_tools/bazel/workspace.bzl
+++ b/build_tools/bazel/workspace.bzl
@@ -166,3 +166,10 @@
         build_file = iree_repo_alias + "//:build_tools/third_party/tracy_client/BUILD.overlay",
         path = paths.join(iree_path, "third_party/tracy"),
     )
+
+    maybe(
+        native.new_local_repository,
+        name = "nccl",
+        build_file = iree_repo_alias + "//:build_tools/third_party/nccl/BUILD.overlay",
+        path = paths.join(iree_path, "third_party/nccl"),
+    )