> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Running NCCL tests in a Managed Service for Kubernetes® cluster with InfiniBand™-connected GPUs

To boost performance of the high-performance computing (HPC) and AI workloads that you run in a Managed Service for Kubernetes cluster, you can set it up so that the GPUs on its nodes are interconnected directly using InfiniBand.

In this tutorial, you will create a Managed Service for Kubernetes cluster with GPUs interconnected using InfiniBand, install operators and drivers from NVIDIA on it, and run NVIDIA NCCL tests to check InfiniBand performance.

## Costs

The tutorial includes the following chargeable resources:

* [Compute virtual machines with GPUs](../../compute/resources/pricing)
* [Managed Service for Kubernetes cluster](../resources/pricing)

## Prerequisites

1. [Install and configure the Nebius AI Cloud CLI](../../cli/quickstart).

2. Save IDs of the default subnet and the `k8s-node-group-sa` default service account to environment variables:

   ```bash theme={null}
   export NB_SUBNET_ID=$(nebius vpc subnet list --format json \
     | jq -r '.items[0].metadata.id')
   export NB_SA_ID=$(nebius iam service-account get-by-name \
     --name k8s-node-group-sa --format json \
     | jq -r '.metadata.id')
   ```

3. [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) and [Helm](https://helm.sh/docs/intro/install/).

## Steps

### Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand

1. Create a GPU cluster:

   ```bash theme={null}
   export NB_GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
     --name k8s-gpus --infiniband-fabric fabric-3 \
     --format json | jq -r ".metadata.id")
   ```

2. Create a Managed Service for Kubernetes cluster with a public endpoint:

   ```bash theme={null}
   export NB_MK8S_CLUSTER_ID=$(nebius mk8s cluster create \
     --name nccl \
     --control-plane-version 1.33 \
     --control-plane-endpoints-public-endpoint=true \
     --control-plane-subnet-id $NB_SUBNET_ID \
     --format json | jq -r '.metadata.id')
   ```

3. Create a node group in the cluster:

   ```bash theme={null}
   nebius mk8s node-group create \
     --name nccl-gpu-nodes \
     --parent-id $NB_MK8S_CLUSTER_ID \
     --fixed-node-count 2 \
     --template-service-account-id $NB_SA_ID \
     --template-resources-platform "gpu-h100-sxm" \
     --template-resources-preset "8gpu-128vcpu-1600gb" \
     --template-boot-disk-type network_ssd \
     --template-boot-disk-size-bytes 137438953472 \
     --template-gpu-settings-drivers-preset cuda12 \
     --template-gpu-cluster-id $NB_GPU_CLUSTER_ID
   ```

   For this tutorial, it is required that:

   * The node group has the GPU cluster specified.

   * The nodes use a [VM platform and preset](../../compute/virtual-machines/types) compatible with GPU clusters:

     | Platform                                                                | Presets               | [Regions](/overview/regions)                                                                                             |
     | ----------------------------------------------------------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------ |
     | NVIDIA® B300 NVLink with Intel Granite Rapids  <br />(`gpu-b300-sxm`)   | `8gpu-192vcpu-2768gb` | `uk-south1`*<Tooltip href="/overview/regions" cta="Private region">\*</Tooltip>*                                         |
     | NVIDIA® B200 NVLink with Intel Emerald Rapids  <br />(`gpu-b200-sxm`)   | `8gpu-160vcpu-1792gb` | `us-central1`                                                                                                            |
     | NVIDIA® B200 NVLink with Intel Emerald Rapids  <br />(`gpu-b200-sxm-a`) | `8gpu-160vcpu-1792gb` | `me-west1`                                                                                                               |
     | NVIDIA® H200 NVLink with Intel Sapphire Rapids  <br />(`gpu-h200-sxm`)  | `8gpu-128vcpu-1600gb` | `eu-north1`, `eu-north2`*<Tooltip href="/overview/regions" cta="Private region">\*</Tooltip>*, `eu-west1`, `us-central1` |
     | NVIDIA® H100 NVLink with Intel Sapphire Rapids  <br />(`gpu-h100-sxm`)  | `8gpu-128vcpu-1600gb` | `eu-north1`                                                                                                              |

     In this command, the nodes use the <code>gpu-h100-sxm</code> VM platform with the `8gpu-128vcpu-1600gb` preset.

   * The nodes use a boot disk image offered by Managed Kubernetes that contains drivers and other components for GPUs. Without this image, you need to install the drivers and components manually. For more details, see [GPU drivers and other components](./set-up#gpu-drivers-and-other-components).

4. Generate a kubeconfig file with the cluster details for kubectl:

   ```bash theme={null}
   nebius mk8s cluster get-credentials \
     --id $NB_MK8S_CLUSTER_ID --external
   ```

   To verify that kubectl is connected to the cluster, you can run `kubectl cluster-info`.

### Run the NCCL tests

1. Install the [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/) (also known as Kubeflow Trainer).

   ```bash theme={null}
   kubectl apply --server-side -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.9.3"
   ```

2. Create a namespace for the tests, named `nccl-test` in this tutorial:

   ```
   kubectl create ns nccl-test
   ```

3. Create `nccl-test.yaml` with an `MPIJob` for your tests.

   <Note>
     This example is for 2 nodes. If you [created a node group](#set-up-a-managed-service-for-kubernetes-cluster-with-gpus-and-infiniband) with a different number of nodes, change accordingly the `mpirun` command in `.spec.mpiReplicaSpecs.Launcher.template.spec.containers[0].args` and the number of workers in `.spec.mpiReplicaSpecs.Worker.replicas`.
   </Note>

   <Accordion title="nccl-test.yaml — NVIDIA B200 GPUs">
     ```yaml theme={null}
     apiVersion: kubeflow.org/v1
     kind: MPIJob
     metadata:
       name: nccl-test-nebius
     spec:
       slotsPerWorker: 8 # Number of GPUs on each node
       mpiReplicaSpecs:
         Launcher:
           replicas: 1
           template:
             spec:
               containers:
               - args:
                 # In `-np 16`, 16 is the total number of GPUs on all nodes 
                 # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
                 - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH 
                   -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 
                   -mca coll ^hcoll
                   -x UCX_NET_DEVICES=eth0
                   -x NCCL_IB_HCA=mlx5 
                   -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 
                   -x NCCL_COLLNET_ENABLE=0
                   /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
                 command:
                 - /bin/bash
                 - -c
                 env:
                 - name: OMPI_ALLOW_RUN_AS_ROOT
                   value: "1"
                 - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                   value: "1"
                 image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
                 name: nccl
                 resources:
                   requests:
                     cpu: 2
                     memory: 1208Mi
                 securityContext:
                   privileged: true
               initContainers:
               - command:
                 - sh
                 - -c
                 - ulimit -Hl unlimited && ulimit -Sl unlimited
                 image: busybox:1.27.2
                 name: init-limit
                 securityContext:
                   privileged: true
         Worker:
           replicas: 2 # Number of nodes
           template:
             spec:
               automountServiceAccountToken: false
               containers:
               - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
                 name: nccl
                 resources: 
                   # If you have other applications running in your cluster, 
                   # adjust the `cpu` and `memory` values according to 
                   # the resources available on the nodes
                   limits:
                     cpu: 96
                     memory: 1600G
                     nvidia.com/gpu: 8
                   requests:
                     cpu: 96
                     memory: 1600G
                     nvidia.com/gpu: 8
                 securityContext:
                   privileged: true
                 volumeMounts:
                 - mountPath: /dev/shm
                   name: dshm
               enableServiceLinks: false
               initContainers:
               - command:
                 - sh
                 - -c
                 - ulimit -Hl unlimited && ulimit -Sl unlimited
                 image: busybox:1.27.2
                 name: init-limit
                 securityContext:
                   privileged: true
               volumes:
               - emptyDir:
                   medium: Memory
                 name: dshm
       runPolicy:
         cleanPodPolicy: Running
     ```
   </Accordion>

   <Accordion title="nccl-test.yaml — NVIDIA H200 or H100 GPUs">
     ```yaml theme={null}
     apiVersion: kubeflow.org/v1
     kind: MPIJob
     metadata:
       name: nccl-test-nebius
     spec:
       slotsPerWorker: 8 # Number of GPUs on each node
       mpiReplicaSpecs:
         Launcher:
           replicas: 1
           template:
             spec:
               containers:
               - args:
                 # In `-np 16`, 16 is the total number of GPUs on all nodes 
                 # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
                 - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH 
                   -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 
                   -x NCCL_IB_HCA=mlx5 
                   -x UCX_NET_DEVICES=eth0
                   -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 
                   -x NCCL_COLLNET_ENABLE=0
                   /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
                 command:
                 - /bin/bash
                 - -c
                 env:
                 - name: OMPI_ALLOW_RUN_AS_ROOT
                   value: "1"
                 - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                   value: "1"
                 image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
                 name: nccl
                 resources:
                   requests:
                     cpu: 2
                     memory: 1208Mi
                 securityContext:
                   privileged: true
               initContainers:
               - command:
                 - sh
                 - -c
                 - ulimit -Hl unlimited && ulimit -Sl unlimited
                 image: busybox:1.27.2
                 name: init-limit
                 securityContext:
                   privileged: true
         Worker:
           replicas: 2 # Number of nodes
           template:
             spec:
               automountServiceAccountToken: false
               containers:
               - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
                 name: nccl
                 resources: 
                   # If you have other applications running in your cluster, 
                   # adjust the `cpu` and `memory` values according to 
                   # the resources available on the nodes
                   limits:
                     cpu: 96
                     memory: 1600G
                     nvidia.com/gpu: 8
                   requests:
                     cpu: 96
                     memory: 1600G
                     nvidia.com/gpu: 8
                 securityContext:
                   privileged: true
                 volumeMounts:
                 - mountPath: /dev/shm
                   name: dshm
               enableServiceLinks: false
               initContainers:
               - command:
                 - sh
                 - -c
                 - ulimit -Hl unlimited && ulimit -Sl unlimited
                 image: busybox:1.27.2
                 name: init-limit
                 securityContext:
                   privileged: true
               volumes:
               - emptyDir:
                   medium: Memory
                 name: dshm
       runPolicy:
         cleanPodPolicy: Running
     ```
   </Accordion>

4. Deploy the `MPIJob` in `nccl-test`:

   ```text theme={null}
   kubectl apply -f nccl-test.yaml -n nccl-test
   ```

5. Check that the test pods are running:

   ```bash theme={null}
   kubectl get pods -w -n nccl-test
   ```

   Wait until all the pods are running, like this:

   ```text theme={null}
   NAME                        READY   STATUS    RESTARTS   AGE
   nccl-test-nebius-launcher   1/1     Running   0          24s
   nccl-test-nebius-worker-0   1/1     Running   0          24s
   nccl-test-nebius-worker-1   1/1     Running   0          24s
   ```

6. Check the test logs:

   ```bash theme={null}
   kubectl logs -f nccl-test-nebius-launcher -n nccl-test \
     | grep -v "NCCL INFO"
   ```

   In the result, check the average bus bandwidth. If its value is higher than 300 GB/sec, the connection is stable.

   Example:

   ```
   ...
   #                                                              out-of-place                       in-place          
   #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
   #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
     1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
     2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
     4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
     8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
   # Out of bounds values : 0 OK
   # Avg bus bandwidth    : 317.11
   ```

   The average bus bandwidth is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.

   To stop streaming logs, press **Ctrl** + **C**.

7. Delete the `MPIJob`.

   ```text theme={null}
   kubectl delete -f nccl-test.yaml -n nccl-test
   ```

   <Note>
     You should delete the `MPIJob` even if you want to run another test. In this case, redeploy the `MPIJob` as described in steps 2–3.
   </Note>

## How to delete the created resources

Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:

* Delete the installed operator:

  ```bash theme={null}
  kubectl delete -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.9.3"
  ```

* Delete the node group with GPUs:

  ```bash theme={null}
  export NB_MK8S_CLUSTER_ID=$(nebius mk8s cluster get-by-name \
    --name nccl --format json | jq -r '.metadata.id')
  nebius mk8s node-group delete --id \
    $(nebius mk8s node-group get-by-name \
    --name nccl-gpu-nodes --parent-id $NB_MK8S_CLUSTER_ID \
    --format json | jq -r '.metadata.id')
  ```

* Delete the entire cluster:

  ```bash theme={null}
  nebius mk8s cluster delete --id \
    $(nebius mk8s cluster get-by-name \
    --name nccl --format json | jq -r '.metadata.id')
  ```

***

*InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.*
