Running NCCL tests in a Managed Service for Kubernetes® cluster with InfiniBand™-connected GPUs

To boost performance of the high-performance computing (HPC) and AI workloads that you run in a Managed Service for Kubernetes cluster, you can set it up so that the GPUs on its nodes are interconnected directly using InfiniBand. In this tutorial, you will create a Managed Service for Kubernetes cluster with GPUs interconnected using InfiniBand, install operators and drivers from NVIDIA on it, and run NVIDIA NCCL tests to check InfiniBand performance.

Costs

The tutorial includes the following chargeable resources:

Prerequisites

Install and configure the Nebius AI Cloud CLI.

Save IDs of the default subnet and the k8s-node-group-sa default service account to environment variables:

export SUBNET_ID=$(nebius vpc subnet list --format json \
  | jq -r '.items[0].metadata.id')
export SA_ID=$(nebius iam service-account get-by-name \
  --name k8s-node-group-sa --format json \
  | jq -r '.metadata.id')

Install kubectl and Helm.

Steps

Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand

Create a GPU cluster:

export GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
  --name k8s-gpus --infiniband-fabric fabric-3 \
  --format json | jq -r ".metadata.id")

Create a Managed Service for Kubernetes cluster with a public endpoint:

export MK8S_CLUSTER_ID=$(nebius mk8s cluster create \
  --name nccl \
  --control-plane-version 1.33 \
  --control-plane-endpoints-public-endpoint=true \
  --control-plane-subnet-id $SUBNET_ID \
  --format json | jq -r '.metadata.id')

Create a node group in the cluster:

nebius mk8s node-group create \
  --name nccl-gpu-nodes \
  --parent-id $MK8S_CLUSTER_ID \
  --fixed-node-count 2 \
  --template-service-account-id $SA_ID \
  --template-resources-platform "gpu-h100-sxm" \
  --template-resources-preset "8gpu-128vcpu-1600gb" \
  --template-boot-disk-type network_ssd \
  --template-boot-disk-size-bytes 137438953472 \
  --template-gpu-settings-drivers-preset cuda12 \
  --template-gpu-cluster-id $GPU_CLUSTER_ID

For this tutorial, it is required that:

The node group has the GPU cluster specified.
The node group includes at least two nodes.

The nodes use a VM platform and preset compatible with GPU clusters:

Platform	Presets	Regions
NVIDIA® B300 NVLink with Intel Granite Rapids (`gpu-b300-sxm`)	`8gpu-192vcpu-2768gb`	`uk-south1`
NVIDIA® B200 NVLink with Intel Emerald Rapids (`gpu-b200-sxm`)	`8gpu-160vcpu-1792gb`	`us-central1`
NVIDIA® B200 NVLink with Intel Emerald Rapids (`gpu-b200-sxm-a`)	`8gpu-160vcpu-1792gb`	`me-west1`
NVIDIA® H200 NVLink with Intel Sapphire Rapids (`gpu-h200-sxm`)	`8gpu-128vcpu-1600gb`	`eu-north1`, `eu-north2`, `eu-west1`, `us-central1`
NVIDIA® H100 NVLink with Intel Sapphire Rapids (`gpu-h100-sxm`)	`8gpu-128vcpu-1600gb`	`eu-north1`

In this command, the nodes use the gpu-h100-sxm VM platform with the 8gpu-128vcpu-1600gb preset.

The nodes use a boot disk image offered by Managed Kubernetes that contains drivers and other components for GPUs. Without this image, you need to install the drivers and components manually. For more details, see GPU drivers and other components.

Generate a kubeconfig file with the cluster details for kubectl:
```
nebius mk8s cluster get-credentials \
  --id $MK8S_CLUSTER_ID --external
```
To verify that kubectl is connected to the cluster, you can run kubectl cluster-info.

Run the NCCL tests

Install the Kubeflow Training Operator (also known as Kubeflow Trainer).

kubectl apply --server-side -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.9.3"

Create a namespace for the tests, named nccl-test in this tutorial:
```
kubectl create ns nccl-test
```

Create nccl-test.yaml with an MPIJob for your tests.

This example is for 2 nodes. If you created a node group with a different number of nodes, change accordingly the mpirun command in .spec.mpiReplicaSpecs.Launcher.template.spec.containers[0].args and the number of workers in .spec.mpiReplicaSpecs.Worker.replicas.

nccl-test.yaml — NVIDIA B200 GPUs

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: nccl-test-nebius
spec:
  slotsPerWorker: 8 # Number of GPUs on each node
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            # In `-np 16`, 16 is the total number of GPUs on all nodes
            # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
            - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH
              -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0
              -mca coll ^hcoll
              -x UCX_NET_DEVICES=eth0
              -x NCCL_IB_HCA=mlx5
              -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1
              -x NCCL_COLLNET_ENABLE=0
              /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
            command:
            - /bin/bash
            - -c
            env:
            - name: OMPI_ALLOW_RUN_AS_ROOT
              value: "1"
            - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
              value: "1"
            image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
            name: nccl
            resources:
              requests:
                cpu: 2
                memory: 1208Mi
            securityContext:
              privileged: true
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
    Worker:
      replicas: 2 # Number of nodes
      template:
        spec:
          automountServiceAccountToken: false
          containers:
          - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
            name: nccl
            resources:
              # If you have other applications running in your cluster,
              # adjust the `cpu` and `memory` values according to
              # the resources available on the nodes
              limits:
                cpu: 96
                memory: 1600G
                nvidia.com/gpu: 8
              requests:
                cpu: 96
                memory: 1600G
                nvidia.com/gpu: 8
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          enableServiceLinks: false
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
  runPolicy:
    cleanPodPolicy: Running

nccl-test.yaml — NVIDIA H200 or H100 GPUs

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: nccl-test-nebius
spec:
  slotsPerWorker: 8 # Number of GPUs on each node
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            # In `-np 16`, 16 is the total number of GPUs on all nodes
            # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
            - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH
              -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0
              -x NCCL_IB_HCA=mlx5
              -x UCX_NET_DEVICES=eth0
              -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1
              -x NCCL_COLLNET_ENABLE=0
              /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
            command:
            - /bin/bash
            - -c
            env:
            - name: OMPI_ALLOW_RUN_AS_ROOT
              value: "1"
            - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
              value: "1"
            image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
            name: nccl
            resources:
              requests:
                cpu: 2
                memory: 1208Mi
            securityContext:
              privileged: true
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
    Worker:
      replicas: 2 # Number of nodes
      template:
        spec:
          automountServiceAccountToken: false
          containers:
          - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
            name: nccl
            resources:
              # If you have other applications running in your cluster,
              # adjust the `cpu` and `memory` values according to
              # the resources available on the nodes
              limits:
                cpu: 96
                memory: 1600G
                nvidia.com/gpu: 8
              requests:
                cpu: 96
                memory: 1600G
                nvidia.com/gpu: 8
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          enableServiceLinks: false
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
  runPolicy:
    cleanPodPolicy: Running

Deploy the MPIJob in nccl-test:

kubectl apply -f nccl-test.yaml -n nccl-test

Check that the test Pods are running:

kubectl get pods -w -n nccl-test

Wait until all the Pods are running, like this:

NAME                        READY   STATUS    RESTARTS   AGE
nccl-test-nebius-launcher   1/1     Running   0          24s
nccl-test-nebius-worker-0   1/1     Running   0          24s
nccl-test-nebius-worker-1   1/1     Running   0          24s

Check the test logs:

kubectl logs -f nccl-test-nebius-launcher -n nccl-test \
  | grep -v "NCCL INFO"

In the result, check the average bus bandwidth. If its value is higher than 300 GB/sec, the connection is stable. Example:

...
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
  1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
  2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
  4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
  8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 317.11

The average bus bandwidth is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection. To stop streaming logs, press Ctrl + C.

Delete the MPIJob.
```
kubectl delete -f nccl-test.yaml -n nccl-test
```
You should delete the MPIJob even if you want to run another test. In this case, redeploy the MPIJob as described in steps 2–3.

How to delete the created resources

Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:

Delete the installed operator:

kubectl delete -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.9.3"

Delete the node group with GPUs:

export MK8S_CLUSTER_ID=$(nebius mk8s cluster get-by-name \
  --name nccl --format json | jq -r '.metadata.id')
nebius mk8s node-group delete --id \
  $(nebius mk8s node-group get-by-name \
  --name nccl-gpu-nodes --parent-id $MK8S_CLUSTER_ID \
  --format json | jq -r '.metadata.id')

Delete the entire cluster:

nebius mk8s cluster delete --id \
  $(nebius mk8s cluster get-by-name \
  --name nccl --format json | jq -r '.metadata.id')

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.

​Costs

​Prerequisites

​Steps

​Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand

​Run the NCCL tests

​How to delete the created resources

Costs

Prerequisites

Steps

Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand

Run the NCCL tests

How to delete the created resources