Skip to main content
To boost performance of the high-performance computing (HPC) and AI workloads that you run in a Managed Service for Kubernetes cluster, you can set it up so that the GPUs on its nodes are interconnected directly using InfiniBand. In this tutorial, you will create a Managed Service for Kubernetes cluster with GPUs interconnected using InfiniBand, install operators and drivers from NVIDIA on it, and run NVIDIA NCCL tests to check InfiniBand performance.

Costs

The tutorial includes the following chargeable resources:

Prerequisites

  1. Install and configure the Nebius AI Cloud CLI.
  2. Save IDs of the default subnet and the k8s-node-group-sa default service account to environment variables:
    export NB_SUBNET_ID=$(nebius vpc subnet list --format json \
      | jq -r '.items[0].metadata.id')
    export NB_SA_ID=$(nebius iam service-account get-by-name \
      --name k8s-node-group-sa --format json \
      | jq -r '.metadata.id')
    
  3. Install kubectl and Helm.

Steps

Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand

  1. Create a GPU cluster:
    export NB_GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
      --name k8s-gpus --infiniband-fabric fabric-3 \
      --format json | jq -r ".metadata.id")
    
  2. Create a Managed Service for Kubernetes cluster with a public endpoint:
    export NB_MK8S_CLUSTER_ID=$(nebius mk8s cluster create \
      --name nccl \
      --control-plane-version 1.32 \
      --control-plane-endpoints-public-endpoint=true \
      --control-plane-subnet-id $NB_SUBNET_ID \
      --format json | jq -r '.metadata.id')
    
  3. Create a node group in the cluster:
    nebius mk8s node-group create \
      --name nccl-gpu-nodes \
      --parent-id $NB_MK8S_CLUSTER_ID \
      --fixed-node-count 2 \
      --template-service-account-id $NB_SA_ID \
      --template-resources-platform "gpu-h100-sxm" \
      --template-resources-preset "8gpu-128vcpu-1600gb" \
      --template-boot-disk-type network_ssd \
      --template-boot-disk-size-bytes 137438953472 \
      --template-gpu-settings-drivers-preset cuda12 \
      --template-gpu-cluster-id $NB_GPU_CLUSTER_ID
    
    For this tutorial, it is required that:
    • The node group has the GPU cluster specified.
    • The nodes use a VM platform and preset compatible with GPU clusters:
      PlatformPresetsRegions
      NVIDIA® B300 NVLink with Intel Granite Rapids
      (gpu-b300-sxm)
      8gpu-192vcpu-2768gbuk-south1
      NVIDIA® B200 NVLink with Intel Emerald Rapids
      (gpu-b200-sxm)
      8gpu-160vcpu-1792gbus-central1
      NVIDIA® B200 NVLink with Intel Emerald Rapids
      (gpu-b200-sxm-a)
      8gpu-160vcpu-1792gbme-west1
      NVIDIA® H200 NVLink with Intel Sapphire Rapids
      (gpu-h200-sxm)
      8gpu-128vcpu-1600gbeu-north1, eu-north2, eu-west1, us-central1
      NVIDIA® H100 NVLink with Intel Sapphire Rapids
      (gpu-h100-sxm)
      8gpu-128vcpu-1600gbeu-north1
      In this command, the nodes use the gpu-h100-sxm VM platform with the 8gpu-128vcpu-1600gb preset.
    • The nodes use a boot disk image offered by Managed Kubernetes that contains drivers and other components for GPUs. Without this image, you need to install the drivers and components manually. For more details, see GPU drivers and other components.
  4. Generate a kubeconfig file with the cluster details for kubectl:
    nebius mk8s cluster get-credentials \
      --id $NB_MK8S_CLUSTER_ID --external
    
    To verify that kubectl is connected to the cluster, you can run kubectl cluster-info.

Run the NCCL tests

  1. Install the Kubeflow Training Operator.
    kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
    
  2. Create a namespace for the tests, named nccl-test in this tutorial:
    kubectl create ns nccl-test
    
  3. Create nccl-test.yaml with an MPIJob for your tests.
    This example is for 2 nodes. If you created a node group with a different number of nodes, change accordingly the mpirun command in .spec.mpiReplicaSpecs.Launcher.template.spec.containers[0].args and the number of workers in .spec.mpiReplicaSpecs.Worker.replicas.
    apiVersion: kubeflow.org/v1
    kind: MPIJob
    metadata:
      name: nccl-test-nebius
    spec:
      slotsPerWorker: 8 # Number of GPUs on each node
      mpiReplicaSpecs:
        Launcher:
          replicas: 1
          template:
            spec:
              containers:
              - args:
                # In `-np 16`, 16 is the total number of GPUs on all nodes 
                # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
                - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH 
                  -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 
                  -mca coll ^hcoll
                  -x UCX_NET_DEVICES=eth0
                  -x NCCL_IB_HCA=mlx5 
                  -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 
                  -x NCCL_COLLNET_ENABLE=0
                  /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
                command:
                - /bin/bash
                - -c
                env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
                image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
                name: nccl
                resources:
                  requests:
                    cpu: 2
                    memory: 1208Mi
                securityContext:
                  privileged: true
              initContainers:
              - command:
                - sh
                - -c
                - ulimit -Hl unlimited && ulimit -Sl unlimited
                image: busybox:1.27.2
                name: init-limit
                securityContext:
                  privileged: true
        Worker:
          replicas: 2 # Number of nodes
          template:
            spec:
              automountServiceAccountToken: false
              containers:
              - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.26.5-ubu22.04-cu12.8
                name: nccl
                resources: 
                  # If you have other applications running in your cluster, 
                  # adjust the `cpu` and `memory` values according to 
                  # the resources available on the nodes
                  limits:
                    cpu: 96
                    memory: 1600G
                    nvidia.com/gpu: 8
                  requests:
                    cpu: 96
                    memory: 1600G
                    nvidia.com/gpu: 8
                securityContext:
                  privileged: true
                volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
              enableServiceLinks: false
              initContainers:
              - command:
                - sh
                - -c
                - ulimit -Hl unlimited && ulimit -Sl unlimited
                image: busybox:1.27.2
                name: init-limit
                securityContext:
                  privileged: true
              volumes:
              - emptyDir:
                  medium: Memory
                name: dshm
      runPolicy:
        cleanPodPolicy: Running
    
    apiVersion: kubeflow.org/v1
    kind: MPIJob
    metadata:
      name: nccl-test-nebius
    spec:
      slotsPerWorker: 8 # Number of GPUs on each node
      mpiReplicaSpecs:
        Launcher:
          replicas: 1
          template:
            spec:
              containers:
              - args:
                # In `-np 16`, 16 is the total number of GPUs on all nodes 
                # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
                - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH 
                  -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 
                  -x NCCL_IB_HCA=mlx5 
                  -x UCX_NET_DEVICES=eth0
                  -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 
                  -x NCCL_COLLNET_ENABLE=0
                  /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
                command:
                - /bin/bash
                - -c
                env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
                image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
                name: nccl
                resources:
                  requests:
                    cpu: 2
                    memory: 1208Mi
                securityContext:
                  privileged: true
              initContainers:
              - command:
                - sh
                - -c
                - ulimit -Hl unlimited && ulimit -Sl unlimited
                image: busybox:1.27.2
                name: init-limit
                securityContext:
                  privileged: true
        Worker:
          replicas: 2 # Number of nodes
          template:
            spec:
              automountServiceAccountToken: false
              containers:
              - image: cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
                name: nccl
                resources: 
                  # If you have other applications running in your cluster, 
                  # adjust the `cpu` and `memory` values according to 
                  # the resources available on the nodes
                  limits:
                    cpu: 96
                    memory: 1600G
                    nvidia.com/gpu: 8
                  requests:
                    cpu: 96
                    memory: 1600G
                    nvidia.com/gpu: 8
                securityContext:
                  privileged: true
                volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
              enableServiceLinks: false
              initContainers:
              - command:
                - sh
                - -c
                - ulimit -Hl unlimited && ulimit -Sl unlimited
                image: busybox:1.27.2
                name: init-limit
                securityContext:
                  privileged: true
              volumes:
              - emptyDir:
                  medium: Memory
                name: dshm
      runPolicy:
        cleanPodPolicy: Running
    
  4. Deploy the MPIJob in nccl-test:
    kubectl apply -f nccl-test.yaml -n nccl-test
    
  5. Check that the test pods are running:
    kubectl get pods -w -n nccl-test
    
    Wait until all the pods are running, like this:
    NAME                        READY   STATUS    RESTARTS   AGE
    nccl-test-nebius-launcher   1/1     Running   0          24s
    nccl-test-nebius-worker-0   1/1     Running   0          24s
    nccl-test-nebius-worker-1   1/1     Running   0          24s
    
  6. Check the test logs:
    kubectl logs -f nccl-test-nebius-launcher -n nccl-test \
      | grep -v "NCCL INFO"
    
    In the result, check the average bus bandwidth. If its value is higher than 300 GB/sec, the connection is stable. Example:
    ...
    #                                                              out-of-place                       in-place          
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
      1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
      2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
      4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
      8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 317.11
    
    The average bus bandwidth is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection. To stop streaming logs, press Ctrl + C.
  7. Delete the MPIJob.
    kubectl delete -f nccl-test.yaml -n nccl-test
    
    You should delete the MPIJob even if you want to run another test. In this case, redeploy the MPIJob as described in steps 2–3.

How to delete the created resources

Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:
  • Delete the installed operator:
    kubectl delete -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
    
  • Delete the node group with GPUs:
    export NB_MK8S_CLUSTER_ID=$(nebius mk8s cluster get-by-name \
      --name nccl --format json | jq -r '.metadata.id')
    nebius mk8s node-group delete --id \
      $(nebius mk8s node-group get-by-name \
      --name nccl-gpu-nodes --parent-id $NB_MK8S_CLUSTER_ID \
      --format json | jq -r '.metadata.id')
    
  • Delete the entire cluster:
    nebius mk8s cluster delete --id \
      $(nebius mk8s cluster get-by-name \
      --name nccl --format json | jq -r '.metadata.id')
    

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.