Costs
The tutorial includes the following chargeable resources:Prerequisites
- Install and configure the Nebius AI Cloud CLI.
-
Save IDs of the default subnet and the
k8s-node-group-sadefault service account to environment variables: - Install kubectl and Helm.
Steps
Set up a Managed Service for Kubernetes cluster with GPUs and InfiniBand
-
Create a GPU cluster:
-
Create a Managed Service for Kubernetes cluster with a public endpoint:
-
Create a node group in the cluster:
For this tutorial, it is required that:
- The node group has the GPU cluster specified.
-
The nodes use a VM platform and preset compatible with GPU clusters:
In this command, the nodes use the
Platform Presets Regions NVIDIA® B300 NVLink with Intel Granite Rapids
(gpu-b300-sxm)8gpu-192vcpu-2768gbuk-south1NVIDIA® B200 NVLink with Intel Emerald Rapids
(gpu-b200-sxm)8gpu-160vcpu-1792gbus-central1NVIDIA® B200 NVLink with Intel Emerald Rapids
(gpu-b200-sxm-a)8gpu-160vcpu-1792gbme-west1NVIDIA® H200 NVLink with Intel Sapphire Rapids
(gpu-h200-sxm)8gpu-128vcpu-1600gbeu-north1,eu-north2,eu-west1,us-central1NVIDIA® H100 NVLink with Intel Sapphire Rapids
(gpu-h100-sxm)8gpu-128vcpu-1600gbeu-north1gpu-h100-sxmVM platform with the8gpu-128vcpu-1600gbpreset. - The nodes use a boot disk image offered by Managed Kubernetes that contains drivers and other components for GPUs. Without this image, you need to install the drivers and components manually. For more details, see GPU drivers and other components.
-
Generate a kubeconfig file with the cluster details for kubectl:
To verify that kubectl is connected to the cluster, you can run
kubectl cluster-info.
Run the NCCL tests
-
Install the Kubeflow Training Operator.
-
Create a namespace for the tests, named
nccl-testin this tutorial: -
Create
nccl-test.yamlwith anMPIJobfor your tests.This example is for 2 nodes. If you created a node group with a different number of nodes, change accordingly thempiruncommand in.spec.mpiReplicaSpecs.Launcher.template.spec.containers[0].argsand the number of workers in.spec.mpiReplicaSpecs.Worker.replicas.nccl-test.yaml — NVIDIA B200 GPUs
nccl-test.yaml — NVIDIA H200 or H100 GPUs
-
Deploy the
MPIJobinnccl-test: -
Check that the test pods are running:
Wait until all the pods are running, like this:
-
Check the test logs:
In the result, check the average bus bandwidth. If its value is higher than 300 GB/sec, the connection is stable. Example:The average bus bandwidth is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection. To stop streaming logs, press Ctrl + C.
-
Delete the
MPIJob.You should delete theMPIJobeven if you want to run another test. In this case, redeploy theMPIJobas described in steps 2–3.
How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:-
Delete the installed operator:
-
Delete the node group with GPUs:
-
Delete the entire cluster:
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.