InfiniBand™ networking for Compute virtual machines with GPUs
You can group your virtual machines with GPUs into a GPU cluster. The cluster accelerates high-performance computing (HPC) tasks such as training and inference. These tasks require a lot of processing power that a single VM cannot provide.The GPU clusters are built with InfiniBand secure high-speed networking. Each GPU in a VM is connected through a network interface card (NIC) that provides 400 Gbps. As a compute VM for GPU clusters consists of 8 GPUs, the total bandwidth for a node is 3.2 Tbps.Nebius AI Cloud uses GPUDirect RDMA, an NVIDIA technology of remote direct memory access (RDMA) that allows data to flow directly between each GPU and its NIC, avoiding CPU, thus boosting the data exchange speed.
Each GPU cluster is created in one of physical InfiniBand fabrics. This is where GPUs interconnected over InfiniBand are located. Each fabric has limited GPU capacity.When creating a GPU cluster, select an InfiniBand fabric for it. Take into account the type of GPUs you are going to use. For example, if you select fabric-7, you can only add NVIDIA® H200 NVLink with Intel Sapphire Rapids GPUs to this cluster.Available fabrics and corresponding regions (private regions are marked with *):
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm)
eu-north1
fabric-3
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm)
eu-north1
fabric-4
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm)
eu-north1
fabric-5
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm)
eu-west1
fabric-6
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm)
eu-north1
fabric-7
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm)
eu-north1
eu-north2-a
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm)
eu-north2
me-west1-a
NVIDIA® B200 NVLink with Intel Emerald Rapids (gpu-b200-sxm-a)
me-west1
uk-south1-a
NVIDIA® B300 NVLink with Intel Granite Rapids (gpu-b300-sxm)
uk-south1
us-central1-a
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm)
us-central1
us-central1-b
NVIDIA® B200 NVLink with Intel Emerald Rapids (gpu-b200-sxm)
us-central1
In most cases, you do not need to change the preselected fabric. We recommend that you create a GPU cluster in another fabric only if it is better suited for a different platform or if you experience capacity issues with an existing GPU cluster.
All virtual machines added to the GPU cluster, including Managed Service for Kubernetes nodes, must be in the same project.
Depending on how you are specifying the parameters in the nebius compute instance create command:
JSONUse .spec.gpu_cluster.id for the GPU cluster ID. Specify a VM platform with GPUs in .spec.resources.platform, and a preset compatible with GPU clusters in .spec.resources.preset. The compatible platforms and presets are:
CLI parametersUse --gpu-cluster-id for the GPU cluster ID and --boot-disk-existing-disk-id for the boot disk ID. Specify a VM platform with GPUs in --resources-platform, and a preset compatible with GPU clusters in --resources-preset. The compatible platforms and presets are:
To test InfiniBand performance in a Compute cluster, you can run the NVIDIA NCCL test in it. For instructions, see our tutorial on running distributed jobs with MPIrun: it uses the NCCL test as an example.