Why InfiniBand topology is important
Workload managers like Slurm and Volcano allow you to specify the network topology of your virtual machines within a GPU cluster for topology-aware job scheduling. Topology files describe how the VMs are located relative to each other. When you provide a topology file to a workload manager, it schedules distributed jobs on the worker nodes that are topologically closest to each other. This reduces network latency in such jobs and leads to better performance in both real workloads and synthetic tests.For example, the AllReduce tests from the NCCL Tests suite that we ran in Nebius AI Cloud as topology-aware jobs have shown performance gains of up to 20%, depending on cluster size, compared to the same tests without the topology provided.
Architecture of InfiniBand topology
Every virtual machine with GPUs is connected to a set of three nodes related to a particular InfiniBand fabric. Every node is located on a separate network layer. The hierarchy of an InfiniBand network includes the following layers with different types of switches:- InfiniBand fabric layer. Contains a root switch.
- Point of delivery (POD). Represents a set of racks with servers. Core switches interconnect PODs.
- Scalable unit (SU). Consists of a set of servers. Leaf switches interconnect scalable units.
Cost of network communication
The cost of network communication increases as you go to a higher layer. For example, if you need to transfer data from one POD to another, the data goes through a root switch. This leads to a higher cost of resources. Mutual connections between PODs or SUs do not influence the cost. If a connection or data exchange takes place within one entity (POD or SU), the cost of network communication does not change.InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.