Modern AI/ML workloads depend heavily on high-throughput, low-latency communication between nodes. In GPU clusters connected via InfiniBand™, the physical network topology has a direct impact on performance. Topology-aware scheduling (TAS) enables Kubernetes schedulers to optimize workload placement based on how nodes are physically connected within the InfiniBand fabric. With this feature, Nebius AI Cloud exposes InfiniBand topology information as node labels, allowing schedulers to place workloads on nodes that are closer in the network hierarchy. This can improve communication efficiency and provide performance gains for distributed workloads. For more information about Kubernetes scheduling, see the Kubernetes scheduler documentation.Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Create a Managed Service for Kubernetes cluster and attach at least one GPU node group to it.
- Install kubectl.
- Connect to the cluster by using kubectl.
How to view topology labels in your cluster
View the topology labels on GPU nodes with the following command:| Label | Description |
|---|---|
topology.nebius.com/gpu-cluster-id | Identifies a connected high-speed network domain. Nodes with the same value have high-speed network connectivity between each other. |
topology.nebius.com/tier-0 | The closest topology level. For example, this can represent a direct accelerator interconnect, such as multi-node NVLink between NVIDIA GPUs. This label is set only for nodes that support this type of communication. |
topology.nebius.com/tier-1 | A lower-level network locality domain. For example, this can represent rack-level switches that connect nodes in one or more racks into a single block. |
topology.nebius.com/tier-2 | A wider network locality domain. For example, this can represent spine-level switches that connect multiple blocks inside a data center. |
topology.nebius.com/tier-1 value are expected to be closer to each other than two nodes that only share the same topology.nebius.com/tier-2 value.
Example output
computeinstance-e00f4wsk77x4vsr58sandcomputeinstance-e00v7g42bam61yqzp3share the sameGPU-CLUSTER-IDandTIER-2values, which means they belong to the same high-speed network domain and the same wider network locality domain. These nodes have differentTIER-1values, which indicates that they belong to different lower-level locality domains.computeinstance-e00tw5jypq4zvfrsrxbelongs to a different GPU cluster and topology hierarchy because all of its topology label values are different.computeinstance-e00nm33x3y9597zzxjdoes not have topology labels. This usually means that the node is not attached to a GPU cluster, or TAS is not enabled.
How to enable topology-aware scheduling
Kueue is used below as an example scheduler. You can also use other schedulers that support TAS, such as Volcano.Steps
Install and configure Kueue
- Install Kueue.
-
Enable TAS:
-
Create a file named
kueue-tas.yamlto configure the Kueue resources:To get the<node_group_ID>, open your Kubernetes cluster in the web console, go to the Node groups tab and copy the node group ID. -
Apply the configuration:
-
Check that the Kueue resources were created:
Schedule workloads with TAS using Kueue
To request TAS, add a topology annotation to the Pod template of your workload.-
Create a file named
job-tas.yamlthat requests TAS for the workload:Replace the following variables:-
<number_of_replicas>: Number of Pods that are running in parallel. -
<annotations_string>: Requested topology constraint. See the following table for available values:In Kueue, a podset represents a group of Pods belonging to the same workload (for example, replicas of a Job).Scenario Description Value of <annotations_string>Required to run within a GPU cluster Kueue admits the workload only if enough resources are available within the same GPU cluster. Otherwise, the workload remains pending until resources become available or the topology constraint changes. kueue.x-k8s.io/podset-required-topology: "topology.nebius.com/gpu-cluster-id"Required to run within a topology domain Kueue admits the workload only if enough resources are available within the same topology domain. Otherwise, the workload remains pending until resources become available or the topology constraint changes. kueue.x-k8s.io/podset-required-topology: "topology.nebius.com/tier-2"Preferred to run within a topology domain Kueue tries to schedule Pods within the same topology domain. If this is not possible, Kueue can place Pods across multiple topology domains. kueue.x-k8s.io/podset-preferred-topology: "topology.nebius.com/tier-2"
-
-
Create the Job:
-
Check admission status:
If Kueue does not admit the workload, reduce parallelism, use less restrictive topology constraints or increase available GPU capacity.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.