Topology-aware scheduling for GPU workloads in Managed Service for Kubernetes®

Modern AI/ML workloads depend heavily on high-throughput, low-latency communication between nodes. In GPU clusters connected via InfiniBand™, the physical network topology has a direct impact on performance. Topology-aware scheduling (TAS) enables Kubernetes schedulers to optimize workload placement based on how nodes are physically connected within the InfiniBand fabric. With this feature, Nebius AI Cloud exposes InfiniBand topology information as node labels, allowing schedulers to place workloads on nodes that are closer in the network hierarchy. This can improve communication efficiency and provide performance gains for distributed workloads. For more information about Kubernetes scheduling, see the Kubernetes scheduler documentation.

Prerequisites

How to view topology labels in your cluster

View the topology labels on GPU nodes with the following command:

kubectl get nodes -L topology.nebius.com/gpu-cluster-id,topology.nebius.com/tier-2,topology.nebius.com/tier-1,kubernetes.io/hostname

The following labels can be present on GPU nodes:

Label	Description
`topology.nebius.com/gpu-cluster-id`	Identifies a connected high-speed network domain. Nodes with the same value have high-speed network connectivity between each other.
`topology.nebius.com/tier-0`	The closest topology level. For example, this can represent a direct accelerator interconnect, such as multi-node NVLink between NVIDIA GPUs. This label is set only for nodes that support this type of communication.
`topology.nebius.com/tier-1`	A lower-level network locality domain. For example, this can represent rack-level switches that connect nodes in one or more racks into a single block.
`topology.nebius.com/tier-2`	A wider network locality domain. For example, this can represent spine-level switches that connect multiple blocks inside a data center.

The lower the tier level shared by two nodes, the better the expected communication performance between them. For example, two nodes with the same topology.nebius.com/tier-1 value are expected to be closer to each other than two nodes that only share the same topology.nebius.com/tier-2 value.

Example output

NAME                                 STATUS   ROLES    AGE    VERSION   GPU-CLUSTER-ID                         TIER-2                             TIER-1                             HOSTNAME
computeinstance-e00f4wsk77x4vsr58s   Ready    <none>   144d   v1.32.9   computegpucluster-e00agxzkvne8558nv8   959d125cbd887219574193fdba27b2c8   282cfcbf8e735653e4ce9884052cb523   computeinstance-e00f4wsk77x4vsr58s
computeinstance-e00nm33x3y9597zzxj   Ready    <none>   54d    v1.32.9                                                                                                                computeinstance-e00nm33x3y9597zzxj
computeinstance-e00tw5jypq4zvfrsrx   Ready    <none>   77d    v1.32.9   computegpucluster-e00z7ftxx6dacdbra5   5ca19afc3d62844695b08033aeba635b   297bbb6a0db40d875c266b589dc95f5b   computeinstance-e00tw5jypq4zvfrsrx
computeinstance-e00v7g42bam61yqzp3   Ready    <none>   144d   v1.32.9   computegpucluster-e00agxzkvne8558nv8   959d125cbd887219574193fdba27b2c8   04011d5d2b17ec74672df94efbbeeb15   computeinstance-e00v7g42bam61yqzp3

Nodes that share the same value for a label belong to the same topology domain at that level. A topology domain is a group of nodes that are physically close to each other in the network hierarchy and are therefore expected to have faster communication between them. For example, in the sample output:

computeinstance-e00f4wsk77x4vsr58s and computeinstance-e00v7g42bam61yqzp3 share the same GPU-CLUSTER-ID and TIER-2 values, which means they belong to the same high-speed network domain and the same wider network locality domain. These nodes have different TIER-1 values, which indicates that they belong to different lower-level locality domains.
computeinstance-e00tw5jypq4zvfrsrx belongs to a different GPU cluster and topology hierarchy because all of its topology label values are different.
computeinstance-e00nm33x3y9597zzxj does not have topology labels. This usually means that the node is not attached to a GPU cluster, or TAS is not enabled.

The exact physical meaning of each tier depends on the infrastructure configuration and is not guaranteed to match these examples.

How to enable topology-aware scheduling

Kueue is used below as an example scheduler. You can also use other schedulers that support TAS, such as Volcano.

Steps

Install and configure Kueue

Install Kueue.

Enable TAS:

kubectl -n kueue-system patch deployment kueue-controller-manager \
 --type json \
 -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--feature-gates=TopologyAwareScheduling=true"}]'

Create a file named kueue-tas.yaml to configure the Kueue resources:

apiVersion: kueue.x-k8s.io/v1beta2
kind: Topology
metadata:
  name: ib-topology
spec:
  levels:
    - nodeLabel: "topology.nebius.com/gpu-cluster-id"
    - nodeLabel: "topology.nebius.com/tier-2"
    - nodeLabel: "topology.nebius.com/tier-1"
    - nodeLabel: "kubernetes.io/hostname"
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: gpu-flavor-tas
spec:
  nodeLabels:
    nebius.com/node-group-id: "<node_group_ID>"
  topologyName: "ib-topology"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: gpu-cluster-queue-tas
spec:
  namespaceSelector: {}
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "gpu-flavor-tas"
          resources:
            - name: "cpu"
              nominalQuota: 100
            - name: "memory"
              nominalQuota: "100Gi"
            - name: "nvidia.com/gpu"
              nominalQuota: "16"
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: gpu-user-queue-tas
  namespace: default
spec:
  clusterQueue: gpu-cluster-queue-tas

To get the <node_group_ID>, open your Kubernetes cluster in the web console, go to the Node groups tab and copy the node group ID.

Apply the configuration:
```
kubectl apply -f kueue-tas.yaml
```

Check that the Kueue resources were created:

kubectl get topology,resourceflavor,clusterqueue,localqueue

Schedule workloads with TAS using Kueue

To request TAS, add a topology annotation to the Pod template of your workload.

Create a file named job-tas.yaml that requests TAS for the workload:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-tas
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: gpu-user-queue-tas
spec:
  parallelism: <number_of_replicas>
  completions: <number_of_replicas>
  completionMode: Indexed
  template:
    metadata:
      annotations:
        <annotations_string>
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["pause"]
        resources:
          requests:
            cpu: "100m"
            memory: "100Mi"
            nvidia.com/gpu: 8
          limits:
            nvidia.com/gpu: 8
      restartPolicy: Never

Replace the following variables:

<number_of_replicas>: Number of Pods that are running in parallel.

<annotations_string>: Requested topology constraint. See the following table for available values:

Scenario	Description	Value of `<annotations_string>`
Required to run within a GPU cluster	Kueue admits the workload only if enough resources are available within the same GPU cluster. Otherwise, the workload remains pending until resources become available or the topology constraint changes.	`kueue.x-k8s.io/podset-required-topology: "topology.nebius.com/gpu-cluster-id"`
Required to run within a topology domain	Kueue admits the workload only if enough resources are available within the same topology domain. Otherwise, the workload remains pending until resources become available or the topology constraint changes.	`kueue.x-k8s.io/podset-required-topology: "topology.nebius.com/tier-2"`
Preferred to run within a topology domain	Kueue tries to schedule Pods within the same topology domain. If this is not possible, Kueue can place Pods across multiple topology domains.	`kueue.x-k8s.io/podset-preferred-topology: "topology.nebius.com/tier-2"`

In Kueue, a podset represents a group of Pods belonging to the same workload (for example, replicas of a Job).

Create the Job:
```
kubectl apply -f job-tas.yaml
```
Check admission status:
```
kubectl get workloads -A
kubectl describe workload <workload_name> -n <namespace>
```
If Kueue does not admit the workload, reduce parallelism, use less restrictive topology constraints or increase available GPU capacity.

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.

Virtual machines

Slurm and Soperator in Nebius AI Cloud

Managed Service for Kubernetes®

Topology-aware scheduling for GPU workloads in Managed Service for Kubernetes®

Prerequisites

How to view topology labels in your cluster

Example output

How to enable topology-aware scheduling

Steps

Install and configure Kueue

Schedule workloads with TAS using Kueue

Virtual machines

Slurm and Soperator in Nebius AI Cloud

Managed Service for Kubernetes®

Documentation Index

​Prerequisites

​How to view topology labels in your cluster

​Example output

​How to enable topology-aware scheduling

​Steps

​Install and configure Kueue

​Schedule workloads with TAS using Kueue

Prerequisites

How to view topology labels in your cluster

Example output

How to enable topology-aware scheduling

Steps

Install and configure Kueue

Schedule workloads with TAS using Kueue