Working with GPUs in the Managed Service for Kubernetes®

To run ML, AI and high-performance computing (HPC) workloads in your Managed Service for Kubernetes cluster, you need to add nodes with GPUs to it. Managed Kubernetes nodes are Compute virtual machines, and you can choose VMs with GPUs to serve as nodes in your clusters. In this article, you will learn how to set up GPUs in a Managed Kubernetes cluster. The article also touches on interconnecting GPUs using InfiniBand™ to accelerate your workloads; this topic is covered in detail in another article.

How to add nodes with GPUs to a cluster

When creating a node group in a Managed Service for Kubernetes cluster, specify a virtual machine platform that supports GPUs:

Web console
CLI

In the node group creation form (

Compute → Kubernetes → your cluster → Node groups → Create node group), under Computing resources:

Select With GPU.
Select a platform and a preset. For available platforms and presets, see Types of virtual machines and GPUs in Nebius AI Cloud and How to find out platforms and presets available in a project.
Under GPU settings, keep the Install NVIDIA GPU drivers and other components option enabled.
Under Drivers, select a CUDA driver version. For available driver versions, see GPU drivers and other components.
Under Operating system, select an OS. Available operating systems depend on the selected driver.
If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), disable the Install NVIDIA GPU drivers and other components option. Then, manually install the GPU operator.

Add GPU parameters to the nebius mk8s node-group create command:

nebius mk8s node-group create \
  --template-resources-platform gpu-h100-sxm \
  --template-resources-preset 8gpu-128vcpu-1600gb \
  --template-gpu-settings-drivers-preset cuda12.8 \
  ...

In --template-resources-platform, specify a platform with GPUs. In --template-resources-preset, specify a compatible preset (number of GPUs and vCPUs, RAM size). For available platforms and presets, see Types of virtual machines and GPUs in Nebius AI Cloud and How to find out platforms and presets available in a project.
In --template-gpu-settings-drivers-preset, specify a supported preset to use a boot disk image that contains drivers and other components for GPUs. For more details, see GPU drivers and other components. If you want to install the drivers manually, omit the --template-gpu-settings-drivers-preset parameter.
If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), omit the --template-gpu-settings-drivers-preset parameter. Then, manually install the GPU operator.

See an example of a full specification and CLI command.

To enable InfiniBand interconnect between the nodes with GPUs, specify a GPU cluster when creating the node group. For more details, see Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™.

You cannot change the VM platform and preset or the GPU cluster of an existing node group. Create a new node group instead.

GPU drivers and other components

For node groups with GPUs, Managed Kubernetes offers boot disk images with GPU drivers and other components required for GPUs. You can specify Managed Kubernetes GPU images with --template-gpu-settings-drivers-preset. The preset determines the CUDA toolkit and NVIDIA driver series. Each preset has a default operating system (OS), you can optionally override it with --template-os.

Driver preset	`cuda12.8`	`cuda13.0`
NVIDIA Data Center GPU Driver	570.x	580.x
OS	`ubuntu24.04`	`ubuntu24.04`

If your cluster’s control plane is on Kubernetes 1.30 (deprecated), use cuda12 instead of cuda12.8 (Ubuntu 24.04). Kubernetes 1.31 and later support cuda12.8, so there is no need to use cuda12 on these versions.

To confirm which drivers_preset and os values are supported for your platform and Kubernetes version, check the compatibility matrix:

nebius mk8s node-group get-compatibility-matrix \
  --cluster-kubernetes-version 1.33 \
  --platform gpu-h200-sxm

Example output:

versions:
  - items:
      - compatible_platforms:
          - gpu-h200-sxm
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12.8
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda13.0
        os: ubuntu24.04
    kubernetes_version: "1.33"

Use the returned drivers_preset and os values to select the driver branch and, optionally, an operating system (OS) in a node group configuration. For instructions on how to specify these parameters when creating a node group, see How to add nodes with GPUs to a cluster.

How to change the driver preset

To change the driver preset for an existing node group, run:

nebius mk8s node-group update \
  --id <group_ID> \
  --template-gpu-settings-drivers-preset <new_preset>

When you change the driver preset, Managed Kubernetes recreates all nodes in the group according to the group’s deployment strategy.

How to install the drivers and components on existing node groups

You can create a node group without the boot disk image. For example, you may opt not use the Install NVIDIA GPU drivers and other components option when you create the node group in the web console. In this case, you can choose one of the following options to install the drivers and components:

Create a new node group with the image and migrate your workloads to it (recommended) For instructions, see Moving workload from the existing node group.
Modify the node group to use the image
When you modify a node group, Managed Kubernetes recreates each node according to the group’s deployment strategy.
How to modify the node group
- CLI
Run the nebius mk8s node-group update command:
nebius mk8s node-group update \ --id <node_group_ID> \ --template-gpu-settings-drivers-preset cuda12.8
Manually install NVIDIA operators You can install Kubernetes operators from NVIDIA that manage components required for GPUs and their networking:
- NVIDIA Network Operator Installing NVIDIA Network Operator is required when at least one node group in the cluster does not use the boot disk image offered by Managed Kubernetes and satisfies any of the following conditions:
  - The node group uses NVIDIA B200 GPUs.
  - The node group is added to a GPU cluster for InfiniBand interconnection.
  In all other cases, NVIDIA Network Operator is optional.
- NVIDIA GPU Operator Any cluster with at least one node group that has GPUs and does not use the boot disk image offered by Managed Kubernetes, must have NVIDIA GPU Operator installed.
To install the operators, follow the instructions, depending on whether you enabled the InfiniBand interconnection:
With InfiniBand: GPU and network operators
Install and check the operators in the exact order presented in these instructions. The operators depend on each other.
1. Prepare your environment:
  1. Configure kubectl, the Kubernetes CLI, to work with your cluster:
    nebius mk8s cluster get-credentials \ --id <cluster_ID> --external
    For more details, see How to connect to Managed Service for Kubernetes® clusters using kubectl.
  2. Install Helm, the package manager for Kubernetes that we will use to install the operator:
    curl -fsSL -o get_helm.sh \ https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
    For more ways to install, see the Helm documentation.
2. Install the NVIDIA Network Operator from the Nebius AI Cloud chart repository:
  helm install network-operator \ oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-network-operator/chart/network-operator \ --version 25.7.0 \ -n nvidia-network-operator --create-namespace \ --wait
3. Verify that the NVIDIA Network Operator installed its components correctly. Get the NICClusterPolicy instance status:
  kubectl get nicclusterpolicy.mellanox.com nic-cluster-policy \ -n nvidia-network-operator -o json | jq -r '.status'
  The output example is the following:
  { "appliedStates": [ ... { "name": "state-OFED", "state": "ready" }, ... ], "state": "ready" }
  While state-OFED is notReady, you can check the driver installation logs:
  kubectl logs -n nvidia-network-operator \ $(kubectl get pods -n nvidia-network-operator \ | grep mofed | head -1 | awk '{print $1}')
4. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:
  helm install gpu-operator \ oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \ --version v25.10.0 \ --set driver.version=580.95.05 \ -n nvidia-gpu-operator \ --create-namespace \ --wait
  GPUDirect RDMA is enabled by default and uses the recommended DMA-BUF Linux kernel module. For more command parameters, see the NVIDIA GPU Operator documentation.
5. Verify that the GPU driver is installed correctly.
  Do not check the GPU driver until you install both operators.
  Get the last log line from each DaemonSet that installs the driver:
  for pod in $(kubectl get pods -n nvidia-gpu-operator \ | grep nvidia-driver-daemonset | awk '{print $1}'); do echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)"; done
  If the last lines are Done, now waiting for signal, the driver should work correctly.
Without InfiniBand: GPU operator
1. Prepare your environment:
  1. Configure kubectl, the Kubernetes CLI, to work with your cluster:
    nebius mk8s cluster get-credentials \ --id <cluster_ID> --external
    For more details, see How to connect to Managed Service for Kubernetes® clusters using kubectl.
  2. Install Helm, the package manager for Kubernetes that we will use to install the operator:
    curl -fsSL -o get_helm.sh \ https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
    For more ways to install, see the Helm documentation.
2. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:
  helm install gpu-operator \ oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \ --version v25.10.0 \ --set driver.version=580.95.05 \ -n nvidia-gpu-operator --create-namespace \ --wait
  For more options, see the NVIDIA GPU Operator documentation.
3. Verify that the GPU driver is installed correctly. Get the last log line from each DaemonSet that installs the driver:
  for pod in $(kubectl get pods -n nvidia-gpu-operator \ | grep nvidia-driver-daemonset | awk '{print $1}'); do echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)"; done
  If the last lines are Done, now waiting for signal, the driver should work correctly.

Example: Using CUDA for vector addition

To test CUDA support in the cluster with GPU nodes and drivers installed on them, you can run a small CUDA application, which adds two vectors together:

Connect to the cluster using kubectl.
Follow instructions in the NVIDIA GPU Operator documentation.

​How to add nodes with GPUs to a cluster

​GPU drivers and other components

​How to change the driver preset

​How to install the drivers and components on existing node groups

​Example: Using CUDA for vector addition

​See also

How to add nodes with GPUs to a cluster

GPU drivers and other components

How to change the driver preset

How to install the drivers and components on existing node groups

Example: Using CUDA for vector addition

See also