Skip to main content
To run ML, AI and high-performance computing (HPC) workloads in your Managed Service for Kubernetes cluster, you need to add nodes with GPUs to it. Managed Kubernetes nodes are Compute virtual machines, and you can choose VMs with GPUs to serve as nodes in your clusters. In this article, you will learn how to set up GPUs in a Managed Kubernetes cluster. The article also touches on interconnecting GPUs using InfiniBand™ to accelerate your workloads; this topic is covered in detail in another article.

How to add nodes with GPUs to a cluster

When creating a node group in a Managed Service for Kubernetes cluster, specify a virtual machine platform that supports GPUs:
In the node group creation form (https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Kubernetes → your cluster → Node groups → Create node group), under Computing resources:
  1. Select With GPU.
  2. Select a platform and a preset. For available platforms and presets, see Types of virtual machines and GPUs in Nebius AI Cloud and How to find out platforms and presets available in a project.
  3. Under GPU settings, keep the Install NVIDIA GPU drivers and other components option enabled.
  4. Under Drivers, select a CUDA driver version. For available driver versions, see GPU drivers and other components.
  5. Under Operating system, select an OS. Available operating systems depend on the selected driver.
    If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), disable the Install NVIDIA GPU drivers and other components option. Then, manually install the GPU operator.
To enable InfiniBand interconnect between the nodes with GPUs, specify a GPU cluster when creating the node group. For more details, see Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™.
You cannot change the VM platform and preset or the GPU cluster of an existing node group. Create a new node group instead.

GPU drivers and other components

For node groups with GPUs, Managed Kubernetes offers boot disk images with GPU drivers and other components required for GPUs. You can specify Managed Kubernetes GPU images with --template-gpu-settings-drivers-preset. The preset determines the CUDA toolkit and NVIDIA driver series. Each preset has a default operating system (OS), you can optionally override it with --template-os.
Driver presetcuda12.8cuda13.0cuda12.4
NVIDIA Data Center GPU Driver570.x580.x550.x
OSubuntu24.04ubuntu24.04ubuntu22.04
If your cluster’s control plane is on Kubernetes 1.30 (deprecated), use cuda12 instead of cuda12.8 (Ubuntu 24.04) or cuda12.4 (Ubuntu 22.04). Kubernetes 1.31 and later support cuda12.8 and cuda12.4, so there is no need to use cuda12 on these versions.
To confirm which drivers_preset and os values are supported for your platform and Kubernetes version, check the compatibility matrix:
nebius mk8s node-group get-compatibility-matrix \
  --cluster-kubernetes-version 1.33 \
  --platform gpu-h200-sxm
Example output:
versions:
  - items:
      - compatible_platforms:
          - gpu-h200-sxm
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12.8
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda13.0
        os: ubuntu24.04
    kubernetes_version: "1.33"
Use the returned drivers_preset and os values to select the driver branch and, optionally, an operating system (OS) in a node group configuration. For instructions on how to specify these parameters when creating a node group, see How to add nodes with GPUs to a cluster.

How to change the driver preset

To change the driver preset for an existing node group, run:
nebius mk8s node-group update \
  --id <group_ID> \
  --template-gpu-settings-drivers-preset <new_preset>
When you change the driver preset, Managed Kubernetes recreates all nodes in the group according to the group’s deployment strategy: creates replacement nodes and then cordons, drains and deletes the existing nodes.

How to install the drivers and components on existing node groups

You can create a node group without the boot disk image. For example, you may opt not use the Install NVIDIA GPU drivers and other components option when you create the node group in the web console. In this case, you can choose one of the following options to install the drivers and components:
  • Create a new node group with the image and migrate your workloads to it (recommended) For instructions, see Moving workload from the existing node group.
  • Modify the node group to use the image
    When you modify a node group, Managed Kubernetes recreates each node according to the group’s deployment strategy: creates a replacement node and then cordons, drains and deletes the existing node.
    Run the nebius mk8s node-group update command:
    nebius mk8s node-group update \
      --id <node_group_ID> \
      --template-gpu-settings-drivers-preset cuda12.8
    
  • Manually install NVIDIA operators You can install Kubernetes operators from NVIDIA that manage components required for GPUs and their networking:
    • NVIDIA Network Operator Installing NVIDIA Network Operator is required when at least one node group in the cluster does not use the boot disk image offered by Managed Kubernetes and satisfies any of the following conditions: In all other cases, NVIDIA Network Operator is optional.
    • NVIDIA GPU Operator Any cluster with at least one node group that has GPUs and does not use the boot disk image offered by Managed Kubernetes, must have NVIDIA GPU Operator installed.
    To install the operators, follow the instructions, depending on whether you enabled the InfiniBand interconnection:
    Install and check the operators in the exact order presented in these instructions. The operators depend on each other.
    1. Prepare your environment:
      1. Configure kubectl, the Kubernetes CLI, to work with your cluster:
        nebius mk8s cluster get-credentials \
          --id <cluster_ID> --external
        
        For more details, see How to connect to Managed Service for Kubernetes® clusters using kubectl.
      2. Install Helm, the package manager for Kubernetes that we will use to install the operator:
        curl -fsSL -o get_helm.sh \
          https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
        chmod 700 get_helm.sh
        ./get_helm.sh
        
        For more ways to install, see the Helm documentation.
    2. Install the NVIDIA Network Operator from the Nebius AI Cloud chart repository:
      helm install network-operator \
        oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-network-operator/chart/network-operator \
        --version 25.7.0 \
        -n nvidia-network-operator --create-namespace \
        --wait
      
    3. Verify that the NVIDIA Network Operator installed its components correctly. Get the NICClusterPolicy instance status:
      kubectl get nicclusterpolicy.mellanox.com nic-cluster-policy \
        -n nvidia-network-operator -o json | jq -r '.status'
      
      The output example is the following:
      {
        "appliedStates": [
          ...
          {
            "name": "state-OFED",
            "state": "ready"
          },
          ...
        ],
        "state": "ready"
      }
      
      While state-OFED is notReady, you can check the driver installation logs:
      kubectl logs -n nvidia-network-operator \
        $(kubectl get pods -n nvidia-network-operator \
        | grep mofed | head -1 | awk '{print $1}')
      
    4. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:
      helm install gpu-operator \
        oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
        --version v25.10.0 \
        --set driver.version=580.95.05 \
        -n nvidia-gpu-operator \
        --create-namespace \
        --wait
      
      GPUDirect RDMA is enabled by default and uses the recommended DMA-BUF Linux kernel module. For more command options, see the NVIDIA GPU Operator documentation.
    5. Verify that the GPU driver is installed correctly.
      Do not check the GPU driver until you install both operators.
      Get the last log line from each DaemonSet that installs the driver:
      for pod in $(kubectl get pods -n nvidia-gpu-operator \
          | grep nvidia-driver-daemonset | awk '{print $1}'); do
        echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)";
      done
      
      If the last lines are Done, now waiting for signal, the driver should work correctly.
    1. Prepare your environment:
      1. Configure kubectl, the Kubernetes CLI, to work with your cluster:
        nebius mk8s cluster get-credentials \
          --id <cluster_ID> --external
        
        For more details, see How to connect to Managed Service for Kubernetes® clusters using kubectl.
      2. Install Helm, the package manager for Kubernetes that we will use to install the operator:
        curl -fsSL -o get_helm.sh \
          https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
        chmod 700 get_helm.sh
        ./get_helm.sh
        
        For more ways to install, see the Helm documentation.
    2. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:
      helm install gpu-operator \
        oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
        --version v25.10.0 \
        --set driver.version=580.95.05 \
        -n nvidia-gpu-operator --create-namespace \
        --wait
      
      For more options, see the NVIDIA GPU Operator documentation.
    3. Verify that the GPU driver is installed correctly. Get the last log line from each DaemonSet that installs the driver:
      for pod in $(kubectl get pods -n nvidia-gpu-operator \
          | grep nvidia-driver-daemonset | awk '{print $1}'); do
        echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)";
      done
      
      If the last lines are Done, now waiting for signal, the driver should work correctly.

Example: Using CUDA for vector addition

To test CUDA support in the cluster with GPU nodes and drivers installed on them, you can run a small CUDA application, which adds two vectors together:
  1. Connect to the cluster using kubectl.
  2. Follow instructions in the NVIDIA GPU Operator documentation.

See also


InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.