> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Working with GPUs in the Managed Service for Kubernetes®

To run ML, AI and high-performance computing (HPC) workloads in your Managed Service for Kubernetes cluster, you need to add nodes with GPUs to it. Managed Kubernetes nodes are Compute virtual machines, and you can choose VMs with GPUs to serve as nodes in your clusters.

In this article, you will learn how to set up GPUs in a Managed Kubernetes cluster. The article also touches on interconnecting GPUs using InfiniBand™ to accelerate your workloads; this topic is covered in detail in [another article](./clusters).

## How to add nodes with GPUs to a cluster

When [creating a node group](../node-groups/manage) in a Managed Service for Kubernetes cluster, specify a virtual machine platform that supports GPUs:

<Tabs>
  <Tab title="Web console">
    In the node group creation form (<Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1" width="16" height="16" data-path="_assets/sidebar/compute.svg" /> **Compute** → **Kubernetes** → your cluster → **Node groups** → **Create node group**), under **Computing resources**:

    1. Select **With GPU**.
    2. Select a platform and a preset. For available platforms and presets, see [Types of virtual machines and GPUs in Nebius AI Cloud](../../compute/virtual-machines/types) and [How to find out platforms and presets available in a project](../../compute/virtual-machines/list-platforms).
    3. Under **GPU settings**, keep the **Install NVIDIA GPU drivers and other components** option enabled.
    4. Under **Drivers**, select a CUDA driver version. For available driver versions, see [GPU drivers and other components](#gpu-drivers-and-other-components).
    5. Under **Operating system**, select an OS. Available operating systems depend on the selected driver.

           <Note>
             If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), disable the **Install NVIDIA GPU drivers and other components** option. Then, [manually install the GPU operator](#how-to-install-the-drivers-and-components-on-existing-node-groups).
           </Note>
  </Tab>

  <Tab title="CLI">
    Add GPU parameters to the [nebius mk8s node-group create](/cli/reference/mk8s/node-group/create) command:

    ```bash theme={null}
    nebius mk8s node-group create \
      --template-resources-platform gpu-h100-sxm \
      --template-resources-preset 8gpu-128vcpu-1600gb \
      --template-gpu-settings-drivers-preset cuda12.8 \
      ...
    ```

    * In `--template-resources-platform`, specify a platform with GPUs. In `--template-resources-preset`, specify a compatible preset (number of GPUs and vCPUs, RAM size). For available platforms and presets, see [Types of virtual machines and GPUs in Nebius AI Cloud](../../compute/virtual-machines/types) and [How to find out platforms and presets available in a project](../../compute/virtual-machines/list-platforms).

    * In `--template-gpu-settings-drivers-preset`, specify a supported preset to use a boot disk image that contains drivers and other components for GPUs. For more details, see [GPU drivers and other components](/kubernetes/gpu/set-up.md#gpu-drivers-and-other-components).

      If you want to [install the drivers manually](/kubernetes/gpu/set-up#drivers-install), omit the `--template-gpu-settings-drivers-preset` parameter.

          <Note>
            If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), omit the `--template-gpu-settings-drivers-preset` parameter. Then, [manually install the GPU operator](#how-to-install-the-drivers-and-components-on-existing-node-groups).
          </Note>

    See an [example](../node-groups/manage#examples) of a full specification and CLI command.
  </Tab>
</Tabs>

To enable InfiniBand interconnect between the nodes with GPUs, specify a GPU cluster when creating the node group. For more details, see [Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™](./clusters).

<Warning>
  You cannot change the VM platform and preset or the GPU cluster of an existing node group. Create a new node group instead.
</Warning>

## GPU drivers and other components

For node groups with GPUs, Managed Kubernetes offers boot disk images with GPU drivers and other components required for GPUs.

You can specify Managed Kubernetes GPU images with `--template-gpu-settings-drivers-preset`. The preset determines the CUDA toolkit and NVIDIA driver series. Each preset has a default operating system (OS), you can optionally override it with `--template-os`.

| Driver preset                 | `cuda12.8`    | `cuda13.0`    |
| ----------------------------- | ------------- | ------------- |
| NVIDIA Data Center GPU Driver | 570.x         | 580.x         |
| OS                            | `ubuntu24.04` | `ubuntu24.04` |

<Warning>
  If your cluster's control plane is on Kubernetes 1.30 (deprecated), use `cuda12` instead of `cuda12.8` (Ubuntu 24.04). Kubernetes 1.31 and later support `cuda12.8`, so there is no need to use `cuda12` on these versions.
</Warning>

To confirm which `drivers_preset` and `os` values are supported for your platform and Kubernetes version, check the compatibility matrix:

```bash theme={null}
nebius mk8s node-group get-compatibility-matrix \
  --cluster-kubernetes-version 1.33 \
  --platform gpu-h200-sxm
```

Example output:

```yaml theme={null}
versions:
  - items:
      - compatible_platforms:
          - gpu-h200-sxm
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12.8
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda12
        os: ubuntu24.04
      - compatible_platforms:
          - gpu-h200-sxm
        drivers_preset: cuda13.0
        os: ubuntu24.04
    kubernetes_version: "1.33"
```

Use the returned `drivers_preset` and `os` values to select the driver branch and, optionally, an operating system (OS) in a node group configuration. For instructions on how to specify these parameters when creating a node group, see [How to add nodes with GPUs to a cluster](#how-to-add-nodes-with-gpus-to-a-cluster).

### How to change the driver preset

To change the driver preset for an existing node group, run:

```bash theme={null}
nebius mk8s node-group update \
  --id <group_ID> \
  --template-gpu-settings-drivers-preset <new_preset>
```

<Warning>
  When you change the driver preset, Managed Kubernetes recreates all nodes in the group according to the group's [deployment strategy](../node-groups/manage#node-group-parameters): creates replacement nodes and then cordons, drains and deletes the existing nodes.
</Warning>

### How to install the drivers and components on existing node groups

You can create a node group without the boot disk image. For example, you may opt not use the **Install NVIDIA GPU drivers and other components** option when you create the node group in the web console. In this case, you can choose one of the following options to install the drivers and components:

* **Create a new node group with the image and migrate your workloads to it** (recommended)

  For instructions, see [Moving workload from the existing node group](../node-groups/moving-workload).

* **Modify the node group to use the image**

  <Warning>
    When you modify a node group, Managed Kubernetes recreates each node according to the group's [deployment strategy](../node-groups/manage#node-group-parameters): creates a replacement node and then cordons, drains and deletes the existing node.
  </Warning>

  <Accordion title="How to modify the node group">
    <Tabs>
      <Tab title="CLI">
        Run the [nebius mk8s node-group update](/cli/reference/mk8s/node-group/update) command:

        ```bash theme={null}
        nebius mk8s node-group update \
          --id <node_group_ID> \
          --template-gpu-settings-drivers-preset cuda12.8
        ```
      </Tab>
    </Tabs>
  </Accordion>

* **Manually install NVIDIA operators**

  You can install Kubernetes operators from NVIDIA that manage components required for GPUs and their networking:

  * **NVIDIA Network Operator**

    Installing NVIDIA Network Operator is required when at least one node group in the cluster does not use the boot disk image offered by Managed Kubernetes and satisfies any of the following conditions:

    * The node group uses NVIDIA B200 GPUs.
    * The node group is added to a GPU cluster for [InfiniBand interconnection](./clusters).

    In all other cases, NVIDIA Network Operator is optional.

  * **NVIDIA GPU Operator**

    Any cluster with at least one node group that has GPUs and does not use the boot disk image offered by Managed Kubernetes, must have NVIDIA GPU Operator installed.

  To install the operators, follow the instructions, depending on whether you enabled the InfiniBand interconnection:

  <Accordion title={`With InfiniBand: GPU and network operators`}>
    <Warning>
      Install and check the operators in the exact order presented in these instructions. The operators depend on each other.
    </Warning>

    1. Prepare your environment:

       1. Configure kubectl, the Kubernetes CLI, to work with your cluster:

          ```bash theme={null}
          nebius mk8s cluster get-credentials \
            --id <cluster_ID> --external
          ```

          For more details, see [How to connect to Managed Service for Kubernetes® clusters using kubectl](/kubernetes/connect).

       2. Install [Helm](https://helm.sh/docs/), the package manager for Kubernetes that we will use to install the operator:

          ```bash theme={null}
          curl -fsSL -o get_helm.sh \
            https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
          chmod 700 get_helm.sh
          ./get_helm.sh
          ```

          For more ways to install, see the [Helm documentation](https://helm.sh/docs/intro/install/).

    2. Install the NVIDIA Network Operator from the Nebius AI Cloud chart repository:

       ```bash theme={null}
       helm install network-operator \
         oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-network-operator/chart/network-operator \
         --version 25.7.0 \
         -n nvidia-network-operator --create-namespace \
         --wait
       ```

    3. Verify that the NVIDIA Network Operator installed its components correctly. Get the `NICClusterPolicy` instance status:

       ```bash theme={null}
       kubectl get nicclusterpolicy.mellanox.com nic-cluster-policy \
         -n nvidia-network-operator -o json | jq -r '.status'
       ```

       The output example is the following:

       ```json theme={null}
       {
         "appliedStates": [
           ...
           {
             "name": "state-OFED",
             "state": "ready"
           },
           ...
         ],
         "state": "ready"
       }
       ```

       While `state-OFED` is `notReady`, you can check the driver installation logs:

       ```bash theme={null}
       kubectl logs -n nvidia-network-operator \
         $(kubectl get pods -n nvidia-network-operator \
         | grep mofed | head -1 | awk '{print $1}')
       ```

    4. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:

         <CodeGroup>
           ```bash For NVIDIA B300 GPUs theme={null}
           helm install gpu-operator \
             oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
             --version v25.10.0 \
             --set driver.version=580.95.05 \
             -n nvidia-gpu-operator \
             --create-namespace \
             --wait
           ```

           ```bash For any other GPUs theme={null}
           helm install gpu-operator \
             oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
             --version v25.10.0 \
             -n nvidia-gpu-operator \
             --create-namespace \
             --wait
           ```
         </CodeGroup>

       [GPUDirect RDMA](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html) is enabled by default and uses the recommended DMA-BUF Linux kernel module.

       For more command parameters, see the [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options).

    5. Verify that the GPU driver is installed correctly.

         <Warning>
           Do not check the GPU driver until you install both operators.
         </Warning>

       Get the last log line from each DaemonSet that installs the driver:

       ```bash theme={null}
       for pod in $(kubectl get pods -n nvidia-gpu-operator \
           | grep nvidia-driver-daemonset | awk '{print $1}'); do
         echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)";
       done
       ```

       If the last lines are `Done, now waiting for signal`, the driver should work correctly.
  </Accordion>

  <Accordion title={`Without InfiniBand: GPU operator`}>
    1. Prepare your environment:

       1. Configure kubectl, the Kubernetes CLI, to work with your cluster:

          ```bash theme={null}
          nebius mk8s cluster get-credentials \
            --id <cluster_ID> --external
          ```

          For more details, see [How to connect to Managed Service for Kubernetes® clusters using kubectl](/kubernetes/connect).

       2. Install [Helm](https://helm.sh/docs/), the package manager for Kubernetes that we will use to install the operator:

          ```bash theme={null}
          curl -fsSL -o get_helm.sh \
            https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
          chmod 700 get_helm.sh
          ./get_helm.sh
          ```

          For more ways to install, see the [Helm documentation](https://helm.sh/docs/intro/install/).

    2. Install the NVIDIA GPU Operator from the Nebius AI Cloud chart repository:

         <CodeGroup>
           ```bash For NVIDIA B300 GPUs theme={null}
           helm install gpu-operator \
             oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
             --version v25.10.0 \
             --set driver.version=580.95.05 \
             -n nvidia-gpu-operator --create-namespace \
             --wait
           ```

           ```bash For any other GPUs theme={null}
           helm install gpu-operator \
             oci://cr.eu-north1.nebius.cloud/marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
             --version v25.10.0 \
             -n nvidia-gpu-operator --create-namespace \
             --wait
           ```
         </CodeGroup>

       For more options, see the [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options).

    3. Verify that the GPU driver is installed correctly.

       Get the last log line from each DaemonSet that installs the driver:

       ```bash theme={null}
       for pod in $(kubectl get pods -n nvidia-gpu-operator \
           | grep nvidia-driver-daemonset | awk '{print $1}'); do
         echo -e "$pod:\n\t$(kubectl logs -n nvidia-gpu-operator $pod --tail 1)";
       done
       ```

       If the last lines are `Done, now waiting for signal`, the driver should work correctly.
  </Accordion>

## Example: Using CUDA for vector addition

To test CUDA support in the cluster with GPU nodes and drivers installed on them, you can run a small CUDA application, which adds two vectors together:

1. [Connect to the cluster using kubectl](../connect).
2. Follow instructions in the [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#cuda-vectoradd).

## See also

* [Interconnecting GPUs in a Managed Kubernetes cluster using InfiniBand](./clusters)
* [Tutorial: Running NCCL tests in a cluster with InfiniBand-connected GPUs](./nccl-test)
* [Creating and modifying node groups](../node-groups/manage)

***

*InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.*
