Creating and modifying Managed Service for Kubernetes® node groups

Clusters in Managed Service for Kubernetes use Compute virtual machines as nodes to run applications. In this guide, you will learn how to create node groups, add them to clusters, modify and delete them. To learn how to manage clusters outside of their node groups, see How to create and modify Managed Service for Kubernetes® clusters.

Prerequisites

You do not need to complete any prerequisites if you create or modify node groups in the web console.

CLI
Terraform

Install and configure the Nebius AI Cloud CLI.

Create a cluster and save its ID to an environment variable:

export K8S_CLUSTER_ID=$(nebius mk8s cluster get-by-name \
  --name <cluster_name> --format json | jq -r '.metadata.id')

How to create node groups

Node groups define the characteristics of the virtual machines (VMs) that run your workloads. Each node group includes identical nodes created with the same template. You can create different types of node groups depending on your performance, cost and availability requirements. For example, you can choose high-performance GPUs for compute-intensive workloads or preemptible VMs to reduce costs for interruptible tasks.

Regular node groups

Web console
CLI
Terraform

In the sidebar, go to Compute → Kubernetes.
Open the page of the cluster where you want to create a node group.
Switch to the Node groups tab.
Click Create node group.
On the page that opens, specify a name for the node group (for example, mk8s-node-group-test).
(Optional) Enable the Assign public IPv4 addresses option if you want the nodes to be accessible from the internet.
Under Size, specify the initial Number of nodes. If you want to let the node group scale up or down depending on the workload, enable autoscaling. After that, specify the minimum and maximum number of nodes that the group can have.
Configure the Computing resources section:
1. Select whether the node group should have GPUs.
2. Select a regular VM type. VMs without GPUs only support the regular type. For information about creating preemptible node groups, see instructions below.
3. (Optional) For a regular VM with GPUs, select Reservation usage. Specify whether Managed Kubernetes should allocate resources for the node group from reservations. The Reservation usage field is only displayed if you have capacity block groups.
  More information about reservation usage
  - With reservations: The resources are allocated from reservations (capacity block groups). For example, if a Nebius manager has created a capacity block group for you, Managed Kubernetes allocates GPUs for the node group from this capacity block group. This ensures that resources are always available, even if VMs in the node group are stopped (for example, by you or a maintenance event). You can use one of the following reservation types:
    
    Any (default): You do not need to select reservations. The service uses the reservations that are most suitable for the configuration of your VM.
    
    Specific: Select specific reservations. Make sure to select reservations that have enough capacity and that do not expire in several days.
    
    If there are no reservations available during the VM lifecycle, you can run your VM without reservations. Resources for it will be taken from the common pool. To configure this behavior, enable the Start without a reservation when reservation capacity is exhausted option.
  - Without reservations: The resources are allocated from a common pool, and no reservations are used for the node group.
4. Select an available platform and a preset (a combination of GPUs, vCPUs and RAM) that fits your workload requirements.
5. (Optional) If you create a node group with 8 GPUs (for example, for training models), use a GPU cluster for the node group. InfiniBand™ in the cluster allows you to accelerate tasks that require high-performance computing (HPC) power. A single node group without InfiniBand cannot perform these tasks as quickly. To use a GPU cluster, select an existing one or create a new cluster:
  1. Click Create in the GPU cluster field.
  2. In the window that opens, specify the cluster name and InfiniBand fabric. To select the fabric, see InfiniBand fabrics.
  3. Click Create.
6. (Optional) Enable or disable GPU settings. They are enabled by default, and they allow Managed Kubernetes to pre-install NVIDIA drivers and the Container Toolkit. You can also select a specific NVIDIA CUDA driver version. Disable GPU settings only if you need to install specific driver versions manually or use a custom operator. Disabling is not recommended.
7. Select an operating system for the nodes (for example, Ubuntu 24.04 LTS).
Under Node storage, select the disk type and specify the size in . Supported disk types are the following:
- SSD: Standard solid-state drive for general-purpose workloads.
- SSD NRD: Network-replicated SSD providing higher reliability through data duplication across the network.
- SSD IO: High-performance SSD optimized for I/O-intensive operations with lower latency.
(Optional) If the selected platform and preset support local SSD disks, enable Local SSD disks to add ephemeral local storage to your node group.
Local SSD disks are available only for supported platforms and presets. For details, see Availability.
(Optional) If you want to attach a filesystem to your node group, in the Shared filesystems section, specify the settings of this filesystem:
1. Click Attach shared filesystem.
2. In the window that opens, select an existing filesystem or create a new one.
3. If you create a new filesystem, specify its name, size and the block size.
4. Click Attach filesystem or Create and attach filesystem.
5. After the window is closed, specify a mount tag for mounting the filesystem to the VM. Create your own tag, such as my-filesystem. Make sure that it is unique within the VM.
6. To mount the filesystem to the node group automatically, keep the Auto mount option enabled.
(Optional) In the Username and SSH key field, add credentials, so you can connect to the node group:
1. Generate an SSH key pair.
2. In the Username and SSH key field, click .
3. If you added an SSH key earlier and you want to reuse it, select the key from the drop-down list. If you want to add a new key, click Add credentials.
4. In the window that opens, specify the username of the node group user, a public key of your SSH key pair and the credentials name to recognize the key in the list.
5. Click Add credentials.
(Optional) Under Additional, select or create a service account that will perform actions on behalf of the nodes.
Click Create node group.

Create a node group:

nebius mk8s node-group create \
  --parent-id $K8S_CLUSTER_ID \
  --name <node_group_name> \
  --fixed-node-count <number_of_nodes> \
  --template-resources-platform <platform_ID> \
  --template-resources-preset <preset_name> \
  --template-gpu-settings-drivers-preset <driver_preset>

For descriptions of node group parameters, see Node group parameters.

If you need to modify the NVIDIA device plug-in (for example, to enable multi-instance GPU), don’t add the --template-gpu-settings-drivers-preset parameter to the command. Instead, manually install the GPU operator.

For more details about GPUs in node groups, see Working with GPUs in the Managed Service for Kubernetes® and Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™.

Create a node group configuration file:

resource "nebius_mk8s_v1_node_group" "<node_group_name>" {
  name = "<node_group_name>"
  parent_id = "<cluster_ID>"
  fixed_node_count = <number_of_nodes>

  template = {
    resources = {
      platform = "<platform_ID>"
      preset = "<preset_name>"
    }

    gpu_settings = {
      drivers_preset = "<driver_preset>"
    }
  }
}

For descriptions of node group parameters, see Node group parameters.

Check that the configuration is correct:
```
terraform validate
```
Apply the changes:
```
terraform apply
```

Preemptible node groups

Preemptible nodes use virtual machines that can be stopped by Nebius AI Cloud at any time. These VMs are more cost-efficient than regular ones and suitable for workloads with interruptions, such as batch processing or training ML models. For more information about how preemptible VMs work, see Preemptible virtual machines.

Web console
CLI
Terraform

In the sidebar, go to Compute → Kubernetes.
Create a cluster or choose an existing one.
On the cluster page, switch to the Node groups tab.
Click Create node group.
When creating a node group, under Computing resources, select:
- With GPU
- Preemptible VM type

For information about other node group parameters, see instructions about creating regular node groups.

Run the Nebius AI Cloud CLI command nebius mk8s node-group create with the --template-preemptible parameter:

nebius mk8s node-group create \
  ... \
  --template-preemptible

Create a node group configuration file and set the .template.preemptible block to enable preemptibility:

resource "nebius_mk8s_v1_node_group" "example" {
  name     = "preemptible-ng"
  ...

  template = {
    preemptible = {}
    ...
  }
}

How to modify node groups

Modifying the node group template triggers a rolling update. Managed Kubernetes replaces each node with another one, with a new configuration. To check the list of the template parameters, see all --template-* parameters in CLI reference or the nested schema for template in Terraform reference. If you modify other parameters, Managed Kubernetes does not replace the nodes, they remain unchanged. During a node group update, Managed Kubernetes uses the default values of the deployment strategy parameters. By default, one node at a time can be unavailable during the update, and no additional compute quota is required. You can change the deployment strategy when you modify a node group.

Web console
CLI
Terraform

To modify a node group:

In the sidebar, go to Compute → Kubernetes.
Open the page of the required cluster and then go to the Node groups tab.
Open the page of the node group that you wish to change.
Switch to the Settings tab and then modify the required parameters. Parameters available for editing:
- Name: Name of the node group.
- Size:
  - Number of nodes: Target and fixed number of nodes (if autoscaling is disabled). The maximum number is 100.
  - Enable autoscaling: Allows you to set the range of nodes within which the cluster autoscaler adds or removes nodes as needed.
- Computing resources: Select whether the node group should have GPUs, and then specify the hardware configuration:
  - VM type:
    - Regular: Standard VMs for high-availability production workloads.
    - Preemptible: Lower-cost VMs that may be terminated by the platform at any time.
  - Available platform and Preset: Combination of GPUs, vCPUs and RAM that fits your workload requirements. For more information, see Types of virtual machines and GPUs in Nebius AI Cloud.
  - GPU cluster: GPU cluster with InfiniBand. Allows you to accelerate tasks that require HPC power. Available only if the node group contains 8 GPUs.
  - GPU settings: If enabled, the system pre-installs NVIDIA drivers and the Container Toolkit. You can also select a specific NVIDIA CUDA driver version. Disable GPU settings only if you need to install specific driver versions manually or use a custom operator.
  - Drivers: CUDA driver version based on enabled GPU settings.
  - Operating system: OS for the nodes, for example, Ubuntu 24.04 LTS.
- Node storage:
  - Disk type: Type of the boot disk.
  - Size: Size of the boot disk in GiB.
Click Save changes.

The status of the node group changes to Updating while the new configuration is being applied.

Get the node group ID and save it to an environment variable:

export K8S_NODE_GROUP_ID=$(nebius mk8s node-group get-by-name \
  --parent-id $K8S_CLUSTER_ID \
  --name <node_group_name> --format json | jq -r '.metadata.id')

Update the node group:
```
nebius mk8s node-group update \
  --id $K8S_NODE_GROUP_ID \
  (parameters)
```
Only the parameters specified in the CLI command can be changed. You can do a full update instead by adding --full to the command. This will update all parameters with the values specified in the command or the default values. For more information, see Specifying parameters.

In the node group configuration file, update the parameters of the nebius_mk8s_v1_node_group resource.
Check that the configuration is correct:
```
terraform validate
```
Apply the changes:
```
terraform apply
```

Deployment strategy and quotas

Managed Kubernetes performs a rolling update to each node in the group when you modify the node group template. To check the list of the template parameters, see all --template-* parameters in CLI reference or the nested schema for template in Terraform reference. With the default values of the deployment strategy parameters, Managed Kubernetes updates one node at a time in the following order:

Cordons the existing node (marks it as unschedulable).
Drains the existing node (evicts all Pods from it).
Deletes the existing node.
Creates a replacement node.

This default behavior does not require additional compute quota during an update. Managed Kubernetes uses the node group’s deployment strategy to determine how, in what order and to how many nodes at a time it performs the listed steps. You can configure the deployment strategy using the corresponding parameters. If you prefer Managed Kubernetes to create replacement nodes before removing the existing ones, for example to minimize workload disruption, configure a deployment strategy that allows the node group to temporarily exceed its target size during an update. In this case, make sure that your quotas on underlying Compute resources allow for the additional nodes that can be created during a rolling update. If there is no quota available for any of the required resources, the update fails. You can check your remaining quotas on the Administration → Limits → Quotas page of the web console.

For example, each node uses 8 GPUs, 128 vCPUs, 1600 GiB RAM and a public IP address. If your deployment strategy allows up to two additional nodes during the update, you need quotas for the following additional resources:

16 GPUs (2 × 8)

256 vCPUs (2 × 128)

3200 GiB RAM (2 × 1600)

2 public IP addresses

When you or the autoscaler scales a node group up or down, Managed Kubernetes does not recreate any nodes.

Node group parameters

CLI
Terraform

The nebius mk8s node-group create and nebius mk8s node-group update commands support the following parameters.

Metadata
- --name: Node group name. Must be unique within the tenant. Cannot be changed after creation.
Kubernetes version on nodes
- --version: Kubernetes version in <major>.<minor> format. Recommended version is 1.33. For more information, see Kubernetes versions in Managed Service for Kubernetes.
Node group size
- --fixed-node-count: Number of nodes per group. The maximum is 100.
- --autoscaling-min-node-count, --autoscaling-max-node-count: Allow you to set the range of nodes within which the cluster autoscaler adds or removes nodes as needed.
Node template All nodes in a group are identical and are created based on a node template. A node template is similar to a virtual machine specification in Compute. The node template has the following parameters:
- --template-taints: Array of Kubernetes taints (rules that repel Pods from nodes) for all nodes in the group.
- --template-resources-platform: A platform with GPUs, see Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™.
- --template-resources-preset: A compatible preset (number of GPUs and vCPUs, RAM size), see Types of virtual machines and GPUs in Nebius AI Cloud.
- --template-gpu-settings-drivers-preset: GPU drivers preset, see GPU drivers and other components.
- --template-gpu-cluster-id: GPU cluster ID.
- --template-service-account-id: Service account ID. You can add a service account, for example, to pull images from Container Registry.
- --template-network-interfaces: Network interface configuration (for example, subnet ID, see How to use a non-default subnet for Managed Service for Kubernetes® clusters and node groups).
- --template-filesystems: Filesystem for nodes, see How to attach volumes to VMs.
  The filesystem that you are adding to a node group must be located in the same project as the node group’s parent cluster. For more details about projects and resource hierarchy in Nebius AI Cloud, see How resources, identities and access are managed in Nebius AI Cloud.
- --template-local-disks-passthrough-group-requested (optional): Requests local SSD disks when set to true. You can configure how the local SSD disks are added to your node group with one of the following parameters:
  - --template-local-disks-config-kubelet-ephemeral: Set to true to use the requested local SSD disks as the node’s local ephemeral storage. Managed Kubernetes prepares, formats and mounts the resulting storage for node ephemeral data. This is the default configuration mode.
  - --template-local-disks-config-none: Set to true to provision the requested local SSD disks with no preparation.
  Local SSD disks are available only for supported platforms and presets. For details, see Availability.
- --template-reservation-policy-policy: Policy for reservation usage. You can use reservations of capacity resources and run your node group based on them. As a result, the node group resources are reserved and always available.
- --template-reservation-policy-reservation-ids: IDs of specific reservations. These are capacity block groups that a Nebius manager has created. For information about how to configure --template-reservation-policy-policy and --template-reservation-policy-reservation-ids, see How to add reservations to node groups.
Deployment strategy The deployment strategy of a node group defines how it is updated when necessary — for example, when you modify the group’s node template or Kubernetes version, or when nodes fail and need to be replaced. For more details, see Deployment strategy and quotas. The following parameters specify the deployment strategy:
- --strategy-max-unavailable-percent, --strategy-max-unavailable-count: The maximum number of nodes in a group that can be unavailable at any time during an update, set as a percentage of the group’s target size or a number of nodes. When a percentage is used, the number of nodes is calculated by rounding down.
  For example, if the value of --strategy-max-unavailable-percent is 40 and the group’s target size is 3, at most ⌊3 × 40%⌋ = ⌊1.2⌋ = 1 node can be unavailable at any time during the update.
  The default value is --strategy-max-unavailable-count 1. Cannot be set to 0 if --strategy-max-surge-count or --strategy-max-surge-percent is 0.
- --strategy-max-surge-percent, --strategy-max-surge-count: The maximum number of nodes in a group that can exceed the group’s target size at any time during an update, set as a percentage of the target size or as a number of nodes.
  For example, if the value of --strategy-max-surge-count is 2 and the group’s target size is 3, then the group can only have 3 + 2 = 5 nodes at any time during the update.
  The default value is --strategy-max-surge-count 0. Cannot be set to 0 if --strategy-max-unavailable-count or --strategy-max-unavailable-percent is 0.
- --strategy-drain-timeout: The maximum amount of time it can take to drain a node during the update. If the timeout is set, a node in the updated group is deleted when it reaches the timeout, even if its draining is not complete.
  The timeout is not set by default and nodes are deleted only after the draining is complete.

The nebius_mk8s_v1_node_group resource supports the following parameters:

Metadata
- parent_id: Cluster ID.
- name: Node group name. Must be unique within the tenant. Cannot be changed after creation.
Kubernetes version on nodes
- version: Kubernetes version in <major>.<minor> format. Recommended version is 1.33. For more information, see Kubernetes versions in Managed Service for Kubernetes.
Node group size
- fixed_node_count: Number of nodes per group. The maximum is 100. Cannot be set together with autoscaling.
- autoscaling.min_node_count, autoscaling.max_node_count: Allow you to set the range of nodes within which the cluster autoscaler adds or removes nodes as needed. Cannot be set together with fixed_node_count.
Node template All nodes in a group are identical and are created based on a node template. A node template is similar to a virtual machine specification in Compute. The node template is configured in the template block and supports the following parameters:
- template.taints: Array of Kubernetes taints (rules that repel Pods from nodes) for all nodes in the group.
- template.resources.platform: A platform with GPUs, see Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™.
- template.resources.preset: A compatible preset (number of GPUs and vCPUs, RAM size), see Types of virtual machines and GPUs in Nebius AI Cloud.
- template.gpu_settings.drivers_preset: GPU drivers preset. For more information, see GPU drivers and other components.
- template.gpu_cluster.id: GPU cluster ID.
- template.service_account_id: Service account ID. You can add a service account, for example, to pull images from Container Registry.
- template.network_interfaces: Network interface configuration (for example, subnet ID, see How to use a non-default subnet for Managed Service for Kubernetes® clusters and node groups).
- template.filesystems: Filesystem for nodes, see How to attach volumes to VMs.
  The filesystem that you are adding to a node group must be located in the same project as the node group’s parent cluster. For more details about projects and resource hierarchy in Nebius AI Cloud, see How resources, identities and access are managed in Nebius AI Cloud.
- template.local_disks.passthrough_group.requested: Requests local SSD disks when set to true. You can configure how the local SSD disks are added to your node group with one of the following parameters:
  - template.local_disks.config.kubelet_ephemeral: Set to true to use the requested local SSD disks as the node’s local ephemeral storage. Managed Kubernetes prepares, formats and mounts the resulting storage for node ephemeral data. This is the default configuration mode.
  - template.local_disks.config.none: Set to true to provision the requested local SSD disks with no configuration.
  Local SSD disks are available only for supported platforms and presets. For details, see Availability.
- template.reservation_policy.policy: Policy for reservation usage. You can use reservations of capacity resources and run your node group based on them. As a result, the node group resources are reserved and always available.
- template.reservation_policy.reservation_ids: IDs of specific reservations. These are capacity block groups that a Nebius manager has created. For information about how to configure template.reservation_policy.policy and template.reservation_policy.reservation_ids, see How to add reservations to node groups.
Deployment strategy The deployment strategy of a node group defines how it is updated when necessary — for example, when you modify the group’s node template or Kubernetes version, or when nodes fail and need to be replaced. For more details, see Deployment strategy and quotas. The deployment strategy is configured in the strategy block:
- strategy.max_unavailable.percent, strategy.max_unavailable.count: The maximum number of nodes in a group that can be unavailable at any time during an update, set as a percentage of the group’s target size or a number of nodes. When a percentage is used, the number of nodes is calculated by rounding down.
  For example, if strategy.max_unavailable.percent = 40 and the group’s target size is 3, at most ⌊3 × 40%⌋ = ⌊1.2⌋ = 1 node can be unavailable at any time during the update.
  The default value is 1. Cannot be set to 0 if strategy.max_surge.count or strategy.max_surge.percent is set to 0.
- strategy.max_surge.percent, strategy.max_surge.count: The maximum number of nodes in a group that can exceed the group’s target size at any time during an update, set as a percentage of the target size or as a number of nodes.
  For example, if strategy.max_surge.count = 2 and the group’s target size is 3, then the group can have up to 3 + 2 = 5 nodes at any time during the update.
  The default value is strategy.max_surge.count = 0. Cannot be set to 0 if strategy.max_unavailable.count or strategy.max_unavailable.percent is set to 0.
- strategy.drain_timeout: The maximum amount of time it can take to drain a node during the update. If the timeout is set, a node in the updated group is deleted when it reaches the timeout, even if its draining is not complete.
  The timeout is not set by default and nodes are deleted only after the draining is complete.

How to delete node groups

Web console
CLI
Terraform

In the sidebar, go to Compute → Kubernetes.
Open the cluster page and then go to the Node groups tab.
Open the page of the node group that you want to remove.
Switch to the Settings tab.
Click Delete node group.
Confirm the deletion.

To delete a node group, get its ID as shown in How to modify node groups and run the following command:

nebius mk8s node-group delete --id $K8S_NODE_GROUP_ID

Remove the corresponding nebius_mk8s_v1_node_group resource from the node group configuration file.
Check that the configuration is correct:
```
terraform validate
```
Apply the changes:
```
terraform apply
```

Examples

CLI
Terraform

Creating a node group with two nodes, each with 8 NVIDIA H100 GPUs, 128 vCPUs, 1600 GiB of RAM, a 100 GiB Network SSD disk and the Kubernetes version 1.33:

export SUBNET_ID=$(nebius vpc subnet list --format json \
  | jq -r '.items[0].metadata.id')
nebius mk8s node-group create \
  --parent-id $K8S_CLUSTER_ID \
  --name node-group-example \
  --version 1.33 \
  --fixed-node-count 2 \
  --template-resources-platform gpu-h100-sxm \
  --template-resources-preset 8gpu-128vcpu-1600gb \
  --template-gpu-settings-drivers-preset cuda12.8 \
  --template-boot-disk-type NETWORK_SSD \
  --template-boot-disk-size-gibibytes 100 \
  --template-network-interfaces "[{\"subnet_id\": \"$SUBNET_ID\"}]"

Modifying the node group from the previous example (ID $K8S_NODE_GROUP_ID) to add a node and enable public IP addresses for all nodes:

nebius mk8s node-group update \
  --id $K8S_NODE_GROUP_ID \
  --fixed-node-count 3 \
  --template-network-interfaces "[{\"subnet_id\": \"$SUBNET_ID\", \"public_ip_address\": {}}]"

Creating a node group with two nodes, each with 8 NVIDIA H100 GPUs, 128 vCPUs, 1600 GiB of RAM, a 100 GiB Network SSD disk and the Kubernetes version 1.33:

resource "nebius_mk8s_v1_node_group" "node-group-example" {
  name = "node-group-example"
  parent_id = $K8S_CLUSTER_ID
  version = "1.33"
  fixed_node_count = 2

  template = {
    resources = {
      platform = "gpu-h100-sxm"
      preset = "8gpu-128vcpu-1600gb"
    }

    gpu_settings = {
      drivers_preset = "cuda12.8"
    }

    boot_disk = {
      type = "NETWORK_SSD"
      size_gibibytes = 100
    }
  }
}

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.

​Prerequisites

​How to create node groups

​Regular node groups

​Preemptible node groups

​How to modify node groups

​Deployment strategy and quotas

​Node group parameters

​How to delete node groups

​Examples

Prerequisites

How to create node groups

Regular node groups

Preemptible node groups

How to modify node groups

Deployment strategy and quotas

Node group parameters

How to delete node groups

Examples