Prerequisites
- CLI
- Terraform
- Install and configure the Nebius AI Cloud CLI.
-
Create a cluster and save its ID to an environment variable:
How to create node groups
Node groups define the characteristics of the virtual machines (VMs) that run your workloads. Each node group includes identical nodes created with the same template. You can create different types of node groups depending on your performance, cost and availability requirements. For example, you can choose high-performance GPUs for compute-intensive workloads or preemptible VMs to reduce costs for interruptible tasks.Regular node groups
- Web console
- CLI
- Terraform
-
In the sidebar, go to
Compute → Kubernetes.
- Open the page of the cluster where you want to create a node group.
- Switch to the Node groups tab.
-
Click
Create node group.
-
On the page that opens, specify a name for the node group (for example,
mk8s-node-group-test). - (Optional) Enable the Assign public IPv4 addresses option if you want the nodes to be accessible from the internet.
- Under Size, specify the initial Number of nodes. If you want to let the node group scale up or down depending on the workload, enable autoscaling. After that, specify the minimum and maximum number of nodes that the group can have.
-
Configure the Computing resources section:
- Select whether the node group should have GPUs.
- Select a regular VM type. VMs without GPUs only support the regular type. For information about creating preemptible node groups, see instructions below.
-
(Optional) For a regular VM with GPUs, select Reservation usage. Specify whether Managed Kubernetes should allocate resources for the node group from reservations.
The Reservation usage field is only displayed if you have capacity block groups.
More information about reservation usage
-
With reservations: The resources are allocated from reservations (capacity block groups). For example, if a Nebius manager has created a capacity block group for you, Managed Kubernetes allocates GPUs for the node group from this capacity block group. This ensures that resources are always available, even if VMs in the node group are stopped (for example, by you or a maintenance event).
You can use one of the following reservation types:
- Any (default): You do not need to select reservations. The service uses the reservations that are most suitable for the configuration of your VM.
- Specific: Select specific reservations. Make sure to select reservations that have enough capacity and that do not expire in several days.
- Without reservations: The resources are allocated from a common pool, and no reservations are used for the node group.
-
With reservations: The resources are allocated from reservations (capacity block groups). For example, if a Nebius manager has created a capacity block group for you, Managed Kubernetes allocates GPUs for the node group from this capacity block group. This ensures that resources are always available, even if VMs in the node group are stopped (for example, by you or a maintenance event).
You can use one of the following reservation types:
- Select an available platform and a preset (a combination of GPUs, vCPUs and RAM) that fits your workload requirements.
-
(Optional) If you create a node group with 8 GPUs (for example, for training models), use a GPU cluster for the node group. InfiniBand™ in the cluster allows you to accelerate tasks that require high-performance computing (HPC) power. A single node group without InfiniBand cannot perform these tasks as quickly.
To use a GPU cluster, select an existing one or create a new cluster:
- Click
Create in the GPU cluster field.
- In the window that opens, specify the cluster name and InfiniBand fabric. To select the fabric, see InfiniBand fabrics.
- Click Create.
- Click
- (Optional) Enable or disable GPU settings. They are enabled by default, and they allow Managed Kubernetes to pre-install NVIDIA drivers and the Container Toolkit. You can also select a specific NVIDIA CUDA driver version. Disable GPU settings only if you need to install specific driver versions manually or use a custom operator. Disabling is not recommended.
-
Select an operating system for the nodes (for example,
Ubuntu 24.04 LTS).
-
Under Node storage, select the disk type and specify the size in . Supported disk types are the following:
- SSD: Standard solid-state drive for general-purpose workloads.
- SSD NRD: Network-replicated SSD providing higher reliability through data duplication across the network.
- SSD IO: High-performance SSD optimized for I/O-intensive operations with lower latency.
-
(Optional) If you want to attach a filesystem to your node group, in the Shared filesystems section, specify the settings of this filesystem:
-
Click
Attach shared filesystem.
- In the window that opens, select an existing filesystem or create a new one.
- If you create a new filesystem, specify its name, size and the block size.
- Click Attach filesystem or Create and attach filesystem.
-
After the window is closed, specify a mount tag for mounting the filesystem to the VM.
Create your own tag, such as
my-filesystem. Make sure that it is unique within the VM. - To mount the filesystem to the node group automatically, keep the Auto mount option enabled.
-
Click
-
(Optional) In the Username and SSH key field, add credentials, so you can connect to the node group:
-
Generate an SSH key pair:
-
In the Username and SSH key field, click
.
-
If you added an SSH key earlier and you want to reuse it, select the key from the drop-down list.
If you want to add a new key, click
Add credentials.
- In the window that opens, specify the username of the node group user, a public key of your SSH key pair and the credentials name to recognize the key in the list.
- Click Add credentials.
-
Generate an SSH key pair:
- (Optional) Under Additional, select or create a service account that will perform actions on behalf of the nodes.
- Click Create node group.
Preemptible node groups
Preemptible nodes use virtual machines that can be stopped by Nebius AI Cloud at any time. These VMs are more cost-efficient than regular ones and suitable for workloads with interruptions, such as batch processing or training ML models. For more information about how preemptible VMs work, see Preemptible virtual machines.- Web console
- CLI
- Terraform
-
In the sidebar, go to
Compute → Kubernetes.
- Create a cluster or choose an existing one.
- On the cluster page, switch to the Node groups tab.
-
Click
Create node group.
-
When creating a node group, under Computing resources, select:
- With GPU
- Preemptible VM type
How to modify node groups
Modifying the noge group template (the GPU cluster, GPU settings and boot disk) triggers a rolling update. Managed Kubernetes replaces each node with another one, with a new configuration. If you modify other parameters, Managed Kubernetes does not replace the nodes, they remain unchanged. During the node group update, by default, no node is unavailable, and the group size can increase by one node. This is based on the default values of the deployment strategy parameters:--strategy-max-unavailable-count 0 and --strategy-max-surge-count 1. You can change them when you modify a node group by using the CLI.
- Web console
- CLI
- Terraform
To modify a node group:
-
In the sidebar, go to
Compute → Kubernetes.
- Open the page of the required cluster and then go to the Node groups tab.
- Open the page of the node group that you wish to change.
-
Switch to the Settings tab and then modify the required parameters.
Parameters available for editing:
- Name: Name of the node group.
-
Size:
- Number of nodes: Target and fixed number of nodes (if autoscaling is disabled). The maximum number is 100.
- Enable autoscaling: Allows you to set the range of nodes within which the cluster autoscaler adds or removes nodes as needed.
-
Computing resources: Select whether the node group should have GPUs, and then specify the hardware configuration:
-
VM type:
- Regular: Standard VMs for high-availability production workloads.
- Preemptible: Lower-cost VMs that may be terminated by the platform at any time.
- Available platform and Preset: Combination of GPUs, vCPUs and RAM that fits your workload requirements. For more information, see Types of virtual machines and GPUs in Nebius AI Cloud.
- GPU cluster: GPU cluster with InfiniBand. Allows you to accelerate tasks that require HPC power. Available only if the node group contains 8 GPUs.
- GPU settings: If enabled, the system pre-installs NVIDIA drivers and the Container Toolkit. You can also select a specific NVIDIA CUDA driver version. Disable GPU settings only if you need to install specific driver versions manually or use a custom operator.
- Drivers: CUDA driver version based on enabled GPU settings.
-
Operating system: OS for the nodes, for example,
Ubuntu 24.04 LTS.
-
VM type:
-
Node storage:
- Disk type: Type of the boot disk.
- Size: Size of the boot disk in GiB.
- Click Save changes.
Deployment strategy and quotas
When you modify a node group’s Kubernetes version or node template, Managed Kubernetes performs a rolling update to each node in the group:- Creates a replacement node.
- Cordons the existing node (marks it as unschedulable).
- Drains the existing node (evicts all pods from it).
- Deletes the existing node.
For example, each node uses 8 GPUs, 128 vCPUs, 1600 GiB RAM and a public IP address. You have 3 nodes in the cluster and a deployment strategy withIf your quotas allow for only one extra node, the update is still performed using the default--strategy-max-surge-count 2. During the update, you need quotas for the following additional resources:
- 16 GPUs (2 × 8)
- 256 vCPUs (2 × 128)
- 3200 GiB RAM (2 × 1600)
- 2 public IP addresses
--strategy-max-surge-count 1. In this case, nodes are updated one-by-one: while one node is being replaced, update attempts for the others may temporarily fail but will eventually complete.
When you or the autoscaler scales a node group up or down, Managed Kubernetes does not recreate any nodes.
Node group parameters
- CLI
- Terraform
The
nebius mk8s node-group create and nebius mk8s node-group update commands support the following parameters:-
Metadata
--name: Node group name. Must be unique within the tenant. Cannot be changed after creation.
-
Kubernetes version on nodes
--version: Kubernetes version in<major>.<minor>format. Recommended version is 1.33. For more information, see Kubernetes versions in Managed Service for Kubernetes.
-
Node group size
--fixed-node-count: Number of nodes per group. The maximum is 100.--autoscaling-min-node-count,--autoscaling-max-node-count: Allow you to set the range of nodes within which the cluster autoscaler adds or removes nodes as needed.
-
Node template
All nodes in a group are identical and are created based on a node template. A node template is similar to a virtual machine specification in Compute.
The node template has the following parameters:
-
--template-taints: Array of Kubernetes taints (rules that repel pods from nodes) for all nodes in the group. -
--template-resources-platform: A platform with GPUs, see Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™. -
--template-resources-preset: A compatible preset (number of GPUs and vCPUs, RAM size), see Types of virtual machines and GPUs in Nebius AI Cloud. -
--template-gpu-settings-drivers-preset: GPU drivers preset, see GPU drivers and other components. -
--template-gpu-cluster-id: GPU cluster ID. -
--template-service-account-id: Service account ID. You can add a service account, for example, to pull images from Container Registry. -
--template-network-interfaces: Network interface configuration (for example, subnet ID, see How to use a non-default subnet for Managed Service for Kubernetes® clusters and node groups). -
--template-filesystems: Filesystem for nodes, see How to attach volumes to VMs. -
--template-reservation-policy-policy: Policy for reservation usage. You can use reservations of capacity resources and run your node group based on them. As a result, the node group resources are reserved and always available. -
--template-reservation-policy-reservation-ids: IDs of specific reservations. These are capacity block groups that a Nebius manager has created. For information about how to configure--template-reservation-policy-policyand--template-reservation-policy-reservation-ids, see How to add reservations to node groups.
-
-
Deployment strategy
The deployment strategy of a node group defines how it is updated when necessary — for example, when you modify the group’s node template or Kubernetes version, or when nodes fail and need to be replaced. For more details, see Deployment strategy and quotas.
The following parameters specify the deployment strategy:
-
--strategy-max-unavailable-percent,--strategy-max-unavailable-count: The maximum number of nodes in a group that can be unavailable at any time during an update, set as a percentage of the group’s target size or a number of nodes. When a percentage is used, the number of nodes is calculated by rounding down.For example, if the value of
The default value is 0. Cannot be set to 0 if--strategy-max-unavailable-percentis 40 and the group’s target size is 3, at most ⌊3 × 40%⌋ = ⌊1.2⌋ = 1 node can be unavailable at any time during the update. As a result, nodes are replaced one at a time; a running node is not stopped or deleted until the previous one has been replaced by a new running node.--strategy-max-surge-countor--strategy-max-surge-percentis 0. -
--strategy-max-surge-percent,--strategy-max-surge-count: The maximum number of nodes in a group that can exceed the group’s target size at any time during an update, set as a percentage of the target size or as a number of nodes.For example, if the value of
The default value is--strategy-max-surge-countis 2 and the group’s target size is 3, then the group can only have 3 + 2 = 5 nodes at any time during the update.--strategy-max-surge-count 1. Cannot be set to 0 if--strategy-max-unavailable-countor--strategy-max-unavailable-percentis 0. -
--strategy-drain-timeout: The maximum amount of time it can take to drain a node during the update. If the timeout is set, a node in the updated group is deleted when it reaches the timeout, even if its draining is not complete.
-
How to delete node groups
- Web console
- CLI
- Terraform
- In the sidebar, go to
Compute → Kubernetes.
- Open the cluster page and then go to the Node groups tab.
- Open the page of the node group that you want to remove.
- Switch to the Settings tab.
- Click Delete node group.
- Confirm the deletion.
Examples
- CLI
- Terraform
-
Creating a node group with two nodes, each with 8 NVIDIA H100 GPUs, 128 vCPUs, 1600 GiB of RAM, a 100 GiB Network SSD disk and the Kubernetes version 1.33:
-
Modifying the node group from the previous example (ID
$NB_K8S_NODE_GROUP_ID) to add a node and enable public IP addresses for all nodes:
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.