Maintenance in Managed Service for Kubernetes®

The goal of the Managed Service for Kubernetes maintenance is to terminate a node as gracefully as possible. When a node termination is required, Nebius AI Cloud issues a maintenance event. Maintenance events are triggered when software or hardware fails on the physical machines that host your nodes, or when Nebius AI Cloud runs planned maintenance. Software and hardware failures account for the vast majority of maintenance events. Managed Kubernetes listens to maintenance events that underlying services launch. In particular, as every Kubernetes node represents a Compute virtual machine, Managed Kubernetes tracks Compute maintenance events.

How maintenance occurs

Nebius AI Cloud issues a maintenance event. When the event is issued, Managed Kubernetes assigns the NebiusMaintenanceScheduled Kubernetes condition. You can check the list of conditions to make sure that the service has issued the event.
The Managed Kubernetes service detects an event on a node. The service groups nodes into batches within a given node group. If a lot of maintenance events are expected in a Managed Kubernetes cluster, batches allow you to avoid stopping all nodes at once. The batch size equals either 1 or the .spec.strategy.max_unavailable value if this value is greater than 1. You can check the .spec.strategy.max_unavailable parameter by using the following command:
```
nebius mk8s node-group get --id <node_group_ID>
```
To stop scheduling new Pods, Managed Kubernetes cordons the node.
The service waits for workloads on the node to finish. They should finish at least one hour before the SLA deadline of the maintenance event. This is the latest time the maintenance event should take place. You can check the SLA deadline together with Kubernetes conditions.
To remove existing Pods, Managed Kubernetes drains the node. The drain takes up to one hour.
Nebius AI Cloud stops the Compute VM (that is, the node).
Nebius AI Cloud starts the VM.
Managed Kubernetes uncordons the node and enables scheduling new Pods.
Managed Kubernetes removes the NebiusMaintenanceScheduled condition from the node.

After that, the node is considered to be healthy. Workloads can run on this node again.

Manual launch of maintenance

The service runs maintenance automatically. However, you can launch it manually as well if a maintenance event is issued for your node. For more information, see How to launch maintenance manually in Managed Kubernetes.

​How maintenance occurs

​Manual launch of maintenance

How maintenance occurs

Manual launch of maintenance