How maintenance occurs
-
Nebius AI Cloud issues a maintenance event.
When the event is issued, Managed Kubernetes assigns the
NebiusMaintenanceScheduledKubernetes condition. You can check the list of conditions to make sure that the service has issued the event. -
The Managed Kubernetes service detects an event on a node.
The service groups nodes into batches within a given node group. If a lot of maintenance events are expected in a Managed Kubernetes cluster, batches allow you to avoid stopping all nodes at once.
The batch size equals either
1or the .spec.strategy.max_unavailable value if this value is greater than1. You can check the.spec.strategy.max_unavailableparameter by using the following command: - To stop scheduling new pods, Managed Kubernetes cordons the node.
- The service waits for workloads on the node to finish. They should finish at least one hour before the SLA deadline of the maintenance event. This is the latest time the maintenance event should take place. You can check the SLA deadline together with Kubernetes conditions.
- To remove existing pods, Managed Kubernetes drains the node. The drain takes up to one hour.
- Nebius AI Cloud stops the Compute VM (that is, the node).
- Nebius AI Cloud starts the VM.
- Managed Kubernetes uncordons the node and enables scheduling new pods.
-
Managed Kubernetes removes the
NebiusMaintenanceScheduledcondition from the node.