- A Network SSD Non-replicated (Network SSD NRD) boot disk experiences input/output (I/O) issues and does not work correctly.
- A node is reporting a false or unknown status.
- A node is experiencing problems with GPUs.
I/O issues of a Network SSD NRD boot disk
If a Network SSD NRD boot disk on a node is unavailable for reading data from the disk or writing to it, the boot disk reports I/O errors. If the errors persist for 30 seconds or more, Managed Kubernetes sets the Kubernetes conditionNebiusBootDiskIOError = True for the node.
To fix the I/O issues, the service runs an automatic recovery: it deletes the node and then creates a new one with a different name and healthy boot disk.
False or unknown status of a node
Managed Kubernetes runs health checks and sets theNodeReady Kubernetes condition to check a node’s status. If the condition remains in the False status for more than five minutes, or if the condition remains in the Unknown status for more than 15 minutes, the service runs an automatic recovery. It deletes the node and then creates a healthy one with a different name.
Issues with GPUs on a node
Managed Kubernetes runs several health checks for components of a GPU-based cluster. For example, the service checks GPUs, InfiniBand™ and NVLink by using thenvidia-smi, dcgmi and dmesg tools. Also, the service checks if a GPU node experiences Xid errors or problems with the error correction code memory.
Each GPU health check runs every five minutes. If all GPU health checks have passed, the NebiusGPUError Kubernetes condition is set to the False status.
If the condition is set to the True status, Managed Kubernetes automatically recovers the node:
- To stop scheduling new pods, Managed Kubernetes cordons the node.
- The service waits until all workloads that consume GPUs are finished or stopped, and until these GPUs are released.
- To remove existing pods, Managed Kubernetes drains the node. The drain takes up to one hour.
- Nebius AI Cloud stops the node (the Compute virtual machine which the node is based on).
- Nebius AI Cloud starts the node.
- Managed Kubernetes uncordons the node and enables scheduling new pods.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.