Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
When Compute schedules maintenance for your virtual machine (VM), it assigns a reason code that describes why the maintenance was triggered. After you identify the reason, stop and start your VM to prepare it for maintenance. The reason code helps you assess the severity of the error and decide whether additional action is needed.
How to identify the reason code for a maintenance event
You can view the reason code for maintenance events by:
- Checking the maintenance notification banner in the web console;
- Using the Nebius AI Cloud CLI to list all active maintenance events scheduled for resources in a project.
Run the following command, and specify your project ID.
nebius compute maintenance list-active --parent-id <project_ID>
The output contains a list of all maintenance events that are scheduled for resources in the project you specified.
Reason codes
Maintenance events can be triggered by GPU, InfiniBand™ or node-level errors. The tables below show the reason codes that map to different types of errors.
If maintenance was triggered by a condition that is not mapped to one of these reason codes, Compute assigns OTHER as the reason code.
GPU errors
| Reason code | Description |
|---|
HW_GPU_PCI_FALLEN_OFF_BUS | A GPU or NVSwitch has fallen off the PCI bus, typically due to critical thermal or power issues. The affected node is taken out of service for hardware inspection. |
HW_GPU_PCI_CONFIG_ERROR | Unexpected GPU PCI configuration detected, or critical PCI errors observed between the GPU, deltaboard and motherboard. Requires physical hardware maintenance. |
HW_GPU_NVLINK_DOWN | An NVLink connection is down on a Blackwell or newer GPU. Requires a GPU reset or VM restart to recover. |
HW_GPU_XID_62 | The GPU internal micro-controller has halted (XID 62). Requires a GPU reset or VM restart. |
HW_GPU_XID_109 | GPU context switch timeout (XID 109). Typically not fatal to running workloads, but may require a GPU reset or VM restart. |
HW_GPU_XID_119 | GSP RPC timeout (XID 119). Requires a GPU reset or VM restart. |
HW_GPU_FW_VERSION_UNAVAILABLE | DCGM could not report the GPU firmware version. This is usually a symptom of other underlying hardware errors. |
HW_GPU_DRIVER_INIT_FAILED | The NVIDIA® driver failed to initialize one or more GPUs. Typically caused by other hardware errors. |
InfiniBand™ errors
| Reason code | Description |
|---|
HW_IB_LINK_DOWN | The InfiniBand link has been in a physically down state for more than 3 minutes. |
HW_IB_PCI_FALLEN_OFF_BUS | The InfiniBand adapter has fallen off the PCI bus, typically due to critical thermal or power issues. |
HW_IB_PCI_CONFIG_ERROR | Unexpected InfiniBand PCI configuration detected, typically due to critical PCI errors. |
Node-level errors
| Reason code | Description |
|---|
HW_NODE_OFFLINE | The node hosting the VM went offline. The cause may vary. Affected VMs are force-migrated and will experience an unexpected reboot. |
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.