Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

When Compute schedules maintenance for your virtual machine (VM), it assigns a reason code that describes why the maintenance was triggered. After you identify the reason, stop and start your VM to prepare it for maintenance. The reason code helps you assess the severity of the error and decide whether additional action is needed.

How to identify the reason code for a maintenance event

You can view the reason code for maintenance events by:
  • Checking the maintenance notification banner in the web console;
  • Using the Nebius AI Cloud CLI to list all active maintenance events scheduled for resources in a project.
Run the following command, and specify your project ID.
nebius compute maintenance list-active --parent-id <project_ID>
The output contains a list of all maintenance events that are scheduled for resources in the project you specified.

Reason codes

Maintenance events can be triggered by GPU, InfiniBand™ or node-level errors. The tables below show the reason codes that map to different types of errors. If maintenance was triggered by a condition that is not mapped to one of these reason codes, Compute assigns OTHER as the reason code.

GPU errors

Reason codeDescription
HW_GPU_PCI_FALLEN_OFF_BUSA GPU or NVSwitch has fallen off the PCI bus, typically due to critical thermal or power issues. The affected node is taken out of service for hardware inspection.
HW_GPU_PCI_CONFIG_ERRORUnexpected GPU PCI configuration detected, or critical PCI errors observed between the GPU, deltaboard and motherboard. Requires physical hardware maintenance.
HW_GPU_NVLINK_DOWNAn NVLink connection is down on a Blackwell or newer GPU. Requires a GPU reset or VM restart to recover.
HW_GPU_XID_62The GPU internal micro-controller has halted (XID 62). Requires a GPU reset or VM restart.
HW_GPU_XID_109GPU context switch timeout (XID 109). Typically not fatal to running workloads, but may require a GPU reset or VM restart.
HW_GPU_XID_119GSP RPC timeout (XID 119). Requires a GPU reset or VM restart.
HW_GPU_FW_VERSION_UNAVAILABLEDCGM could not report the GPU firmware version. This is usually a symptom of other underlying hardware errors.
HW_GPU_DRIVER_INIT_FAILEDThe NVIDIA® driver failed to initialize one or more GPUs. Typically caused by other hardware errors.

InfiniBand™ errors

Reason codeDescription
HW_IB_LINK_DOWNThe InfiniBand link has been in a physically down state for more than 3 minutes.
HW_IB_PCI_FALLEN_OFF_BUSThe InfiniBand adapter has fallen off the PCI bus, typically due to critical thermal or power issues.
HW_IB_PCI_CONFIG_ERRORUnexpected InfiniBand PCI configuration detected, typically due to critical PCI errors.

Node-level errors

Reason codeDescription
HW_NODE_OFFLINEThe node hosting the VM went offline. The cause may vary. Affected VMs are force-migrated and will experience an unexpected reboot.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.