Soperator has built-in health checks that continuously monitor the fleet of worker nodes in your cluster, and an auto-healing system that isolates and replaces broken nodes. When deployed in Nebius AI Cloud (Managed Service for Soperator or Pro Solution for Soperator), Soperator also relies on maintenance events from the Compute service and Kubernetes® to auto-heal worker nodes. You can also use custom health checks to run additional checks on worker nodes.Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Health management in Soperator
Built-in health checks
For all deployment types, Soperator runs built-in health checks on a schedule for each worker node with GPUs. Most of these checks are considered critical: if a worker node fails a critical check, Soperator marks it as requiring further action. Critical checks include, but are not limited to, the following checks:-
GPU checks:
- AllReduce, an NCCL test (with and without InfiniBand, outside and inside Docker containers)
- CUDA samples, such as vectorAdd, simpleMultiGPU, deviceQuery and p2pBandwidthLatencyTest
- NVIDIA Data Center GPU Manager (DCGM) diagnostics
- GPU stress test
- RAM checks: bandwidth and latency
- Active Checks – Health and system checks framework: description of checks’ architecture and implementation
-
soperator-activechecksHelm chart:- values.yaml: list of checks
- scripts/: scripts for each check
Node isolation
When a critical check fails, Soperator performs the extensive check procedure:- Drains the node, waiting for running Slurm jobs to finish. The drain reason has the
[node_problem]prefix. - Moves the node into the suspicious reservation, preventing new jobs from being scheduled on it.
- Runs extensive checks on the node, which include hardware-level tests and re-runs of most critical checks.
Node replacement
If the extensive checks fail, Soperator drains the node. The drain reason now has the[hardware_problem] prefix — Soperator marks all worker nodes with this prefix as unhealthy Kubernetes nodes, which triggers automatic re-creation of the node.
If the node passes the extensive checks, Soperator removes it from the suspicious reservation, and jobs can run on the node again.
In Managed Service for Soperator and Pro Solution for Soperator, Compute may schedule maintenance for an underlying virtual machine (VM) of the worker node during the extensive check procedure. This typically indicates a hardware issue already detected by Compute. In this case, Soperator immediately stops the checks, and then drains and recreates the node.
Custom health checks (Slurm prolog and epilog programs)
All Soperator deployment types support Slurm prolog and epilog programs for job steps. You can configure them by using--task-prolog and --task-epilog parameters of srun, either in batch scripts or in direct srun calls. The prolog and epilog programs specified in --task-prolog and --task-epilog run on each worker node before and after the job step that is launched by the srun call. You can use them to run custom health checks on worker nodes.
For example, you can runFor more details about prolog and epilog programs, see Slurm documentation. By default, Soperator doesn’t auto-heal worker nodes that fail custom health checks. To set up custom auto-healing in your Managed Soperator or Pro Solution for Soperator clusters, contact support or your personal manager.nvidia-smibefore and after the training step in your batch script (my_ml_job.sh) to check the GPU utilization and health: