Moving capacity between training and inference workloads

If you share a capacity block group between a Soperator cluster and an inference node group, you can move GPU capacity between them without stopping the entire Soperator cluster. The steps below are the same whether training and inference run in separate Managed Service for Kubernetes clusters or in different node groups within one cluster. Use Slurm power management commands to release or add worker nodes in Soperator, and change the inference node group size to consume or free capacity. For information on how ephemeral nodes work, see Ephemeral nodes in Soperator.

Prerequisites

Reserve a capacity block group that your Soperator worker nodes and inference node group share.
Make sure your Soperator cluster runs version 3.0 or later and has ephemeral nodes enabled. Nebius enables this at cluster provisioning time; when you request the cluster, ask your Nebius manager or technical support to enable ephemeral nodes on the relevant worker node sets. If scontrol power commands do not work, ask support to upgrade the cluster or enable ephemeral nodes.
Set up an inference node group that uses the same capacity block:
- Separate clusters: Create a Managed Service for Kubernetes cluster for inference workloads and add a node group to it.
- Same cluster: Add a node group for inference workloads to the cluster that already runs Soperator.
Make sure you are in a group that has at least the editor role within your tenant or project; for example, the default editors group. You can check this in the Administration → IAM section of the web console.
Generate an SSH key pair and set up access to a login node in the Soperator cluster.

How to move capacity from training to inference

When training nodes are idle but inference needs more GPUs, release nodes from Soperator and add them to the inference node group.

Connect to a login node in the Soperator cluster.
List worker nodes and their states:
```
sinfo -Nel
```
Choose nodes that are idle or that you are ready to drain. See node states for details.
Release the chosen nodes from the Soperator cluster:
- To deprovision nodes when they are idle, use plain power down. Without the asap parameter, power down has lower priority than starting new jobs from the queue, so Slurm may run queued jobs on the node before powering it down:
  scontrol power down <node_list> Reason="move capacity to inference"
- To drain nodes so no new jobs are scheduled, wait for the current job to finish (if any), and then deprovision the nodes:
  scontrol power down asap <node_list> Reason="move capacity to inference"
- To power down nodes immediately and cancel running jobs:
  scontrol power down force <node_list> Reason="move capacity to inference"
Replace <node_list> in the commands above with a Slurm hostlist of worker node names, for example worker-10,worker-11, worker-[10,11], or worker-[0-3,5-8,13],worker-cpu-18.
To prevent the nodes from powering back up automatically when queued jobs target them, drain them:
```
scontrol update NodeName=<node_list> State=drain Reason="prevent power up"
```
Use the same Slurm hostlist as in the power down command. For details, see Automatic node provisioning.
Wait until the nodes are powered down. Confirm their state with the following command:
```
sinfo -N -o "%N %t %E"
```
Powered-down ephemeral nodes remain in the node list with a powered-down cloud state. They no longer run worker Pods.
In the Managed Service for Kubernetes cluster that hosts your inference workloads, increase the inference node group size by the same number of nodes you released from Soperator. Use a node group that draws GPUs from the same capacity block group.

The released GPUs are now available to inference workloads.

How to move capacity from inference to training

When inference traffic drops and you want to run training jobs on idle GPUs, scale down the inference node group and power worker nodes back on in Soperator.

In the Managed Service for Kubernetes cluster that hosts your inference workloads, reduce the inference node group size by the number of nodes you want to move to training. Wait until the nodes are removed and the GPUs are released to the capacity block.
Connect to a login node in the Soperator cluster.
If you drained the nodes when you released them, resume them:
```
scontrol update NodeName=<node_list> State=resume
```
Power on worker nodes in Soperator:
```
scontrol power up <node_list>
```
Replace <node_list> in the command above with a Slurm hostlist of worker node names to bring back, for example worker-10,worker-11, worker-[10,11], or worker-[0-3,5-8,13],worker-cpu-18. Soperator creates worker Pods for the requested nodes if enough free GPUs remain in the capacity block. Alternatively, submit a job with srun or sbatch that needs those nodes; Slurm may power them on automatically. For details, see Automatic node provisioning.
Confirm that the nodes are available for scheduling:
```
sinfo -N -o "%N %t %E"
```
The powered-on nodes should move toward the idle state when they are ready for new jobs.

​Prerequisites

​How to move capacity from training to inference

​How to move capacity from inference to training

​See also

Prerequisites

How to move capacity from training to inference

How to move capacity from inference to training

See also