Prerequisites
- If you need worker nodes with GPUs, make sure that you have capacity block groups that reserve GPUs.
-
Make sure you are in a group that has at least the
editorrole within your tenant; for example, the defaulteditorsgroup. You can check this in the Administration → IAM section of the web console. -
Generate at least one SSH key pair to connect to Slurm login nodes as the default
rootuser:How to generate an SSH key pair
If you do not have an SSH key pair, generate it on your local machine:-
In the terminal, go to the
~/.sshdirectory: -
Create an SSH key pair:
-C "<comment>"is optional but it helps distinguish the key from others. -
At the prompt that appears, enter the following information:
- Name of the file where the key should be stored.
- Passphrase for the key. Press Enter if you do not want to use a passphrase.
-
Get the contents of the generated public key:
Use the file name that you specified during the key pair creation.
-
In the terminal, go to the
How to create a cluster
- Web console
-
In the sidebar, go to
Compute → Soperator.
- Click Create cluster.
-
In the Overview, configure the cluster’s general parameters:
- Enter the cluster name.
- Add one or more SSH public keys (
ssh-ed25519 AAA***) to access the login node.
-
Configure node sets. A cluster must have a login node set and at least one worker node set.
- For the login node set, specify a number of nodes.
-
For each worker node set, specify the following:
- Name.
- Whether the nodes should use GPUs.
- Platform and preset.
- Reservation ID from your capacity block group.
- Number of nodes.
→ Clone node set.
-
Add visible and, optionally, hidden partitions. You can use the default partitions or define your own ones.
Partitions group nodes into logical (and possibly overlapping) sets and define how workloads are scheduled on those node sets. See details in the Slurm quickstart on partitions.
In Managed Soperator, hidden partitions are not listed by Slurm CLI tools as available (per the
Hiddenparameter in slurm.conf), but you can create and manage them in the web console and other Nebius AI Cloud interfaces. For each partition, specify:- PartitionName: A unique partition name that you will use when submitting jobs.
- Nodes: The worker node sets that the partition can schedule jobs on. A partition can include one or more node sets, and a node set can belong to more than one partition.
- PriorityTier: Determines how Soperator prioritises partitions when resources are limited. A higher partition priority means that more jobs from this partition are favored for the same resources.
- DefaultTime: The default time limit for jobs submitted to the partition, in
HH:MM:SSformat. Jobs inherit this limit unless they have a different time in their submission settings. - DefMemPerNode: The default amount of memory available to each node for jobs scheduled in the partition. Must not exceed the node capacity.
- PreemptMode: Preemption mode controls what happens to currently running jobs when higher priority jobs require resources.
-
Add volumes. A cluster can include shared, local and memory volumes.
- Сluster volumes are created per cluster and are available to all node sets.
- Shared volumes are created per project and are available to all node sets.
- Local volumes are created per node and store temporary or runtime data.
- Memory volumes store data in RAM. You cannot change them.
- Review the configuration on the Review page and click Create cluster.
What’s next
- Connect to the cluster.
- To save costs when you are not using the cluster, stop and start it.
How to delete a cluster
- In the sidebar, go to Managed Soperator.
- In the list of clusters, find the one that you want to delete.
- Next to the cluster, click
→ Delete.
- Enter the cluster name to confirm and click Delete cluster.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.