Skip to main content
Soperator deploys Slurm to Kubernetes clusters. In a Soperator cluster, Slurm nodes, storage and other components are Kubernetes resources: Pods, PersistentVolumes, etc. The diagram below outlines the architecture of a Soperator cluster:
soperator-architecture

Cluster specification and Slurm configuration

When Soperator is installed in a Kubernetes cluster, it adds the SlurmCluster custom resource to it. This resource contains the specification of the Slurm cluster deployed in the Kubernetes cluster. The Slurm operator itself is a Pod that uses the SlurmCluster specification to create and reconcile the Kubernetes resources in the Slurm cluster, such as login, worker and controller nodes, and storage resources. The configuration files of Slurm itself (slurm.conf, gres.conf, cgroup.conf, plugstack.conf, etc.) are Kubernetes ConfigMaps controlled by the Slurm operator.

Nodes

In Soperator clusters, all Slurm nodes are Kubernetes Pods. The main types of Slurm nodes in Soperator clusters are the following: For simplicity, there are nodes that are not represented on the diagram in this article, for example, DBD (database daemon) nodes for accounting, nodes that export metrics, nodes that other Kubernetes operators manage for backups and auto-healing.

Login nodes

To work with a Slurm (submit jobs, check their status, write sbatch scripts and prepare data for them, etc.), users connect to its login nodes. The sshd daemon runs on every login node. Soperator balances load between login nodes — each time a user connects to the cluster via SSH, they are directed to a random login node.

Worker nodes

Worker nodes, also known as compute nodes, perform computations for Slurm jobs. The slurmd daemon runs on every worker node. It monitors, launches and terminates jobs. For more information on how to work with login and worker nodes, see Connecting to login and worker nodes.

Controller nodes

Controller nodes orchestrate Slurm activities, such as job queuing, monitoring node states and allocating resources. The central management daemon, slurmctld, runs on all controller nodes.

Persistent storage

Soperator’s main storage feature is its shared root filesystem. It is mounted to all login and worker nodes in a special way — you see it as the root directory (/) in your SSH sessions and Slurm jobs. This helps maintain the traditional Slurm user experience where you work with the entire root filesystem on each node. The filesystem is shared, which means you do not need to keep it identical across nodes manually. When you make changes to the filesystem on one node, these changes automatically show up on other nodes. The shared root filesystem is implemented as a Kubernetes PersistentVolume (PV) that ensures data is preserved when nodes restart. Soperator also uses PVs for system needs, like storing cluster and controller states, etc.

See also

For more information about the Soperator cluster architecture, see: