Cluster status
To list current worker nodes, run the following command:sinfo, see Slurm documentation.
Output example:
idle*, this means that the node did not respond and it is unavailable. The node goes down if it does not respond quickly. For more information about common node states, see Node states.
You can customize the columns in the sinfo output by using the -o option. For example, sinfo -o "%20P %5D %14F %8z %10m %10d %11l %16f %N" lists the partitions, gives you the total number of nodes and shows which nodes are free, how much memory is available and the time limits for jobs currently being executed. For more information about these options, see Slurm documentation.
To get more information about a particular node, run the following command:
Node states
Some of the common node states include:| Node state | Description | State causes |
|---|---|---|
idle | The node is not currently running any jobs and is available for scheduling. | No jobs assigned yet. |
allocated | The node is actively running one or more jobs. | A job has been assigned and is executing. |
mixed | Some CPUs on the node are allocated to jobs, while others remain idle. | Partial job allocations. |
down | The node is unavailable for use. | Hardware failure, maintenance or manual marking by an admin. |
drained | The node is excluded from the scheduling pool for new jobs. Jobs submitted before the node was drained may run to completion. | Marked for draining manually or by Soperator health checks. |
unknown | The node state cannot be determined. | Communication issues between the controller and the node. |
fail | The node has failed and cannot execute jobs. | Critical hardware or software issues. |
How to drain and resume a node
To drain a node (that is, stop scheduling more jobs and make the node unavailable), run the following command:Job queue
To list all jobs currently running or pending (that is, waiting for resources), run the following command:PD status, the NODELIST(REASON) column shows the reason why the job is pending. For more information about possible reasons for this status, see Slurm documentation.
Job details
To use the job ID to get more details about the job, run the following command:Completed job statistics
To get the details about the jobs already completed, run the following command:sacct, see Slurm documentation.