Skip to main content
You can monitor the performance of your Soperator cluster on preconfigured dashboards in Grafana®.

Prerequisites

  1. Connect to your cluster. You should see the SSH welcome message. For example:
    Welcome to Soperator cluster
    
    ...
    
    System information as of Thu May  8 10:43:02 UTC 2025:
    ...
    
    Slurm nodes:
      PARTITION   CPUS   MEMORY    GRES                                 NODES   NODELIST                  STATE   REASON
      main        128    1553408   gpu:nvidia_h100_80gb_hbm3:8(S:0-1)   2       worker-[0-1]              idle    none
    
    No user jobs in the queue
    
    No other users are currently logged in
    
    To open monitoring dashboards in your browser:
      1. Execute this command on your local computer:
         `ssh -L 3000:metrics-grafana.monitoring-system.svc:80 -N <USER>@<LOGIN_IP>`
      2. Open `localhost:3000` in your browser
    ...
    
  2. Get the command to open monitoring dashboards from the instructions in the SSH welcome message. In the example above, it is ssh -L 3000:metrics-grafana.monitoring-system.svc:80 -N <USER>@<LOGIN_IP>. The URL for your cluster might be different.

How to view metrics in Grafana

  1. On your local machine, run the command to open monitoring dashboards that you got from the SSH welcome message. For example:
    ssh -L 3000:metrics-grafana.monitoring-system.svc:80 -N <username>@<public_IP_address>
    
    In this command, specify the username and public_IP_address that you use to connect to the cluster. Optionally, change port 3000 if it is already in use on your local machine.
  2. Open localhost:3000 (or localhost:<port>) in your browser.
  3. In the sidebar, select Dashboards. Review the metrics on these dashboards. For example, you can see the metrics of Slurm jobs and resource allocations.

How to view metrics for worker nodes

The nodes of your Soperator cluster are Compute virtual machines. You can view their metrics on Monitoring dashboards in the web console. To find out the ID of the virtual machine for a worker node:
  1. Connect to a login node of your Soperator cluster.
  2. Run the following command:
    scontrol show node worker-<number>
    
    Output example:
    NodeName=worker-0 Arch=x86_64 CoresPerSocket=32
       CPUAlloc=0 CPUEfctv=128 CPUTot=128 CPULoad=0.97
       AvailableFeatures=(null)
       ActiveFeatures=(null)
       Gres=gpu:nvidia_h100_80gb_hbm3:8(S:0-1)
       NodeAddr=10.0.35.138 NodeHostName=worker-0 Version=24.05.5
       OS=Linux 5.15.0-133-generic #144-Ubuntu SMP Fri Feb 7 20:47:38 UTC 2025
       RealMemory=1553408 AllocMem=0 FreeMem=1421003 Sockets=2 Boards=1
       State=IDLE+DYNAMIC_NORM ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
       Partitions=main
       BootTime=2025-03-11T11:28:45 SlurmdStartTime=2025-03-11T12:39:23
       LastBusyTime=2025-05-08T13:42:21 ResumeAfterTime=None
       CfgTRES=cpu=128,mem=1517G,billing=128
       AllocTRES=
       CurrentWatts=0 AveWatts=0
    
       Extra={ "monitoring": "https://console.eu.nebius.com/project-e00x6706bdmd42yjyn/compute/instances/computeinstance-****/monitoring" }
       InstanceId=computeinstance-****
    
    Get the link from the monitoring parameter.
  3. Open the link in your browser. There, you can view the dashboards for the virtual machine that runs the worker node.

The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates.