Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

To work with metrics in Prometheus, connect Prometheus to Observability Metrics and query the data by using PromQL.

Prerequisites

  1. Install and configure Nebius AI Cloud CLI.
  2. If you don’t have a service account for observability services, create one.
  3. Make sure that the service account is in a group that has at least the viewer role within your tenant; for example, the default viewers group. You can check this in the Administration → IAM section of the web console. If the service account is not in the required group, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/button-vellipsis.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=e80b8e57c43bfd117679262e6a1334ad → Add to group, and select viewers.
  4. Issue a static key for the service account using the following command:
    nebius iam static-key issue \
      --name <name_for_the_key> \
      --account-service-account-id <service_account_ID> \
      --service=OBSERVABILITY
    
    Copy the value of the static key from the token parameter of the response. You will need it on later steps.

How to connect Prometheus

Prometheus can only show a limited amount of monitoring data. If you have a large infrastructure, consider connecting a data source in Grafana® instead.
  1. Download the latest release of Prometheus for your platform.
  2. Extract the contents and switch to the folder with Prometheus:
    tar xvfz prometheus-***.tar.gz
    cd prometheus-***
    
  3. Create the prometheus.yml configuration file that configures Prometheus to retrieve the metrics. Use one of the following configurations depending on your Prometheus version:
    scrape_configs:
      - job_name: 'Export time series from Nebius Observability'
        honor_labels: true
        scrape_interval: 15s
        scheme: https
        metrics_path: '/projects/<project_ID>/service-provider/prometheus/federate'
        params:
          match[]:
            - '{__name__=~".+"}'
        bearer_token: '<static_key_for_service_account>'
        static_configs:
          - targets:
            - 'read.monitoring.api.nebius.cloud'
    
    In this file, change the following parameters:
    • bearer_token: Enter the static key that you got earlier.
    • metrics_path: Specify your project ID in the URL. Optionally, add a service in the path in the following format:
      metrics_path: '/projects/<project_ID>/buckets/<service>/prometheus/federate'
      
      The following services are available:
      • compute: metrics related to Compute virtual machines.
      • gpu: GPU-related metrics.
      • nbs: metrics related to Compute volumes.
      • sp_storage: metrics related to Object Storage.
      • msp: metrics related to Managed Service for PostgreSQL® and Managed Service for MLflow.
    • match[]: optionally specify which data Prometheus collects by filtering for labels or metric names. For example, to collect only metrics with the disk prefix, set the following value:
      match[]:
        - '{__name__=~"^disk.*"}'
      
    • scrape_interval: you can change the interval, but the recommended interval is no less than 15 seconds.
  4. Start Prometheus:
    ./prometheus --config.file=prometheus.yml
    

How to shard large scraping jobs

If a scraping job needs to return a large amount of data, shard (split) it into several jobs. Use sharding when any of the following is true:
  • Prometheus takes too long to retrieve metrics because one job requests too many time series.
  • A large scraping job intermittently times out or becomes unreliable.
  • You expect your cluster to grow significantly and want to avoid reworking the Prometheus configuration later.
To shard a scraping job, create multiple scrape_configs entries that use the same metrics_path but different match[] selectors. Make the selectors non-overlapping so that the same metric is not collected more than once.
For example, when you collect only GPU metrics, split the requests by the uuid label:
scrape_configs:
 - job_name: 'Nebius Observability: GPU metrics, shard 1'
   honor_labels: true
   scrape_interval: 15s
   scheme: https
   metrics_path: '/projects/<project_ID>/buckets/gpu/prometheus/federate'
   params:
     match[]:
       - '{uuid=~"GPU-[0-7].*"}'
   bearer_token: '<static_key_for_service_account>'
   static_configs:
     - targets:
       - 'read.monitoring.api.nebius.cloud'

 - job_name: 'Nebius Observability: GPU metrics, shard 2'
   honor_labels: true
   scrape_interval: 15s
   scheme: https
   metrics_path: '/projects/<project_ID>/buckets/gpu/prometheus/federate'
   params:
     match[]:
       - '{uuid=~"GPU-[8-9a-f].*"}'
   bearer_token: '<static_key_for_service_account>'
   static_configs:
     - targets:
       - 'read.monitoring.api.nebius.cloud'

 - job_name: 'Nebius Observability: GPU metrics, shard 3'
   honor_labels: true
   scrape_interval: 15s
   scheme: https
   metrics_path: '/projects/<project_ID>/buckets/gpu/prometheus/federate'
   params:
     match[]:
       - '{uuid=""}'
   bearer_token: '<static_key_for_service_account>'
   static_configs:
     - targets:
       - 'read.monitoring.api.nebius.cloud'
Choose one sharding strategy and use it consistently. For example, split requests by service, by metric name prefix or by a stable label that clearly partitions your infrastructure.

How to explore and manage metrics

Open http://localhost:9090 in your browser and explore the metrics by using PromQL queries. For example, to get all metrics related to Compute virtual machines, enter the following query:
{instance_id=~"computeinstance-.*"}

The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates. Postgres, PostgreSQL and the Slonik Logo are trademarks or registered trademarks of the PostgreSQL Community Association of Canada, and used with their permission.