Skip to main content
The NCCL Inspector by NVIDIA® is a plug-in for the NVIDIA Collective Communications Library (NCCL) that provides detailed, per-communicator, per-collective performance and metadata logging for distributed GPU training jobs. It helps you see how NCCL communication behaves inside a workload, without changing the training code. If your Soperator cluster is prepared with NCCL Inspector support, you can enable it for specific jobs. When enabled for a job, the NCCL Inspector collects performance data from NCCL operations and sends the resulting metrics to the Grafana® installation that is part of your Soperator cluster. This gives you a job-level view of communication bandwidth, latency, message sizes and collective types.
The NCCL Inspector is available on demand for Soperator clusters deployed in Nebius AI Cloud. For details, contact support or your personal manager.

Enabling NCCL Inspector for jobs

On a Soperator cluster prepared with NCCL Inspector support, enable it in your Slurm job, for example, a batch job, by setting the NCCL_INSPECTOR_ENABLE environment variable:
export NCCL_INSPECTOR_ENABLE=1
After that, run your workload as usual with sbatch or srun. You don’t need to change the training application code.
To collect data for NCCL point-to-point (P2P) operations, the NCCL Inspector requires NCCL 2.30.3 or higher.
To disable the NCCL Inspector for a srun call, add the --snccliprecon-enabled=0 parameter to the srun command.

Accessing Grafana dashboards

The following NCCL Inspector dashboards are available in Grafana:
  • NCCL Inspector Job Performance: primary per-job view.
  • NCCL Inspector Metrics: metric-level overview.
  • NCCL Inspector Raw Metrics: raw metrics view.
For instructions about viewing dashboards in Grafana, see Monitoring metrics of Soperator clusters.

See also


The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates.