Soperator includes pre-built NCCL tests that you can use to validate collective communication between GPUs over the available high-performance network and assess network performance in your cluster. The NVIDIA Collective Communications Library (NCCL) used in these benchmarks enables communication over NVLink for single-node runs, and over a combination of NVLink and InfiniBand™ for multi-node runs. There are two versions of NCCL tests included in Soperator: a single-node version (Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
all_reduce_perf) and a multi-node version with the _mpi suffix (all_reduce_perf_mpi). In this article, you will run the all_reduce_perf_mpi NCCL test on multiple nodes. The all-reduce operation is important for synchronizing gradients during multi-GPU training, which makes all_reduce_perf_mpi useful for evaluating both network and compute resources.
How to run the test
- Connect to a login node.
-
Create an output directory for Slurm logs:
-
Create an sbatch script for your platform and save it as
nccl_all_reduce.sbatch:- 8 × NVIDIA H100/H200
- 8 × NVIDIA B200
- 8 × NVIDIA B300
nccl_all_reduce.sbatch -
Submit the job:
The examples above assume the job is run on 8 nodes using the
#SBATCH --nodesdirective. To run on a different number of nodes, override this value using the command-line argument when submitting the job:sbatch --nodes=<node_count> nccl_all_reduce.sbatch. Slurm uses the command-line value when both are present. For more details about running and configuring jobs, see Running Slurm batch jobs.
Script structure and configuration
The example scripts include three sections:- Slurm configuration parameters
- Environment variables for NCCL tuning
- Launch command for the parallel NCCL testing
Slurm configuration parameters
This section defines job parameters using#SBATCH directives. These directives configure job submission options.
For more details, see Job configuration and the SBATCH documentation.
--job-name: Job name shown in thesqueueandsacctoutput.--time: Job time limit (30 minutes in these examples).--output,--error: File names for the job output and error. You can use pattern substitution, for example:%x(job name) and%j(job ID).--nodes: Number of Slurm worker nodes used for the test. You can override this withsbatch --nodes=<node_count>.--gpus-per-node: Number of GPUs per node. The examples use all available GPUs for each platform (for example, 8 GPUs on 8xH200 nodes).--ntasks-per-node: Number of parallel processes per node. This example uses one process per GPU, so the value matches the number of GPUs (for example, 8). Some ML frameworks instead launch one process per node and handle parallelism internally; if you want to test how such frameworks would work, set this value to1.--cpus-per-task: Number of CPUs per process. In these examples, all CPUs are evenly divided across processes (total cores ÷ number of GPUs). This helps ensure consistent performance and avoids interference from cluster defaults. For one-process-per-node setups, set the value to the total number of CPUs per node. For CPU counts for each platform, see presets.--mem: Amount of system memory per node. The examples use all available memory (--mem=0) to avoid unintended limitations from cluster defaults. You can also set a specific memory amount using units, for example:--mem=4G.
Environment variables for NCCL tuning
This section defines environment variables for NCCL and related parallel libraries. For more details, see the NCCL environment variables documentation.-
UCX_NET_DEVICES: Comma-separated list of high-speed network interfaces used by the Unified Communication X (UCX) library. UCX is commonly used as a transport layer in Message Passing Interface (MPI) implementations and other libraries (for example, NVIDIA Inference Xfer Library — NIXL). Use platform-specific values of 8 specific InfiniBand ports to achieve maximum bandwidth. Otherwise, use an Ethernet interface with IP connectivity across all worker nodes, for example:UCX_NET_DEVICES=eth0. -
NCCL tuning variables:
NCCL_IB_QPS_PER_CONNECTION=2: Sets the number of InfiniBand queue pairs per connection to 2 (instead of the default 1). Using more queue pairs may improve throughput by increasing routing entropy.NCCL_NVLS_ENABLE=1: Enables NVLink SHARP support, allowing the NVSwitch to offload part of the computation work. This option is typically enabled by default, but may be disabled in some container images.NCCL_BUFFSIZE=8388608: Sets the size of the internal NCCL communication buffer to 8 MB. A larger buffer can help improve performance in some scenarios by allowing more data to be processed in each communication step.
Launch command configuration
This section uses thesrun command to start the application in parallel across multiple nodes and processes, and defines arguments for both srun and NCCL tests.
srun --mpi=pmix: Selects the Process Management Interface for Exascale (PMIx) used to exchange rank information between Slurm’s launcher (srun) and the MPI library (in the case ofall_reduce_perf_mpi, Open MPI).all_reduce_perf_mpi: Runs the NCCL benchmark for the all-reduce operation, built with support for multi-node execution. NCCL also provides similar benchmarks for other collective operations like AllGather and ReduceScatter.
-b 512M: Sets the minimum message size, meaning the test begins with messages of 512 MB.-e 16G: Defines the maximum message size, so the test will go up to 16 GB.-f 2: Specifies the multiplication factor, so each step doubles the message size (for example, 512 MB → 1 GB → 2 GB, etc.).-g 1: Indicates that each thread will use 1 GPU.-t 1: Sets the number of threads per process to 1.
How to validate the results
When the job finishes, it saves both the standard output and standard error to a file namedresults/nccl_all_reduce-<job_ID>.out. A successful run should:
- Complete without errors
- Show
# Out of bounds values : 0 OK - Show increasing bus bandwidth (
busbw) as the collective message size grows. In most cases,busbwwill reach a stable peak bandwidth at larger collective operation sizes.
Example output
This example was generated on a cluster of 8 virtual machines, each equipped with 8 NVIDIA H200 GPUs:The values above are provided for illustrative purposes. Actual results may vary depending on factors such as hardware or system configuration.
- Size of the exchange (in bytes)
- Element count (based on the data type)
- Data type (for example, floating point)
- Reduction operation (for example sum)
- Rank-related information
- Out-of-place, where input and output buffers are separate
- In-place, where input and output share the same buffer
- Time: The duration of a single collective operation iteration, measured in microseconds (μs).
- Algorithm bandwidth (
algbw): The size of the input array for collective operation divided by time. Shows how fast one iteration completes based on the data size. - Bus bandwidth (
busbw): The algorithm bandwidth corrected for the number of communicating ranks to better estimate the peak hardware bandwidth. Due to this correction and the large speed difference of NVLink and InfiniBand, the final bus bandwidth value may not accurately reflect true hardware bandwidth. For example, in some scenarios, such as two-node runs that involve both NVLink and InfiniBand, thebusbwvalue can overestimate the load on InfiniBand.
Avg bus bandwidth is the average of all busbw values reported across the different message sizes. When interpreting or comparing this average, consider the range of message sizes used in the test (defined by the -b, -e and -f 2 arguments, and visible in the size column). Smaller message sizes typically result in lower bus bandwidth, while larger sizes reach higher bandwidth. As a result, the average can vary significantly depending on the size range included in the test.
For more details about the test and understanding its results, see NCCL tests documentation.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.