Skip to main content
Soperator comes bundled with pre-built NCCL tests. You can run them to check the networking performance between GPUs on a single node or on multiple nodes, communicating over NVLink or InfiniBand™. In this article, you will run the all_reduce_perf_mpi NCCL test. The all-reduce operation is crucial for synchronizing gradients during multi-GPU training, which makes all_reduce_perf_mpi useful for evaluating both network and compute resources.

How to run the test

  1. Connect to a login node.
  2. Create the following batch script, for example named nccl.sbatch:
    #!/bin/bash
    #SBATCH --job-name=nccl_multi_node
    #SBATCH --output=results/nccl_multi_node-%j.out
    #SBATCH --error=results/nccl_multi_node-%j.out
    #SBATCH --ntasks-per-node=8
    #SBATCH --gpus-per-node=8
    #SBATCH --cpus-per-task=16
    
    # Use 2 InfiniBand queue pairs per connection between ranks
    export NCCL_IB_QPS_PER_CONNECTION=2
    
    # Use NVLink SHARP to offload all-reduce to NVSwitch
    export NCCL_NVLS_ENABLE=1
    
    # Double buffer size for NCCL communications
    export NCCL_BUFFSIZE=8388608
    
    # Prevent MPI from using InfiniBand
    export UCX_NET_DEVICES=eth0
    
    # Run a multi-node MPI NCCL test
    srun --mpi=pmix \
      all_reduce_perf_mpi -b 512M -e 8G -f 2 -g 1
    
  3. If necessary, edit the script to configure the test.
  4. Submit the script to Slurm:
    sbatch --nodes=8 nccl.sbatch
    
    For more details about running and configuring jobs, see Running Slurm batch jobs.

Configuring the test

The following sections describe settings that are specific to the script above. For more details on configuring batch scripts, see Running Slurm batch jobs.

Number of worker nodes

sbatch --nodes=8 means that the test is running on 8 worker nodes. You can change this number. If you run the test on multiple nodes, it uses NVLink for communications between GPUs on the same node and InfiniBand for GPUs on different nodes. To benchmark NVLink specifically, run the test on one node.

Number of vCPUs

#SBATCH --ntasks-per-node=8 (one task per GPU) and #SBATCH --cpus-per-task=16 in the script mean that the test uses 8 × 16 = 128 vCPUs on each worker node. If your nodes have a different number of vCPUs, you can adjust the number of vCPUs per node accordingly.
For example, if the nodes are Compute virtual machines with 8 NVIDIA B200 GPUs, each node has 160 vCPUs, so you can write #SBATCH --cpus-per-task=20. For more details on types of VMs that can be worker nodes in Soperator in Nebius AI Cloud, see Types of virtual machines and GPUs in Nebius AI Cloud.

Environment variables

The script uses the following NCCL and UCX environment variables:
  • NCCL_IB_QPS_PER_CONNECTION=2 makes each connection between two ranks (GPU processes) use two InfiniBand queue pairs.
  • NCCL_NVLS_ENABLE=1 explicitly enables NVLink SHARP (NVLS), which offloads the all-reduce operation to the NVSwitch domain.
  • NCCL_BUFFSIZE=8388608 increases the buffer size for NCCL communications between pairs of GPUs from 4 MiB (default) to 8 MiB.
  • UCX_NET_DEVICES=eth0 makes MPI use the eth0 network interface instead of InfiniBand.
These variables and their values are adapted for running all_reduce_perf_mpi in Soperator clusters in Nebius AI Cloud. They might not fit your configuration or other workloads. For more details on these and other available variables, see NCCL documentation and DOCA UCX documentation.

all_reduce_perf_mpi parameters

all_reduce_perf_mpi uses the following parameters that you can customize:
  • -b, -f and -e: The start size, the increment factor and the end size of data that the test uses. For example, -b 512M -f 2 -e 8G means that the first iteration works with 512 MiB of data, which then doubles in size at each following iteration (1 GiB, 2 GiB, 4 GiB) until it reaches 8 GiB.
  • -g: The number of GPUs per task.
For more parameters, see NCCL tests documentation.

Understanding the results

all_reduce_perf_mpi measures the performance of the all-reduce collective operation across multiple GPUs. All-reduce combines data from all participating GPU processes and then distributes the result back to all processes. To start the processes and initialize NCCL on worker nodes, all_reduce_perf_mpi uses the Message Passing Interface (MPI). The output of all_reduce_perf_mpi looks like this:
#  Rank  0 Group  0 Pid 121061 on   worker-0 device  0 [0000:8d:00] NVIDIA H200
...
#  Rank 63 Group  0 Pid  89331 on  worker-13 device  7 [0000:b7:00] NVIDIA H200
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   536870912     134217728     float     sum      -1   3489.1  153.87  302.94      0   3343.1  160.59  316.16      0
  1073741824     268435456     float     sum      -1   6060.2  177.18  348.82      0   6061.9  177.13  348.72      0
  2147483648     536870912     float     sum      -1    11523  186.36  366.89      0    11799  182.00  358.32      0
  4294967296    1073741824     float     sum      -1    22312  192.49  378.97      0    22410  191.65  377.31      0
  8589934592    2147483648     float     sum      -1    44596  192.62  379.22      0    44542  192.85  379.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 355.703 
#
The values above are provided for illustrative purposes. Actual results may vary depending on factors such as hardware or system configuration.
The results include three metrics for each iteration of the test:
  • Time: The duration of the iteration in microseconds (μs).
  • Algorithm bandwidth (algbw): The size of the iteration’s data divided by time. Shows how fast the iteration completed based on the data size.
  • Bus bandwidth (busbw): The algorithm bandwidth corrected for the number of ranks to better estimate the peak hardware bandwidth. Due to this correction, the bus bandwidth may not accurately reflect true hardware bandwidth in two-node tests that use both NVLink and InfiniBand.
Each iteration is performed out-of-place, where input and output buffers are different, and in-place, where the buffers are the same. For more details about the test and understanding its results, see NCCL tests documentation.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.