all_reduce_perf_mpi NCCL test. The all-reduce operation is crucial for synchronizing gradients during multi-GPU training, which makes all_reduce_perf_mpi useful for evaluating both network and compute resources.
How to run the test
- Connect to a login node.
-
Create the following batch script, for example named
nccl.sbatch: - If necessary, edit the script to configure the test.
-
Submit the script to Slurm:
For more details about running and configuring jobs, see Running Slurm batch jobs.
Configuring the test
The following sections describe settings that are specific to the script above. For more details on configuring batch scripts, see Running Slurm batch jobs.Number of worker nodes
sbatch --nodes=8 means that the test is running on 8 worker nodes. You can change this number.
If you run the test on multiple nodes, it uses NVLink for communications between GPUs on the same node and InfiniBand for GPUs on different nodes. To benchmark NVLink specifically, run the test on one node.
Number of vCPUs
#SBATCH --ntasks-per-node=8 (one task per GPU) and #SBATCH --cpus-per-task=16 in the script mean that the test uses 8 × 16 = 128 vCPUs on each worker node. If your nodes have a different number of vCPUs, you can adjust the number of vCPUs per node accordingly.
For example, if the nodes are Compute virtual machines with 8 NVIDIA B200 GPUs, each node has 160 vCPUs, so you can write #SBATCH --cpus-per-task=20. For more details on types of VMs that can be worker nodes in Soperator in Nebius AI Cloud, see Types of virtual machines and GPUs in Nebius AI Cloud.
Environment variables
The script uses the following NCCL and UCX environment variables:NCCL_IB_QPS_PER_CONNECTION=2makes each connection between two ranks (GPU processes) use two InfiniBand queue pairs.NCCL_NVLS_ENABLE=1explicitly enables NVLink SHARP (NVLS), which offloads the all-reduce operation to the NVSwitch domain.NCCL_BUFFSIZE=8388608increases the buffer size for NCCL communications between pairs of GPUs from 4 MiB (default) to 8 MiB.UCX_NET_DEVICES=eth0makes MPI use theeth0network interface instead of InfiniBand.
all_reduce_perf_mpi in Soperator clusters in Nebius AI Cloud. They might not fit your configuration or other workloads. For more details on these and other available variables, see NCCL documentation and DOCA UCX documentation.
all_reduce_perf_mpi parameters
all_reduce_perf_mpi uses the following parameters that you can customize:
-b,-fand-e: The start size, the increment factor and the end size of data that the test uses. For example,-b 512M -f 2 -e 8Gmeans that the first iteration works with 512 MiB of data, which then doubles in size at each following iteration (1 GiB, 2 GiB, 4 GiB) until it reaches 8 GiB.-g: The number of GPUs per task.
Understanding the results
all_reduce_perf_mpi measures the performance of the all-reduce collective operation across multiple GPUs. All-reduce combines data from all participating GPU processes and then distributes the result back to all processes. To start the processes and initialize NCCL on worker nodes, all_reduce_perf_mpi uses the Message Passing Interface (MPI).
The output of all_reduce_perf_mpi looks like this:
The values above are provided for illustrative purposes. Actual results may vary depending on factors such as hardware or system configuration.
- Time: The duration of the iteration in microseconds (μs).
- Algorithm bandwidth (
algbw): The size of the iteration’s data divided by time. Shows how fast the iteration completed based on the data size. - Bus bandwidth (
busbw): The algorithm bandwidth corrected for the number of ranks to better estimate the peak hardware bandwidth. Due to this correction, the bus bandwidth may not accurately reflect true hardware bandwidth in two-node tests that use both NVLink and InfiniBand.
out-of-place, where input and output buffers are different, and in-place, where the buffers are the same.
For more details about the test and understanding its results, see NCCL tests documentation.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.