Skip to main content
This guide explains how to use Open MPI and MPIrun to run parallel jobs on Compute virtual machines (VMs) that have GPUs and are added to a GPU cluster. The guide uses the NCCL tests developed by NVIDIA as an example of a job that you can run with MPIrun. You can also run these tests by using Slurm, or in a Managed Service for Kubernetes cluster with a node group that has a GPU cluster attached.

Costs

Nebius AI Cloud only charges for running the virtual machines that make up your GPU cluster. For more details, see the Compute pricing.

Prerequisites

  1. Create a GPU cluster if you do not already have one.
  2. Create virtual machines and add them to the cluster.

Steps

Install Open MPI on each VM in the cluster

For each VM in the GPU cluster:
  1. Get the VM’s private IP address.
  2. Connect to the VM through SSH.
  3. Install the Open MPI library on the VM:
    sudo apt-get install openmpi-bin
    

Build the tests on one of the VMs

Choose one of the VMs as the main VM – you will run the tests from it. Build the tests on the main VM:
  1. Clone the NVIDIA repository with the tests:
    git clone https://github.com/NVIDIA/nccl-tests
    
  2. Build the tests with Open MPI:
    cd nccl-tests
    MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi MPI=1  make
    
  3. Copy the built binary file, all_reduce_perf, to the same directory on other VMs.

Set up SSH connectivity between the VMs in the cluster

  1. On the main VM, generate an SSH key pair without a passphrase:
    ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
    
  2. Copy the generated pair, ~/.ssh/id_ed25519 and ~/.ssh/id_ed25519.pub, to the same directory on each other VM.
  3. On all other VMs, add the public key from the pair to the list of authorized keys:
    cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
    
For more details, see the Open MPI documentation.

Run the tests

Run the tests from the main VM with the mpirun command:
mpirun --host <IP_address_1>:8,<IP_address_2>:8,<IP_address_3>:8,<IP_address_4>:8 \
  --allow-run-as-root -np 32 \
  -mca pml ucx \
  ~/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
Where:
  • IP_address_[1-4]: IP address of the VM where you want to run the test.
  • :8: Amount of GPUs on the VM.
  • -mca pml ucx: Instruction for MPI communications to go through InfiniBand™ using UCX. To use Ethernet instead, replace the option with -mca btl_tcp_if_include eth0. This does not affect InfiniBand data exchanges of the test itself.
  • ~/nccl-tests/build/all_reduce_perf: A path to the binary file that should be available on all VMs.
In the result, check the average bus bandwith. If its value is higher than 300 GB/s, the connection is stable. Example:
...

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
  1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
  2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
  4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
  8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 317.11
The average bus bandwith is not equal to the InfiniBand™ one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.

How to delete the chargeable resources

The virtual machines that make up your GPU cluster are chargeable. If you do not need the VMs, delete them, so Nebius AI Cloud does not charge for them:
  1. In the sidebar, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Virtual machines.
  2. Next to the virtual machine’s name, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/button-vellipsis.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=e80b8e57c43bfd117679262e6a1334adDelete.
  3. Enter the VM name and confirm deletion.

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.