Skip to main content
To set up the infrastructure for ML workloads, create a virtual machine (VM) with 8 GPUs and a shared filesystem for training and a VM with one GPU for inference. In this guide, we will use the Nebius AI Cloud CLI to create VMs in a project in the eu-north1 region.

Before you start

Install the Nebius AI Cloud CLI

The Nebius AI Cloud CLI manages all Nebius AI Cloud resources. For more details, see the Nebius AI Cloud CLI documentation. To install and initialize the Nebius AI Cloud CLI, run the following commands one by one:
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
nebius profile create
The last command, nebius profile create, will open the Nebius AI Cloud web console sign-in screen in your browser. Sign in to the web console to complete the initialization. After that, get the project ID and save it in the CLI configuration:
nebius config set parent-id <project_ID>

Install jq

In this guide, we will use jq to extract IDs and tokens from JSON data returned by the Nebius AI Cloud CLI. For more details, see the jq documentation.
sudo apt-get install jq

Generate keys for SSH access to the VM

Generate a key pair for SSH access to the VM and save it to the default location:
ssh-keygen -t ed25519

Create a VM with eight GPUs with InfiniBand™ and a shared filesystem for training

  1. Create a boot disk and save its ID to an environment variable:
    export TR_VM_BOOT_DISK_ID=$(nebius compute disk create \
      --name training-vm-disk-1 \
      --size-gibibytes 200 \
      --type network_ssd \
      --source-image-family-image-family ubuntu22.04-cuda12 \
      --block-size-bytes 4096 \
      --format json | jq -r ".metadata.id")
    
    The command creates a 200 GiB SSD disk with a 4 KiB block size and an Ubuntu boot image with pre-installed NVIDIA GPU drivers. For details about boot disk images (--source-image-family-image-family), see Boot disk images for Compute virtual machines.
  2. Create a shared filesystem and save its ID to an environment variable:
    export TR_VM_FILESYSTEM_ID=$(nebius compute filesystem create \
      --name training-vm-filesystem-1 \
      --size-gibibytes 1024 \
      --type network_ssd \
      --block-size-bytes 4096 \
      --format json | jq -r ".metadata.id")
    
    The command creates a 1 TiB SSD shared filesystem with 4 KiB blocks.
  3. Get the subnet ID and save it to an environment variable:
    export SUBNET_ID=$(nebius vpc subnet list \
      --format json \
      | jq -r ".items[0].metadata.id")
    
    Possible subnet ID: vpcsubnet-e0dcbaa76x2024xyz8.
  4. For high-speed networking and efficient training, consider interconnecting multiple VM GPUs in a GPU cluster using InfiniBand. To do this, before creating the VM, create a GPU cluster to connect the VM and get its ID:
    export GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
      --name gpu-cluster-name \
      --infiniband-fabric fabric-3 \
      --format json \
      | jq -r ".metadata.id")
    
  5. Create a VM with 8 GPUs for training:
    export NETWORK_INTERFACE_NAME=single-gpu-node-compute-api-network-interface
    export USER_DATA=$(printf "users:\n - name: user\n   sudo: ALL=(ALL) NOPASSWD:ALL\n   shell: /bin/bash\n   ssh_authorized_keys:\n   - $(cat ~/.ssh/id_ed25519.pub)" | jq -Rs '.')
    
    export TR_VM_ID=$(nebius compute instance create \
      --format json \
      ${GPU_CLUSTER_ID:+--gpu-cluster-id} ${GPU_CLUSTER_ID:+"$GPU_CLUSTER_ID"} \
      - <<EOF | jq -r ".metadata.id"
    {
      "metadata": {
        "name": "training-instance"
      },
      "spec": {
        "stopped": false,
        "cloud_init_user_data": $USER_DATA,
        "resources": {
          "platform": "gpu-h100-sxm",
          "preset": "8gpu-128vcpu-1600gb"
        },
        "boot_disk": {
          "attach_mode": "READ_WRITE",
          "existing_disk": {
            "id": "$TR_VM_BOOT_DISK_ID"
          }
        },
        "filesystems": [
          {
            "attach_mode": "READ_WRITE",
            "mount_tag": "training-vm-filesystem-1",
            "existing_filesystem": {
              "id": "$TR_VM_FILESYSTEM_ID"
            }
          }
        ],
        "network_interfaces": [
          {
            "name": "$NETWORK_INTERFACE_NAME",
            "subnet_id": "$SUBNET_ID",
            "ip_address": {},
            "public_ip_address": {}
          }
        ]
      }
    }
    EOF
    )
    
    The given example assumes that you work with VMs that have public addresses, so you can later connect to these VMs by SSH. However, if you need isolated VMs without public addresses, remove the "public_ip_address": {} line from the VM configuration. To access the VM, you can set up a WireGuard jump server later. This approach enhances security and still provides access to the VM within the same subnet. For more information about creating VMs and managing their network parameters, see How to create a virtual machine in Nebius AI Cloud.

Create a VM with one GPU for inference

  1. Create a boot disk and save its ID to an environment variable:
    export INF_VM_BOOT_DISK_ID=$(nebius compute disk create \
      --name inference-vm-disk-1 \
      --size-gibibytes 200 \
      --type network_ssd \
      --source-image-family-image-family ubuntu22.04-cuda12 \
      --block-size-bytes 4096 \
      --format json | jq -r ".metadata.id")
    
The command creates a 200 GiB SSD disk with a 4 KiB block size and an Ubuntu boot image with pre-installed NVIDIA GPU drivers. For details about boot disk images (--source-image-family-image-family), see Boot disk images for Compute virtual machines.
  1. Create a VM with one GPU for inference:
    export NETWORK_INTERFACE_NAME=single-gpu-node-compute-api-network-interface
    export USER_DATA=$(jq -Rs '.' <<EOF
    users:
      - name: user
        sudo: ALL=(ALL) NOPASSWD:ALL
        shell: /bin/bash
        ssh_authorized_keys:
          - $(cat ~/.ssh/id_ed25519.pub)
    EOF
    )
    
    export INF_VM_ID=$(nebius compute instance create \
      --format json \
      - <<EOF | jq -r ".metadata.id"
    {
      "metadata": {
        "name": "inference-instance"
      },
      "spec": {
        "stopped": false,
        "cloud_init_user_data": $USER_DATA,
        "resources": {
          "platform": "gpu-h100-sxm",
          "preset": "1gpu-16vcpu-200gb"
        },
        "boot_disk": {
          "attach_mode": "READ_WRITE",
          "existing_disk": {
            "id": "$INF_VM_BOOT_DISK_ID"
          }
        },
        "network_interfaces": [
          {
            "name": "$NETWORK_INTERFACE_NAME",
            "subnet_id": "$SUBNET_ID",
            "ip_address": {},
            "public_ip_address": {}
          }
        ]
      }
    }
    EOF
    )
    

Connect to the VMs

Connect to the VM for training via SSH:
  1. Get your VM’s public IP address and save it to an environment variable:
    export TR_PUBLIC_IP_ADDRESS=$(nebius compute instance get \
      --id $TR_VM_ID \
      --format json \
      | jq -r '.status.network_interfaces[0].public_ip_address.address | split("/")[0]')
    
  2. Use the public IP address to connect to the VM:
    ssh user@$INF_PUBLIC_IP_ADDRESS
    
Connect to the VM for inference via SSH:
  1. Get your VM’s public IP address and save it to an environment variable:
    export INF_PUBLIC_IP_ADDRESS=$(nebius compute instance get \
      --id $INF_VM_ID \
      --format json \
      | jq -r '.status.network_interfaces[0].public_ip_address.address | split("/")[0]')
    
  2. Use the public IP address to connect to the VM:
    ssh user@$INF_PUBLIC_IP_ADDRESS
    

What’s next

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.