Skip to main content
Nebius AI Cloud supports integration with SkyPilot, an open-source framework for running AI workloads on different cloud infrastructures. SkyPilot can create clusters of Compute virtual machines (VMs) and run workloads on them based on task definitions like this:
resources:
  cloud: nebius
  accelerators: H100:8
  region: eu-north1

run: |
  nvidia-smi

Costs

The tutorial includes the following chargeable resources:

Prerequisites

  1. Install dependencies:
    1. Install Python version 3.10 or higher.
    2. Install Rust:
      curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
      
      For more ways to install, see the Rust website.
    3. Install and configure the Nebius AI Cloud CLI.
    4. Install jq.
    5. Reload your terminal.
  2. Install the latest nightly build of SkyPilot:
    pip3 install "skypilot-nightly[nebius]"
    
  3. Clone the Nebius solution library from GitHub and go to the skypilot directory:
    git clone https://github.com/nebius/nebius-solution-library.git
    cd nebius-solution-library/skypilot
    
  4. Run the nebius-setup.sh script to create and configure a service account that will manage resources in your project on behalf of SkyPilot:
    chmod +x nebius-setup.sh
    ./nebius-setup.sh
    
    After running the script, follow its prompts. If you want to test mounting Object Storage buckets to VMs, enable Object Storage support when prompted.
  5. Check that SkyPilot can access your project:
    sky check nebius
    
    If the check is successful, the output shows that Nebius AI Cloud support is enabled:
    Checking credentials to enable clouds for SkyPilot.
    Nebius: enabled [compute, storage]
    
    To enable a cloud, follow the hints above and rerun: sky check
    If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html
    
    🎉 Enabled clouds 🎉
    Nebius [compute, storage]
    
    Using SkyPilot API server: http://127.0.0.1:46580
    
  6. If you want to test mounting Object Storage buckets to VMs, create a bucket.

Steps

  1. Run SkyPilot tasks from the Nebius AI Cloud solution library: You can choose to run some or all of these tasks, depending on your use cases.
  2. Work with the VMs managed by SkyPilot as part of the tasks:
If you face issues when you launch tasks or work with the VMs, see the troubleshooting section.

Run tasks

Run a GPU check

The GPU check in this tutorial creates a cluster of one VM with 8 NVIDIA H100 GPUs and runs the NVIDIA System Management Interface (nvidia-smi) on it. nvidia-smi outputs the list of GPUs available on the VM and the list of processes running on the GPUs.
  1. Launch the task:
    sky launch -c basic-job examples/basic-job.yaml
    
  2. When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter Y and press Enter.
If the launch is successful, the output contains the list of GPUs and processes running on them returned by nvidia-smi:
⚙︎ Launching on Nebius eu-north1.
└── Instance is up.
✓ Cluster launched: basic-check.  View logs: sky api logs -l sky-2025-02-27-11-36-07-257438/provision.log
⚙︎ Syncing files.
✓ Setup detached.
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
...
(task, pid=7059) +-----------------------------------------------------------------------------------------+
(task, pid=7059) | NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
(task, pid=7059) |-----------------------------------------+------------------------+----------------------+
(task, pid=7059) | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
(task, pid=7059) | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
(task, pid=7059) |                                         |                        |               MIG M. |
(task, pid=7059) |=========================================+========================+======================|
(task, pid=7059) |   0  NVIDIA H100 80GB HBM3          On  |   00000000:8D:00.0 Off |                    0 |
(task, pid=7059) | N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
(task, pid=7059) |                                         |                        |             Disabled |
(task, pid=7059) +-----------------------------------------+------------------------+----------------------+
(task, pid=7059) |   1  NVIDIA H100 80GB HBM3          On  |   00000000:91:00.0 Off |                    0 |
...
✓ Job finished (status: SUCCEEDED).

Run an Object Storage check

The Object Storage check creates a VM and mounts a bucket from your project to it.
  1. In examples/test-cloud-bucket.yaml, find source: nebius://my-nebius-bucket under file_mounts and replace my-nebius-bucket with the name of your bucket.
  2. Launch the task:
    sky launch -c test-cloud-bucket examples/test-cloud-bucket.yaml
    
  3. When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter Y and press Enter.
If the launch is successful, the output contains the list of objects in the bucket:
⚙︎ Launching on Nebius eu-north1.
└── Instance is up.
✓ Cluster launched: test-cloud-bucket.  View logs: sky api logs -l sky-2025-03-25-13-42-18-045624/provision.log
⚙︎ Syncing files.
  Mounting (to 1 node): nebius://another-example-bucket -> /my_data
✓ Storage mounted.  View logs: sky api logs -l sky-2025-03-25-13-42-18-045624/storage_mounts.log
✓ Setup detached.
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=3963) Setup will be executed on every `sky launch` command on all nodes
(task, pid=3963) Run will be executed on every `sky exec` command on all nodes
(task, pid=3963) Do we have data?
(task, pid=3963) total 4
(task, pid=3963) drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 25 12:44 lorem-ipsum
✓ Job finished (status: SUCCEEDED).

Run an InfiniBand™ check

The InfiniBand check creates a cluster of two VMs with 8 NVIDIA H100 GPUs each, adds them to a GPU cluster for InfiniBand networking and runs the ib_send_bw test from perftest. The test measures bandwidth when sending data between GPUs on different VMs.
  1. Launch the task:
    sky launch -c infiniband-test examples/infiniband-test.yaml
    
  2. When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter Y and press Enter.
If the launch is successful, the output contains the results of the test:
⚙︎ Launching on Nebius eu-north1.
└── Instances are up.
✓ Cluster launched: nebius-ib-test.  View logs: sky api logs -l sky-2025-02-27-09-15-19-016219/provision.log
✓ Setup detached.
⚙︎ Job submitted, ID: 14
├── Waiting for task resources on 2 nodes.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
...
(worker1, rank=1, pid=33870, ip=192.168.0.15) ---------------------------------------------------------------------------------------
(worker1, rank=1, pid=33870, ip=192.168.0.15)                     Send BW Test
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Dual-port       : OFF            Device         : mlx5_0
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Number of qps   : 1              Transport type : IB
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Connection type : RC             Using SRQ      : OFF
(worker1, rank=1, pid=33870, ip=192.168.0.15)  PCIe relax order: ON
(worker1, rank=1, pid=33870, ip=192.168.0.15)  ibv_wr* API     : ON
(worker1, rank=1, pid=33870, ip=192.168.0.15)  TX depth        : 128
(worker1, rank=1, pid=33870, ip=192.168.0.15)  CQ Moderation   : 1
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Mtu             : 4096[B]
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Link type       : IB
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Max inline data : 0[B]
(worker1, rank=1, pid=33870, ip=192.168.0.15)  rdma_cm QPs       : OFF
(worker1, rank=1, pid=33870, ip=192.168.0.15)  Data ex. method : Ethernet
(worker1, rank=1, pid=33870, ip=192.168.0.15) ---------------------------------------------------------------------------------------
(worker1, rank=1, pid=33870, ip=192.168.0.15)  local address: LID 0x1334 QPN 0x0131 PSN 0xcdddde
(worker1, rank=1, pid=33870, ip=192.168.0.15)  remote address: LID 0x132f QPN 0x0131 PSN 0x90f79b
(worker1, rank=1, pid=33870, ip=192.168.0.15) ---------------------------------------------------------------------------------------
(worker1, rank=1, pid=33870, ip=192.168.0.15)  #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
(worker1, rank=1, pid=33870, ip=192.168.0.15)  65536      1000             361.82             361.67               0.689839
(worker1, rank=1, pid=33870, ip=192.168.0.15) ---------------------------------------------------------------------------------------
✓ Job finished (status: SUCCEEDED).

Run a training task

The training task adapts a tutorial from the PyTorch documentation and its implementation from the SkyPilot documentation. It creates a cluster of two VMs with 8 NVIDIA H100 GPUs each and adds them to a GPU cluster for InfiniBand networking. Then, it uses PyTorch to train a GPT-like model with Distributed Data Parallel (DDP) on the VMs.
  1. Launch the task:
    sky launch distributed-training examples/distributed-training.yaml
    
  2. When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter Y and press Enter.
If the launch is successful, the output contains training logs:
⚙︎ Launching on Nebius eu-north1.
└── Instances are up.
✓ Cluster launched: mingpt.  View logs: sky api logs -l sky-2025-02-27-11-27-07-706257/provision.log
⚙︎ Syncing files.
✓ Setup detached.
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 2 nodes.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
...
...
(task, pid=8591) [GPU4] Epoch 10 | Iter 0 | Eval Loss 1.94895
(task, pid=8591) [GPU7] Epoch 10 | Iter 0 | Eval Loss 1.93593
(task, pid=8591) [GPU6] Epoch 10 | Iter 0 | Eval Loss 1.95961
(task, pid=8591) I0227 16:33:56.668000 20640 site-packages/torch/distributed/elastic/agent/server/api.py:879] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
(task, pid=8591) I0227 16:33:56.669000 20640 site-packages/torch/distributed/elastic/agent/server/api.py:932] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
(task, pid=8591) I0227 16:33:56.670000 20640 site-packages/torch/distributed/elastic/agent/server/api.py:946] Done waiting for other agents. Elapsed: 0.0005085468292236328 seconds
✓ Job finished (status: SUCCEEDED).
The training task also has a single VM (non-distributed) version. To launch it, run sky launch ai-training examples/ai-training.yaml.

Work with the VMs managed by SkyPilot

After you created a cluster and launched a task on it, you can use SkyPilot and Nebius AI Cloud tools to monitor the cluster and connect to its VMs.

Monitor the SkyPilot clusters and VMs

To see the statuses of the SkyPilot clusters, run the following command:
sky status
You can also monitor individual VMs in the Nebius AI Cloud web console. The VMs’ names have the cluster name as the prefix.

Connect to the VMs

SkyPilot sets up SSH access to VMs in clusters automatically.
  • To connect to the main (“head”) VM of the cluster, run ssh <cluster_name>. For example:
    ssh distributed-training
    
  • To connect to other VMs (“workers”), run ssh <cluster_name>-worker<index>. For example:
    ssh distributed-training-worker1
    
For more details, see the SkyPilot documentation.

Troubleshoot issues

Unavailable resources

When you run sky launch, you might get the following error:
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 2x Nebius({'H100': 8})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
Reasons for provision failures (for details, please check the log above):
Resource                       Reason                                        
Nebius(gpu-h100-sxm_8gpu-      Failed to acquire resources in all zones in   
128vcpu-1600gb, {'H100': 8})   eu-north1 for {Nebius({'H100': 8})}.   
It means that the resources that you specified in the task definition exceed your Compute quotas. For details on how to view and manage quotas, see Quotas in Nebius AI Cloud.

Authentication errors

If you update your credentials, for example, by rerunning nebius-setup.sh, you might get authentication errors when you run SkyPilot commands. This means the SkyPilot API server cached the previous credentials. To update them on the server, restart it:
sky api stop
sky api start
After that, run sky check nebius again to check that SkyPilot can access your Nebius AI Cloud project.

Other issues

If you face problems that are not covered by this tutorial, you can create issues in relevant GitHub repositories:

How to delete the created resources

Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:
  • Delete VMs and GPU clusters that SkyPilot created for all tasks in this tutorial:
    sky down basic-job test-cloud-bucket infiniband-test distributed-training
    
  • Delete the bucket that you used in the Object Storage task.

InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.