Nebius AI Cloud supports integration with SkyPilot, an open-source framework for running AI workloads on different cloud infrastructures. SkyPilot can create clusters of Compute virtual machines (VMs) and run workloads on them based on task definitions like this:Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Costs
The tutorial includes the following chargeable resources:- Managed SkyPilot API Server (standalone application)
- Compute virtual machines
- Compute disks
- Object Storage buckets
Prerequisites
-
Deploy the Managed SkyPilot API Server:
- In the Nebius AI Cloud console, go to
ย AI Services โย SkyPilot.
- Enter a name for the application or keep the default one.
- Select a Platform and a Preset (vCPUs and RAM) for the API server VM.
- Click Deploy application.
- In the Nebius AI Cloud console, go to
-
Install
uv, a Python package manager, on your local machine: -
Install the SkyPilot CLI:
-
Connect to the managed API server. On the application page, click How to connect and copy the login command. Run the command in your terminal:
Replace
<your_server_endpoint>with the public endpoint URL from your deployed application. -
Verify that SkyPilot can access your project:
If the check is successful, the output contains the following:
-
Clone the Nebius ML Cookbook repository and go to the
skypilotdirectory:The ML Cookbook contains example task definitions used in this tutorial. - If you want to test mounting Object Storage buckets to VMs, create a bucket.
Steps
- Run SkyPilot tasks from the Nebius ML Cookbook: You can choose to run some or all of these tasks, depending on your use cases.
- Work with the VMs managed by SkyPilot as part of the tasks:
Run tasks
Run a GPU check
The GPU check in this tutorial creates a VM with one NVIDIAยฎ H100 GPU and runs the NVIDIA System Management Interface (nvidia-smi) on it. nvidia-smi outputs the list of GPUs available on the VM and the list of processes running on the GPUs.
-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
nvidia-smi:
Output example
Output example
Run an Object Storage check
The Object Storage check creates a VM and mounts a bucket from your project to it.-
In
examples/test-cloud-bucket.yaml, findsource: nebius://my-nebius-bucketunderfile_mountsand replacemy-nebius-bucketwith the name of your bucket. -
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
Run an InfiniBandโข check
The InfiniBand check creates a cluster of two VMs with 8 NVIDIAยฎ H100 GPUs each, connected with InfiniBand, and runs theib_send_bw test from perftest. The test measures bandwidth when sending data between GPUs on different VMs.
-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
Run a training task
The training task adapts a tutorial from the PyTorch documentation and its implementation from the SkyPilot documentation. It creates a cluster of two VMs with 8 NVIDIAยฎ H100 GPUs each, connected with InfiniBand. Then, it uses PyTorch to train a GPT-like model with Distributed Data Parallel on the VMs.-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
The training task also has a single VM (non-distributed) version. To launch it, run
sky launch ai-training examples/ai-training.yaml.Work with the VMs managed by SkyPilot
After you created a cluster and launched a task on it, you can use SkyPilot and Nebius AI Cloud tools to monitor the cluster and connect to its VMs.Monitor the SkyPilot clusters and VMs
To see the statuses of the SkyPilot clusters, run the following command:Connect to the VMs
SkyPilot sets up SSH access to VMs in clusters automatically.-
To connect to the main (โheadโ) VM of the cluster, run
ssh <cluster_name>. For example: -
To connect to other VMs (โworkersโ), run
ssh <cluster_name>-worker<index>. For example:
Troubleshoot issues
Unavailable resources
When you runsky launch, you might get the following error:
Connection errors
If SkyPilot commands fail with connection errors, make sure the Managed SkyPilot API Server is running. Check the application status in the Nebius AI Cloud console undersky api login again with the correct endpoint URL.
Authentication errors
If you get authentication errors when running SkyPilot commands, verify that you are connected to the managed API server by runningsky api login again with the correct endpoint URL.
Other issues
If you face problems that are not covered by this tutorial, you can create issues in relevant GitHub repositories:How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:-
Delete VMs and GPU clusters that SkyPilot created for all tasks in this tutorial:
- Delete the bucket that you used in the Object Storage task.
-
If you no longer need the Managed SkyPilot API Server, delete it in the Nebius AI Cloud console. Go to
ย AI Services โย SkyPilot, open the application, go to the Settings tab and click Delete application.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.