Costs
The tutorial includes the following chargeable resources:Prerequisites
-
Install dependencies:
- Install Python version 3.10 or higher.
-
Install Rust:
For more ways to install, see the Rust website.
- Install and configure the Nebius AI Cloud CLI.
- Install jq.
- Reload your terminal.
-
Install the latest nightly build of SkyPilot:
-
Clone the Nebius solution library from GitHub and go to the
skypilotdirectory: -
Run the
nebius-setup.shscript to create and configure a service account that will manage resources in your project on behalf of SkyPilot:After running the script, follow its prompts. If you want to test mounting Object Storage buckets to VMs, enable Object Storage support when prompted. -
Check that SkyPilot can access your project:
If the check is successful, the output shows that Nebius AI Cloud support is enabled:
- If you want to test mounting Object Storage buckets to VMs, create a bucket.
Steps
- Run SkyPilot tasks from the Nebius AI Cloud solution library: You can choose to run some or all of these tasks, depending on your use cases.
- Work with the VMs managed by SkyPilot as part of the tasks:
Run tasks
Run a GPU check
The GPU check in this tutorial creates a cluster of one VM with 8 NVIDIA H100 GPUs and runs the NVIDIA System Management Interface (nvidia-smi) on it. nvidia-smi outputs the list of GPUs available on the VM and the list of processes running on the GPUs.
-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
nvidia-smi:
Output example
Output example
Run an Object Storage check
The Object Storage check creates a VM and mounts a bucket from your project to it.-
In
examples/test-cloud-bucket.yaml, findsource: nebius://my-nebius-bucketunderfile_mountsand replacemy-nebius-bucketwith the name of your bucket. -
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
Run an InfiniBand™ check
The InfiniBand check creates a cluster of two VMs with 8 NVIDIA H100 GPUs each, adds them to a GPU cluster for InfiniBand networking and runs theib_send_bw test from perftest. The test measures bandwidth when sending data between GPUs on different VMs.
-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
Run a training task
The training task adapts a tutorial from the PyTorch documentation and its implementation from the SkyPilot documentation. It creates a cluster of two VMs with 8 NVIDIA H100 GPUs each and adds them to a GPU cluster for InfiniBand networking. Then, it uses PyTorch to train a GPT-like model with Distributed Data Parallel (DDP) on the VMs.-
Launch the task:
-
When SkyPilot displays which resources it is going to create and asks you to confirm the launch, enter
Yand press Enter.
Output example
Output example
The training task also has a single VM (non-distributed) version. To launch it, run
sky launch ai-training examples/ai-training.yaml.Work with the VMs managed by SkyPilot
After you created a cluster and launched a task on it, you can use SkyPilot and Nebius AI Cloud tools to monitor the cluster and connect to its VMs.Monitor the SkyPilot clusters and VMs
To see the statuses of the SkyPilot clusters, run the following command:Connect to the VMs
SkyPilot sets up SSH access to VMs in clusters automatically.-
To connect to the main (“head”) VM of the cluster, run
ssh <cluster_name>. For example: -
To connect to other VMs (“workers”), run
ssh <cluster_name>-worker<index>. For example:
Troubleshoot issues
Unavailable resources
When you runsky launch, you might get the following error:
Authentication errors
If you update your credentials, for example, by rerunningnebius-setup.sh, you might get authentication errors when you run SkyPilot commands. This means the SkyPilot API server cached the previous credentials. To update them on the server, restart it:
sky check nebius again to check that SkyPilot can access your Nebius AI Cloud project.
Other issues
If you face problems that are not covered by this tutorial, you can create issues in relevant GitHub repositories:How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources, so Nebius AI Cloud does not charge for them:-
Delete VMs and GPU clusters that SkyPilot created for all tasks in this tutorial:
- Delete the bucket that you used in the Object Storage task.
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.