Skip to main content
If a job fails, gets canceled or returns an error, Serverless AI deletes the underlying virtual machine (VM) and boot disk to avoid costs. In this case, you can’t connect to the job and debug it. To keep the VM alive for debugging, run the job with a “sleep-on-fail” wrapper.

Keep the VM alive on failure

To keep the VM alive for debugging when a job fails, use bash -lc and append a long sleep if the main command fails. Add the following parameters when creating the job:
nebius ai job create \
  ... \
  --container-command bash \
  --args "-lc '<your_main_command> || (echo FAILED; sleep 86400)'"
Replace <your_main_command> with the actual command you want to run. When you use this wrapper, the job behaves as follows:
  • If the main command succeeds, the job exits normally.
  • If the main command fails, the job keeps running for 24 hours. During this time, you can connect to the underlying VM via SSH and debug.
Use sleep 3600 for one hour if you want a shorter debug window.

Connect to the job container by using SSH

You can connect to the job container in one of the following ways:
To connect to the job container by using SSH, you should create a job with at least one --ssh-key.To run a shell inside the job container for debugging, run:
nebius ai job ssh <job_ID>
A shell starts in the container.To specify the connection, you can use the following parameters:
  • -i or --identity-file: Identity file for SSH authentication. Default: your SSH key.
  • -s or --shell: Shell to run inside the container. Default: sh.
For example, to use bash and a specific key, run:
nebius ai job ssh <job_ID> -i ~/.ssh/id_rsa -s bash

Open a shell inside the container

Use the container ID to open a shell inside the container:
sudo docker exec -it <container_ID> /bin/bash

Run tests inside the container

  • Check currently running processes:
    ps aux
    
    You can identify which processes used the most computing resources, check if the script was running as expected and see when the processes started.
  • Check if GPUs are accessible:
    nvidia-smi
    
    This test helps confirm whether the GPUs were allocated correctly and whether the job workload could use them.