Keep the VM alive on failure
To keep the VM alive for debugging when a job or endpoint fails, usebash -lc and append a long sleep if the main command fails. Add the following parameters when creating the job or endpoint:
<your_main_command> with the actual command you want to run.
When you use this wrapper, the job or endpoint behaves as follows:
- If the main command succeeds, the job or endpoint exits normally.
- If the main command fails, the job or endpoint keeps running for 24 hours. During this time, you can connect to the underlying VM via SSH and debug.
Connect to the container by using SSH
You can connect to the container of the job or endpoint in the following ways:- Directly
- Through the underlying VM
To connect to the container by using SSH, you should create a job or create an endpoint with at least one
--ssh-key.To run a shell inside the container for debugging, run nebius ai job ssh <job_ID> or nebius ai endpoint ssh <endpoint_ID>. A shell starts in the container.You can add the following parameters to these commands:-ior--identity-file: Identity file for SSH authentication. Default: your SSH key.-sor--shell: Shell to run inside the container. Default:sh.
bash and a specific key, run:-
For the job:
-
For the endpoint:
Run tests inside the container
After you connected to the container:-
Check currently running processes:
You can identify which processes used the most computing resources, check if the script was running as expected and see when the processes started.
-
Check if GPUs are accessible:
This test helps confirm whether the GPUs were allocated correctly and whether the job or endpoint workloads could use them.