Soperator clusters allow you to use Docker Engine to run jobs in containers.Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Limitations
When using Docker Engine with Slurm, consider the following limitations:- Docker Engine doesn’t respect Slurm resource allocations. Docker may use all resources of a node, regardless of the settings that you specify in
sbatch. We recommend using Enroot or other supported container runtimes to run Docker containers. If you want to use Docker Engine, use the-Nand--exclusivesettings to allocate entire nodes to Slurm jobs. - Docker containers aren’t managed as part of the Slurm job lifecycle. If a job is canceled, fails or times out, containers started with
srun docker runcontinue running. Stop them manually or adjust your job script to stop the containers when the job receives theSIGTERMorSIGKILLsignals. - Performance may be degraded without local disks. If no local disk is available, Docker uses the VFS storage driver, which leads to significantly lower performance.
How to run a Docker container in a Slurm job
- Connect to a login node of your Soperator cluster.
-
Create a batch script that runs your workload in a container.
For example, create the
test_nccl.shscript with the following contents:This script pulls a Docker image with Ubuntu and CUDA toolkit from NVIDIA, then installs NCCL tests and their dependencies, and runs NCCL tests in a Docker container. The script uses the following parameters:#SBATCH -Nspecifies how many nodes to allocate.#SBATCH --exclusivespecifies that no other jobs may be scheduled on these nodes until this job is completed.--device=/dev/infinibandparameter fordockerallows access to InfiniBand™ from inside Docker containers.
-vparameter to make paths from the shared filesystem visible from inside the container: -
Start the job:
The output contains the job ID:
-
When the job is completed, review the contents of
output.log. The output contains the logs of the container starting up and installing dependencies, followed by the results of NCCL tests. For example:
How to run a Docker container in an interactive mode
- Connect to a login node of your Soperator cluster.
-
To run an interactive session on a node and prevent any other allocations on this node, use salloc:
This command allocates a worker node to a new job and opens a terminal on this node. Output example:
-
Start a Docker container on a worker node:
The
--rmparameter ensures that the container is automatically deleted when it exits. If your workload needs access to the shared filesystem, use the-vparameter to make paths from the shared filesystem visible from inside the container:For multi-node GPU workloads, use the--device=/dev/infinibandparameter fordockerthat allows access to InfiniBand from inside Docker containers. -
After you finish the interactive session and exit, you can see the confirmation that the node is no longer allocated:
How to get information about your Docker containers
To list all containers, including the ones that are already finished, connect to a worker node and run the following command:InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.