Limitations
Docker Engine does not respect Slurm resource allocations and may use all resources of a node, regardless of the settings that you specify insbatch. It is recommended to use Enroot or other supported container runtimes to run Docker containers. If you want to use Docker Engine, use the -N and --exclusive settings to allocate entire nodes to Slurm jobs that run the containers and follow other instructions in this article.
If no local disk is available, Docker uses the VFS storage driver, which leads to significantly lower performance.
How to run a Docker container in a Slurm job
- Connect to a login node of your Soperator cluster.
-
Create a batch script that runs your workload in a container.
For example, create the
test_nccl.shscript with the following contents:This script pulls a Docker image with Ubuntu and CUDA toolkit from NVIDIA, then installs NCCL tests and their dependencies, and runs NCCL tests in a Docker container. The script uses the following parameters:#SBATCH -Nspecifies how many nodes to allocate.#SBATCH --exclusivespecifies that no other jobs may be scheduled on these nodes until this job is completed.--device=/dev/infinibandparameter fordockerallows access to InfiniBand™ from inside Docker containers.
-vparameter to make paths from the shared filesystem visible from inside the container: -
Start the job:
The output contains the job ID:
-
When the job is completed, review the contents of
output.log. The output contains the logs of the container starting up and installing dependencies, followed by the results of NCCL tests. For example:
How to run a Docker container in an interactive mode
- Connect to a login node of your Soperator cluster.
-
To run an interactive session on a node and prevent any other allocations on this node, use salloc:
This command allocates a worker node to a new job and opens a terminal on this node. Output example:
-
Start a Docker container on a worker node:
The
--rmparameter ensures that the container is automatically deleted when it exits. If your workload needs access to the shared filesystem, use the-vparameter to make paths from the shared filesystem visible from inside the container:For multi-node GPU workloads, use the--device=/dev/infinibandparameter fordockerthat allows access to InfiniBand from inside Docker containers. -
After you finish the interactive session and exit, you can see the confirmation that the node is no longer allocated:
How to get information about your Docker containers
To list all containers, including the ones that are already finished, connect to a worker node and run the following command:InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.