- To download data from other Slurm clusters via SSH:
- To download data to or from an Object Storage bucket or other S3-compatible storage:
How to download data by using rsync
Rsync can transfer files via SSH between a Soperator cluster and a remote server. You can usersync to migrate data from an external data source to a Soperator cluster.
Use rsync to download binaries, configuration files, container images or other small files. It is not intended for transferring large files.
Unlike other tools, rsync can preserve file permissions and ownerships. You can combine rsync with rclone to download the files with rclone, then update their permissions with rsync. For more information, see the example below.
You can transfer data with rsync directly or within a Slurm job.
Download data directly
For example, to download a directory via SSH from a remote server to the shared filesystem of your Soperator cluster:- Connect to a login node of your Soperator cluster.
-
Run the following command:
In this command, specify the following:
- Private SSH key for the remote server
- Username and host to connect to the remote server
- Path to the remote directory with data
- Path to the directory in the shared filesystem where the data should be downloaded
Download data within a Slurm job
A Slurm job uses a worker node that has more processing resources than a login node. In addition, data transfer via a Slurm job continues even if your connection to a login node is interrupted. To download data within a Slurm job:- Connect to a login node of your Soperator cluster.
-
Create the
rsync_copy.batchfile in the shared filesystem of your Soperator cluster and paste the following contents into it: -
Run a job to transfer data:
In this command, specify the following:
- Private SSH key for the remote server
- Username and host to connect to the remote server
- Path to the remote directory with data
- Path to the directory in the shared filesystem where the data should be downloaded
How to download data by using rclone
Rclone is a versatile tool for downloading or uploading data between various locations. It needs more configuration thanrsync, but after the initial setup it allows you to download large amounts of data (10-100 TiB) fast.
To work with rclone, create the ~/.config/rclone/rclone.conf configuration file on the machine from which you are running the commands. In this configuration file, create profiles for each remote location and specify in them information such as the address and type of the location, or credentials for connecting to it. For the full list of location types and possible settings for them, see the rclone documentation.
After you configure rclone profiles for several remote locations, you can move data between these locations or to the local machine.
For example, to download data from an Object Storage bucket to a shared filesystem of a Soperator cluster:
- Connect to a login node of your cluster.
-
Create the
~/.config/rclone/rclone.confconfiguration file. For example, this file can have the following contents:This is a remote profile for an Object Storage bucket. Add more profiles for other locations if needed. -
Create the
rclone_copy.batchscript that transfers data between remote locations that havercloneprofiles configured, or to the shared filesystem of your Soperator cluster. Paste the following contents into therclone_copy.batchfile:The rclone sync command synchronizes the contents of directories in two locations. To download data into an empty directory, you can also use rclone copy or other rclone subcommands. -
Run a job to download the data from the
slurm-mlperf-trainingObject Storage bucket to themlperf-datadirectory in the shared filesystem of your Soperator cluster:The job runs on one of the worker nodes and downloads the data to the shared filesystem, so that all nodes can access it.
How to use several nodes for data download
rclone is a single-node utility and doesn’t let you take full advantage of Slurm parallelism. However, to speed up large downloads, you can distribute the load manually. To do so, start several jobs by using the rclone_copy.batch script described above and specify a different subdirectory in each job. As the Soperator cluster has a shared filesystem, the downloaded data is available to all nodes.
Example with rclone and rsync combined
rclone downloads files from a remote server faster than rsync. However, rsync can preserve file ownerships and permissions. You can combine these two instruments: download the files first with rclone, which is fast, then adjust permissions with rsync.
To download data from a remote location configured in the rclone configuration file to a local directory:
- Connect to a login node of your Soperator cluster.
-
Create the
~/.config/rclone/rclone.confconfiguration file. In this file, specify profiles for all remote locations that you are going to use. -
Submit the job that runs the rclone_copy.batch script:
In this command, specify the following:
- Name of the remote profile configured in
rclone.confand path to the remote directory with data - Path to the directory in the shared filesystem where the data should be downloaded
- Name of the remote profile configured in
-
Get the job ID from the output of the last command:
-
Submit the job that runs the rsync_copy.batch script with the condition that it should start after the previous job completes:
In this command, specify:
- ID of the job running
rclonethat you obtained in the previous step - Private SSH key for the remote server
- Username and host to connect to the remote server, and path to the remote directory with data
- Path to the directory in the shared filesystem where the data should be downloaded
- ID of the job running
How to download data by using the AWS CLI
The AWS CLI allows you to download data from any S3-compatible storage, including buckets in Object Storage. You can use it for smaller amounts of data (no more than 10 TiB).- Connect to a login node of your Soperator cluster.
-
Create the
aws_copy.batchfile with the following contents: -
Submit the
aws_copy.batchjob: