Hi,

This post is an homage to a stackoverflow post copying data from s3. This shared work saved me a lot of time. I believe that individuals who share their work do not receive sufficient recognition.

The problem is that I have multiple Gb of data separated into thousands of files. Those files are selected for download by the semi-automated pipeline for model training. So the number of files to download varies from pipeline run to pipeline run. Also, this makes any data preparation obsolete. The solution from the official boto3 documentation for copying data from s3 takes too long. Even with asynchronous execution in the download, it will take a few hours to download those files. Imagine a scenario where you want to fine-tune a deep learning model on a machine with multiple GPUs, but you have to wait several hours for the data to be copied 😱. Any preprocessing steps are not feasible since the data is filtered upon request. Additionally, downloading the data via aws cli is not an option, as there is much more data in the s3 buckets than requested for model training. The simplest approach is to increase the throughput. And here is the beauty, directly copy+pasted from Pierre D:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer


def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
    botocore_config = botocore.config.Config(max_pool_connections=workers)
    s3client = session.client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(
        use_threads=True,
        max_concurrency=workers,
    )
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
    for filepath in filelist:
        dst = str(Path(key, filepath.name))
        s3t.upload(
            filepath, bucketname, dst,
            subscribers=[
                s3transfer.ProgressCallbackInvoker(progress_func),
            ],
        )
    s3t.shutdown()

Dependent on my internet ocnnection, I can upload/download 10_000 files in minutes. Before I get a bit into depth, here is the example usage:

1
2
3
4
5
6
7
8
from tqdm import tqdm

bucketname = 'bucket-name'
key = 'some/path/in/s3'
filelist = ['as Path from pathlib']

with tqdm(desc='upload', ncols=60, unit_scale=1) as pbar:
    fast_upload(boto3.Session(), bucketname, key, filelist, pbar.update)

So, what is happening here.

The upload depends on the Transfer Manager from the s3transfer library. Also, the transfer manager allows multi-threading. I haven’t tested this for an upper limit - in general, I use 20 or 30 threads and hit the limit of my internet connection (German internet 🥲). I like the integration with a progress callback (tqdm). This callback implementation only progresses the number of transferred bytes. Thanks to this shared code snippet, I could solve this small data transfer problem.

Thank you for your attention.