One of the most popular AWS products for data storage is S3 – Simple Storage Service. It allows you to store vast amounts of data in several types of storage for every use imaginable. All of that packed in a well made CLI tool which allows you to manage data effectively.
The Use Case
We maintain a large call-center handling tens of thousands calls a day. Each of those calls has to be recorded for compliance and review reasons. You can imagine the number of files this generates over the years. To make sure we have redundant backups of our data we opted for storing it in S3. We can always have it accessible and safe for an affordable price.
The backup process implementation was pretty simple in the beginning. It just synced our local recording archive with the S3 bucket using the AWS CLI tool.
aws s3 sync /home/archive/RECORDINGS/ s3://call-recording-backup/
Pretty simple right? Just run this every day and we have backups! ?
The Problem
This worked fine until we reached somewhere in the area of 7 million files that needed to be updated. The AWS CLI simply didn’t run fast enough to iterate through the files locally. My backup process was no longer able to process the number of files we had to back-up.
The (Officially Proposed) Solution
The AWS docs say that you should try and shard the upload of files into chunks to optimize the data transfer by utilizing the –include and –exclude parameters of the aws s3 command.
aws s3 sync /home/archive/RECORDINGS/ s3://call-recording-backup/ --exclude="*" --include="2020-03-*"
This example will exclude all files in the directory, except for the ones starting with the “2020-03-” prefix. Sounds great, right? Well, it would if that would make any difference what-so-ever.
The underlying implementation of this functionality is flawed because it still iterates trough every single file on the client-side, and then just decides if it will upload it or not. The core of the problem is still there, it still has to iterate through every file which is not possible in a sensible amount of time.
There are proposals to make this functionality work in a smarter manner, but they have not been implemented at the time of writing.
The Actual Solution
Since the AWS CLI tool does not allow you to do any kind of wildcard filtering out of the box that would speed things up, you have to handle it on your own. Now the solution for you will depend on the type of use case you have. The goal is to upload only the files that were added/modified since the last time sync was run.
In my case the solution was fairly simple as the call-center agents can not travel through time (yet). New call recordings are made every day, all I have to do is make sure I upload the recordings from the previous day.
aws s3 sync /home/archive/RECORDINGS/$(date +%F --date="yesterday") s3://vici-call-recording-backup/$(date +%F --date="yesterday")
The solution was obvious in the end, but the official docs made it sound like they offer a remedy for the problem, which does not have any effect.