2023-07-26
Uploading huge files can take a long time. To solve this problem, we can upload a single file as a set of parts, with each part representing a contiguous part of the file. These parts can be uploaded in parallel, and are indepedent of one another. One part failing to upload will not stop the entire upload.
The process of multipart uploads in AWS is as follows:
UploadPart
or UploadPartCopy
, each with their pros, cons, and limitations.How you orchestrate your multipart upload will depend greatly on your final file size, the sizes of each part, as well as the limits of multipart upload.
As of this post, the limits of AWS multipart upload are as follows:
5 TiB
10000
[1, 10000]
[5 MiB, 5 GiB]
with no minimum size limit on the last partWith the above limits, you may face some problems, such as:
[5 MiB, 5 GiB]
.UploadPart
This uploads a part in a multipart upload. You must provide the content body of the part. If you are running this in a lambda, you will be holding the content body in memory.
A typical implementation of multipart uploads would be to retrieve byte ranges of a huge file and upload it part by part using UploadPart
.
UploadPartCopy
This copies an existing S3 object to be used as a part in a multipart upload. Note that you can specify byte ranges as well. Compared to UploadPart
, you do not have to hold the content body in memory, allowing you to specify larger parts using this method.
If you need to merge multiple objects that are within the part size limit of [5 MiB, 5 GiB]
, this is an efficient way to do so.
Given many different parts of different sizes, how can you merge them all using multipart uploads?
We can achieve the above by optimising part sizes, as well as the number of parts. This allows us to upload the biggest file using the smallest number of parts.
This is the process I found worked for our use case:
HeadObject
on them.5 MiB
are considered large, and must be handled differently from parts smaller than 5 MiB
, which are considered small.UploadPartCopy
since we can use byte ranges to ensure parts are within the part size limit.
UploadPartCopy
.5 GiB - 5 MiB
. This ensures that the final part is at least 5 MiB in size.UploadPart
since we have to construct parts that are at least 5 MiB
in size.
5 MiB
groups using small parts, or form groups as much as your function or program allows.Note that through this process, the original order of the files are not preserved. If the order by which the files are merged is important, this process will not work.
If the order matters, I suggest you explore using byte ranges to create parts across multiple files.
Since UploadPartCopy
only works for a single object, you are restricted to using UploadPart
to upload your part. Therefore, the sizes of your parts will be restricted by the memory limits of your function or program.
This was a fun problem to solve, especially when I had the eureka moment of splitting files larger than 5 GiB
into 5 GiB - 5 MiB
parts.
It is helpful to structure your functions such that one generates a set of instructions to be executed, while another function consumes these instructions and executes them. This allows you to test your feature logic without caring about implementation detail.
Here is the code.
Thanks for reading!