Class StreamTransferManager
- java.lang.Object
-
- alex.mojaki.s3upload.StreamTransferManager
-
public class StreamTransferManager extends Object
Manages streaming of data to S3 without knowing the size beforehand and without keeping it all in memory or writing to disk.The data is split into chunks and uploaded using the multipart upload API by one or more separate threads.
After creating an instance with details of the upload, use
getMultiPartOutputStreams()to get a list ofMultiPartOutputStreams. When you finish writing data, callMultiPartOutputStream.close(). Parts will be uploaded to S3 as you write.Once all streams have been closed, call
complete(). Alternatively you can callabort()at any point if needed.Here is an example. A lot of the code relates to setting up threads for creating data unrelated to the library. The essential parts are commented.
AmazonS3Client client = new AmazonS3Client(awsCreds); // Setting up int numStreams = 2; final StreamTransferManager manager = new StreamTransferManager(bucket, key, client) .numStreams(numStreams) .numUploadThreads(2) .queueCapacity(2) .partSize(10); final List<MultiPartOutputStream> streams = manager.getMultiPartOutputStreams(); ExecutorService pool = Executors.newFixedThreadPool(numStreams); for (int i = 0; i < numStreams; i++) { final int streamIndex = i; pool.submit(new Runnable() { public void run() { try { MultiPartOutputStream outputStream = streams.get(streamIndex); for (int lineNum = 0; lineNum < 1000000; lineNum++) { String line = generateData(streamIndex, lineNum); // Writing data and potentially sending off a part outputStream.write(line.getBytes()); } // The stream must be closed once all the data has been written outputStream.close(); } catch (Exception e) { // Aborts all uploads manager.abort(e); } } }); } pool.shutdown(); pool.awaitTermination(5, TimeUnit.SECONDS); // Finishing off manager.complete();The final file on S3 will then usually be the result of concatenating all the data written to each stream, in the order that the streams were in in the list obtained from
getMultiPartOutputStreams(). However this may not be true if multiple streams are used and some of them produce less than 5 MB of data. This is because the multipart upload API does not allow the uploading of more than one part smaller than 5 MB, which leads to fundamental limits on what this class can accomplish. If order of data is important to you, then either use only one stream or ensure that you write at least 5 MB to every stream.While performing the multipart upload this class will create instances of
InitiateMultipartUploadRequest,UploadPartRequest, andCompleteMultipartUploadRequest, fill in the essential details, and send them off. If you need to add additional details then override the appropriatecustomise*Requestmethods and set the required properties within. Note that if no data is written (i.e. the object body is empty) then a normal (not multipart) upload will be performed andcustomisePutEmptyObjectRequestwill be called instead.This class does not perform retries when uploading. If an exception is thrown at any stage the upload will be aborted and the exception rethrown, wrapped in a
RuntimeException.You can configure the upload process by calling any of the chaining setter methods
numStreams(int),numUploadThreads(int),queueCapacity(int), orpartSize(long)before callinggetMultiPartOutputStreams. Parts that have been produced sit in a queue of specified capacity while they wait for a thread to upload them. The worst case memory usage is(numUploadThreads + queueCapacity) * partSize + numStreams * (partSize + 6MB), while higher values for these first three parameters may lead to better resource usage and throughput. If you are uploading very large files, you may need to increase the part size - seepartSize(long)for details.- Author:
- Alex Hall
-
-
Field Summary
Fields Modifier and Type Field Description protected StringbucketNameprotected booleancheckIntegrityprotected intnumStreamsprotected intnumUploadThreadsprotected intpartSizeprotected StringputKeyprotected intqueueCapacityprotected com.amazonaws.services.s3.AmazonS3s3Clientprotected StringuploadId
-
Constructor Summary
Constructors Constructor Description StreamTransferManager(String bucketName, String putKey, com.amazonaws.services.s3.AmazonS3 s3Client)StreamTransferManager(String bucketName, String putKey, com.amazonaws.services.s3.AmazonS3 s3Client, int numStreams, int numUploadThreads, int queueCapacity, int partSize)Deprecated.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidabort()Aborts the upload.RuntimeExceptionabort(Throwable t)Aborts the upload and rethrows the argument, wrapped in a RuntimeException if necessary.StreamTransferManagercheckIntegrity(boolean checkIntegrity)Sets whether a data integrity check should be performed during and after upload.voidcomplete()Blocks while waiting for the threads uploading the contents of the streams returned bygetMultiPartOutputStreams()to finish, then sends a request to S3 to complete the upload.voidcustomiseCompleteRequest(com.amazonaws.services.s3.model.CompleteMultipartUploadRequest request)voidcustomiseInitiateRequest(com.amazonaws.services.s3.model.InitiateMultipartUploadRequest request)voidcustomisePutEmptyObjectRequest(com.amazonaws.services.s3.model.PutObjectRequest request)voidcustomiseUploadPartRequest(com.amazonaws.services.s3.model.UploadPartRequest request)List<MultiPartOutputStream>getMultiPartOutputStreams()Get the list of output streams to write to.StreamTransferManagernumStreams(int numStreams)Sets the number ofMultiPartOutputStreams that will be created and returned bygetMultiPartOutputStreams()for you to write to.StreamTransferManagernumUploadThreads(int numUploadThreads)Sets the number of threads that will be created to upload the data in parallel to S3.StreamTransferManagerpartSize(long partSize)Sets the size in MB of the parts to be uploaded to S3.StreamTransferManagerqueueCapacity(int queueCapacity)Sets the capacity of the queue where completed parts from the output streams will sit waiting to be taken by the upload threads.StringtoString()
-
-
-
Field Detail
-
bucketName
protected final String bucketName
-
putKey
protected final String putKey
-
s3Client
protected final com.amazonaws.services.s3.AmazonS3 s3Client
-
uploadId
protected String uploadId
-
numStreams
protected int numStreams
-
numUploadThreads
protected int numUploadThreads
-
queueCapacity
protected int queueCapacity
-
partSize
protected int partSize
-
checkIntegrity
protected boolean checkIntegrity
-
-
Constructor Detail
-
StreamTransferManager
public StreamTransferManager(String bucketName, String putKey, com.amazonaws.services.s3.AmazonS3 s3Client)
-
StreamTransferManager
@Deprecated public StreamTransferManager(String bucketName, String putKey, com.amazonaws.services.s3.AmazonS3 s3Client, int numStreams, int numUploadThreads, int queueCapacity, int partSize)
Deprecated.Deprecated constructor kept for backward compatibility. UseStreamTransferManager(String, String, AmazonS3)and then chain the desired setters.
-
-
Method Detail
-
numStreams
public StreamTransferManager numStreams(int numStreams)
Sets the number ofMultiPartOutputStreams that will be created and returned bygetMultiPartOutputStreams()for you to write to.By default this is 1, increase it if you want to write to multiple streams from different threads in parallel.
If you are writing large files with many streams, you may need to increase the part size to avoid running out of part numbers - see
partSize(long)for more details.Each stream may hold up to
partSize(long)+ 6MB in memory at a time.- Returns:
- this
StreamTransferManagerfor chaining. - Throws:
IllegalArgumentException- if the argument is less than 1.IllegalStateException- ifgetMultiPartOutputStreams()has already been called, initiating the upload.
-
numUploadThreads
public StreamTransferManager numUploadThreads(int numUploadThreads)
Sets the number of threads that will be created to upload the data in parallel to S3.By default this is 1, increase it if uploading is a speed bottleneck and you have network bandwidth to spare.
Each thread may hold up to
partSize(long)in memory at a time.- Returns:
- this
StreamTransferManagerfor chaining. - Throws:
IllegalArgumentException- if the argument is less than 1.IllegalStateException- ifgetMultiPartOutputStreams()has already been called, initiating the upload.
-
queueCapacity
public StreamTransferManager queueCapacity(int queueCapacity)
Sets the capacity of the queue where completed parts from the output streams will sit waiting to be taken by the upload threads.By default this is 1, increase it if you want to help your threads which write to the streams be consistently busy instead of blocking waiting for upload threads.
Each part sitting in the queue will hold
partSize(long)bytes in memory at a time.- Returns:
- this
StreamTransferManagerfor chaining. - Throws:
IllegalArgumentException- if the argument is less than 1.IllegalStateException- ifgetMultiPartOutputStreams()has already been called, initiating the upload.
-
partSize
public StreamTransferManager partSize(long partSize)
Sets the size in MB of the parts to be uploaded to S3.By default this is 5, which is the minimum that AWS allows. You may need to increase it if you are uploading very large files or writing to many output streams.
AWS allows up to 10,000 parts to be uploaded for a single object, and each part must be identified by a unique number from 1 to 10,000. These part numbers are allocated evenly by the manager to each output stream. Therefore the maximum amount of data that can be written to a stream is
10000/numStreams * partSize. If you try to write more, anIndexOutOfBoundsExceptionwill be thrown. The total object size can be at most 5 TB, so if you're using just one stream, there is no reason to set this higher than 525. If you're using more streams, you may want a higher value in case some streams get more data than others.Increasing this value will of course increase memory usage.
- Returns:
- this
StreamTransferManagerfor chaining. - Throws:
IllegalArgumentException- if the argument is less than 5.IllegalArgumentException- if the resulting part size in bytes cannot fit in a 32 bit int.IllegalStateException- ifgetMultiPartOutputStreams()has already been called, initiating the upload.
-
checkIntegrity
public StreamTransferManager checkIntegrity(boolean checkIntegrity)
Sets whether a data integrity check should be performed during and after upload.By default this is disabled.
The integrity check consists of two steps. First, each uploaded part is verified by setting the Content-MD5 header for Amazon S3 to check against its own hash. If they don't match, the AWS SDK will throw an exception. The header value is the base64-encoded 128-bit MD5 digest of the part body.
The second step is to ensure integrity of the final object merged from the uploaded parts. This is achieved by comparing the expected ETag value with the actual ETag returned by S3. However, the ETag value is not a MD5 hash. When S3 combines the parts of a multipart upload into the final object, the ETag value is set to the hex-encoded MD5 hash of the concatenated binary-encoded MD5 hashes of each part followed by "-" and the number of parts, for instance:
57f456164b0e5f365aaf9bb549731f32-95
Note that AWS doesn't document this, so their hashing algorithm might change without notice which would lead to false alarm exceptions. If the ETags don't match, anIntegrityCheckExceptionwill be thrown after completing the upload. This will not abort or revert the upload.- Parameters:
checkIntegrity-trueif data integrity should be checked- Returns:
- this
StreamTransferManagerfor chaining. - Throws:
IllegalStateException- ifgetMultiPartOutputStreams()has already been called, initiating the upload.
-
getMultiPartOutputStreams
public List<MultiPartOutputStream> getMultiPartOutputStreams()
Get the list of output streams to write to.The first call to this method initiates the multipart upload. All setter methods must be called before this.
-
complete
public void complete()
Blocks while waiting for the threads uploading the contents of the streams returned bygetMultiPartOutputStreams()to finish, then sends a request to S3 to complete the upload. For the former to complete, it's essential that every stream is closed, otherwise the upload threads will block forever waiting for more data.
-
abort
public RuntimeException abort(Throwable t)
Aborts the upload and rethrows the argument, wrapped in a RuntimeException if necessary. Writethrow abort(e)to make it clear to the compiler and readers that the code stops here.
-
abort
public void abort()
Aborts the upload. Repeated calls have no effect.
-
customiseInitiateRequest
public void customiseInitiateRequest(com.amazonaws.services.s3.model.InitiateMultipartUploadRequest request)
-
customiseUploadPartRequest
public void customiseUploadPartRequest(com.amazonaws.services.s3.model.UploadPartRequest request)
-
customiseCompleteRequest
public void customiseCompleteRequest(com.amazonaws.services.s3.model.CompleteMultipartUploadRequest request)
-
customisePutEmptyObjectRequest
public void customisePutEmptyObjectRequest(com.amazonaws.services.s3.model.PutObjectRequest request)
-
-