Add a MultipartUploadStream IO object #46

nantiamak · 2023-10-17T12:20:23Z

This PR adds a MultipartUploadStream IO object in CloudStore.jl, which facilitates uploading a stream of data chunks to blob storage. After creating a MultipartUploadStream object, we can repeatedly call write() and upload each part till there are no more chunks, when we can complete uploading and close the stream.

nantiamak · 2023-10-17T12:22:50Z

@Drvi I pushed a first version of the MultipartUploadStream struct with two tests, one for S3 and one for Azure. I guess we could also integrate with GzipCompressorStream, but I think it's not necessary for starters.

codecov · 2023-10-17T12:24:40Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (328b427) 83.13% compared to head (0790f79) 84.21%.

❗ Current head 0790f79 differs from pull request most recent head c18952b. Consider uploading reports for the commit c18952b to get more accurate results

Files	Patch %	Lines
src/object.jl	94.91%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   83.13%   84.21%   +1.07%     
==========================================
  Files           7        7              
  Lines         587      646      +59     
==========================================
+ Hits          488      544      +56     
- Misses         99      102       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Drvi

Hey Nantia and thanks for the PR. I think the implementation would work, but it will probably be noticeably slower than CloudStore.put because we don't attempt any parallelism here. Even though I think your main concern was memory overhead, it shouldn't be that hard to get even faster than CloudStore.put since we should be able to overlap arrow serialization and uploading.

An approach inspired by PrefetchedDownloadStream would be:
a) When creating MultipartUploadStream, you'd also spawn multiple tasks
b) The spawned tasks would try to take buffers from a new Channel field we'd add to the MultipartUploadStream, and upload them
c) When we call Base.write(io::MultipartUploadStream, bytes) we put bytes to the Channel and let the tasks handle it.
d) the close method would have to block until all submitted buffers were successfully uploaded and shutdown the tasks.

For bonus points, once we're done uploading a buffer, we'd give it back to the caller, maybe to a user-provided channel, so you could safely reuse them.

nantiamak · 2023-10-23T15:27:46Z

Hi @Drvi, thanks for the feedback! It's very helpful.
One question I have is about the number of spawned tasks for the upload. For download/prefetching this is dependent on prefetch_multipart_size and the size of input, but for upload we cannot know upfront the total upload size. How should the number of spawned tasks be determined in this case?
I think this is also related to how we'll know when all the tasks are done and the parts uploaded, as for _download_task we decrease io.cond.ntasks by 1 each time a task is done, if I understand correctly.

Drvi · 2023-10-23T17:25:17Z

@nantiamak Ah, good point. I think the design space is a bit larger than I initially thought:) A couple of options

How many tasks to spawn?

Pre-spawn a fixed amount / an amount based on some estimate input size. Relatively easy, but you might overspawn for small files...
Start spawning tasks dynamically in write up to a limit as needed (e.g. uploading tasks grab a semaphore, if there is a new buffer to upload and semafore if not fully acquired we check if we spawned max of tasks). This is probably trickier to implement.
Or we can do that the CloudStore.jl.put does and just always spawn a new task for each buffer during write, probably the simplest

and we should target each buffer being 8MiB (MULTIPART_SIZE). For PrefetchedDownloadStream a had to experiment quite a bit on EC2 to figure out which buffer sizes and number of task combos worked well, usually it was a good idea to follow the behavior of CloudStore.get.

I think this is also related to how we'll know when all the tasks are done and the parts uploaded

Yes, so for PrefetchedDownloadStream I used TaskCondition as a counter, but since you are already using the OrderedSynchronizer, I think you could use their counter that is used internally (https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/main/src/synchronizer.jl#L120C1-L120C20) together with the condition. Maybe we should rethink the API of OrderedSynchronizer so we wouldn't have to touch internals like this...

Drvi

Hey Nantia, exciting progress! Left a couple of concurrency related notes. Thanks for working on this!

src/object.jl

Drvi · 2023-10-27T11:00:15Z

src/object.jl

+            if x.ntasks == 0
+                break
+            end
+            wait(x.cond_wait)


I think before we check ntasks values, we should confirm that we're in a good state. If an upload task failed for whatever reason before it decremented here and if it notified before we grabbed the lock, we'd wait forever. The upload task grabs the lock to error notify -- if you in the same block also set some state that singals error, then it would be safe, because here in close
a) we didn't held the lock, so we'd wait for the upload task to error notify and set the error state, then we'd grab the lock and checked the error state
b) we did held the lock, so when we'd get to wait and the upload task would succesfully error-notify us

I added an is_error flag in MultipartUploadStream, which I'm setting to true here, when catching an exception. I'm looking for a test case checking this code path. Any thoughts?

Drvi · 2023-10-27T11:05:52Z

src/object.jl

+            end
+        end
+        # atomically increment our part counter
+        @atomic io.cur_part_id += 1


Since we increment after we call put!, I think it could happen that two tasks could reach the put! call with the same value

Indeed I've seen this happening. On the other hand if we increment before put! two tasks might have incremented io_cur_part_id before reaching put! and io.sync.i, which is incremented inside put! will be != io.cur_part_id` resulting in a deadlock.

I fixed this by assigning io.cur_part_idto a variable and passing it as a parameter to _upload_task rather than getting its value from io.cur_part_id directly.

Yes, making this the callers problem seems like a nice solution to me 👍

nantiamak · 2023-10-30T16:43:33Z

CI/Julia 1.6 is failing with UndefVarError: @atomic not defined. How to go past this error? Is it an incompatibility issue with older Julia versions?

nantiamak · 2023-10-30T18:11:25Z

@Drvi About your comment "we should target each buffer being 8MiB (MULTIPART_SIZE)", the buffer is currently constructed outside write(). Inside write we only put it in the channel. Do you mean that buffers should be created inside write() or that wherever they're created, the batch size passed to a buffer should be of size MULTIPART_SIZE?

Drvi · 2023-10-31T12:46:17Z

Hey @nantiamak, sorry for the delay.

CI/Julia 1.6 is failing with UndefVarError: @atomic not defined. How to go past this error? Is it an incompatibility issue with older Julia versions?

You can see e.g. in the OrderedSynchronizer code how Jacob dealt with the issue -- https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/main/src/synchronizer.jl
There are @static if VERSION < v"1.7" version checks which are using the older API closed::Threads.Atomic{Bool} instead of @atomic closed::Bool etc.

@Drvi About your comment "we should target each buffer being 8MiB (MULTIPART_SIZE)", the buffer is currently constructed outside write(). Inside write we only put it in the channel. Do you mean that buffers should be created inside write() or that wherever they're created, the batch size passed to a buffer should be of size MULTIPART_SIZE?

I meant in our usage code we should target cca 8MiB chunked to be given to the MultipartUploadStream. On the other hand, I agree it would be useful for the MultipartUploadStream to have the ability do the chunking internally, but I don't know how to do that without being more complicated than just chunking externally. I'd say this is an open design question worth some experimenting.

nantiamak · 2023-10-31T14:17:13Z

@Drvi Thanks for the pointer!
Indeed, I agree that chunking is more straightforward to be done externally for the time being.

nantiamak · 2023-11-10T13:18:29Z

@Drvi Could you please take another look on this PR to see if we can merge it?

Drvi · 2023-11-13T10:11:19Z

@nantiamak Sorry for the late reply, in short:

We should either document that this should be called behind a semaphore or implement the semaphore throttling ourselves (put! does throttle the number of tasks in flight and we should not deviate from that)
I think we should document that the chunks need to be written in order, which makes me think that we don't really need the OrderedSynchronizer, when we get the (part_n, parteTag) all we need to do is io.eTags[part_n] = parteTag behind the lock (and making sure the io.eTags is grown as needed).
Our close method does two things -- it waits for the submitted chunks to be written and then it closes the channel and calls completeMultipartUpload. I think we should separate the two into wait and close.
The wait should actually throw the error if there was one.

Also, could you try larger files for the benchmark, say 700MiB, and use a semaphore?

nantiamak · 2023-11-14T14:37:48Z

src/object.jl

+            notify(io.cond_wait)
+        end
+    catch e
+        isopen(io.upload_queue) && close(io.upload_queue, e)


@Drvi Could you help me with adding a test case that triggers an error during uploading and covers these lines? I tried with uploading a part that was smaller than the minimum S3 size and got the following error, but it didn't exercise this catch block. 🤔

HTTP.Exceptions.StatusError(400, "POST", "/jl-minio-23869/test.csv?uploadId=7e57586f-e75d-4cf6-a14e-f480ebe655cd", HTTP.Messages.Response: """ HTTP/1.1 400 Bad Request Accept-Ranges: bytes Cache-Control: no-cache Content-Length: 364 Content-Security-Policy: block-all-mixed-content Content-Type: application/xml Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin, Accept-Encoding X-Accel-Buffering: no X-Amz-Request-Id: 179736FE62556FB8 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Mon, 13 Nov 2023 15:04:10 GMT <?xml version="1.0" encoding="UTF-8"?> <Error><Code>EntityTooSmall</Code><Message>Your proposed upload is smaller than the minimum allowed object size.</Message><Key>test.csv</Key><BucketName>jl-minio-23869</BucketName><Resource>/jl-minio-23869/test.csv</Resource><RequestId>179736FE62556FB8</RequestId><HostId>7b2b4a12-8baf-4d22-bc30-e4c3f2f12bff</HostId></Error>""") Stacktrace: [1] (::HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2" ...

I think you had a good idea, try writing a small file and then testing, that the channel is closed and that the exp field is populated.

I tried that, but when the errors happens I'm getting Exiting on signal: TERMINATED and the test terminates before I get to check the channel and the exception.

src/object.jl

nantiamak · 2023-11-14T14:39:27Z

src/object.jl

+        Base.@lock io.cond_wait begin
+            while true
+                io.ntasks == 0 && !io.is_error && break
+                if io.is_error


@Drvi I added this if check here to account for the case when the upload task has error notified before we enter wait() here.

I think you don't need is_error field, checking for !isnothing(io.exc). Then you can just check for error first and only if we don't throw, you'd io.ntasks == 0 && break

Drvi · 2023-11-14T15:14:16Z

src/object.jl

+        Base.@lock io.cond_wait begin
+            while true
+                io.ntasks == 0 && !io.is_error && break
+                if io.is_error


I think you don't need is_error field, checking for !isnothing(io.exc). Then you can just check for error first and only if we don't throw, you'd io.ntasks == 0 && break

Drvi · 2023-11-14T15:32:20Z

src/object.jl

+            notify(io.cond_wait)
+        end
+    catch e
+        isopen(io.upload_queue) && close(io.upload_queue, e)


I think you had a good idea, try writing a small file and then testing, that the channel is closed and that the exp field is populated.

Drvi · 2023-11-14T15:42:50Z

src/object.jl

+            end
+        end
+    catch e
+        rethrow()


I think we are missing a way to signal to S3/Blobs that we're aborting the multi-part upload a la AbortMultipartUpload... can you open an issue about it?

Also, I think we should consider the following syntax to the user

MultipartUploadStream(...) do io ... end

something like

function MultipartUploadStream(f::Function, args...; kwargs...) io = MultipartUploadStream(args...; kwargs...) try f(io) wait(io) close(io) catch e abort(io, e) # todo, we don't have this yet rethrow() end end

There is already an issue for adding more low-level functionality regarding multipart uploads including list parts and and list parts that are in progress, which are related to aborting: #3. Should I add a comment there about abort?

For the alternative syntax, f would need to encapsulate a call to MultipartUploadStream.write() though, right?

#3

ah, sorry, I didn't realize there was an issue already

For the alternative syntax, f would need to encapsulate a call to MultipartUploadStream.write() though, right?

Yes, similar to what one would do with a local file

open("file") do io write(io, "my stuff") end

Drvi · 2023-11-14T15:44:07Z

src/object.jl

+mutable struct MultipartUploadStream <: IO
+    store::AbstractStore
+    url::String
+    credentials::Union{Nothing, AWS.Credentials, Azure.Credentials}
+    uploadState


Should we specialize on the store type? What is the type of uploadState? Ideally, we'd specialize so that touching the credentials and store is inferrable

I replaced AbstractStore for store with Union{AWS.Bucket, Azure.Container}. uploadState is either String or nothing.

But do we want to specialize MultipartUploadStream{T<:AbstractStore}, similar to PrefetchedDownloadStream? It's possible that Julia will manage to union split on small unions like these but it would be nice to make sure we don't box unnecessarily. Also to enforce that we have compatible Credetial and Store objects (i.e. both Azure or both AWS specific).

I don't mind making MultipartUploadStream being parametrized on AbstractStore, but I'm not quite following what's the benefit in comparison to Union{AWS.Bucket, Azure.Container}, as I'm not very familiar with the details of boxing. We do use a union for credentials. Is this because there isn't an abstract struct for those?

Maybe boxing is not the right CS term, but the idea is to make the code as inferrable as possible, so Julia can optimize things easily without having to do dynamic dispatch and so on. In these type-unstable cases, Julia tends to allocate defensively because it doesn't know which types it will encounter. In this specific example, it might be ok since the unions are small, but there is no harm in specializing these since people tend to know in advance if they want to talk to Azure or S3 anyway.

Thanks for the clarification! I admit I like the flexibility of Union 😄, but I see your point.

nantiamak · 2023-11-14T16:04:33Z

A couple more results on a larger file and with adding a semaphore for MultipartUploadStream.

Method	Filename	Schema	Size (MB)	Time (s)	Allocations
MultipartUploadStream	csv_various_larger.csv	Tuple{Int64,Int64,Float64,Float64,VS,VS}	860.3	140.94	(362.30 k allocations: 834.048 MiB, 0.01% gc time)
Regular Put	csv_various_larger.csv	Tuple{Int64,Int64,Float64,Float64,VS,VS}	860.3	21.16	(356.52 k allocations: 831.520 MiB, 0.04% gc time)

There is a big difference now between MultipartUploadStream and put, which could be because I'm not configuring the semaphore correctly. I've currently set the default value to 4 * Threads.nthreads() similar to defaultBatchSize().

Drvi · 2023-11-14T16:36:37Z

Hmm, the performance difference seems problematic, we should investigate. Can you share the benchmarking code again?

nantiamak · 2023-11-15T12:57:50Z

@Drvi Regarding the following:

I think we should document that the chunks need to be written in order, which makes me think that we don't really need the OrderedSynchronizer, when we get the (part_n, parteTag) all we need to do is io.eTags[part_n] = parteTag behind the lock (and making sure the io.eTags is grown as needed).

Why should we change this behaviour for MultipleUploadStream? putObjectImpl() that also does a multipart upload works with an OrderedSynchronizer.

Drvi · 2023-11-15T16:46:59Z

Why should we change this behaviour for MultipleUploadStream? putObjectImpl() that also does a multipart upload works with an OrderedSynchronizer.

I just think it is simple to use 1 synchronization mechanism than 2, Since we already do the locking for the condition, we might as well assign the eTag to the eTags vector without involving the OrderedSynchornizer

nantiamak · 2023-11-15T17:05:31Z

Ah I get your point now. But what do you mean by "making sure the io.eTags is grown as needed"? I only know of push! to grow a vector without knowing its size beforehand, but if I'm not mistaken you mean something different here.

nantiamak · 2023-11-15T17:13:51Z

@Drvi Do you maybe mean to use resize!?

src/object.jl

nantiamak · 2023-11-16T11:59:52Z

@Drvi I think I've addressed all of your feedback.

Drvi

I think this looks good, I've left a couple of docs improvement ideas but apart from that, I think this is ready to be used. I'm tempted to mark this API as experimental since we'd probably want a story about reusing buffers and handing them back to the caller via a channel and also a way to pre-spawn uploader tasks.

src/object.jl

Drvi · 2023-11-16T13:04:21Z

src/object.jl

+Data chunks are written to a channel and spawned tasks read buffers from this channel.
+We expect the chunks to be written in order.


I'd mention this is currently spawns one task per chunk and lets explicitly mention the write method and also put as an alternative.

Could you please elaborate a bit more on the use of put as an alternative here? Doesn't put require the total input to be known upfront?

Just to mention it, that if you don't need to upload files by part (or if you simply cannot, because your file is too small), you can always use put

src/object.jl

Drvi · 2023-11-16T14:20:20Z

Btw, I think the benchnark results are heavily influenced by the performance of copy!. I added some logging to it:

@time "copy " copyto!(buf, 1, csv, i, nb)

And the copy got progressively slower over time:

copy : 0.016634 seconds
copy : 0.028683 seconds
copy : 0.041162 seconds
copy : 0.052870 seconds
copy : 0.065678 seconds
copy : 0.077296 seconds
copy : 0.089498 seconds
copy : 0.101765 seconds
copy : 0.113838 seconds
copy : 0.126034 seconds
copy : 0.139463 seconds
copy : 0.151689 seconds
copy : 0.163353 seconds
copy : 0.175153 seconds
copy : 0.187306 seconds
copy : 0.199294 seconds
copy : 0.212154 seconds
copy : 0.223732 seconds
copy : 0.236243 seconds
copy : 0.248252 seconds
copy : 0.260920 seconds
copy : 0.272577 seconds
copy : 0.284779 seconds
copy : 0.297669 seconds
copy : 0.309319 seconds
copy : 0.322311 seconds
copy : 0.333299 seconds
copy : 0.347037 seconds
copy : 0.357561 seconds
copy : 0.371530 seconds
copy : 0.382138 seconds
copy : 0.395172 seconds
copy : 0.408420 seconds
copy : 0.420083 seconds
copy : 0.430917 seconds
copy : 0.445276 seconds
copy : 0.455702 seconds
copy : 0.468660 seconds
copy : 0.481685 seconds
copy : 0.493671 seconds
copy : 0.504613 seconds
copy : 0.516266 seconds
copy : 0.528877 seconds
copy : 0.550385 seconds
copy : 0.554715 seconds
copy : 0.565004 seconds
copy : 0.577645 seconds
copy : 0.590031 seconds
copy : 0.603796 seconds
copy : 0.615178 seconds
copy : 0.625942 seconds
copy : 0.638203 seconds
copy : 0.650485 seconds
copy : 0.663207 seconds
copy : 0.674765 seconds
copy : 0.688626 seconds
copy : 0.699776 seconds
copy : 0.712376 seconds
copy : 0.724644 seconds
copy : 0.737175 seconds
copy : 0.748462 seconds
copy : 0.760385 seconds
copy : 0.777539 seconds
copy : 0.784269 seconds

nantiamak · 2023-11-16T14:40:22Z

Wow! Good catch. I didn't expect it to be something in the benchmark code.

nantiamak · 2023-11-16T14:42:45Z

Then, I'll remove the comment about performance getting worse with larger files for now, as it might be misleading, but I've mentioned that the API is experimental.

nantiamak added 4 commits October 11, 2023 16:42

WIP MultipartUploadStream

e6173ab

more debugging and testing

57dfc36

added test writing large bytes to S3 with MultipleUploadStream

4eabf65

cleanup, added comments and a test with Azurite

57c2d16

nantiamak requested a review from Drvi October 17, 2023 12:22

Drvi reviewed Oct 19, 2023

View reviewed changes

WIP - spawn multiple tasks for uploading

b711349

nantiamak added 3 commits October 24, 2023 18:27

WIP - upload with multiple threads

2273210

2nd attempt - upload with multiple threads

1f16ffc

fixed Azure test, S3 test still fails with error 400

174464c

Drvi reviewed Oct 27, 2023

View reviewed changes

nantiamak added 6 commits October 27, 2023 19:11

Julia scheduling error debugging

1d7ef83

fixed S3 test failures

7d7ef7b

added error flag to avoid deadlocks

5f903f1

put ntasks increment behind lock

2ed517d

put back tests

a9232f4

cleanup

cdd1d23

fixed julia 1.6 incompatibility issue with @atomic

86f8ece

nantiamak added 4 commits October 31, 2023 15:35

fixed type, cleanup

456c32c

fixed initialization for Julia 1.6

b5a7ad5

fixed type again

97c9ca6

another attempt for Julia 1.6

737c408

nantiamak added 2 commits November 14, 2023 14:36

addressed feedback

e6ba825

replaced acquire macro with acquire function for Julia 1.6 compatibility

0214836

nantiamak commented Nov 14, 2023

View reviewed changes

src/object.jl Show resolved Hide resolved

nantiamak commented Nov 14, 2023

View reviewed changes

Drvi reviewed Nov 14, 2023

View reviewed changes

nantiamak added 2 commits November 16, 2023 10:30

small refactoring of MultipartUploadStream based on feedback

6439a3e

added tests for failures

8910fb1

Drvi reviewed Nov 16, 2023

View reviewed changes

src/object.jl Outdated Show resolved Hide resolved

Drvi reviewed Nov 16, 2023

View reviewed changes

src/object.jl Show resolved Hide resolved

nantiamak added 2 commits November 16, 2023 12:30

alternative syntax, fixed semaphore, cleanup

ac71f14

added specialized constructors

f72716d

nantiamak requested a review from Drvi November 16, 2023 11:59

Drvi approved these changes Nov 16, 2023

View reviewed changes

comments and cleanup

0790f79

more comments and cleanup

c18952b

nantiamak merged commit d6dee8c into main Nov 16, 2023
5 checks passed

nantiamak deleted the nm-add-multipart-upload-stream branch November 16, 2023 14:50

		Data chunks are written to a channel and spawned tasks read buffers from this channel.
		We expect the chunks to be written in order.

Add a MultipartUploadStream IO object #46

Add a MultipartUploadStream IO object #46

Conversation

nantiamak commented Oct 17, 2023

nantiamak commented Oct 17, 2023

codecov bot commented Oct 17, 2023 • edited Loading

Codecov Report

Drvi left a comment

Choose a reason for hiding this comment

nantiamak commented Oct 23, 2023

Drvi commented Oct 23, 2023

Drvi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nantiamak commented Oct 30, 2023

nantiamak commented Oct 30, 2023

Drvi commented Oct 31, 2023

nantiamak commented Oct 31, 2023

nantiamak commented Nov 10, 2023

Drvi commented Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Drvi Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nantiamak Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Drvi Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nantiamak commented Nov 14, 2023

Drvi commented Nov 14, 2023

nantiamak commented Nov 15, 2023

Drvi commented Nov 15, 2023

nantiamak commented Nov 15, 2023

nantiamak commented Nov 15, 2023

nantiamak commented Nov 16, 2023

Drvi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Drvi commented Nov 16, 2023

nantiamak commented Nov 16, 2023

nantiamak commented Nov 16, 2023 • edited Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

Drvi commented Nov 13, 2023 •

edited

Loading

Drvi Nov 14, 2023 •

edited

Loading

nantiamak Nov 14, 2023 •

edited

Loading

Drvi Nov 15, 2023 •

edited

Loading

nantiamak commented Nov 16, 2023 •

edited

Loading