Clean up dangling S3 multipart uploads #111955

DaveCTurner · 2024-08-18T16:44:26Z

If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes dangling. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.

Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.

Closes #44971
Closes #101169

github-actions · 2024-08-18T16:44:36Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-08-18T16:44:51Z

Hi @DaveCTurner, I've created a changelog YAML for you.

If Elasticsearch fails part-way through a multipart upload to S3 it will generally try and abort the upload, but it's possible that the abort attempt also fails. In this case the upload becomes _dangling_. Dangling uploads consume storage space, and therefore cost money, until they are eventually aborted. Earlier versions of Elasticsearch require users to check for dangling multipart uploads, and to manually abort any that they find. This commit introduces a cleanup process which aborts all dangling uploads on each snapshot delete instead.

elasticsearchmachine · 2024-08-19T05:40:30Z

Hi @DaveCTurner, I've updated the changelog YAML for you.

elasticsearchmachine · 2024-08-19T05:40:54Z

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi · 2024-08-19T15:29:41Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

+                @Override
+                public void onFailure(Exception e) {
+                    logger.warn("failed to get multipart uploads for cleanup during snapshot delete", e);
+                    snapshotDeleteListener.onFailure(e);


Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion? getMultipartUploadCleanupListener doesn't seem to throw. Sorry if this is not related, ActionListener<ActionListener<>> is really bumping this up to a whole new level! :)))

Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion?

Hm that was how it worked in an earlier iteration but in the finished version we log a failure there but otherwise treat it as if there are no multipart uploads. Let me try and simplify that, sec.

ActionListener<ActionListener<>> is really bumping this up to a whole new level

Let me try and simplify that, sec.

Actually this is a little tricky, in principle we could turn this into a Consumer<ActionListener<Void>> but then we submit a task to a threadpool executor and that sort of thing can theoretically fail. It won't actually fail today because snapshotExecutor has an infinite queue and just silently drops work when shut down rather than rejecting it, but still I would prefer to write code like this in a style that allows for failures.

I'm ok with the ActionListener<ActionListener<>> part. FWIW, probably returning a specialized object with more clear names might make it more readable in my opinion, but that would anyway underneath be an action listener. But why log and propagate the failure if it's not supposed to happen?

If we're wrong (e.g. some future change to threadpool behaviour makes us wrong) then we have to do something in production when assertions are disabled, this seems like the most sensible thing to do. There's loads of other spots where we assert false to indicate to the reader that we think something is impossible, but nonetheless react in a somewhat reasonable fashion when assertions are disabled.

If Elasticsearch fails part-way through a multipart upload to S3 it will generally try and abort the upload, but it's possible that the abort attempt also fails. In this case the upload becomes _dangling_. Dangling uploads consume storage space, and therefore cost money, until they are eventually aborted. Earlier versions of Elasticsearch require users to check for dangling multipart uploads, and to manually abort any that they find. This commit introduces a cleanup process which aborts all dangling uploads on each snapshot delete instead. Closes elastic#44971 Closes elastic#101169

DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.16.0 labels Aug 18, 2024

DaveCTurner force-pushed the 2024/08/18/s3-cleanup-multipart-upload branch from 162494c to 5fe1390 Compare August 18, 2024 19:05

Merge branch 'main' into 2024/08/18/s3-cleanup-multipart-upload

bd97563

DaveCTurner requested a review from DiannaHohensee August 19, 2024 05:40

Update docs/changelog/111955.yaml

10e4598

DaveCTurner marked this pull request as ready for review August 19, 2024 05:40

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 19, 2024

pxsalehi reviewed Aug 19, 2024

View reviewed changes

DaveCTurner added 2 commits August 19, 2024 16:43

Merge branch 'main' into 2024/08/18/s3-cleanup-multipart-upload

b93ad05

Comment on impossible failure

bbd7bae

DaveCTurner requested a review from pxsalehi August 19, 2024 15:46

pxsalehi approved these changes Aug 19, 2024

View reviewed changes

DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 19, 2024

elasticsearchmachine merged commit e6b830e into elastic:main Aug 19, 2024
15 checks passed

DaveCTurner deleted the 2024/08/18/s3-cleanup-multipart-upload branch August 19, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up dangling S3 multipart uploads #111955

Clean up dangling S3 multipart uploads #111955

DaveCTurner commented Aug 18, 2024 •

edited

Loading

github-actions bot commented Aug 18, 2024

elasticsearchmachine commented Aug 18, 2024

elasticsearchmachine commented Aug 19, 2024

elasticsearchmachine commented Aug 19, 2024

pxsalehi Aug 19, 2024

DaveCTurner Aug 19, 2024

DaveCTurner Aug 19, 2024

pxsalehi Aug 19, 2024

DaveCTurner Aug 19, 2024

Clean up dangling S3 multipart uploads #111955

Clean up dangling S3 multipart uploads #111955

Conversation

DaveCTurner commented Aug 18, 2024 • edited Loading

github-actions bot commented Aug 18, 2024

elasticsearchmachine commented Aug 18, 2024

elasticsearchmachine commented Aug 19, 2024

elasticsearchmachine commented Aug 19, 2024

pxsalehi Aug 19, 2024

Choose a reason for hiding this comment

DaveCTurner Aug 19, 2024

Choose a reason for hiding this comment

DaveCTurner Aug 19, 2024

Choose a reason for hiding this comment

pxsalehi Aug 19, 2024

Choose a reason for hiding this comment

DaveCTurner Aug 19, 2024

Choose a reason for hiding this comment

DaveCTurner commented Aug 18, 2024 •

edited

Loading