Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up dangling S3 multipart uploads #111955

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Aug 18, 2024

If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes dangling. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.

Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.

Closes #44971
Closes #101169

@DaveCTurner DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.16.0 labels Aug 18, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes _dangling_. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.

Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.
@DaveCTurner DaveCTurner force-pushed the 2024/08/18/s3-cleanup-multipart-upload branch from 162494c to 5fe1390 Compare August 18, 2024 19:05
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

@DaveCTurner DaveCTurner marked this pull request as ready for review August 19, 2024 05:40
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 19, 2024
@Override
public void onFailure(Exception e) {
logger.warn("failed to get multipart uploads for cleanup during snapshot delete", e);
snapshotDeleteListener.onFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion? getMultipartUploadCleanupListener doesn't seem to throw. Sorry if this is not related, ActionListener<ActionListener<>> is really bumping this up to a whole new level! :)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion?

Hm that was how it worked in an earlier iteration but in the finished version we log a failure there but otherwise treat it as if there are no multipart uploads. Let me try and simplify that, sec.

ActionListener<ActionListener<>> is really bumping this up to a whole new level

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try and simplify that, sec.

Actually this is a little tricky, in principle we could turn this into a Consumer<ActionListener<Void>> but then we submit a task to a threadpool executor and that sort of thing can theoretically fail. It won't actually fail today because snapshotExecutor has an infinite queue and just silently drops work when shut down rather than rejecting it, but still I would prefer to write code like this in a style that allows for failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the ActionListener<ActionListener<>> part. FWIW, probably returning a specialized object with more clear names might make it more readable in my opinion, but that would anyway underneath be an action listener. But why log and propagate the failure if it's not supposed to happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're wrong (e.g. some future change to threadpool behaviour makes us wrong) then we have to do something in production when assertions are disabled, this seems like the most sensible thing to do. There's loads of other spots where we assert false to indicate to the reader that we think something is impossible, but nonetheless react in a somewhat reasonable fashion when assertions are disabled.

@DaveCTurner DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 19, 2024
@elasticsearchmachine elasticsearchmachine merged commit e6b830e into elastic:main Aug 19, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/08/18/s3-cleanup-multipart-upload branch August 19, 2024 16:50
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes _dangling_. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.

Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.

Closes elastic#44971 Closes elastic#101169
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes _dangling_. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.

Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.

Closes elastic#44971 Closes elastic#101169
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Meta label for distributed team v8.16.0
Projects
None yet
3 participants