-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up dangling S3 multipart uploads #111955
Clean up dangling S3 multipart uploads #111955
Conversation
Documentation preview: |
Hi @DaveCTurner, I've created a changelog YAML for you. |
If Elasticsearch fails part-way through a multipart upload to S3 it will generally try and abort the upload, but it's possible that the abort attempt also fails. In this case the upload becomes _dangling_. Dangling uploads consume storage space, and therefore cost money, until they are eventually aborted. Earlier versions of Elasticsearch require users to check for dangling multipart uploads, and to manually abort any that they find. This commit introduces a cleanup process which aborts all dangling uploads on each snapshot delete instead.
162494c
to
5fe1390
Compare
Hi @DaveCTurner, I've updated the changelog YAML for you. |
Pinging @elastic/es-distributed (Team:Distributed) |
@Override | ||
public void onFailure(Exception e) { | ||
logger.warn("failed to get multipart uploads for cleanup during snapshot delete", e); | ||
snapshotDeleteListener.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion? getMultipartUploadCleanupListener
doesn't seem to throw. Sorry if this is not related, ActionListener<ActionListener<>>
is really bumping this up to a whole new level! :)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that failures happening while getting list of multipart uploads would also fail the snapshot deletion?
Hm that was how it worked in an earlier iteration but in the finished version we log a failure there but otherwise treat it as if there are no multipart uploads. Let me try and simplify that, sec.
ActionListener<ActionListener<>>
is really bumping this up to a whole new level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try and simplify that, sec.
Actually this is a little tricky, in principle we could turn this into a Consumer<ActionListener<Void>>
but then we submit a task to a threadpool executor and that sort of thing can theoretically fail. It won't actually fail today because snapshotExecutor
has an infinite queue and just silently drops work when shut down rather than rejecting it, but still I would prefer to write code like this in a style that allows for failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with the ActionListener<ActionListener<>>
part. FWIW, probably returning a specialized object with more clear names might make it more readable in my opinion, but that would anyway underneath be an action listener. But why log and propagate the failure if it's not supposed to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're wrong (e.g. some future change to threadpool behaviour makes us wrong) then we have to do something in production when assertions are disabled, this seems like the most sensible thing to do. There's loads of other spots where we assert false
to indicate to the reader that we think something is impossible, but nonetheless react in a somewhat reasonable fashion when assertions are disabled.
If Elasticsearch fails part-way through a multipart upload to S3 it will generally try and abort the upload, but it's possible that the abort attempt also fails. In this case the upload becomes _dangling_. Dangling uploads consume storage space, and therefore cost money, until they are eventually aborted. Earlier versions of Elasticsearch require users to check for dangling multipart uploads, and to manually abort any that they find. This commit introduces a cleanup process which aborts all dangling uploads on each snapshot delete instead. Closes elastic#44971 Closes elastic#101169
If Elasticsearch fails part-way through a multipart upload to S3 it will generally try and abort the upload, but it's possible that the abort attempt also fails. In this case the upload becomes _dangling_. Dangling uploads consume storage space, and therefore cost money, until they are eventually aborted. Earlier versions of Elasticsearch require users to check for dangling multipart uploads, and to manually abort any that they find. This commit introduces a cleanup process which aborts all dangling uploads on each snapshot delete instead. Closes elastic#44971 Closes elastic#101169
If Elasticsearch fails part-way through a multipart upload to S3 it will
generally try and abort the upload, but it's possible that the abort
attempt also fails. In this case the upload becomes dangling. Dangling
uploads consume storage space, and therefore cost money, until they are
eventually aborted.
Earlier versions of Elasticsearch require users to check for dangling
multipart uploads, and to manually abort any that they find. This commit
introduces a cleanup process which aborts all dangling uploads on each
snapshot delete instead.
Closes #44971
Closes #101169