-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applier manager improvements #5062
Conversation
The map is only ever used in the loop to create and remove stacks, so it doesn't need to be stored in the struct. This ensures that there can't be any racy concurrent accesses to it. Signed-off-by: Tom Wieczorek <[email protected]>
The only reason these channels get closed is if the watcher itself gets closed. This happens only when the method returns, which in turn only happens when the context is done. In this case, the loop has already exited without a select on a potentially closed channel. So the branches that checked for closed channels were effectively unreachable during runtime. Signed-off-by: Tom Wieczorek <[email protected]>
Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine. Signed-off-by: Tom Wieczorek <[email protected]>
Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file. Signed-off-by: Tom Wieczorek <[email protected]>
Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package. Signed-off-by: Tom Wieczorek <[email protected]>
Seems to be a remnant from the past. Signed-off-by: Tom Wieczorek <[email protected]>
I've created a pull request into the fork with a test twz123#134 |
Thx @emosbaugh! I added the commit here, but it's not signed-off. Can you maybe sign it and do a force push? You should be able to do it directly on the branch in my fork, as "allow edits by maintainers" is checked. |
Signed-off-by: Ethan Mosbaugh <[email protected]>
afd01db
to
11b3197
Compare
Done. Sorry about that |
@twz123 is the intention to backport this fix and to what version? thanks! |
We can check if it's easy to backport. If yes, all good, If not, we can maybe re-target your patch to the release-1.31 branch and backport that to 1.30 - 1.28 instead. Or we do it the other way round and merge your patch into main, and I'll rebase this one on top of yours. I'll check tomorrow in detail... |
Thanks! |
Alright, the code changes themselves can be backported with just a few small merge conflicts that are straight forward to resolve. The test case, on the other hand, doesn't work at all in 1.28-1.30. This is supposedly due to some non-trivial improvements that have been implemented for the fake clients quite recently. We can do a backport, excluding the tests, or we could try to backport the fake client improvements, as well, which might be quite a bit of work. |
The code change is what is most important. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one minor Q on the timeout used
go func() { | ||
_ = m.runWatchers(watcherCtx) | ||
defer close(stopped) | ||
wait.UntilWithContext(ctx, m.runWatchers, 1*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 minute seems bit abstract here, any reasoning why that time is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, didn't realize auto-merge was set. oh well, it was minor anyways 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a trade-off between busy-loop/log-spam and a reasonable self-healing delay. I think everything between, say, 10 secs and a couple of minutes would be fine here, so one minute was just the thing I came up with when writing that code 🙈
Successfully created backport PR for |
Description
The stacks don't need to be stored in the manager struct
The map is only ever used in the loop to create and remove stacks, so it doesn't need to be stored in the struct. This ensures that there can't be any racy concurrent accesses to it.
Don't check for closed watch channels
The only reason these channels get closed is if the watcher itself gets closed. This happens only when the method returns, which in turn only happens when the context is done. In this case, the loop has already exited without a select on a potentially closed channel. So the branches that checked for closed channels were effectively unreachable during runtime.
Wait for goroutines to exit
Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine.
Restart watch loop on errors
Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package.
Improve logging
Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file.
Remove unused applier field
Seems to be a remnant from the past.
Type of change
How Has This Been Tested?
Checklist: