Applier manager improvements #5062

twz123 · 2024-10-02T07:03:28Z

Description

The stacks don't need to be stored in the manager struct
The map is only ever used in the loop to create and remove stacks, so it doesn't need to be stored in the struct. This ensures that there can't be any racy concurrent accesses to it.
Don't check for closed watch channels
The only reason these channels get closed is if the watcher itself gets closed. This happens only when the method returns, which in turn only happens when the context is done. In this case, the loop has already exited without a select on a potentially closed channel. So the branches that checked for closed channels were effectively unreachable during runtime.
Wait for goroutines to exit
Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine.
Restart watch loop on errors
Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package.
Improve logging
Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file.
Remove unused applier field
Seems to be a remnant from the past.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

How Has This Been Tested?

Manual test
Auto test added

Checklist:

The map is only ever used in the loop to create and remove stacks, so it doesn't need to be stored in the struct. This ensures that there can't be any racy concurrent accesses to it. Signed-off-by: Tom Wieczorek <[email protected]>

The only reason these channels get closed is if the watcher itself gets closed. This happens only when the method returns, which in turn only happens when the context is done. In this case, the loop has already exited without a select on a potentially closed channel. So the branches that checked for closed channels were effectively unreachable during runtime. Signed-off-by: Tom Wieczorek <[email protected]>

Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine. Signed-off-by: Tom Wieczorek <[email protected]>

Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file. Signed-off-by: Tom Wieczorek <[email protected]>

Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package. Signed-off-by: Tom Wieczorek <[email protected]>

Seems to be a remnant from the past. Signed-off-by: Tom Wieczorek <[email protected]>

emosbaugh · 2024-10-17T18:25:55Z

I've created a pull request into the fork with a test twz123#134

twz123 · 2024-10-21T13:42:12Z

Thx @emosbaugh! I added the commit here, but it's not signed-off. Can you maybe sign it and do a force push? You should be able to do it directly on the branch in my fork, as "allow edits by maintainers" is checked.

Signed-off-by: Ethan Mosbaugh <[email protected]>

emosbaugh · 2024-10-21T13:55:18Z

Thx @emosbaugh! I added the commit here, but it's not signed-off. Can you maybe sign it and do a force push? You should be able to do it directly on the branch in my fork, as "allow edits by maintainers" is checked.

Done. Sorry about that

emosbaugh · 2024-10-21T16:59:06Z

@twz123 is the intention to backport this fix and to what version? thanks!

twz123 · 2024-10-21T18:04:20Z

We can check if it's easy to backport. If yes, all good, If not, we can maybe re-target your patch to the release-1.31 branch and backport that to 1.30 - 1.28 instead. Or we do it the other way round and merge your patch into main, and I'll rebase this one on top of yours.

I'll check tomorrow in detail...

emosbaugh · 2024-10-22T16:19:41Z

We can check if it's easy to backport. If yes, all good, If not, we can maybe re-target your patch to the release-1.31 branch and backport that to 1.30 - 1.28 instead. Or we do it the other way round and merge your patch into main, and I'll rebase this one on top of yours.

I'll check tomorrow in detail...

Thanks!

twz123 · 2024-10-23T08:28:53Z

Alright, the code changes themselves can be backported with just a few small merge conflicts that are straight forward to resolve. The test case, on the other hand, doesn't work at all in 1.28-1.30. This is supposedly due to some non-trivial improvements that have been implemented for the fake clients quite recently.

We can do a backport, excluding the tests, or we could try to backport the fake client improvements, as well, which might be quite a bit of work.

emosbaugh · 2024-10-23T12:22:49Z

Alright, the code changes themselves can be backported with just a few small merge conflicts that are straight forward to resolve. The test case, on the other hand, doesn't work at all in 1.28-1.30. This is supposedly due to some non-trivial improvements that have been implemented for the fake clients quite recently.

We can do a backport, excluding the tests, or we could try to backport the fake client improvements, as well, which might be quite a bit of work.

The code change is what is most important. Thanks

jnummelin

Left one minor Q on the timeout used

jnummelin · 2024-10-29T21:21:45Z

pkg/applier/manager.go

 		go func() {
-			_ = m.runWatchers(watcherCtx)
+			defer close(stopped)
+			wait.UntilWithContext(ctx, m.runWatchers, 1*time.Minute)


1 minute seems bit abstract here, any reasoning why that time is used?

oh, didn't realize auto-merge was set. oh well, it was minor anyways 😂

It's a trade-off between busy-loop/log-spam and a reasonable self-healing delay. I think everything between, say, 10 secs and a couple of minutes would be fine here, so one minute was just the thing I came up with when writing that code 🙈

k0s-bot · 2024-10-29T21:28:10Z

Successfully created backport PR for release-1.31:

[Backport release-1.31] Applier manager improvements #5171

twz123 added 6 commits October 1, 2024 23:53

Wait for goroutines to exit in applier manager

402c728

Rename cancelWatcher to stop and wait until the newly added stopped channel is closed. Also, add a stopped channel to each stack to do the same for each stack-specific goroutine. Signed-off-by: Tom Wieczorek <[email protected]>

Improve logging in applier manager

edb105c

Cancel the contexts with a cause. Add this cause to the log statements when exiting loops. Rename bundlePath to bundleDir to reflect the fact that it is a directory, not a file. Signed-off-by: Tom Wieczorek <[email protected]>

Restart applier manager watch loop on errors

404c6cf

Exit the loop on error and restart it after a one-minute delay to allow it to recover in a new run. Also replace the bespoke retry loop for stacks with the Kubernetes client's wait package. Signed-off-by: Tom Wieczorek <[email protected]>

Remove unused applier field from applier manager

c2beea7

Seems to be a remnant from the past. Signed-off-by: Tom Wieczorek <[email protected]>

twz123 added area/controlplane chore labels Oct 2, 2024

twz123 marked this pull request as ready for review October 2, 2024 09:43

twz123 requested a review from a team as a code owner October 2, 2024 09:43

twz123 requested review from kke and juanluisvaladas October 2, 2024 09:43

twz123 mentioned this pull request Oct 17, 2024

fix: clear stack watchers map when lost leader lease #5123

Closed

16 tasks

juanluisvaladas previously approved these changes Oct 17, 2024

View reviewed changes

emosbaugh mentioned this pull request Oct 17, 2024

When leader lease is lost applier manager is not restarted #5122

Open

4 tasks

twz123 dismissed juanluisvaladas’s stale review via afd01db October 21, 2024 13:40

twz123 requested a review from a team as a code owner October 21, 2024 13:40

chore: add applier manager test

11b3197

Signed-off-by: Ethan Mosbaugh <[email protected]>

emosbaugh force-pushed the applier-manager-improvements branch from afd01db to 11b3197 Compare October 21, 2024 13:54

twz123 added the backport/release-1.31 PR that needs to be backported/cherrypicked to the release-1.31 branch label Oct 22, 2024

twz123 requested a review from juanluisvaladas October 28, 2024 12:56

twz123 enabled auto-merge October 29, 2024 18:46

jnummelin approved these changes Oct 29, 2024

View reviewed changes

twz123 merged commit c07d3c1 into k0sproject:main Oct 29, 2024
90 checks passed

twz123 deleted the applier-manager-improvements branch October 29, 2024 21:27

k0s-bot mentioned this pull request Oct 29, 2024

[Backport release-1.31] Applier manager improvements #5171

Open

twz123 mentioned this pull request Oct 30, 2024

[Backport release-1.30] Applier manager improvements #5172

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applier manager improvements #5062

Applier manager improvements #5062

twz123 commented Oct 2, 2024

emosbaugh commented Oct 17, 2024

twz123 commented Oct 21, 2024 •

edited

Loading

emosbaugh commented Oct 21, 2024

emosbaugh commented Oct 21, 2024

twz123 commented Oct 21, 2024

emosbaugh commented Oct 22, 2024

twz123 commented Oct 23, 2024 •

edited

Loading

emosbaugh commented Oct 23, 2024

jnummelin left a comment

jnummelin Oct 29, 2024

jnummelin Oct 29, 2024

twz123 Oct 30, 2024

k0s-bot commented Oct 29, 2024

Applier manager improvements #5062

Applier manager improvements #5062

Conversation

twz123 commented Oct 2, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

emosbaugh commented Oct 17, 2024

twz123 commented Oct 21, 2024 • edited Loading

emosbaugh commented Oct 21, 2024

emosbaugh commented Oct 21, 2024

twz123 commented Oct 21, 2024

emosbaugh commented Oct 22, 2024

twz123 commented Oct 23, 2024 • edited Loading

emosbaugh commented Oct 23, 2024

jnummelin left a comment

Choose a reason for hiding this comment

jnummelin Oct 29, 2024

Choose a reason for hiding this comment

jnummelin Oct 29, 2024

Choose a reason for hiding this comment

twz123 Oct 30, 2024

Choose a reason for hiding this comment

k0s-bot commented Oct 29, 2024

twz123 commented Oct 21, 2024 •

edited

Loading

twz123 commented Oct 23, 2024 •

edited

Loading