GHA ScaleSet controller upgrade process #3090

larhauga · 2023-11-20T19:43:55Z

larhauga
Nov 20, 2023

Hi,
we use flux for managing the different objects related to the scale-set controller and scale-sets, and additionally use helm template to generated the objects from the helm template.
This allows us full control over the resources being created while still following the upstream changes.

However, after the upgrade from 0.6.1 to 0.7.0 we were surprised by the way the upgrade procedure went.
We did not expect the ScaleSet controller to delete the resources, and the AutoscalingRunnerset getting deleted. This results in a long wait until a new reconcile.

There may be technical reasons for the way that this is implemented, but it is unexpected and I think can cause problems (and downtime) in the long run. Having to ensure that all the different ScaleSets have the correct version set as a label is problematic.
If you use helm from the command line or indirectly you have control over the timing of reconciles, but I think this can lead to different race conditions in the long run.

Is it a technical reason for the ScaleSet controller to delete the AutoscalingRunnerSets? I can understand that it would delete the EphemeralRunners, but not that it would need to have the correct versions as it should only be one controller version running at a time.

If it is necessary to manage the version of the scalesets, I think it should be the controllers concern, and not the helm template. If
breaking changes is needed on the CRDs, the version should be increased or the contract be expanded between versions.

I hope that we can work towards the controller not deleting the AutoscalingRunnerSets during an upgrade, ensuring that kubernetes can handle the upgrade more seamlessly.

Is there any knowledge about the reason for this being the case, or more about the intent of the version labels? (I could not find any related comments to the code).
What would be the outcome of removing the code which deletes the AutoscalingRunnerSets, and if that is not a viable solution, adding a version skew could serve as an interim solution?

Hopefully this question can discuss how the process can be better in the future.

TimMnz09 · 2023-11-27T07:02:36Z

TimMnz09
Nov 27, 2023

Hi @larhauga, we are facing the same issue in our clusters. A simple reconcile does not work.
The scale-sets are not created after reconciling.

0 replies

Rekha-Prakash · 2023-12-13T16:24:35Z

Rekha-Prakash
Dec 13, 2023

Hi @larhauga ,

Apologies for hijacking this thread, I am trying to install the scaleset controller helm chart in similar way using flux and Kustomization and I could not find related info on the user document. I would really appreciate if you could share some insights on your arc setup.

0 replies

nocturne1 · 2024-01-30T00:46:21Z

nocturne1
Jan 30, 2024

I have been struggling getting around this, and I think I've found a solution

in your helmRelease for the runner-scale-set, make sure to add:

spec: 
  driftDetection:
    mode: enabled

Per the Flux HelmRelease documentation, the driftDetection can look for differences between what Helm thinks is deployed, and what is actually running in the cluster:

Drift detection

.spec.driftDetection is an optional field to enable the detection (and correction) of cluster-state drift compared to the manifest from the Helm storage.

When .spec.driftDetection.mode is set to warn or enabled, and the desired state of the HelmRelease is in-sync with the Helm release object in the storage, the controller will compare the manifest from the Helm storage with the current state of the cluster using a server-side dry-run apply.

If this comparison detects a drift (either due to a resource being created or modified during the dry-run), the controller will emit a Kubernetes Event with a short summary of the detected changes. In addition, a more extensive JSON Patch summary is logged to the controller logs (with --log-level=debug).

Drift correction

Furthermore, when .spec.driftDetection.mode is set to enabled, the controller will attempt to correct the drift by creating and patching the resources based on the server-side dry-run apply result.

At the end of the correction attempt, it will emit a Kubernetes Event with a summary of the changes it made and any failures it encountered. In case of a failure, it will continue to detect and correct drift until the desired state has been reached, or a new Helm action is triggered (due to e.g. a change to the spec).

I've been able to successfully go back and forth through version upgrades, and this seems to stop them from disappearing. I was noticing that after the controller upgrade occurred, the listeners would be killed, but helm list -A would still think they were deployed. After this, everything works as expected.

Now, this still is less than ideal, as I'm not sure what other negatives could occur by using that setting.

4 replies

nocturne1 Mar 27, 2024

As noted below, CRD updates are not supported by Helm. So today's update to 0.9.0, which requires CRD upgrades, does not work with my above suggestion for Flux. It seems that one might need to suspend flux actions for those helmreleases, manually uninstall via helm, manually uninstall all the CRDS, and then re-enable flux?

nocturne1 Mar 28, 2024

It appears that flux can indeed support upgrading (but not removing) the CRDs: https://fluxcd.io/flux/components/helm/helmreleases/#controlling-the-lifecycle-of-custom-resource-definitions
From what I can see, adding the following to the HelmRelease will allow the CRD upgrade to occur:

spec:
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

Of course, if there's ever a need to delete CRDs, that would need to occur outside of flux.

nocturne1 Mar 28, 2024

In testing, the CreateReplace method seems to work if the version update is done in two stages - controller first, then after that is applied, upgrade the scale sets.
If done at the same time, Reconciler errors were occurring.

There may very well be a better way to handle this, but we'd like to stay within Flux in any way possible.

nocturne1 Mar 28, 2024

It seems a bit more reliable adding the force: true sections in both upgrade configs:

Controller:

spec:
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
    force: true

runner-scale-set:

spec:
  upgrade:
    force: true

I still haven't been able to deploy both simultaneously, but there might be a config that gets us there.

nikola-jokic · 2024-01-30T14:31:25Z

nikola-jokic
Jan 30, 2024

Hey @larhauga,

This is a great question and your assumption why we did it is exactly right. I'll try to explain the reasoning behind it.

Helm does not support CRD updates. This can cause a lot of problems especially since our CRDs may change from one version to another.
One of our goals is not to expose webhooks. Not allowing webhooks increases the security of ARC by not allowing inbound traffic to controllers. However, versioning requires a mutation webhook that can translate one resource version to another. This can be accomplished within the controller, but would significantly add complexity and decrease maintainability.

This is not to say we won't revisit this decision. It is perfectly reasonable to expect that you can upgrade the autoscaling runner set without touching the controller. But with the current rate of changing CRDs, this constraint greatly helps us while debugging and tracking down issues, while keeping controllers relatively simple.

Another approach may be allowing version mismatches only on patch versions, since by definition, they should only include bug fixes.

Anyway, I hope this clarifies this constraint. ☺️

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GHA ScaleSet controller upgrade process #3090

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

Drift detection

Drift correction

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GHA ScaleSet controller upgrade process #3090

Replies: 4 comments · 4 replies

Drift detection

Drift correction

Replies: 4 comments 4 replies