Insulate users from kubeadm API version changes #2769

randomvariable · 2020-03-24T23:45:42Z

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

Improve kubeadm to support Cluster API
Remove duplication of effort across kubeadm and Cluster API

Non-Goals/Future Work

Fully replace kubeadm

User Story

As an operator, I want kubeadm to have better support for Cluster API's use cases to reduce the number of failed machines in my infrastructure.

Detailed Description

In a number of environments, machines can intermittently fail to bootstrap. The most common of these are control plane joins, which lead to temporary changes in etcd and API server availability, mediated by the speed of the underlying infrastructure and the particulars of infrastructure load balancers.

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations. As a long term solution, Cluster API should be a good kubeadm citizen and make changes to kubeadm to do the appropriate retries to cover the variety of infrastructure providers supported by Cluster API. In addition, the KCP controller re-implements some of the

Contract changes [optional]

Support the effort to move kubeadm out-of-tree
Make kubeadm retry operations based on data gathered from Cluster API users
Consider implementing machine-readable for kubeadm to support RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554
Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

Data model changes [optional]

Migrate to kubeadm v1beta2

[Describe contract changes between Cluster API controllers, if applicable.]

/kind proposal

ncdc · 2020-03-25T17:36:05Z

cc @neolit123 @fabriziopandini

fabriziopandini · 2020-03-25T17:50:20Z

I would like to add:

Kubeadm operation should be idempotent -> so it is easier to re-execute in case of problems
Investigate usage of etcd learner mode
investigate changes to the join workflow in order to prevent problems when the API server gets available before the local etcd instance

Final thought.
Despite all the improvements we can add to kubeadm, a CLI cannot prove the same guarantees a reconciliation loop does. So, it is necessary that also ClusterAPI implements/improve the capability to detect failures in the CLIs and replace failed nodes

neolit123 · 2020-03-25T18:08:27Z

agreed to all of @fabriziopandini 's points.

kubeadm follows the philosophy of a CLI tool (like ssh, ftp, etc) and it cannot anticipate all of the infrastructure related failures. but having a sane / best-effort amount of retries in the CLI tool makes sense.

Support the effort to move kubeadm out-of-tree

hopefully scheduled for 1.19. depends a lot on sig-release and partly on sig-arch!

Make kubeadm retry operations based on data gathered from Cluster API users

this can be useful, no doubt. like i've mentioned today, interestingly we have not seen major complains about the failures CAPI is seeing. users are applying custom amount of timeout around their cluster creation on custom infrastructure (e.g. "i know what my GCE running cluster needs").

Consider implementing machine-readable for kubeadm to support #2554

@randomvariable can you expand on this point?
we have a tracking issue to support machine readable output. but not sure how does this relate to the failures. to my understanding one of the major issues we have in CAPI is that we cannot get signal if kubeadm join returned > 0.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

there is a tracking issue for that as well. it will be a long process and the timeline is unclear.
after the move, we can start working on that but for a period of time the exposed library will be unstable.

randomvariable · 2020-03-25T18:17:22Z

For #2254 we will likely have some component on the machine call back to an infrastructure API notification service (or back to the management cluster) to provide information about the failure. Providing users with access to the log is one case, but providing a readable output, which may actually be something along the lines of expanding the range of error codes, could update a specific condition on the machine related to the exact kubeadm failure. A controller could then take appropriate remediative action. I agree this is long-term however.

vincepri · 2020-03-31T15:10:48Z

/area dependency

vincepri · 2020-04-27T17:17:12Z

Removing this as a proposal, rather seems like a future cleanup

/kind cleanup

neolit123 · 2020-06-16T14:01:50Z

WRT:

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations.

in 1.19 kubeadm merged a number of fixes and backported them to 1.17, 1.18:

kubernetes/kubeadm#2091
kubernetes/kubeadm#2092
kubernetes/kubeadm#2093
kubernetes/kubeadm#2094

/assign @fabriziopandini
for evaluation of this part.

adding up-to-date comments to the rest of the tasks:

Support the effort to move kubeadm out-of-tree

[1] timeline is unclear, we are blocked on the lack of policy for component extractions out of k/k.
we have stakeholders such as sig-arch and sig-release who see this as low-prio.

Make kubeadm retry operations based on data gathered from Cluster API users

fixes above should conform this task.

Consider implementing machine-readable for kubeadm to support #2554

we did not merge any PRs in 1.19 for MRO as the contributor was busy with other tasks, but the boilerplate is in place.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

this is very long term, potentially after [1]

Migrate to kubeadm v1beta2

v1beta1 is scheduled for removal in kubeadm 1.20 and my proposal would be to keep us on track for this effort.

vincepri · 2020-06-16T14:56:54Z

/milestone v0.4.0

fabriziopandini · 2020-06-16T21:48:05Z

@neolit123 thanks for the update!

ncdc · 2020-07-10T16:48:33Z

xref my comment from #3323 (comment):

We should stop exposing the kubeadm v1betax types in our KubeadmConfig/KubeadmControlPlane specs, and instead use our own types. This would allow us to separate what users fill in from which kubeadm API version we end up using in our bootstrap data. As @detiber pointed out, we still have to know which version of the kubeadm types to use when generating our kubeadm yaml file and when interacting with the kubeadm-config ConfigMap.

randomvariable · 2021-01-19T12:37:02Z

I know we don't have a label for it, but just for tracking

/area node-agent

k8s-ci-robot · 2021-01-19T12:37:04Z

@randomvariable: The label(s) area/node-agent cannot be applied, because the repository doesn't have them

In response to this:

I know we don't have a label for it, but just for tracking

/area node-agent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2021-01-19T18:33:58Z

@randomvariable We can add one under https://github.com/kubernetes/test-infra/blob/57ffba1efeed46ad1eb03a4f7ea58c2bc530966b/label_sync/labels.yaml#L1469

vincepri · 2021-02-08T18:12:50Z

/assign @fabriziopandini
/priority critical-urgent
/milestone v0.4.0

This task is going to be tackled as release blocker for CABPK first.

As part of it, we need a plan of action for v1alpha3 as well, where we translate the v1beta1 types currently used to v1beta2 or later when creating newer Kubernetes clusters.

fabriziopandini · 2021-03-10T16:19:15Z

/lifecycle active
/kind api-change

CecileRobertMichon · 2021-04-05T20:31:52Z

/kind release-blocking

fabriziopandini · 2021-05-24T11:13:32Z

@vincepri @CecileRobertMichon I think we can close this one

Instead IMO we should bump up the priority for following issues/PRs which are mandatory to get v1alpha4 to work with Kubernetes v1.22:

Issue remove usage of the kubeadm ClusterStatus object in Cluster API. #4423; PRs ⚠️ KCP: Stop updating and using Kubeadm's ClusterStatus with Kuberntes v1.22 #4485 and remove usage of the kubeadm ClusterStatus object in Cluster API. #4423
issue Implement support for kubeadm v1beta3 types #4445

sbueringer · 2021-05-24T15:19:22Z

@fabriziopandini @CecileRobertMichon @vincepri Will 1.22 support be mandatory for the CAPI v0.4.0 release and does bump up then mean release blocking?

(I don't really know when we wanted to release v0.4.0, Kubernetes 1.22 release seems to be mid-August according to https://github.com/justaugustus/enhancements/blob/2887453eac5cbc5fbd31112fd3d0be2be17b456c/keps/sig-release/2572-release-cadence/README.md)

fabriziopandini · 2021-05-24T16:57:23Z

Yes, with bump up I intended to make the issue listed above as release blocking.
The rational behind this is that IMO we should cut v1alpha4 only if it working with v1.22 - at the day we plan to do the cut (things can still change in v1.22 after the v0.4 cut date, but at least all problems known as of today are out of the table).

CecileRobertMichon · 2021-05-24T18:06:10Z

+1 not blocking releasing v0.4 on the release timeline of k8s 1.22

@fabriziopandini, you are saying that we can cut v1alpha4 even if 1.22 is not out yet, given that we have fixed the known compatibility issues defined above, correct?

fabriziopandini · 2021-05-24T19:10:13Z

yes, let's do our best to get ready for v1.22 within v0.4, but we should not wait for it (I edited my comment above to make it clear, I hope)

vincepri · 2021-05-24T19:11:22Z

The rational behind this is that IMO we should cut v1alpha4 only if it working with v1.22

Just to clarify, if we're not ready for 1.22 we shouldn't block the release, but rather add the support in a patch release

vincepri · 2021-05-24T19:11:55Z

Also, given that 1.22 isn't ready yet, we're relying on an alpha/beta/rc version which until release could still bring in additional required changes

fabriziopandini · 2021-05-24T19:36:25Z

I think we are in agreement.
Let try to get ready as much as possible.

fabriziopandini · 2021-05-24T19:36:36Z

What about closing this issue?

vincepri · 2021-05-24T19:47:57Z

/close

k8s-ci-robot · 2021-05-24T19:48:03Z

@vincepri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 24, 2020

randomvariable mentioned this issue Mar 24, 2020

✨ Add UseExperimentalRetryJoin to KubeadmConfig #2763

Merged

ncdc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 25, 2020

ncdc added this to the Next milestone Mar 25, 2020

ncdc added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 25, 2020

neolit123 mentioned this issue Mar 30, 2020

improve kubeadm's preflight and cluster health assurance kubernetes/kubeadm#2096

Closed

k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Mar 31, 2020

sethp-nr mentioned this issue Apr 23, 2020

KubeadmControlPlane Spec should be fully mutable #2083

Closed

vincepri removed the kind/proposal Issues or PRs related to proposals. label Apr 27, 2020

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 27, 2020

k8s-ci-robot assigned fabriziopandini Jun 16, 2020

k8s-ci-robot modified the milestones: Next, v0.4.0 Jun 16, 2020

vincepri mentioned this issue Jul 9, 2020

Update Cluster API to handle kubeadm v1beta2 configuration #3150

Closed

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 12, 2021

vincepri removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Feb 8, 2021

vincepri unassigned randomvariable and yastij Feb 8, 2021

k8s-ci-robot assigned fabriziopandini Feb 8, 2021

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2021

fabriziopandini mentioned this issue Feb 10, 2021

📖 CAEP: Insulate users from kubeadm API version changes #4170

Merged

k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Mar 10, 2021

k8s-ci-robot added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Apr 5, 2021

vincepri mentioned this issue May 24, 2021

KCP: defaulting webhooks add unspecified fields #2693

Closed

k8s-ci-robot closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insulate users from kubeadm API version changes #2769

Insulate users from kubeadm API version changes #2769

randomvariable commented Mar 24, 2020 •

edited

Loading

ncdc commented Mar 25, 2020

fabriziopandini commented Mar 25, 2020

neolit123 commented Mar 25, 2020

randomvariable commented Mar 25, 2020 •

edited

Loading

vincepri commented Mar 31, 2020

vincepri commented Apr 27, 2020

neolit123 commented Jun 16, 2020 •

edited

Loading

vincepri commented Jun 16, 2020

fabriziopandini commented Jun 16, 2020

ncdc commented Jul 10, 2020

randomvariable commented Jan 19, 2021

k8s-ci-robot commented Jan 19, 2021

vincepri commented Jan 19, 2021

vincepri commented Feb 8, 2021

fabriziopandini commented Mar 10, 2021

CecileRobertMichon commented Apr 5, 2021

fabriziopandini commented May 24, 2021 •

edited

Loading

sbueringer commented May 24, 2021

fabriziopandini commented May 24, 2021 •

edited

Loading

CecileRobertMichon commented May 24, 2021

fabriziopandini commented May 24, 2021

vincepri commented May 24, 2021

vincepri commented May 24, 2021

fabriziopandini commented May 24, 2021

fabriziopandini commented May 24, 2021

vincepri commented May 24, 2021

k8s-ci-robot commented May 24, 2021

Insulate users from kubeadm API version changes #2769

Insulate users from kubeadm API version changes #2769

Comments

randomvariable commented Mar 24, 2020 • edited Loading

ncdc commented Mar 25, 2020

fabriziopandini commented Mar 25, 2020

neolit123 commented Mar 25, 2020

randomvariable commented Mar 25, 2020 • edited Loading

vincepri commented Mar 31, 2020

vincepri commented Apr 27, 2020

neolit123 commented Jun 16, 2020 • edited Loading

vincepri commented Jun 16, 2020

fabriziopandini commented Jun 16, 2020

ncdc commented Jul 10, 2020

randomvariable commented Jan 19, 2021

k8s-ci-robot commented Jan 19, 2021

vincepri commented Jan 19, 2021

vincepri commented Feb 8, 2021

fabriziopandini commented Mar 10, 2021

CecileRobertMichon commented Apr 5, 2021

fabriziopandini commented May 24, 2021 • edited Loading

sbueringer commented May 24, 2021

fabriziopandini commented May 24, 2021 • edited Loading

CecileRobertMichon commented May 24, 2021

fabriziopandini commented May 24, 2021

vincepri commented May 24, 2021

vincepri commented May 24, 2021

fabriziopandini commented May 24, 2021

fabriziopandini commented May 24, 2021

vincepri commented May 24, 2021

k8s-ci-robot commented May 24, 2021

randomvariable commented Mar 24, 2020 •

edited

Loading

randomvariable commented Mar 25, 2020 •

edited

Loading

neolit123 commented Jun 16, 2020 •

edited

Loading

fabriziopandini commented May 24, 2021 •

edited

Loading

fabriziopandini commented May 24, 2021 •

edited

Loading