Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dry run e2e tests #2653

Open
1 of 3 tasks
neolit123 opened this issue Feb 9, 2022 · 12 comments · May be fixed by kubernetes/kubernetes#126776
Open
1 of 3 tasks

add dry run e2e tests #2653

neolit123 opened this issue Feb 9, 2022 · 12 comments · May be fixed by kubernetes/kubernetes#126776
Assignees
Labels
area/dry-run area/test help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Feb 9, 2022

kubeadm is currently missing integration / e2e tests for --dry-run.
this means if we happen to break our dry run support for a particular command (e.g. init) we will not know about it until users report it to us.

xref #2649

kubeadm has integration tests here:
https://github.com/kubernetes/kubernetes/tree/master/cmd/kubeadm/test/cmd
these tests execute a precompiled kubeadm binary to perform some checks and look for exist status 0.

we can use the same method for the init, join, reset tests with --dry-run.
because the dry-run will be reentrant.

but we cannot use this method for upgrade * commands, because the --dry-run for upgrade expects an existing cluster.

  • kubeconfig files in /etc/kubernetes/...
  • a running kube-apiserver
  • other running components...

to have everything in the same place we can add dry-run as part of kinder e2e test workflow:
https://github.com/kubernetes/kubeadm/tree/main/kinder/ci/workflows

the workflow can look like the following:

  • allocate a kinder cluster with 1 node
  • call kubeadm init --dry-run on it (add --upload-certs and other special flags, how to test external CA?)
  • call kubeadm join --dry-run on it (add --control-plane, --certificate-key and other flags?)
  • call kubeadm reset --dry-run on it
  • call kubeadm init ... to create an actual k8s node
  • call kubeadm upgrade apply --dry-run to dry run the "primary node" upgrade of this node
  • call kubeadm upgrade node --dry-run to dry run the "secondary node" upgrade of this node
  • .. cleanup

tasks:

@neolit123 neolit123 added area/dry-run area/test help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 9, 2022
@neolit123 neolit123 added this to the v1.24 milestone Feb 9, 2022
@neolit123 neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022
@SataQiu
Copy link
Member

SataQiu commented Apr 12, 2022

@neolit123 It seems that we cannot run kubeadm join --dry-run without an actual Kubernetes control-plane. 😅

The join phase will try to fetch the cluster-info ConfigMap even though in dry-run mode.

I0412 10:01:39.066015     149 join.go:530] [preflight] Discovering cluster-info
I0412 10:01:39.067243     149 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "127.0.0.1:6443"
I0412 10:01:39.101380     149 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s  in 10 milliseconds
I0412 10:01:39.104282     149 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 127.0.0.1:6443: connect: connection refused
I0412 10:01:45.026633     149 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s  in 5 milliseconds
I0412 10:01:45.027445     149 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 127.0.0.1:6443: connect: connection refused

Therefore, we need at least one worker node to complete the dry-run tests.

@neolit123
Copy link
Member Author

neolit123 commented Apr 12, 2022

Hm, i think we should fix that and use a fake CM or the dry run client...In dryrun other API calls work like that. But if using the dry run client it means we will probably will have to skip validation of the CM as well 🤔

If you prefer we can merge the initial test job without the join test but it seems we have to fix it in k/k eventually.
.
EDIT: Or maybe ... join does need a control plane and it will fail later even if we use fake cluster info?

@SataQiu
Copy link
Member

SataQiu commented Apr 12, 2022

@neolit123
Copy link
Member Author

i think the refactor is doable. we need to rename the "init" dry-run client to be a generic one and use it for "join" as well.
it's probably not that much work, but i haven't looked at all the details.

we can merge the current PR, but keep this issue open until we can do that in k/k after code freeze for 1.24.

@neolit123
Copy link
Member Author

@neolit123
Copy link
Member Author

neolit123 commented Apr 20, 2022

@SataQiu looks like the current e2e tests are flaky.

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-dryrun-latest

the error is:

preflight] Some fatal errors occurred:
[ERROR CRI]: container runtime is not running: output: time="2022-04-20T18:56:52Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"
, error: exit status 1

my guess is that we are running kubeadm ... inside the nodes before the container runtime has started. i can't remember if kinder's kubeadm-init action has a "wait for CRI" of sorts, but likely it does. one option would to add e.g. sleep 10 before the first kubeadm init in task-03-init-dryrun.

EDIT: unclear if sleep is in the node images, possibly yes.

alternatively this could be a weird bug where containerd in the nodes is simply refusing to start for some reason, and despite:

I0420 20:57:18.894578 107 initconfiguration.go:117] detected and using CRI socket: unix:///var/run/containerd/containerd.sock

@SataQiu
Copy link
Member

SataQiu commented Apr 26, 2022

@SataQiu looks like the current e2e tests are flaky.

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-dryrun-latest

the error is:

preflight] Some fatal errors occurred:
[ERROR CRI]: container runtime is not running: output: time="2022-04-20T18:56:52Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"
, error: exit status 1

my guess is that we are running kubeadm ... inside the nodes before the container runtime has started. i can't remember if kinder's kubeadm-init action has a "wait for CRI" of sorts, but likely it does. one option would to add e.g. sleep 10 before the first kubeadm init in task-03-init-dryrun.

EDIT: unclear if sleep is in the node images, possibly yes.

alternatively this could be a weird bug where containerd in the nodes is simply refusing to start for some reason, and despite:

I0420 20:57:18.894578 107 initconfiguration.go:117] detected and using CRI socket: unix:///var/run/containerd/containerd.sock

It looks like the kubeadm-kinder-dryrun job was deleted by kubernetes/test-infra@c694052 😓

@neolit123
Copy link
Member Author

neolit123 commented Apr 26, 2022

Oh looks like @RA489 's PR deleted it in the 1.24 updates and i didn't see that..

Can you please send it again.

@SataQiu
Copy link
Member

SataQiu commented Apr 26, 2022

Oh looks like @RA489 's PR deleted it in the 1.24 updates and i didn't see that..

Can you please send it again.

Sure!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@neolit123 neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@RA489
Copy link
Contributor

RA489 commented Jul 25, 2022

/remove-lifecycle stale

@neolit123 neolit123 modified the milestones: v1.25, v1.26 Aug 25, 2022
@neolit123
Copy link
Member Author

neolit123 commented Oct 11, 2022 via email

@neolit123 neolit123 modified the milestones: v1.26, v1.27 Nov 21, 2022
@neolit123 neolit123 modified the milestones: v1.27, v1.28 Apr 17, 2023
@neolit123 neolit123 modified the milestones: v1.28, v1.29 Jul 21, 2023
@neolit123 neolit123 modified the milestones: v1.29, v1.30 Nov 1, 2023
@neolit123 neolit123 modified the milestones: v1.30, v1.31 Apr 5, 2024
@neolit123 neolit123 modified the milestones: v1.31, v1.32 Aug 7, 2024
@neolit123 neolit123 self-assigned this Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dry-run area/test help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants