-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm join
does not explicitly wait for etcd to have grown when joining secondary control plane
#1353
kubeadm join
does not explicitly wait for etcd to have grown when joining secondary control plane
#1353
Comments
/assign |
Hi, I'm still getting the same error message while performing I'm following the steps specified here -> kubeadm/high-availability Error message (partially omitted):
For more detailed log, I've made another attempt with verbosity set to 10 -> HERE
|
I'm in the same boat as @oneoneonepig. I'm writing a chef cookbook and my tests are failing, with kubeadm 1.18, with the described error when the second master joins the first one. |
we are not seeing etcd related e2e failures in our most recent signal:
this will hopefully get fixed once we have proper retries for |
Here is the result of my test suite:
My test is doing the following steps:
The tests are all failing at the same step:
|
does this fix work for flannel? please note that if one CNI plugin fails you should switch to another - kubeadm recommends Calico or WeaveNet. |
Thank you @neolit123 for pointing me to this, but I don't think this is useful for me as the Flannel pods are fine. My cookbook is waiting for the Flannel pods to be in the "Running" state, and it also wait for the master node to be in the "Ready" state. I have a test on the flannel pods as on the master node status too.
Oh ... I remember having read that flannel should work in many cases on the kubernetes.io documentation but I can't find it now. Anyway I want to implement more CNI drivers, so I can give a try to Calico or WeaveNet, but I'm not sure that would solve the issue as with kubeadm 1.14 and 1.15 all works well. |
we saw a number of issues related to flannel in recent k8s versions. |
Oh, okay, then I will give a try to Calico 😃 |
I am seeing the same issue as reported by @oneoneonepig and @zedtux on fresh cluster installs on Ubuntu 18.04 machines with both v1.18.0 and v1.18.1. Have tried with both Weave and Calico ... I am following the instructions at https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ I am using a "stacked control plane and etcd node" setup, and HAProxy as my load balancer. Have tried changing the haproxy config to just refer to the first control plane node (in case there is some race with the LB switching to the second control plane node before it is fully configured ...)
Am a little stuck with how to make progress here, so would appreciate some guidance from the community. |
Am wondering if a new issue should be opened for these recent problems, given that this issue is in "Closed" state? |
we have e2e tests using HAProxy as the LB and we are not seeing issues, so it must be specific to your setup: a bit hard to advise on how to resolve it, as kubeadm is LB agnostic.
i think solving this new issue may resolve the issues that you are seeing:
is this problem permanent or only for a period of time during LB reconfiguration?
to my understand this is the where the problem lies. once you reconfigure the LB you are going to see a LB blackout, but it should be temporary. |
Wanted to clarify that I was not changing the HAProxy configuration midstream ... in one experiment, I had listed both the control plane nodes as my backend servers, and in the other, just the first node. Is there a timeline for #2092? Also, in your e2e tests, are you using a physical cluster (with potentially additional network delays between the components), or a single host setup with multiple containers? I am not very familiar with the kind setup, hence this question. My experiments are being run on a cluster with multiple physical nodes (the LB and the two control-plane nodes are on different hosts). Are there some logs I can collect/share that might shed light on this issue? |
should be fixed in 1.19 and backported to 1.18, 1.17...hopefully.
it's a docker in docker setup, so yes the nodes are local containers and we don't simulate network latency. yet the issue that is being discussed in this thread is apparently very rare. are the networking delays significant on your side? and are the control plane machines spec-ed well on CPU/RAM?
@zedtux 's logs are missing for @ereslibre might have more suggestions. |
The control plane nodes are KVM VMs on Ubuntu hosts. Each control plane node VM is provisioned with 4 CPUs and 8 GB of DRAM. The inter-VM latencies don't show anything out of place. I have repeated the experiment starting from a clean install and will share the following logs below:
Appreciate any insights you might have. --- Output of
--- Output of
--- Output of
--- Output of attempted
--- Output of
|
Here are the commands output with Few informationBefore to run the keepalived is configured in order to create the The first control plane kubeadm init (first control plane)kubeadm config file apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 172.28.128.10
bindPort: 6443
certificateKey: bf9685289b0e0a8685cd412608f2d9ac33c99a57bd44fde3cf3ebcef59d928c9
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: 1.16.8
controlPlaneEndpoint: "172.28.128.10:9443"
clusterName: kubernetes
apiServer:
extraArgs:
advertise-address: 172.28.128.10
networking:
dnsDomain: cluster.local
podSubnet: 10.244.0.0/16
serviceSubnet: 10.96.0.0/12
Worker join
kubeadm init (second control plane)
|
@zedtux i'm not seeing errors when your worker and control-plane nodes are joining.
i think you might be facing this same etcd bug: not directly related to RAID, but my understanding there is that etcd does not tolerate slow disks. |
@neolit123 - thanks for the etcd slow disk issue pointer. |
Okay, checked my configuration and ran fio-based tests to see if the disk fdatasync latency is the issue ... It does not appear to be. I followed the guidance in: For the suggested fio command (--ioengine=sync --fdatasync=1 --size=22m --bs=2300), my two control plane nodes show 99% fsync latency well below the required 10ms. I verified that the underlying disks are SSDs, with data loss protection enabled (so Linux sees them as "write through" write-cache devices). Control plane node 1:
Control plane node 2:
Looking through the etcdserver logs I had shared yesterday, the warning about the range request taking too long comes after a sequence of gRPC errors (and only after the second node has been added to the cluster) ... so don't know if disk latency is the cause or something else:
For what it is worth, I saw a post on StackOverflow from a few days back where someone else is reporting the same issue: Finally, I assume your internal e2e tests are doing this, but wanted to confirm that you are also testing the 'stacked control plane and etcd' configuration, not just the external etcd one ... wanted to rule out any possible issues there. Thanks! |
we have no solution for this on the kubeadm side. current indicators point towards an etcd issue.
we are primary testing stacked etcd, but also have some tests for external. |
@neolit123 the issue is not when the worker joins the first control plane, it's when the second control plane tries to join the first one. |
your logs show:
...
so it says it passed. |
Oh, actually you are right, I was focusing on the following line:
But then there are those lines:
So you are right, it works. 🤔. I'll review my test suite. |
@neolit123 - Wanted to provide an update ... After a few days of debugging, managed to root cause and fix the issue. My setup had a misconfiguration, where the control-plane Linux hosts had the network interface MTU set to the wrong value. Fixing the MTU enabled the control-plane cluster setup to work as intended. What I did not anticipate, and why it took a while to debug is the following:
So, while I know how to fix this now, wanted to request that this be documented in the setup requirements so others don't stumble into it and end up spending the same amount of time to find/fix the issue as I did. Don't know how you would do it, but adding it to the pre-flight checks for kubeadm would also be helpful. Thanks again for all the time you spent on this. |
what value did you set?
there are 100s of different things that can wrong with the networking of a Node, so i don't think kubeadm should document all of them. (from a previous post)
this is the first case i've seen a user report MTU issues. |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):1.13
onwards.What happened?
When joining a secondary control plane (
kubeadm join --experimental-control-plane
, or by usingcontrolPlane
configuration on thekubeadm
YAML), andetcd
needs growing from 1 instance to 2 there's a temporary blackout that we are not explicitly waiting for.This is a spin-off of #1321 (comment).
Different issues were leading to #1321 error on the new joining master:
error uploading configuration: Get https://10.11.0.1:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: unexpected EOF
.Kubeadm fails to bring up a HA cluster due to EOF error when uploading configmap #1321
kubeadm
won't recreate them (as they exist), but their SANs are wrong andetcd
will fail to grow to 2 nodes. The previous error happens because is the immediate next step after we write the etcd static pod manifests on the new master.kubeadm join controlplane not pulling images and fails #1341
uploadconfig
phase can timeout with the previous mentioned error because the system is still pulling images while we are trying to upload the new cluster configuration (with the new node)This issue
etcd
from 1 instance to 2 there's a blackout of etcd that we should explicitly wait for. Instead, we are at the moment implicitly waiting for it in this phase (uploadconfig
). If this blackout is long enough (many index sync required, slow I/O), theuploadconfig
phase can timeout with the error mentioned before.What you expected to happen?
When we grow
etcd
we should perform an intermediate step explicitly waiting foretcd
to be available (or healthy). Then, we can continue with the next step, in this caseuploadconfig
.The timeout happens here because
uploadconfig
happens to be the first phase after the static etcd pod creation that tries to reach to the apiserver while the etcd blackout is still happening, when it tries to call toGetClusterStatus
.How to reproduce it (as minimally and precisely as possible)?
Please, refer to #1321 (comment)
The text was updated successfully, but these errors were encountered: