-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Standard] Stabilize node distribution standard #639
Comments
Topic 1: How is node distribution handled on installations with shared-control planes nodes?e.g. Kamaji, Gardener, etc This question was answered in Container Call 2024-06-27:
For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance. |
Topic 2: Differentiation between
|
I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters. One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node. (I edited above texts accordingly as well to refer to shared control-plane nodes.) |
Another potential problem with the The concept of using the "host-id" may not play nice with VM live migrations. I do not have any operational experience with e. g. Openstack live migrations (who is triggering them, when, ...?), but I guess that any provider-initiated live migration (which might be standard practice within zones, I guess) would invalidate any scheduling decision that Kubernetes made based on the "host-id" label. As Kubernetes does not reevaluate scheduling decisions, pods may end up on the same host, anyway (if the label even ends up updated). That in turn may be worked around by using the Kubernetes descheduler project. If I did not miss anything, I guess there are roughly the following options:
|
@joshmue Thanks for bringing this to our attention! So let me try to get this straight.
I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like? I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:
How does the host-id label play into this process? |
Still, "I do not have any operational experience with e. g. Openstack live migrations", but AFAIK:
Yes (not only Control Plane nodes, though).
Exactly that is the problem: Kubernetes would not (per-se) notice anything about that and the process would be undefined.
Generally, not well, as relying on it for Pod scheduling (instead of e. g. |
Please keep in mind that I am researching the topic from scratch but after some digging into the topic I was able to find some useful information about the topic here: https://trilio.io/kubernetes-disaster-recovery/kubernetes-on-openstack/. I will state some questions after knowing a little bit more about the topic. |
Placing a landing node in each physical machine helps some nodes tolerate fault or failsafe mechanisms. This is because of the anti-affinity policies that result in the separation of key components across various nodes, which, in effect, reduces the likelihood of failures. Also, around such distribution requirements, certain checks have to be put in place to ensure that node distribution standards are met. While the host-id label is helpful in distinguishing physical hosts, it can pose some difficulties in a process of live migrations, especially because it is not very responsive to changes when it comes to node relocation. Instead of host based label we can use a cluster name to designate the ‘logical group’ or ‘cluster zone’ in a software construct which can alter with the ports address migrations in the respective cluster. This label will be less severe and will affect the live eviction optimally.
this is what a theory says. EDIT: I have found some problem with what I have written here because I haven't took under consideration that our k8s is standing on the OpenStack instance with separated hardware nodes. |
Topic 1I have been able to install a tenant control plane using Kamaji, but there is several steps that has to be done before it gets to happend.
This installation is performed using manifest, I am leaving a link here to get to know with documentation
|
I think we are trying to solve for a special case here. Live migrations don't happen all that often. I plead for ignoring live migration.
The control plane node is a VM. So there is only one dimension. |
When speaking about usual workload K8s nodes, this would mean that there can only be either...
Right? |
Initially, no two Kubernetes nodes can run on a single hypervisor employing Active Anti-Affinity. So yes, that's right. Once you reach this limit, you either need to add more hypervisors or re-evaluate the anti-affinity rule.
Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure. |
👍 What I'm trying to get at is this: If there are active OpenStack anti-affinities, there is no use-case for a "host-id" node label to begin with. If node and hypervisor have a 1:1 (or 0:1) relationship, K8s pod anti-affinities can just target If there are no OpenStack anti-affinities, ...
In conclusion, given live migrations may happen occasionally, I do not see any use case for this label. |
The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user. |
Sure? The current standard says:
...and then goes on to describe This concerns K8s scheduling, not OpenStack scheduling, of course. |
The relevant point (and the one that describes the labels) is
|
I must know, because I worked with Hannes on that, and we added this mostly because we needed the labels for the compliance test. |
So the standard basically is intended to say:
? |
I'm not competent to speak on scheduling. It's well possible that these labels are ALSO used for scheduling. In the case of region and availability zone, this is probably true. Question is: how does host-id play into this? |
I think that I see where you're coming from, having a focus on compliance testing. Do you see my point that it requires a great deal of imagination to interpret the standard like it was intended from a general POV?
I do not think it should, because of the reasons above. I also guess compliance tests should reflect the requirements of a standard, and AFAIU the standard does not forbid placing multiple nodes on the same host.
https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone |
Standard says:
So we need to be able to validate that. It also says
But you could have only one failure zone. Then, still, the control plane nodes must be distributed over multiple physical hosts. The host-id field is not necessarily meant for scheduling (particularly for the control plane, where the user cannot schedule anything, right)? Does that make sense? |
BTW, I'm open to improving the wording to avoid any misunderstanding here. At this point, though, we first have to agree on what's reasonable at all. |
Did not see that, actually! Still, let's go through some cases of what "failure zone" may mean:
If Theoretically, one may define "failure zone" as something like:
But the standard already implicitly says that the smallest imaginable unit is a single
EDIT: But yes, introducing this specific requirement may be a bit confusing, having the other wording referring to logical failure zones. And mandating it may only be checked by having a "host-id" with some strict definition - or (better) defining that |
I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:
|
Like a network?
I have thought about it like we have 1 failure zone by one control plane and the workers may be diverse on the different machines physical or virtual
Like here is mentioned |
Well. It seems that the concepts of failure zone and physical host are a bit at odds. From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean? If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved. |
I just wanted to give an example of a theoretically viable, yet hypothetical runtime unit within a single machine.
Yes. CSP's with hosts as failure zones still would have problems with live-migrations and the assumption that topology labels do not change, but by removing the "host-id" requirement, this problem should be exclusive to such small/tiny providers. On another note, the recommendation here...
does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here. |
You raised an important point about the potential misalignment between the concepts of failure zones and physical hosts. AFAIU from Kubernetes' perspective, failure zones are abstract constructs defined to ensure redundancy and fault isolation. The actual granularity of these zones (e.g., a rack, a data center, or even an individual physical host) depends on the cloud service provider's (CSP's) design. Kubernetes treats all nodes within a failure zone as equally vulnerable because the assumption is that a failure impacting one could potentially affect all others in the same zone. This approach is why zones matter more than individual hosts when scheduling workloads. For smaller CSPs, defining each host or rack as its own failure zone might be a practical approach to increase redundancy, especially when physical resources are limited. It aligns with your suggestion to mandate multiple zones while dropping specific focus on physical hosts.
Etcd’s own documentation highlights the challenges of maintaining quorum and scalability in distributed systems, particularly as the cluster size increases beyond the optimal sweet spot of 3-5 nodes. Right now I am wondering what alternative strategies could be employed to balance the need for fault tolerance across failure zones while adhering to etcd’s quorum and scaling best practices? |
We have an availability zone standard (0121), you probably know it better than me. I would highly discourage to now disconnect the notion of infra-layer availability zones from "Failure Zones" in Kubernetes. A recipe for confusion. Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream |
With that comment can we assume that the Node distribution and High Availability topics will be separated for the standard purposes? Would separated standards be more clear than creation of the corner cases? |
Also, I have found that in standard k8s-node-anti-affinity there is a note already regarding high availability but it still not defines how those machines has to be connected to each other. In a productive environment, the control plane usually runs across multiple machines and That is why I have created a separate scs-0219-v1-high-availability to consider on my branch in a draft mode to discuss |
Follow-up for #524
The goal is to set the
Node distribution
standard toStable
after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.The following topics need to be discussed:
Node distribution
and things likeHigh Availability
orRedundancy
? Should this standard only be a precursor for a `High Availability' standard? (more information under Taxonomy of failsafe levels #579)etcd
(https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology) be integrated here? (see Create v2 of node distribution standard (issues/#494) #524 (comment))The text was updated successfully, but these errors were encountered: