[Standard] Stabilize node distribution standard #639

cah-hbaum · 2024-06-17T11:38:16Z

Follow-up for #524
The goal is to set the Node distribution standard to Stable after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.

The following topics need to be discussed:

How is node distribution handled on installations with shared-control plane nodes (Kamaji, Gardener, etc) - see e.g. Create v2 of node distribution standard (issues/#494) #524 (review)
What should be done about control-planes with e.g. 3 nodes containing 3 etcd members, which are only distributed on 2 physical machines (and similar scenarios)what to do about control-planes with e.g. 3 control plane nodes and 2 etcd nodes - see e.g. Create v2 of node distribution standard (issues/#494) #524 (comment)
Where is the differentiation between Node distribution and things like High Availability or Redundancy? Should this standard only be a precursor for a `High Availability' standard? (more information under Taxonomy of failsafe levels #579)
Should information about external etcd (https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology) be integrated here? (see Create v2 of node distribution standard (issues/#494) #524 (comment))

The text was updated successfully, but these errors were encountered:

cah-hbaum · 2024-06-24T09:04:57Z

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

e.g. Kamaji, Gardener, etc

This question was answered in Container Call 2024-06-27:

Standard case kamaji: dedicated controlplane components with shared etcd, everything hosted in k8s (no dedicated nodes), etcd is deployed with antiaffinity (kube-scheduler tries to spread across nodes). Relation of the nodes to each other is unknown to k8s
Gardener:
- Non-HA: single-replica controlplane (dedicated, but hosted in shared seed-cluster).
- HA: (multiple replicas, hosted in seed-cluster but with awareness to tolerate zone failure or node failure) https://gardener.cloud/docs/guides/high-availability/control-plane/#node-failure-tolerance

For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance.

cah-hbaum · 2024-06-25T11:52:18Z

Topic 2: Differentiation between `Node distribution` and things like `High Availability`, `Redundancy`, etc.

I think to discuss this topic correctly, most of the wording/concepts need to be established first. I'm going to try and find multiple (if different) sources and link them here for different things.

High Availability

The main goal of HA is to avoid downtime, which is the period of time when a system, service, application, cloud service, or feature is either unavailable or not functioning properly. (https://www.f5.com/glossary/high-availability)
High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. ... (https://www.cisco.com/c/en/us/solutions/hybrid-work/what-is-high-availability.html)
High availability means that we eliminate single points of failure so that should one of those components go down, the application or system can continue running as intended. In other words, there will be minimal system downtime — or, in a perfect world, zero downtime — as a result of that failure. (https://www.mongodb.com/resources/basics/high-availability)

So things termed with High Availability in general try to avoid downtime of their services with the goal of having zero downtime, which is most times not achievable. This can also be seen in this section: ... In fact, this concept is often expressed using a standard known as "five nines," meaning that 99.999% of the time, systems work as expected. This is the (ambitious) desired availability standard that most of us are aiming for. ... (https://www.mongodb.com/resources/basics/high-availability).
To achieve these goals, services, hardware or networks are most times provided in a redundant setup, which allows automatic fail-over if instances go down.

Redundancy
In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system... )https://en.wikipedia.org/wiki/Redundancy_(engineering))
In cloud computing, redundancy refers to the duplication of certain components or functions of a system with the intention of increasing its reliability and availability. (https://www.economize.cloud/glossary/redundancy)

HINT: WILL BE CONTINUED LATER

martinmo · 2024-06-27T09:26:09Z

I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters.

One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node.

(I edited above texts accordingly as well to refer to shared control-plane nodes.)

joshmue · 2024-08-08T14:47:42Z

Another potential problem with the topology.scs.community/host-id label:

The concept of using the "host-id" may not play nice with VM live migrations.

I do not have any operational experience with e. g. Openstack live migrations (who is triggering them, when, ...?), but I guess that any provider-initiated live migration (which might be standard practice within zones, I guess) would invalidate any scheduling decision that Kubernetes made based on the "host-id" label. As Kubernetes does not reevaluate scheduling decisions, pods may end up on the same host, anyway (if the label even ends up updated). That in turn may be worked around by using the Kubernetes descheduler project.

If I did not miss anything, I guess there are roughly the following options:

Rule out live migrations
Remove the "host-id" label requirement
Specify how certain scenarios should play out in the standard (e. g. requiring descheduler)

mbuechse · 2024-08-16T11:31:10Z

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.

We want environments to use some kind of anti-affinity for their control-plane nodes.
We need some kind of transparency so we can check for compliance.
Our idea with the host-id label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

for the VMs to be scheduled on different hosts
for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

How does the host-id label play into this process?

joshmue · 2024-08-19T09:30:44Z

Still, "I do not have any operational experience with e. g. Openstack live migrations", but AFAIK:

I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live".

Yes (not only Control Plane nodes, though).

But would k8s even notice anything about that? What would the process look like?

Exactly that is the problem: Kubernetes would not (per-se) notice anything about that and the process would be undefined.

How does the host-id label play into this process?

Generally, not well, as relying on it for Pod scheduling (instead of e. g. topology.kubernetes.io/zone) may undermine the whole point of anti affinity for HA - if live migrations do happen as I imagine them.

piobig2871 · 2024-10-25T12:10:55Z

Please keep in mind that I am researching the topic from scratch but after some digging into the topic I was able to find some useful information about the topic here: https://trilio.io/kubernetes-disaster-recovery/kubernetes-on-openstack/.

I will state some questions after knowing a little bit more about the topic.

piobig2871 · 2024-11-05T11:49:35Z

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.
* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.
I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:
1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

Placing a landing node in each physical machine helps some nodes tolerate fault or failsafe mechanisms. This is because of the anti-affinity policies that result in the separation of key components across various nodes, which, in effect, reduces the likelihood of failures. Also, around such distribution requirements, certain checks have to be put in place to ensure that node distribution standards are met.

While the host-id label is helpful in distinguishing physical hosts, it can pose some difficulties in a process of live migrations, especially because it is not very responsive to changes when it comes to node relocation.

Instead of host based label we can use a cluster name to designate the ‘logical group’ or ‘cluster zone’ in a software construct which can alter with the ports address migrations in the respective cluster. This label will be less severe and will affect the live eviction optimally.

this is what a theory says.

EDIT: I have found some problem with what I have written here because I haven't took under consideration that our k8s is standing on the OpenStack instance with separated hardware nodes.

piobig2871 · 2024-11-11T11:48:37Z

Topic 1

I have been able to install a tenant control plane using Kamaji, but there is several steps that has to be done before it gets to happend.

Create Kind cluster with kind create cluster --name kamaji
Install cert-manager:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install cert-manager bitnami/cert-manager \
    --namespace certmanager-system \
    --create-namespace \
    --set "installCRDs=true"

Install metal LB:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.7/config/manifests/metallb-native.yaml

This installation is performed using manifest, I am leaving a link here to get to know with documentation
4. Now what we have to do is to create IP address pool that is requiered to get real ips. Since we are running on kind I needed to extract gateway ips of the kind network that I am running on.

GW_IP=$(docker network inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' kind)
NET_IP=$(echo ${GW_IP} | sed -E 's|^([0-9]+\.[0-9]+)\..*$|\1|g')

Right now we can create create kind-ip-pool by applying this script:

cat <<EOF | sed -E "s|172.19|${NET_IP}|g" | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-ip-pool
  namespace: metallb-system
spec:
  addresses:
  - 172.19.255.200-172.19.255.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system
EOF

After this initial setup I was able to install the Kamaji

helm repo add clastix https://clastix.github.io/charts
helm upgrade --install kamaji clastix/kamaji --namespace kamaji-system --create-namespace --set 'resources=null'

And Create tenant control plane by kubectl apply -f https://raw.githubusercontent.com/clastix/kamaji/master/config/samples/kamaji_v1alpha1_tenantcontrolplane.yaml

garloff · 2024-11-11T22:34:26Z

So let me try to get this straight.
* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.
I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think we are trying to solve for a special case here. Live migrations don't happen all that often.
In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules. The good news is that those same rules are evaluated by the scheduler (placement service) when chosing a new host on live migration. So unless something really strange happens, the guarantees after live migrations are the same as they were before. The host-id labels would be wrong now, which is somewhat ugly, but they still correctly indicate that we're not on the same hypervisor host.
Now what could happen is that the initial node distribution ended up on different hypervisor hosts just by coïncidence (and not systematically by anti-affinity), so live-migration could change that. In that case, statistics would make this setup break also in the initial setup sooner or later, so this would not go undetected.

I plead for ignoring live migration.

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:
1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

The control plane node is a VM. So there is only one dimension.

joshmue · 2024-11-12T12:43:57Z

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...

active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors
inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Right?

piobig2871 · 2024-11-14T15:34:35Z

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...
* active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors

Initially, no two Kubernetes nodes can run on a single hypervisor employing Active Anti-Affinity. So yes, that's right. Once you reach this limit, you either need to add more hypervisors or re-evaluate the anti-affinity rule.

* inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

joshmue · 2024-11-15T08:17:25Z

👍

What I'm trying to get at is this:

If there are active OpenStack anti-affinities, there is no use-case for a "host-id" node label to begin with. If node and hypervisor have a 1:1 (or 0:1) relationship, K8s pod anti-affinities can just target kubernetes.io/hostname.

If there are no OpenStack anti-affinities, ...

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

In conclusion, given live migrations may happen occasionally, I do not see any use case for this label.

mbuechse · 2024-11-15T08:27:42Z

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

joshmue · 2024-11-15T10:14:17Z

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

Sure?

The current standard says:

Worker node distribution MUST be indicated to the user through some kind of labeling in order to enable (anti)-affinity for workloads over "failure zones".

...and then goes on to describe topology.kubernetes.io/zone, topology.kubernetes.io/region and topology.scs.community/host-id in this context.

This concerns K8s scheduling, not OpenStack scheduling, of course.

mbuechse · 2024-11-15T10:29:56Z

The relevant point (and the one that describes the labels) is

To provide metadata about the node distribution, which also enables testing of this standard, providers MUST label their K8s nodes with the labels listed below.

mbuechse · 2024-11-15T10:31:55Z

I must know, because I worked with Hannes on that, and we added this mostly because we needed the labels for the compliance test.

joshmue · 2024-11-15T10:36:54Z

So the standard basically is intended to say:

We REQUIRE some sort of labeling in order to enable (anti)-affinity for workloads over "failure zones". We will not standardize them, though.
On an unrelated note, we REQUIRE labels which are usually used for anti-affinity (the ones defined by upstream, anyway), but they should not be used for anti-affinity.

?

mbuechse · 2024-11-15T10:42:21Z

I'm not competent to speak on scheduling. It's well possible that these labels are ALSO used for scheduling. In the case of region and availability zone, this is probably true. Question is: how does host-id play into this?

joshmue · 2024-11-15T11:46:30Z

I think that I see where you're coming from, having a focus on compliance testing. Do you see my point that it requires a great deal of imagination to interpret the standard like it was intended from a general POV?

Question is: how does host-id play into this?

I do not think it should, because of the reasons above. I also guess compliance tests should reflect the requirements of a standard, and AFAIU the standard does not forbid placing multiple nodes on the same host. ~~unless~~ If a CSP considers a host to be a "failure zone", they could also put the host-id into topology.kubernetes.io/zone - and then also have problems with live migration and the K8s recommendation of...

It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

mbuechse · 2024-11-15T11:53:36Z

Standard says:

The control plane nodes MUST be distributed over multiple physical machines.

So we need to be able to validate that.

It also says

At least one control plane instance MUST be run in each "failure zone"

But you could have only one failure zone. Then, still, the control plane nodes must be distributed over multiple physical hosts.

The host-id field is not necessarily meant for scheduling (particularly for the control plane, where the user cannot schedule anything, right)?

Does that make sense?

mbuechse · 2024-11-15T11:56:49Z

BTW, I'm open to improving the wording to avoid any misunderstanding here. At this point, though, we first have to agree on what's reasonable at all.

joshmue · 2024-11-15T12:18:39Z

The control plane nodes MUST be distributed over multiple physical machines.

Did not see that, actually!

Still, let's go through some cases of what "failure zone" may mean:

zone equals one of many co-located buildings
zone equals one of many rooms within a building
zone equals one of many racks within a room
zone equals one of many machines within a rack

If topology.kubernetes.io/zone is defined as any of these things, it can be used to test the standard and the above requirement is satisfied (In a world where a single VM is always local to one hypervisor at any point of time).

Theoretically, one may define "failure zone" as something like:

zone equals one of many isolation groups within a machine

But the standard already implicitly says that the smallest imaginable unit is a single ~~unit~~ machine.

Zones could be set from things like single machines or racks up to whole datacenters or even regions

EDIT: But yes, introducing this specific requirement may be a bit confusing, having the other wording referring to logical failure zones. And mandating it may only be checked by having a "host-id" with some strict definition - or (better) defining that topology.kubernetes.io/zone must be at least be a physical machine.

joshmue · 2024-11-15T12:50:35Z

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

At least one control plane instance MUST be run in each "failure zone"

Since some providers only have small environments to work with and therefore couldn't comply with this standard, it will be treated as a RECOMMENDED standard, where providers can OPT OUT.

piobig2871 · 2024-11-15T13:59:24Z

Theoretically, one may define "failure zone" as something like:
* zone equals one of many isolation groups within a machine

Like a network?

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

I have thought about it like we have 1 failure zone by one control plane and the workers may be diverse on the different machines physical or virtual

At least one control plane instance MUST be run in each "failure zone"

Like here is mentioned

mbuechse · 2024-11-15T21:03:04Z

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

joshmue · 2024-11-18T09:39:39Z

Theoretically, one may define "failure zone" as something like:
* zone equals one of many isolation groups within a machine
Like a network?

I just wanted to give an example of a theoretically viable, yet hypothetical runtime unit within a single machine.

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

Yes. CSP's with hosts as failure zones still would have problems with live-migrations and the assumption that topology labels do not change, but by removing the "host-id" requirement, this problem should be exclusive to such small/tiny providers.

On another note, the recommendation here...

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

piobig2871 · 2024-11-18T15:00:47Z

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

You raised an important point about the potential misalignment between the concepts of failure zones and physical hosts. AFAIU from Kubernetes' perspective, failure zones are abstract constructs defined to ensure redundancy and fault isolation. The actual granularity of these zones (e.g., a rack, a data center, or even an individual physical host) depends on the cloud service provider's (CSP's) design.

Kubernetes treats all nodes within a failure zone as equally vulnerable because the assumption is that a failure impacting one could potentially affect all others in the same zone. This approach is why zones matter more than individual hosts when scheduling workloads. For smaller CSPs, defining each host or rack as its own failure zone might be a practical approach to increase redundancy, especially when physical resources are limited. It aligns with your suggestion to mandate multiple zones while dropping specific focus on physical hosts.

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

Etcd’s own documentation highlights the challenges of maintaining quorum and scalability in distributed systems, particularly as the cluster size increases beyond the optimal sweet spot of 3-5 nodes.

Right now I am wondering what alternative strategies could be employed to balance the need for fault tolerance across failure zones while adhering to etcd’s quorum and scaling best practices?

garloff · 2024-11-18T15:15:14Z

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

We have an availability zone standard (0121), you probably know it better than me.
Many providers do not have several AZs, either because they are too small or because they use shared-nothing architectures with several regions rather than several AZs.

I would highly discourage to now disconnect the notion of infra-layer availability zones from "Failure Zones" in Kubernetes. A recipe for confusion.

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

piobig2871 · 2024-11-19T11:21:23Z

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

With that comment can we assume that the Node distribution and High Availability topics will be separated for the standard purposes? Would separated standards be more clear than creation of the corner cases?

piobig2871 · 2024-11-19T15:14:10Z

Also, I have found that in standard k8s-node-anti-affinity there is a note already regarding high availability but it still not defines how those machines has to be connected to each other.

In a productive environment, the control plane usually runs across multiple machines and
a cluster usually contains multiple worker nodes in order to provide fault-tolerance and
high availability.

That is why I have created a separate scs-0219-v1-high-availability to consider on my branch in a draft mode to discuss

cah-hbaum added this to Sovereign Cloud Stack Jun 17, 2024

github-project-automation bot moved this to Backlog in Sovereign Cloud Stack Jun 17, 2024

cah-hbaum self-assigned this Jun 17, 2024

cah-hbaum moved this from Backlog to Blocked / On hold in Sovereign Cloud Stack Jun 17, 2024

cah-hbaum mentioned this issue Jun 17, 2024

Create v2 of node distribution standard (issues/#494) #524

Merged

cah-hbaum moved this from Blocked / On hold to Doing in Sovereign Cloud Stack Jun 17, 2024

This was referenced Jun 17, 2024

[Standard] Node distribution v2 #494

Closed

SCS K8s cluster standardization SovereignCloudStack/issues#181

Closed

martinmo mentioned this issue Jun 24, 2024

[EPIC] KaaS standards #615

Open

29 tasks

martinmo changed the title ~~[Standard] Follow-Up Node distribution standard~~ [Standard] Stabilize node distribution standard Jun 27, 2024

anjastrunk unassigned cah-hbaum Sep 3, 2024

cah-patrickthiem assigned piobig2871 Oct 23, 2024

matofeder linked a pull request Nov 6, 2024 that will close this issue

providing new standard for node distribution in draft version - to be… #806

Draft

anjastrunk linked a pull request Nov 6, 2024 that will close this issue

providing new standard for node distribution in draft version - to be… #806

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Standard] Stabilize node distribution standard #639

[Standard] Stabilize node distribution standard #639

cah-hbaum commented Jun 17, 2024 •

edited by martinmo

Loading

cah-hbaum commented Jun 24, 2024 •

edited by martinmo

Loading

cah-hbaum commented Jun 25, 2024 •

edited

Loading

martinmo commented Jun 27, 2024

joshmue commented Aug 8, 2024

mbuechse commented Aug 16, 2024

joshmue commented Aug 19, 2024 •

edited

Loading

piobig2871 commented Oct 25, 2024

piobig2871 commented Nov 5, 2024 •

edited

Loading

piobig2871 commented Nov 11, 2024

garloff commented Nov 11, 2024

joshmue commented Nov 12, 2024

piobig2871 commented Nov 14, 2024

joshmue commented Nov 15, 2024 •

edited

Loading

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 •

edited

Loading

mbuechse commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 •

edited

Loading

mbuechse commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 •

edited

Loading

joshmue commented Nov 15, 2024

piobig2871 commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 18, 2024

piobig2871 commented Nov 18, 2024

garloff commented Nov 18, 2024

piobig2871 commented Nov 19, 2024

piobig2871 commented Nov 19, 2024 •

edited

Loading

[Standard] Stabilize node distribution standard #639

[Standard] Stabilize node distribution standard #639

Comments

cah-hbaum commented Jun 17, 2024 • edited by martinmo Loading

cah-hbaum commented Jun 24, 2024 • edited by martinmo Loading

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

cah-hbaum commented Jun 25, 2024 • edited Loading

Topic 2: Differentiation between Node distribution and things like High Availability, Redundancy, etc.

martinmo commented Jun 27, 2024

joshmue commented Aug 8, 2024

mbuechse commented Aug 16, 2024

joshmue commented Aug 19, 2024 • edited Loading

piobig2871 commented Oct 25, 2024

piobig2871 commented Nov 5, 2024 • edited Loading

piobig2871 commented Nov 11, 2024

Topic 1

garloff commented Nov 11, 2024

joshmue commented Nov 12, 2024

piobig2871 commented Nov 14, 2024

joshmue commented Nov 15, 2024 • edited Loading

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 • edited Loading

mbuechse commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 • edited Loading

mbuechse commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 15, 2024 • edited Loading

joshmue commented Nov 15, 2024

piobig2871 commented Nov 15, 2024

mbuechse commented Nov 15, 2024

joshmue commented Nov 18, 2024

piobig2871 commented Nov 18, 2024

garloff commented Nov 18, 2024

piobig2871 commented Nov 19, 2024

piobig2871 commented Nov 19, 2024 • edited Loading

cah-hbaum commented Jun 17, 2024 •

edited by martinmo

Loading

cah-hbaum commented Jun 24, 2024 •

edited by martinmo

Loading

cah-hbaum commented Jun 25, 2024 •

edited

Loading

Topic 2: Differentiation between `Node distribution` and things like `High Availability`, `Redundancy`, etc.

joshmue commented Aug 19, 2024 •

edited

Loading

piobig2871 commented Nov 5, 2024 •

edited

Loading

joshmue commented Nov 15, 2024 •

edited

Loading

joshmue commented Nov 15, 2024 •

edited

Loading

joshmue commented Nov 15, 2024 •

edited

Loading

joshmue commented Nov 15, 2024 •

edited

Loading

piobig2871 commented Nov 19, 2024 •

edited

Loading