You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context
We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.
Our primary NodePool is workload, which provisions capacity for the majority of our pods.
Observed Behavior:
Approximately daily, we experience a period of high (karpenter) workload node volatility caused by consolidation disruptions (reason: Underutilised).
This usually means that a large proportion of workload nodes get disrupted and replaced in a short period of time.
We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.
This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.
Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.
The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.
For example:
On 27th September between at 22:45 (local time) we had 14 running workload nodeclaims:
7x m6a.large (or equivalent)
4x m7i-flex.xlarge (or equivalent)
3x m7i-flex-2xlarge
Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.
At the end of this process we were running 17 workload nodeclaims:
11x m7i-flex.large (or equivalent)
3x m6a.xlarge (or equivalent)
3x m7i-flex-2xlarge
So the net effect was replacing one xlarge node with 4 large nodes, and shuffling the instance generations slightly.
Pictorially:
(each green bar represents a nodeclaim, with time along the x axis)
Expected Behavior:
Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.
Reproduction Steps (Please include YAML):
Our workload NodePool config is:
However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs.
We've also seen this on v1.28 and earlier version of Kubernetes.
Additional Questions:
Can you explain what threshold the Karpenter controller uses to determine Underutilisation?
We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?
The text was updated successfully, but these errors were encountered:
Description
Context
We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.
Our primary NodePool is
workload
, which provisions capacity for the majority of our pods.Observed Behavior:
Approximately daily, we experience a period of high (karpenter)
workload
node volatility caused by consolidation disruptions (reason: Underutilised).This usually means that a large proportion of
workload
nodes get disrupted and replaced in a short period of time.We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.
This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.
Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.
The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.
For example:
On 27th September between at 22:45 (local time) we had 14 running
workload
nodeclaims:Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.
At the end of this process we were running 17
workload
nodeclaims:So the net effect was replacing one
xlarge
node with 4large
nodes, and shuffling the instance generations slightly.Pictorially:
(each green bar represents a nodeclaim, with time along the x axis)
Expected Behavior:
Reproduction Steps (Please include YAML):
Our
workload
NodePool config is:The corresponding EC2NodeClass is:
Versions:
kubectl version
): 1.29However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs.
We've also seen this on v1.28 and earlier version of Kubernetes.
Additional Questions:
The text was updated successfully, but these errors were encountered: