Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter "Underutilised" disruption causing excessive node churn #7146

Open
headj-origami opened this issue Oct 2, 2024 · 0 comments
Open
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@headj-origami
Copy link

Description

Context
We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.

Our primary NodePool is workload, which provisions capacity for the majority of our pods.

Observed Behavior:
Approximately daily, we experience a period of high (karpenter) workload node volatility caused by consolidation disruptions (reason: Underutilised).

This usually means that a large proportion of workload nodes get disrupted and replaced in a short period of time.
We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.

This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.

Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.

The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.

For example:
On 27th September between at 22:45 (local time) we had 14 running workload nodeclaims:

  • 7x m6a.large (or equivalent)
  • 4x m7i-flex.xlarge (or equivalent)
  • 3x m7i-flex-2xlarge

Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.

At the end of this process we were running 17 workload nodeclaims:

  • 11x m7i-flex.large (or equivalent)
  • 3x m6a.xlarge (or equivalent)
  • 3x m7i-flex-2xlarge

So the net effect was replacing one xlarge node with 4 large nodes, and shuffling the instance generations slightly.

Pictorially:
image
(each green bar represents a nodeclaim, with time along the x axis)

Expected Behavior:

  • Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
  • Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.

Reproduction Steps (Please include YAML):
Our workload NodePool config is:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app.kubernetes.io/managed-by: Helm
  name: workload
spec:
  disruption:
    budgets:
      - nodes: 50%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '256'
    memory: 1Ti
  template:
    metadata:
      labels:
        app: karpenter
        environment: prod
        name: karpenter
    spec:
      expireAfter: 336h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - r
            - m
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - '3'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Lt
          values:
            - '17'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values:
            - '0'
        - key: karpenter.k8s.aws/instance-memory
          operator: Lt
          values:
            - '131073'
        - key: karpenter.k8s.aws/instance-memory
          operator: Gt
          values:
            - '2047'
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-2a
            - eu-west-2b
            - eu-west-2c
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
      taints:
        - effect: NoSchedule
          key: karpenter.sh

The corresponding EC2NodeClass is:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app: karpenter
    app.kubernetes.io/managed-by: Helm
    name: karpenter
  name: default
spec:
  amiFamily: AL2
  amiSelectorTerms:
    - name: amazon-eks-node-1.29-*
      owner: amazon
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          <masked>
        volumeSize: 100Gi
        volumeType: gp3
  instanceProfile: KarpenterNodeInstanceProfile-prod-eu-west-2-eks
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version): 1.29

However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs.
We've also seen this on v1.28 and earlier version of Kubernetes.

Additional Questions:

  • Can you explain what threshold the Karpenter controller uses to determine Underutilisation?
  • We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?
@headj-origami headj-origami added bug Something isn't working needs-triage Issues that need to be triaged labels Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

1 participant