Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some kube-state-metrics shards are serving up stale metrics #2372

Closed
schahal opened this issue Apr 16, 2024 · 9 comments · Fixed by #2478
Closed

Some kube-state-metrics shards are serving up stale metrics #2372

schahal opened this issue Apr 16, 2024 · 9 comments · Fixed by #2478
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@schahal
Copy link

schahal commented Apr 16, 2024

What happened:

We found some kube-state-metrics shards are serving up stale metrics.

For example, this pod is running and healthy:

$ kubectl get pods provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq
NAME                                                READY   STATUS    RESTARTS   AGE
provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq   1/1     Running   0          87m

However, we see for the past hour that kube_pod_container_status_waiting_reason is reporting it in CreatingContainer:

Screenshot 2024-04-16 at 2 00 31 PM

And to prove this is being served by KSM, we looked at the incriminating shard's (kube-state-metrics-5) /metrics endpoint and saw this metric is definitely stale:

kube_pod_container_status_waiting_reason{namespace="<redacted>",pod="provider-kubernetes-a3cbbe355fa7-678fd88bc5-76dw4",uid="<redacted>",container="package-runtime",reason="ContainerCreating"} 1

This is one such example, there seem to be several such situations.

What you expected to happen:

Expectation is that the metric(s) match reality

How to reproduce it (as minimally and precisely as possible):

Unfortunately, we're not quite sure when/why it gets into this state (anecdotally, it almost always happens when we upgrade KSM, though today there was no update besides some Prometheus agents)

We can mitigate the issue by restarting all the KSM shards... e.g.,

$ kubectl rollout restart -n kube-state-metrics statefulset kube-state-metrics

... if that's any clue to determine root cause.

Anything else we need to know?:

  1. When I originally ran into the problem, I thought it had something to do with the Compatibility Matrix. But starting with KSM v2.11.0, I confirmed the client libraries are updated for my version of k8s (v1.28)

  2. There's nothing out of the ordinary in the KSM logs:

Click to view kube-state-metrics-5 logs
I0409 08:17:51.349017       1 wrapper.go:120] "Starting kube-state-metrics"
W0409 08:17:51.349231       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350019       1 server.go:199] "Used resources" resources=["limitranges","storageclasses","deployments","resourcequotas","statefulsets","cronjobs","endpoints","ingresses","namespaces","nodes","poddisruptionbudgets","mutatingwebhookconfigurations","replicasets","horizontalpodautoscalers","networkpolicies","validatingwebhookconfigurations","volumeattachments","daemonsets","jobs","services","certificatesigningrequests","configmaps","persistentvolumeclaims","replicationcontrollers","secrets","persistentvolumes","pods"]
I0409 08:17:51.350206       1 types.go:227] "Using all namespaces"
I0409 08:17:51.350225       1 types.go:145] "Using node type is nil"
I0409 08:17:51.350241       1 server.go:226] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: "
W0409 08:17:51.350258       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350658       1 utils.go:70] "Tested communication with server"
I0409 08:17:52.420690       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28+" gitVersion="v1.28.6-eks-508b6b3" gitTreeState="clean" gitCommit="25a726351cee8ee6facce01af4214605e089d5da" platform="linux/amd64"
I0409 08:17:52.420837       1 utils.go:76] "Communication with server successful"
I0409 08:17:52.422588       1 server.go:350] "Started metrics server" metricsServerAddress="[::]:8080"
I0409 08:17:52.422595       1 server.go:339] "Started kube-state-metrics self metrics server" telemetryAddress="[::]:8081"
I0409 08:17:52.423030       1 server.go:73] levelinfomsgListening onaddress[::]:8080
I0409 08:17:52.423052       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8080
I0409 08:17:52.423075       1 server.go:73] levelinfomsgListening onaddress[::]:8081
I0409 08:17:52.423093       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8081
I0409 08:17:55.422262       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0409 08:17:55.422479       1 discovery.go:274] "discovery finished, cache updated"
I0409 08:17:55.422544       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0409 08:17:55.422573       1 metrics_handler.go:107] "Auto detecting sharding settings"
I0409 08:17:55.430380       1 metrics_handler.go:82] "Configuring sharding of this instance to be shard index (zero-indexed) out of total shards" shard=5 totalShards=16
I0409 08:17:55.431104       1 custom_resource_metrics.go:79] "Custom resource state added metrics" familyNames=["kube_customresource_vpa_containerrecommendations_target","kube_customresource_vpa_containerrecommendations_target"]
I0409 08:17:55.431143       1 builder.go:282] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments,autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423216       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0416 16:47:01.423283       1 config.go:209] "reloaded factory" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423466       1 builder.go:208] "Updating store" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423499       1 discovery.go:274] "discovery finished, cache updated"
I0416 16:47:01.423527       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0416 16:47:01.423545       1 metrics_handler.go:107] "Auto detecting sharding settings"
  1. This may be related to kube-state-metrics with autosharding stops updating shards when the labels of the statefulset are updated #2355 but I'm not sure about that linked PR to decide conclusively.

Environment:

  • kube-state-metrics version: v2.12.0 (this has occurred in previous versions too)
  • Kubernetes version (use kubectl version): v1.28.6
  • Cloud provider or hardware configuration: EKS
  • Other info:
@schahal schahal added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 16, 2024
@CatherineF-dev
Copy link
Contributor

qq: have your Statefulset labels been changed?

@schahal
Copy link
Author

schahal commented Apr 18, 2024

have your Statefulset labels been changed?

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

Labels:             app.kubernetes.io/component=metrics
                    app.kubernetes.io/instance=kube-state-metrics
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kube-state-metrics
                    app.kubernetes.io/part-of=kube-state-metrics
                    app.kubernetes.io/version=2.12.0
                    helm.sh/chart=kube-state-metrics-5.18.1
                    release=kube-state-metrics

@logicalhan
Copy link
Member

/assign @CatherineF-dev
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024
@CatherineF-dev
Copy link
Contributor

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

This is related to #2347

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

This is a new issue.

@LaikaN57 LaikaN57 mentioned this issue May 1, 2024
@schahal
Copy link
Author

schahal commented May 6, 2024

This is related to #2347

For the purposes of this issue, I think it's wholly related to #2347 (the one time we claimed the statefulset may not have changed labels, we had no proof of that).

IMO, we can track this issue to that PR for closure (and if we do see another case of stale metrics, we can try to gather those exact circumstances in a, if needed, separate issue)

@LaikaN57
Copy link
Contributor

LaikaN57 commented Jul 18, 2024

Looks like this will be resolved in v2.13.0

@schahal
Copy link
Author

schahal commented Jul 27, 2024

For tracking puposes, this problem still persists (even in the latest version).

This may lend credence to the one time we ran into this issue and claimed there was no label change: So I believe #2431 is reporting the same issue.

The labels/versions for reference:

  Labels:           app=kube-state-metrics
                    app.kubernetes.io/component=metrics
                    app.kubernetes.io/instance=kube-state-metrics
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kube-state-metrics
                    app.kubernetes.io/part-of=kube-state-metrics
                    app.kubernetes.io/version=2.13.0
                    helm.sh/chart=kube-state-metrics-5.24.0
                    release=kube-state-metrics
  Annotations:      kubectl.kubernetes.io/restartedAt: 2024-07-26TXX:XX:XX-XX:XX   # <--- the workaround
                    our-workaround-rev-to-trigger-shard-refresh: <someSHA>

@CatherineF-dev
Copy link
Contributor

@schahal could you reproduce this issue consistently? If so, could you help provide detailed steps to reproduce it? You can anonymize pod name.

@schahal
Copy link
Author

schahal commented Aug 7, 2024

could you reproduce this issue consistently? If so, could you help provide detailed steps to reproduce it?

Aside from what's in the description, I feel like this consistently happens anytime the Statefulset is updated... e.g., :

  1. I change our kube-state-metrics helm chart version from version: 5.25.0 to version: 5.25.1
  2. I then let our continuous delivery pipeline (in this case ArgoCD), render the templates (e.g., helm template ....)
  3. Those rendered templates are then applied to the Kubernetes cluster, updating the Statefulset

Invariably, right after that get shards with stale metrics - mitigated only by restarting the pods.

#2431 and this slack thread have other perspectives from different users on the same symptom, which may shed some other light.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants