Some kube-state-metrics shards are serving up stale metrics #2372

schahal · 2024-04-16T21:33:42Z

What happened:

We found some kube-state-metrics shards are serving up stale metrics.

For example, this pod is running and healthy:

$ kubectl get pods provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq
NAME                                                READY   STATUS    RESTARTS   AGE
provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq   1/1     Running   0          87m

However, we see for the past hour that kube_pod_container_status_waiting_reason is reporting it in CreatingContainer:

And to prove this is being served by KSM, we looked at the incriminating shard's (kube-state-metrics-5) /metrics endpoint and saw this metric is definitely stale:

kube_pod_container_status_waiting_reason{namespace="<redacted>",pod="provider-kubernetes-a3cbbe355fa7-678fd88bc5-76dw4",uid="<redacted>",container="package-runtime",reason="ContainerCreating"} 1

This is one such example, there seem to be several such situations.

What you expected to happen:

Expectation is that the metric(s) match reality

How to reproduce it (as minimally and precisely as possible):

Unfortunately, we're not quite sure when/why it gets into this state (anecdotally, it almost always happens when we upgrade KSM, though today there was no update besides some Prometheus agents)

We can mitigate the issue by restarting all the KSM shards... e.g.,

$ kubectl rollout restart -n kube-state-metrics statefulset kube-state-metrics

... if that's any clue to determine root cause.

Anything else we need to know?:

When I originally ran into the problem, I thought it had something to do with the Compatibility Matrix. But starting with KSM v2.11.0, I confirmed the client libraries are updated for my version of k8s (v1.28)
There's nothing out of the ordinary in the KSM logs:

Click to view kube-state-metrics-5 logs

I0409 08:17:51.349017       1 wrapper.go:120] "Starting kube-state-metrics"
W0409 08:17:51.349231       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350019       1 server.go:199] "Used resources" resources=["limitranges","storageclasses","deployments","resourcequotas","statefulsets","cronjobs","endpoints","ingresses","namespaces","nodes","poddisruptionbudgets","mutatingwebhookconfigurations","replicasets","horizontalpodautoscalers","networkpolicies","validatingwebhookconfigurations","volumeattachments","daemonsets","jobs","services","certificatesigningrequests","configmaps","persistentvolumeclaims","replicationcontrollers","secrets","persistentvolumes","pods"]
I0409 08:17:51.350206       1 types.go:227] "Using all namespaces"
I0409 08:17:51.350225       1 types.go:145] "Using node type is nil"
I0409 08:17:51.350241       1 server.go:226] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: "
W0409 08:17:51.350258       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350658       1 utils.go:70] "Tested communication with server"
I0409 08:17:52.420690       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28+" gitVersion="v1.28.6-eks-508b6b3" gitTreeState="clean" gitCommit="25a726351cee8ee6facce01af4214605e089d5da" platform="linux/amd64"
I0409 08:17:52.420837       1 utils.go:76] "Communication with server successful"
I0409 08:17:52.422588       1 server.go:350] "Started metrics server" metricsServerAddress="[::]:8080"
I0409 08:17:52.422595       1 server.go:339] "Started kube-state-metrics self metrics server" telemetryAddress="[::]:8081"
I0409 08:17:52.423030       1 server.go:73] levelinfomsgListening onaddress[::]:8080
I0409 08:17:52.423052       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8080
I0409 08:17:52.423075       1 server.go:73] levelinfomsgListening onaddress[::]:8081
I0409 08:17:52.423093       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8081
I0409 08:17:55.422262       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0409 08:17:55.422479       1 discovery.go:274] "discovery finished, cache updated"
I0409 08:17:55.422544       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0409 08:17:55.422573       1 metrics_handler.go:107] "Auto detecting sharding settings"
I0409 08:17:55.430380       1 metrics_handler.go:82] "Configuring sharding of this instance to be shard index (zero-indexed) out of total shards" shard=5 totalShards=16
I0409 08:17:55.431104       1 custom_resource_metrics.go:79] "Custom resource state added metrics" familyNames=["kube_customresource_vpa_containerrecommendations_target","kube_customresource_vpa_containerrecommendations_target"]
I0409 08:17:55.431143       1 builder.go:282] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments,autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423216       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0416 16:47:01.423283       1 config.go:209] "reloaded factory" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423466       1 builder.go:208] "Updating store" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423499       1 discovery.go:274] "discovery finished, cache updated"
I0416 16:47:01.423527       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0416 16:47:01.423545       1 metrics_handler.go:107] "Auto detecting sharding settings"

This may be related to kube-state-metrics with autosharding stops updating shards when the labels of the statefulset are updated #2355 but I'm not sure about that linked PR to decide conclusively.

Environment:

kube-state-metrics version: v2.12.0 (this has occurred in previous versions too)
Kubernetes version (use kubectl version): v1.28.6
Cloud provider or hardware configuration: EKS
Other info:

The text was updated successfully, but these errors were encountered:

CatherineF-dev · 2024-04-17T21:27:19Z

qq: have your Statefulset labels been changed?

schahal · 2024-04-18T16:03:34Z

have your Statefulset labels been changed?

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

Labels:             app.kubernetes.io/component=metrics
                    app.kubernetes.io/instance=kube-state-metrics
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kube-state-metrics
                    app.kubernetes.io/part-of=kube-state-metrics
                    app.kubernetes.io/version=2.12.0
                    helm.sh/chart=kube-state-metrics-5.18.1
                    release=kube-state-metrics

logicalhan · 2024-04-18T16:48:05Z

/assign @CatherineF-dev
/triage accepted

CatherineF-dev · 2024-04-19T13:39:37Z

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

This is related to #2347

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

This is a new issue.

schahal · 2024-05-06T23:26:46Z

This is related to #2347

For the purposes of this issue, I think it's wholly related to #2347 (the one time we claimed the statefulset may not have changed labels, we had no proof of that).

IMO, we can track this issue to that PR for closure (and if we do see another case of stale metrics, we can try to gather those exact circumstances in a, if needed, separate issue)

LaikaN57 · 2024-07-18T19:15:08Z

Looks like this will be resolved in v2.13.0

schahal · 2024-07-27T01:00:17Z

For tracking puposes, this problem still persists (even in the latest version).

This may lend credence to the one time we ran into this issue and claimed there was no label change: So I believe #2431 is reporting the same issue.

The labels/versions for reference:

  Labels:           app=kube-state-metrics
                    app.kubernetes.io/component=metrics
                    app.kubernetes.io/instance=kube-state-metrics
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kube-state-metrics
                    app.kubernetes.io/part-of=kube-state-metrics
                    app.kubernetes.io/version=2.13.0
                    helm.sh/chart=kube-state-metrics-5.24.0
                    release=kube-state-metrics
  Annotations:      kubectl.kubernetes.io/restartedAt: 2024-07-26TXX:XX:XX-XX:XX   # <--- the workaround
                    our-workaround-rev-to-trigger-shard-refresh: <someSHA>

CatherineF-dev · 2024-07-27T14:41:08Z

@schahal could you reproduce this issue consistently? If so, could you help provide detailed steps to reproduce it? You can anonymize pod name.

schahal · 2024-08-07T23:30:28Z

could you reproduce this issue consistently? If so, could you help provide detailed steps to reproduce it?

Aside from what's in the description, I feel like this consistently happens anytime the Statefulset is updated... e.g., :

I change our kube-state-metrics helm chart version from version: 5.25.0 to version: 5.25.1
I then let our continuous delivery pipeline (in this case ArgoCD), render the templates (e.g., helm template ....)
Those rendered templates are then applied to the Kubernetes cluster, updating the Statefulset

Invariably, right after that get shards with stale metrics - mitigated only by restarting the pods.

#2431 and this slack thread have other perspectives from different users on the same symptom, which may shed some other light.

schahal added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 16, 2024

k8s-ci-robot assigned CatherineF-dev Apr 18, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024

LaikaN57 mentioned this issue May 1, 2024

Cut 2.12.1 #2375

Closed

mal-berbatov-ci mentioned this issue Jun 25, 2024

Sharded kube state metrics returns stale metrics #2431

Open

LaikaN57 mentioned this issue Jul 25, 2024

Node selection for fully qualified node-names fails (--node=ip-xx-xx-xx-xx.myzone.com) #2374

Open

wallee94 mentioned this issue Aug 16, 2024

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

Merged

k8s-ci-robot closed this as completed in #2478 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some kube-state-metrics shards are serving up stale metrics #2372

Some kube-state-metrics shards are serving up stale metrics #2372

schahal commented Apr 16, 2024 •

edited

Loading

CatherineF-dev commented Apr 17, 2024

schahal commented Apr 18, 2024

logicalhan commented Apr 18, 2024

CatherineF-dev commented Apr 19, 2024

schahal commented May 6, 2024

LaikaN57 commented Jul 18, 2024 •

edited

Loading

schahal commented Jul 27, 2024

CatherineF-dev commented Jul 27, 2024

schahal commented Aug 7, 2024

Some kube-state-metrics shards are serving up stale metrics #2372

Some kube-state-metrics shards are serving up stale metrics #2372

Comments

schahal commented Apr 16, 2024 • edited Loading

CatherineF-dev commented Apr 17, 2024

schahal commented Apr 18, 2024

logicalhan commented Apr 18, 2024

CatherineF-dev commented Apr 19, 2024

schahal commented May 6, 2024

LaikaN57 commented Jul 18, 2024 • edited Loading

schahal commented Jul 27, 2024

CatherineF-dev commented Jul 27, 2024

schahal commented Aug 7, 2024

schahal commented Apr 16, 2024 •

edited

Loading

LaikaN57 commented Jul 18, 2024 •

edited

Loading