Sharded kube state metrics returns stale metrics #2431

mal-berbatov-ci · 2024-06-25T15:39:33Z

Hi,

This is tangentially related to #2372, but in our case, the state that causes this is definitely not label changes to the kube state metrics statefulset

What happened:
Occasionally, we will see that a shard of our KSM is reporting stale metrics; namely that a pod is stuck in “Pending” state. We can easily verify that this isn’t the case, and rolling out the statefulset will clear the issue.

Anything else we need to know?:
We have been seeing this at least since v2.9.2 of KSM, and only on sharded KSM installations. It is rare, but happens occasionally.
After upgrading KSM to v2.12.0, we saw an issue where a version upgrade of a component across our k8s fleet (resulting in updating labels for these pods) caused a flurry of alerts about these pods being stuck in a pending state. The pods themselves were fine, but caused our KSM installations in v2.12.0 to all start serving stale metrics. Just to clarify, the statefulset labels for KSM were unchanged. Rather, the labels of a non related component were updated.

We were also staggering the v2.12.0 update of KSM, meaning that our dev/staging clusters were on v2.12.0, whilst our production clusters were on v2.9.2. The component version upgrade was being rolled out to all clusters, however it was only our dev/staging clusters that saw this issue of KSM serving stale metrics. It’s looking quite likely that some sort of change from v2.9.2 → v2.12.0 has made this issue worse.

Prior to this particular component version upgrade, and KSM on v2.12.0, I haven’t been able to spot any kind of pattern for when KSM falls into the state of serving stale metrics. I can verify that in all cases however the statefulset labels for KSM remain unchanged. We do not drop the kube_statefulset_labels so I have been able to verify this.

Environment:
kube-state-metrics version: v2.12.0 & v2.9.2
Kubernetes version (use kubectl version): v1.27 to v1.30
Cloud provider or hardware configuration: GKE

dgrisonnet · 2024-06-27T16:49:32Z

/assign @CatherineF-dev
/triage accepted

CatherineF-dev · 2024-06-27T16:56:07Z

serving stale metrics

Do we know this metric?

Sharded kube state metrics

Does non-sharded kube-state-metrics have this issue or not?

Also, could you reproduce this issue consistently?

mal-berbatov-ci · 2024-07-04T14:53:22Z

The stale metric that we saw were kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}. We have an alert checking this expression which made us weary of the issue. The pods in question were all reporting "Pending". I did not check other metrics. I'd hazard a guess at yes, but if the issue happens again, I can double check.

Non sharded KSM do not have this issue.

I could not reproduce the issue consistently. I tried to once but couldn't replicate it

mal-berbatov-ci · 2024-07-04T15:08:52Z

Actually, looking at historical metrics, they are in fact all stale for the specific exported_pod's that were seeing this issue. It also looks like when the roll out of the service we were updating happened, the metric swapped from shard 0 to shard 1. It also looks like until we rollout restarted the KSM statefulset, some metrics for other components were not getting registered.

All in all, it definitely does look like stale metrics were being served across every single shard of KSM, and until they were restarted, values were not updated

mal-berbatov-ci added the kind/bug Categorizes issue or PR as related to a bug. label Jun 25, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 25, 2024

k8s-ci-robot assigned CatherineF-dev Jun 27, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2024

schahal mentioned this issue Jul 27, 2024

Some kube-state-metrics shards are serving up stale metrics #2372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded kube state metrics returns stale metrics #2431

Sharded kube state metrics returns stale metrics #2431

mal-berbatov-ci commented Jun 25, 2024

dgrisonnet commented Jun 27, 2024

CatherineF-dev commented Jun 27, 2024 •

edited

Loading

mal-berbatov-ci commented Jul 4, 2024

mal-berbatov-ci commented Jul 4, 2024

Sharded kube state metrics returns stale metrics #2431

Sharded kube state metrics returns stale metrics #2431

Comments

mal-berbatov-ci commented Jun 25, 2024

dgrisonnet commented Jun 27, 2024

CatherineF-dev commented Jun 27, 2024 • edited Loading

mal-berbatov-ci commented Jul 4, 2024

mal-berbatov-ci commented Jul 4, 2024

CatherineF-dev commented Jun 27, 2024 •

edited

Loading