Sharded kube state metrics returns stale metrics #2431
Labels
kind/bug
Categorizes issue or PR as related to a bug.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
Hi,
This is tangentially related to #2372, but in our case, the state that causes this is definitely not label changes to the kube state metrics statefulset
What happened:
Occasionally, we will see that a shard of our KSM is reporting stale metrics; namely that a pod is stuck in “Pending” state. We can easily verify that this isn’t the case, and rolling out the statefulset will clear the issue.
Anything else we need to know?:
We have been seeing this at least since v2.9.2 of KSM, and only on sharded KSM installations. It is rare, but happens occasionally.
After upgrading KSM to v2.12.0, we saw an issue where a version upgrade of a component across our k8s fleet (resulting in updating labels for these pods) caused a flurry of alerts about these pods being stuck in a pending state. The pods themselves were fine, but caused our KSM installations in v2.12.0 to all start serving stale metrics. Just to clarify, the statefulset labels for KSM were unchanged. Rather, the labels of a non related component were updated.
We were also staggering the v2.12.0 update of KSM, meaning that our dev/staging clusters were on v2.12.0, whilst our production clusters were on v2.9.2. The component version upgrade was being rolled out to all clusters, however it was only our dev/staging clusters that saw this issue of KSM serving stale metrics. It’s looking quite likely that some sort of change from v2.9.2 → v2.12.0 has made this issue worse.
Prior to this particular component version upgrade, and KSM on v2.12.0, I haven’t been able to spot any kind of pattern for when KSM falls into the state of serving stale metrics. I can verify that in all cases however the statefulset labels for KSM remain unchanged. We do not drop the kube_statefulset_labels so I have been able to verify this.
Environment:
kube-state-metrics version: v2.12.0 & v2.9.2
Kubernetes version (use kubectl version): v1.27 to v1.30
Cloud provider or hardware configuration: GKE
The text was updated successfully, but these errors were encountered: