Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Incorrect CPU Usage by Instance display #124

Closed
guessi opened this issue Oct 2, 2024 · 11 comments · Fixed by #125
Closed

[bug] Incorrect CPU Usage by Instance display #124

guessi opened this issue Oct 2, 2024 · 11 comments · Fixed by #125
Assignees
Labels
bug Something isn't working released

Comments

@guessi
Copy link
Contributor

guessi commented Oct 2, 2024

Describe the bug

Incorrect display
Unit: Percent (0.0-1.0)
image

Correct display
Unit: Percent (0-100)
image

How to reproduce?

Import current latest k8s-system-api-server.json and find API Server - CPU Usage by instance panel.

Expected behavior

It should have no percentage > 100 display.

Additional context

No response

@guessi guessi added the bug Something isn't working label Oct 2, 2024
guessi added a commit to guessi/grafana-dashboards-kubernetes that referenced this issue Oct 2, 2024
guessi added a commit to guessi/grafana-dashboards-kubernetes that referenced this issue Oct 2, 2024
@dotdc
Copy link
Owner

dotdc commented Oct 2, 2024

Hi @guessi,

Thank you for opening this issue and the pull request.
I don't see such error on my side, and I wonder if the issue is not related to the panel query or duplicate series on your setup.
Do you think you could compare the panel query with your pods resource usage to see if it matches?

Something like:

2024-10-02 21 45 29 localhost 38221db47cb0

Query (update label like pod name to match your setup):

sum(rate(container_cpu_usage_seconds_total{namespace="kube-system", pod=~"kube-apiserver-k8s-control-plane-001", image="", cluster=""}[$__rate_interval]))

@guessi
Copy link
Contributor Author

guessi commented Oct 3, 2024

Hi @dotdc,

Allow me to provide more background story.

For most circumstances, both percentunit (Percent (0.0-1.0)) and percent (Percent (0-100)) are okay. However, if you try to stress test on Multi-core API servers, percentunit would show wrong metrics, that would be a problem.

Please verify again with current latest Grafana (11.2.2) and please be sure to put stress to your API server and make sure it is running with high loading at the time. I believe you should observe the issue on your side as well.

@dotdc
Copy link
Owner

dotdc commented Oct 3, 2024

I compared with the pods view dashboard which uses another metric (container_cpu_usage_seconds_total) and got the same percentage on my side. No matter the load, numbers should be the same, can you compare them on your side?

Screenshot From 2024-10-03 07-24-01

@guessi
Copy link
Contributor Author

guessi commented Oct 3, 2024

Hi @dotdc

Version info

I'm running with latest version, what about yours?
image

The only difference for me is the Percent setup
image

Comparison

image

image

@guessi
Copy link
Contributor Author

guessi commented Oct 3, 2024

Note that it is about k8s-system-api-server.json and the dashboard API Server, CPU Usage by instance panel.

@guessi
Copy link
Contributor Author

guessi commented Oct 3, 2024

Please note the key is that the issue is about Multi-core API servers monitoring.

@dotdc
Copy link
Owner

dotdc commented Oct 3, 2024

I understand the difference between the two units, I'm just not convinced by this change from what I see on my setups.

Can you share:

  • Kubernetes distribution and version
  • Version of kube-prometheus-stack (or deployment method with involved components versions)
  • proccess_cpu_seconds_total is a host based metrics, do you have other instances running on the same host? (including VMs/Containers)
  • Do you see the same value in both dashboard (k8s-system-api-server.json and k8s-views-pods.json)
  • Can you share a series sample of proccess_cpu_seconds_total related to the apiserver in Grafana explore (need the full line so as text ideally)

@dotdc
Copy link
Owner

dotdc commented Oct 3, 2024

Multi-core API servers monitoring.

Could you elaborate on this?

@guessi
Copy link
Contributor Author

guessi commented Oct 4, 2024

I'm running with Amazon EKS 1.31 (just released few days ago), all components are current latest version, but actually the matrix of {version/distro/cloud} doesn't matter. Even you are running self-managed Kubernetes cluster, I believe it could easily be reproduced by running API servers with a multi-core machine.

@dotdc
Copy link
Owner

dotdc commented Oct 13, 2024

I've been able to reproduce and it turns out that my initial results were wrong because I tested it with the empty image label (#122) which introduced a regression that was later removed in #126.

Thank you for pointing that out and sorry for the delay and the confusion.

@dotdc
Copy link
Owner

dotdc commented Oct 13, 2024

🎉 This issue has been resolved in version 2.3.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@dotdc dotdc added the released label Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants