Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube_endpoint_address duplicates with Prometheus 2.52 #2408

Closed
gdlx opened this issue May 31, 2024 · 13 comments · Fixed by #2503
Closed

kube_endpoint_address duplicates with Prometheus 2.52 #2408

gdlx opened this issue May 31, 2024 · 13 comments · Fixed by #2503
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@gdlx
Copy link

gdlx commented May 31, 2024

After upgrading to Prometheus 2.52, we had some alerts about dropped duplicates samples.

  • The prometheus log shown the following warning:

     scrape_pool=serviceMonitor/monitoring/kube-prometheus-stack-kube-state-metrics/0 target=http://100.91.220.12:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
    
  • Setting the log level to debug shown the concerned series:

    scrape_pool=serviceMonitor/monitoring/kube-prometheus-stack-kube-state-metrics/0 target=http://100.91.220.12:8080/metrics msg="Duplicate sample for timestamp" series="kube_endpoint_address{namespace=\"monitoring\",endpoint=\"prometheus-operated\",ip=\"100.91.68.8\",ready=\"true\"}"
    
  • Checking the indicated series indeed shown the following duplicates:

    kube_endpoint_address{namespace="monitoring",endpoint="prometheus-operated",ip="100.91.68.8",ready="true"} 1
    kube_endpoint_address{namespace="monitoring",endpoint="prometheus-operated",ip="100.91.68.8",ready="true"} 1
    
  • The prometheus-operated endpoint has the following subsets:

    subsets:
      - addresses:
          - ip: 100.91.43.113
            hostname: prometheus-kube-prometheus-istio-0
            nodeName: ip-100-91-48-253.eu-west-3.compute.internal
            targetRef:
              kind: Pod
              namespace: monitoring
              name: prometheus-kube-prometheus-istio-0
              uid: 1180e2a5-75e4-4098-961c-940264115438
          - ip: 100.91.68.8
            hostname: prometheus-kube-prometheus-stack-prometheus-0
            nodeName: ip-100-91-212-113.eu-west-3.compute.internal
            targetRef:
              kind: Pod
              namespace: monitoring
              name: prometheus-kube-prometheus-stack-prometheus-0
              uid: 257bdfed-e2b4-49c7-aaea-1b7bee1a520d
        ports:
          - name: http-web
            port: 9090
            protocol: TCP
      - addresses:
          - ip: 100.91.68.8
            hostname: prometheus-kube-prometheus-stack-prometheus-0
            nodeName: ip-100-91-212-113.eu-west-3.compute.internal
            targetRef:
              kind: Pod
              namespace: monitoring
              name: prometheus-kube-prometheus-stack-prometheus-0
              uid: 257bdfed-e2b4-49c7-aaea-1b7bee1a520d
        ports:
          - name: grpc
            port: 10901
            protocol: TCP

We can see the 2 entries on the same IP (100.91.68.8) but on different ports.
Grpc is enabled only by the Thanos sidecar container, and it's enabled only on one Prometheus instance.
I think there wouldn't have been duplicates if both instances had the same config (there would only have been one subset with both addresses and ports).

The only way I see to fix this would be to add a port label on the kube_endpoint_address metric.
Is there something else I can do or would this be considered as a bug ?

Thanks !

Environment:

  • kube-state-metrics version: 2.12.0
  • Kubernetes version: 1.28
  • Cloud provider or hardware configuration: AWS EKS
  • Other info:
@gdlx gdlx added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 31, 2024
@gdlx gdlx changed the title Endpoint duplicates kube_endpoint_address duplicates with Prometheus 2.52 May 31, 2024
@eimarfandino
Copy link

I noticed the same, we are having

kube_endpoint_address{namespace="monitoring",endpoint="alertmanager-operated",ip="10.25.119.228",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="alertmanager-operated",ip="10.25.119.228",ready="true"} 1

I do not know if this is related but in our case the endpoint is having this IP twice with different ports. in my case IP 10.25.119.228 listens to port 9094 and 9093.

@gdlx
Copy link
Author

gdlx commented Jun 5, 2024

I do not know if this is related

@eimarfandino Yes, different ports but same issue.

@zoopp
Copy link

zoopp commented Jun 10, 2024

I'm writing to confirm that I'm seeing this on GKE as well. Services with multiple ports bound to the same IP lead to duplicate metrics being exported by kube-state-metrics. For example (IP addresses masked):

kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-dncadu41te",ip="xx.xx.xx.xx",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-dncadu41te",ip="xx.xx.xx.xx",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-1j2b5u4e7g",ip="yy.yy.yy.yy",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-1j2b5u4e7g",ip="yy.yy.yy.yy",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-1q8fig66j0",ip="zz.zz.zz.zz",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-1q8fig66j0",ip="zz.zz.zz.zz",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-7eknv6n114",ip="aa.aa.aa.aa",ready="true"} 1
kube_endpoint_address{namespace="monitoring",endpoint="gke-mcs-7eknv6n114",ip="zz.zz.zz.zz",ready="true"} 1

@dgrisonnet
Copy link
Member

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@Serializator
Copy link
Contributor

The kube_endpoint_address is explicitly for the addresses of an endpoint. The kube_endpoint_ports metric is for the ports of an endpoint. This would murky the water between these metrics and their purpose.

The other option is to ensure the IPs are unique when generating these metrics. There are a few concerns I have regarding this approach.

  • Can an IP address with different ports be available (.Addresses) and not ready (.NotReadyAddresses) at the same time?
  • The metric value of kube_endpoint_address_available and kube_endpoint_address_not_ready would not match the amount of kube_endpoint_address metrics anymore.

If we were to consider adding the port to the kube_endpoint_address metric, could this open of a conversation about a more generic kube_endpoint metric more suitable for this? What was the original decision making behind these separate _address and _ports metrics for endpoints?

@gdlx
Copy link
Author

gdlx commented Jun 24, 2024

@Serializator That means the clean way would be the prometheus operator not to use the same address for different instances ?
That would consume more IPs but avoid this kind of hybrid endpoint subsets.

@Serializator
Copy link
Contributor

Hi @gdlx! The Prometheus Operator is not doing anything it shouldn't be doing so I think it's on KSM to support this unforeseen circumstance. The Prometheus Operator is unfortunately the one which brings this problem to light. If it wasn't for the Prometheus Operator it would've been something else.

@dgrisonnet
Copy link
Member

The bug lies in the fact that we don't distinguish between endpoint subsets. The metric was written in a way where we assumed that addresses and ports would always be unique for a single endpoint and never duplicated between subsets.

I looked a bit at Kubernetes' validation for Endpoints and it allows for duplicates ip/port pairs between subsets:
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/validation/validation.go#L7069-L7092

I think that the only option we have here is to add a subset label set to the index of the subset in the endpoint.
We could theoretically also replace kube_endpoint_address and kube_endpoint_ports by kube_endpoint_subsets, but both metrics are stable and looking at the validation code, we have no guarantees that two subsets wouldn't contain the same ip/port pair.

Any thoughts @mrueg @rexagod?

@mrueg
Copy link
Member

mrueg commented Jun 28, 2024

We could also add port field to the kube_endpoint_address and mark the port one as deprecated.

@kdeyko
Copy link

kdeyko commented Jul 3, 2024

Hi there!
Is there any workaround while this is in progress?

@eherot
Copy link

eherot commented Jul 30, 2024

It sounds like this issue may have been brought to light by prometheus/prometheus#12933

@realjump
Copy link

realjump commented Sep 3, 2024

Have now also come across this issue. Also interested in a fix / workaround.

@Le1ns
Copy link

Le1ns commented Sep 9, 2024

Have now also come across this issue. Also interested in a fix / workaround.

same
I have external databases and i created service with port 5432 and 9187(exporter) - so now i have duplicate error
interested in a fix

mrueg added a commit to mrueg/kube-state-metrics that referenced this issue Sep 22, 2024
This marks kube_endpoints_ports as deprecated

Fixes kubernetes#2408
mrueg added a commit to mrueg/kube-state-metrics that referenced this issue Sep 22, 2024
This marks kube_endpoint_ports as deprecated

Fixes kubernetes#2408
mrueg added a commit to mrueg/kube-state-metrics that referenced this issue Oct 16, 2024
This marks kube_endpoint_ports as deprecated

Fixes kubernetes#2408
mrueg added a commit to mrueg/kube-state-metrics that referenced this issue Oct 17, 2024
This marks kube_endpoint_ports as deprecated

Fixes kubernetes#2408
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.