feat: Adding cluster dimension to all aggregating alerting rules #524

langecode · 2020-12-03T09:40:39Z

This is indeed a very nice collection of both alerting and recording rules as well as dashboards. However, while the dashboards are indeed prepared to be run on something like Cortex assuming timeseries from multiple clusters the alert rules did not all have multi-cluster support.

I have gone through the alerting rules to make sure the label defined in the configuration clusterLabel is present on all alerts. So if alerts are evaluated in a setup with data from multiple clusters the alert will carry a label identifying the cluster triggering the alert - could be used for routing in Alertmanager etc.

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>

langecode · 2020-12-03T09:49:16Z

This might actually be a bit related to #457. I acknowledge the argument that the output from the recording rules could have the cluster label attached on its way out of the local Prometheus. However, I think this use case is a little bit different as evaluating the alerting rules outside of the local Prometheus might make sense to be able to do a central surveillance and monitoring of the cluster. I would like to be able to control and change the rules outside of the cluster just as I would like to be able to trigger Alertmanager outside of the cluster purely based on the metrics.

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>

jsmcnair · 2020-12-03T14:51:16Z

If it helps, this is just what I was looking for! My use case is to use Prometheus federation and generate alerts at the central Prometheus, that way we can develop alerts for our cluster services on useful data and have a deployment pipeline for getting the alert rules into prod.

brancz · 2020-12-04T08:06:53Z

@csmarchbanks @povilasv what do you think of this? I'm generally happy with it, but don't want to make the decision by myself as it kind of influences how we treat any new contribution. What do you think?

csmarchbanks · 2020-12-04T15:24:57Z

I think I am ok with this as well. I certainly understand the use case, my only concern is that it encourages alerting far away from a cluster which can be less reliable.

@langecode, as you point out this is related to #457, will you still be evaluating the rules inside of the local Prometheus instances? Otherwise I think some of the rules that are used in the alerts will not have a cluster label if run in something like Cortex.

langecode · 2020-12-04T16:04:17Z

I hope I have been through all the expressions of the alerting rules to make sure they carry the cluster label all the way through. That was the intention anyway. Not the recording rules since the output of the recording rules will have the label added by the in-cluster Prometheus. I get your point of the evaluation happening away from the cluster however we are considering a multi tenancy setup where we would like to be able to centrally control alerting across tenants and the clusters of the tenants. May be customized for each tenant or cluster but still controlled outside of the individual cluster.

povilasv

I'm fine with this approach as long as we dont't break single tentant usecase.

I.e. All the alerts work without cluster label.

langecode · 2020-12-07T09:49:14Z

Yes indeed. Both cases should be supported. I validated the changed expression on a deployment on kind with the mixin running in Prometheus/Grafana. The Prometheus was forwarding metrics to a Cortex instance and I ran the expressions against both Prometheus (having no cluster label) and Cortex (having the cluster label associated with the data). So I tried to test the compatibility, however, it was done by hand so it's not like an automated validation.

csmarchbanks

One small comment, otherwise it seems like there is agreement to accept this!

csmarchbanks · 2020-12-09T17:35:08Z

alerts/kube_apiserver.libsonnet

@@ -18,14 +18,16 @@ local utils = import 'utils.libsonnet';
          {
            alert: 'KubeAPIErrorBudgetBurn',
            expr: |||
-              sum(apiserver_request:burnrate%s) > (%.2f * %.5f)
+              sum(apiserver_request:burnrate%s) by (%s) > (%.2f * %.5f)


Do you think it is worth using named variables here? Takes a moment to make sure the order is correct.

Yeah, I actually had that same experience adding in the by clause. I tried to keep the disruption minimal that was why I did not change the input for the substitution.

jsmcnair · 2021-02-10T13:45:43Z

Is there anything I can do to help move this along? It seems like there's not much left to do than fix the merge conflicts?

langecode added 2 commits December 3, 2020 10:23

feat: Adding cluster dimension to all aggregating alerting rules

332cf36

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>

fix: The cluster label should be based on the configuration

f3190d3

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>

fix: Missing config substitution

2baa989

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>

povilasv reviewed Dec 7, 2020

View reviewed changes

csmarchbanks reviewed Dec 9, 2020

View reviewed changes

hamishforbes mentioned this pull request Aug 9, 2021

Improve alert aggregations for multiple clusters #633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding cluster dimension to all aggregating alerting rules #524

feat: Adding cluster dimension to all aggregating alerting rules #524

langecode commented Dec 3, 2020

langecode commented Dec 3, 2020

jsmcnair commented Dec 3, 2020

brancz commented Dec 4, 2020

csmarchbanks commented Dec 4, 2020

langecode commented Dec 4, 2020 via email •

edited

Loading

povilasv left a comment

langecode commented Dec 7, 2020 via email •

edited

Loading

csmarchbanks left a comment

csmarchbanks Dec 9, 2020

langecode Dec 16, 2020

jsmcnair commented Feb 10, 2021

feat: Adding cluster dimension to all aggregating alerting rules #524

Are you sure you want to change the base?

feat: Adding cluster dimension to all aggregating alerting rules #524

Conversation

langecode commented Dec 3, 2020

langecode commented Dec 3, 2020

jsmcnair commented Dec 3, 2020

brancz commented Dec 4, 2020

csmarchbanks commented Dec 4, 2020

langecode commented Dec 4, 2020 via email • edited Loading

povilasv left a comment

Choose a reason for hiding this comment

langecode commented Dec 7, 2020 via email • edited Loading

csmarchbanks left a comment

Choose a reason for hiding this comment

csmarchbanks Dec 9, 2020

Choose a reason for hiding this comment

langecode Dec 16, 2020

Choose a reason for hiding this comment

jsmcnair commented Feb 10, 2021

langecode commented Dec 4, 2020 via email •

edited

Loading

langecode commented Dec 7, 2020 via email •

edited

Loading