Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding cluster dimension to all aggregating alerting rules #524

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

langecode
Copy link

This is indeed a very nice collection of both alerting and recording rules as well as dashboards. However, while the dashboards are indeed prepared to be run on something like Cortex assuming timeseries from multiple clusters the alert rules did not all have multi-cluster support.

I have gone through the alerting rules to make sure the label defined in the configuration clusterLabel is present on all alerts. So if alerts are evaluated in a setup with data from multiple clusters the alert will carry a label identifying the cluster triggering the alert - could be used for routing in Alertmanager etc.

@langecode
Copy link
Author

This might actually be a bit related to #457. I acknowledge the argument that the output from the recording rules could have the cluster label attached on its way out of the local Prometheus. However, I think this use case is a little bit different as evaluating the alerting rules outside of the local Prometheus might make sense to be able to do a central surveillance and monitoring of the cluster. I would like to be able to control and change the rules outside of the cluster just as I would like to be able to trigger Alertmanager outside of the cluster purely based on the metrics.

Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>
@jsmcnair
Copy link

jsmcnair commented Dec 3, 2020

If it helps, this is just what I was looking for! My use case is to use Prometheus federation and generate alerts at the central Prometheus, that way we can develop alerts for our cluster services on useful data and have a deployment pipeline for getting the alert rules into prod.

@brancz
Copy link
Member

brancz commented Dec 4, 2020

@csmarchbanks @povilasv what do you think of this? I'm generally happy with it, but don't want to make the decision by myself as it kind of influences how we treat any new contribution. What do you think?

@csmarchbanks
Copy link
Member

I think I am ok with this as well. I certainly understand the use case, my only concern is that it encourages alerting far away from a cluster which can be less reliable.

@langecode, as you point out this is related to #457, will you still be evaluating the rules inside of the local Prometheus instances? Otherwise I think some of the rules that are used in the alerts will not have a cluster label if run in something like Cortex.

@langecode
Copy link
Author

langecode commented Dec 4, 2020 via email

Copy link
Contributor

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this approach as long as we dont't break single tentant usecase.

I.e. All the alerts work without cluster label.

@langecode
Copy link
Author

langecode commented Dec 7, 2020 via email

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, otherwise it seems like there is agreement to accept this!

@@ -18,14 +18,16 @@ local utils = import 'utils.libsonnet';
{
alert: 'KubeAPIErrorBudgetBurn',
expr: |||
sum(apiserver_request:burnrate%s) > (%.2f * %.5f)
sum(apiserver_request:burnrate%s) by (%s) > (%.2f * %.5f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth using named variables here? Takes a moment to make sure the order is correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually had that same experience adding in the by clause. I tried to keep the disruption minimal that was why I did not change the input for the substitution.

@jsmcnair
Copy link

Is there anything I can do to help move this along? It seems like there's not much left to do than fix the merge conflicts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants