-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adding cluster dimension to all aggregating alerting rules #524
base: master
Are you sure you want to change the base?
feat: Adding cluster dimension to all aggregating alerting rules #524
Conversation
Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>
Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>
This might actually be a bit related to #457. I acknowledge the argument that the output from the recording rules could have the |
Signed-off-by: Thor Anker Kvisgård Lange <[email protected]>
If it helps, this is just what I was looking for! My use case is to use Prometheus federation and generate alerts at the central Prometheus, that way we can develop alerts for our cluster services on useful data and have a deployment pipeline for getting the alert rules into prod. |
@csmarchbanks @povilasv what do you think of this? I'm generally happy with it, but don't want to make the decision by myself as it kind of influences how we treat any new contribution. What do you think? |
I think I am ok with this as well. I certainly understand the use case, my only concern is that it encourages alerting far away from a cluster which can be less reliable. @langecode, as you point out this is related to #457, will you still be evaluating the rules inside of the local Prometheus instances? Otherwise I think some of the rules that are used in the alerts will not have a cluster label if run in something like Cortex. |
I hope I have been through all the expressions of the alerting rules to
make sure they carry the cluster label all the way through. That was the
intention anyway. Not the recording rules since the output of the recording
rules will have the label added by the in-cluster Prometheus.
I get your point of the evaluation happening away from the cluster however
we are considering a multi tenancy setup where we would like to be able to
centrally control alerting across tenants and the clusters of the tenants.
May be customized for each tenant or cluster but still controlled outside
of the individual cluster.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this approach as long as we dont't break single tentant usecase.
I.e. All the alerts work without cluster label.
Yes indeed. Both cases should be supported. I validated the changed
expression on a deployment on kind with the mixin running in
Prometheus/Grafana. The Prometheus was forwarding metrics to a Cortex
instance and I ran the expressions against both Prometheus (having no
cluster label) and Cortex (having the cluster label associated with the
data). So I tried to test the compatibility, however, it was done by hand
so it's not like an automated validation.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, otherwise it seems like there is agreement to accept this!
@@ -18,14 +18,16 @@ local utils = import 'utils.libsonnet'; | |||
{ | |||
alert: 'KubeAPIErrorBudgetBurn', | |||
expr: ||| | |||
sum(apiserver_request:burnrate%s) > (%.2f * %.5f) | |||
sum(apiserver_request:burnrate%s) by (%s) > (%.2f * %.5f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is worth using named variables here? Takes a moment to make sure the order is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually had that same experience adding in the by
clause. I tried to keep the disruption minimal that was why I did not change the input for the substitution.
Is there anything I can do to help move this along? It seems like there's not much left to do than fix the merge conflicts? |
This is indeed a very nice collection of both alerting and recording rules as well as dashboards. However, while the dashboards are indeed prepared to be run on something like Cortex assuming timeseries from multiple clusters the alert rules did not all have multi-cluster support.
I have gone through the alerting rules to make sure the label defined in the configuration
clusterLabel
is present on all alerts. So if alerts are evaluated in a setup with data from multiple clusters the alert will carry a label identifying the cluster triggering the alert - could be used for routing in Alertmanager etc.