Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts runtime support #3783

Merged
merged 29 commits into from
Feb 9, 2024
Merged

Alerts runtime support #3783

merged 29 commits into from
Feb 9, 2024

Conversation

begelundmuller
Copy link
Contributor

@begelundmuller begelundmuller commented Jan 5, 2024

Overview: Implements an Alert resource type and associated YAML parser and reconciler with the following high-level features:

  • An alert can be triggered on a cron schedule and/or when its dependencies are refreshed
  • An alert can either check relative to its current execution time (default) or for each of a given interval since last refresh
  • An alert has a query that it runs every time it is checked
  • An alert execution results in one of three states:
    • pass (query returned no rows)
    • fail (query returned 1+ rows)
    • error (query failed)
  • An alert can send emails either on every trigger OR only when there is a state change

Preliminary YAML syntax:

kind: alert
title: My Alert

# Configuration of when to check the alert
refresh:
  # When a query dependency is refreshed
  ref_update: true

  # On a time schedule:
  # cron: "0 * * * *"
  # every: 24h
  # time_zone: UTC

  # Disable the alert without deleting it
  # disable: true

# Watermark to use as the execution time of the alert (default: trigger_time).
# For "inherit", it will take the lowest watermark of the alert's refs.
watermark: trigger_time|inherit

# Configure the alert to check once per elapsed time interval (instead of once for the current watermark).
# If multiple time intervals have elapsed, the alert will be checked for each interval.
# If no time intervals have elapsed, the alert will not be checked.
intervals:
  # ISO duration of each interval
  duration: P1D
  # Whether unclosed intervals should be checked
  check_unclosed: false|true
  # Maximum number of intervals to check for one invocation.
  # If the limit is exceeded, it will only check the most recent intervals.
  limit: 10

# Configuration for the query to run
query:
  name: MetricsViewAggregation
  for: 
    # The user's attributes to use for security policies
    user_email: [email protected]
  args:
    metrics_view: mv1
    dimensions:
      - country
    measures:
      - sales
    time_range:
      iso_duration: P1W
    having:
      # Condition for: sales >= 1000
      cond:
        op: OPERATION_GTE
        exprs:
          - ident: sales
          - val: 1000

# Configuration for how to send emails
email:
  recipients:
    - [email protected]

  # The alert status cases to trigger an email for:
  on_recover: true
  on_fail: true
  on_error: true

  # Whether to send an email again if the alert triggers with the same status as previously
  renotify: false,true # Default: false
  # How long to wait until renotifying
  renotify_after: P24H

Other YAML changes:
In dashboard definitions, you can now specify a watermark expression:

model: foo
timeseries:  __time
watermark: date_trunc('day', max(__time)) 
# If not specified, the watermark defaults to max({{ timeseries }})

In any refresh clause (for sources, models, reports, alerts), you now have two new options:

refresh:
  disable: true # completely disable the resource (without deleting it)
  ref_update: false # don't refresh when a dependency is refreshed

Task list:

  • Define initial resource spec and state
  • Implement initial YAML syntax
  • Implement triggering logic
  • Implement interval logic
  • Implement query logic
  • Implement email logic
  • Experiment with watermark triggers (based on changes to max timestamp of underlying data)

Contributes to #3791

@begelundmuller begelundmuller self-assigned this Jan 5, 2024
Copy link

netlify bot commented Feb 1, 2024

Deploy Preview for rill-developer canceled.

Name Link
🔨 Latest commit 15d0118
🔍 Latest deploy log https://app.netlify.com/sites/rill-developer/deploys/65bbd2ca1080d30008431c43

@begelundmuller begelundmuller marked this pull request as ready for review February 1, 2024 17:32
// Enforce timeout
timeout := alertCheckDefaultTimeout
if a.Spec.TimeoutSeconds > 0 {
timeout = time.Duration(a.Spec.TimeoutSeconds) * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the timeout be a multiple of number of each block of interval this needs to run?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the execution in depth, it will recover from last executed interval. But there will be a failure in reconcile no? Would that lead to resource being deleted and all the history being lost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions. I made the timeout apply to all queries together since the purpose of it is to prevent runaway query execution that makes reconciles take a very long time.

About failure, yes it will recover and lead to a reconcile error, but that won't cause the resource to be deleted. The resource will still be there (and keep its state), it will just have .Meta.ReconcileError set until it is retried.

}

// Note: Only writing the state version to the hash, not spec version, because it doesn't matter whether the spec/meta changes, only whether the state changes.
r, err := r.C.Get(ctx, ref, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cause an issue if the ref was created for the 1st time, went into error state and was fixed?

I remember this case when state version was reset and was the same if it came out of error state vs being created for the 1st time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed it

@begelundmuller
Copy link
Contributor Author

Linter is failing for an unrelated reason fixed in #3996

* Initial commit

* Use virtual file API

* Fix get file query
Copy link
Collaborator

@AdityaHegde AdityaHegde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@AdityaHegde AdityaHegde merged commit 4e55e1b into main Feb 9, 2024
7 checks passed
@AdityaHegde AdityaHegde deleted the begelundmuller/alerts-mvp branch February 9, 2024 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants