Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarm on percentage of failing workers #313

Open
emilymcafee opened this issue Jul 28, 2019 · 3 comments
Open

Alarm on percentage of failing workers #313

emilymcafee opened this issue Jul 28, 2019 · 3 comments

Comments

@emilymcafee
Copy link
Contributor

Background: For some services, you don’t need careful monitoring on worker errors if you have careful monitoring on the dead letter queue (DLQ). This is because if worker errors don’t result in a DLQ status, that means they were successfully retried. For these services the only type of worker error monitoring we’d want is to monitor widespread failure across all workers. However, if the number of workers running at a given time is variable, this isn’t accomplishable with current watchbot error alerting, which requires a static threshold

Feature request: The ability to configure the error alarm with a percentage of failures would be great: i.e. “alarm when more than 75% of running jobs are failing.”

/cc @mapbox/platform

@rclark
Copy link
Contributor

rclark commented Jul 29, 2019

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

In the statement

alarm when more than 75% of running jobs are failing

... would you want "failing" to include or exclude jobs that were later retried successfully?

@freenerd
Copy link
Contributor

For reference here is the current alarm that should be replaced by a percentage:

* @param {Number|ref} [options.errorThreshold=10] - Watchbot creates a
* CloudWatch alarm that will fire if there have been more than this number
* of failed worker invocations in a 60 second period. This parameter can be provided as
* either a number or a reference, i.e. `{"Ref": "..."}`.

Resources[prefixed('WorkerErrorsAlarm')] = {
Type: 'AWS::CloudWatch::Alarm',
Properties: {
AlarmName: cf.join('-', [cf.ref('AWS::StackName'), prefixed('-worker-errors'), cf.region]),
AlarmDescription:
`https://github.com/mapbox/ecs-watchbot/blob/${options.watchbotVersion}/docs/alarms.md#workererrors`,
EvaluationPeriods: 1,
Statistic: 'Sum',
Threshold: options.errorThreshold,
Period: '60',
ComparisonOperator: 'GreaterThanThreshold',
Namespace: 'Mapbox/ecs-watchbot',
MetricName: cf.join([prefixed('WorkerErrors-'), cf.stackName]),
AlarmActions: [notify]
}
};

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

I guess this depends on how many jobs finish per period. As seen above, the current alarm uses a period of 1 minute. We could use the same period here.

alarm when more than 75% of running jobs are failing

I actually think we'd want 100% of running jobs failing as the detection metric. This would show that there is a systemic error that makes all jobs fail.

... would you want "failing" to include or exclude jobs that were later retried successfully?

For jobs that continuously fail and don't succeed through retries, there is the dead letter queue (and alarm on it). The idea of this alarm is to get an alarm faster if there is a systemic problem.

One thing to note: We might have to tune this alarm to not trigger if there is a very low number of tasks (e.g. 1 tasks finished, it failed, alarm). Not sure if this is something to protect against.

@rclark
Copy link
Contributor

rclark commented Jul 31, 2019

So pragmatically, what metrics do we compose to make this metric? I think it would be

  • SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker
  • Watchbot.WorkerErrors = counts errors in watchbot worker scripts

So worker errors / messages received I guess?

Note that worker errors metric wouldn't cover watcher failures, though my hunch is that is really quite rare & maybe not what this alarm is trying to grasp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants