Alarm on percentage of failing workers #313

emilymcafee · 2019-07-28T16:47:54Z

Background: For some services, you don’t need careful monitoring on worker errors if you have careful monitoring on the dead letter queue (DLQ). This is because if worker errors don’t result in a DLQ status, that means they were successfully retried. For these services the only type of worker error monitoring we’d want is to monitor widespread failure across all workers. However, if the number of workers running at a given time is variable, this isn’t accomplishable with current watchbot error alerting, which requires a static threshold

Feature request: The ability to configure the error alarm with a percentage of failures would be great: i.e. “alarm when more than 75% of running jobs are failing.”

/cc @mapbox/platform

rclark · 2019-07-29T19:43:55Z

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

In the statement

alarm when more than 75% of running jobs are failing

... would you want "failing" to include or exclude jobs that were later retried successfully?

freenerd · 2019-07-31T12:00:18Z

For reference here is the current alarm that should be replaced by a percentage:

ecs-watchbot/lib/template.js

Lines 73 to 76 in 7aa0e89

    
            * @param {Number|ref} [options.errorThreshold=10] - Watchbot creates a 
        
            * CloudWatch alarm that will fire if there have been more than this number 
        
            * of failed worker invocations in a 60 second period. This parameter can be provided as 
        
            * either a number or a reference, i.e. `{"Ref": "..."}`.

ecs-watchbot/lib/template.js

Lines 673 to 688 in 7aa0e89

    
           Resources[prefixed('WorkerErrorsAlarm')] = { 
        
             Type: 'AWS::CloudWatch::Alarm', 
        
             Properties: { 
        
               AlarmName: cf.join('-', [cf.ref('AWS::StackName'), prefixed('-worker-errors'), cf.region]), 
        
               AlarmDescription: 
        
                 `https://github.com/mapbox/ecs-watchbot/blob/${options.watchbotVersion}/docs/alarms.md#workererrors`, 
        
               EvaluationPeriods: 1, 
        
               Statistic: 'Sum', 
        
               Threshold: options.errorThreshold, 
        
               Period: '60', 
        
               ComparisonOperator: 'GreaterThanThreshold', 
        
               Namespace: 'Mapbox/ecs-watchbot', 
        
               MetricName: cf.join([prefixed('WorkerErrors-'), cf.stackName]), 
        
               AlarmActions: [notify] 
        
             } 
        
           };

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

I guess this depends on how many jobs finish per period. As seen above, the current alarm uses a period of 1 minute. We could use the same period here.

alarm when more than 75% of running jobs are failing

I actually think we'd want 100% of running jobs failing as the detection metric. This would show that there is a systemic error that makes all jobs fail.

... would you want "failing" to include or exclude jobs that were later retried successfully?

For jobs that continuously fail and don't succeed through retries, there is the dead letter queue (and alarm on it). The idea of this alarm is to get an alarm faster if there is a systemic problem.

One thing to note: We might have to tune this alarm to not trigger if there is a very low number of tasks (e.g. 1 tasks finished, it failed, alarm). Not sure if this is something to protect against.

rclark · 2019-07-31T18:45:14Z

So pragmatically, what metrics do we compose to make this metric? I think it would be

SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker
Watchbot.WorkerErrors = counts errors in watchbot worker scripts

So worker errors / messages received I guess?

Note that worker errors metric wouldn't cover watcher failures, though my hunch is that is really quite rare & maybe not what this alarm is trying to grasp.

emilymcafee added the enhancement label Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alarm on percentage of failing workers #313

Alarm on percentage of failing workers #313

emilymcafee commented Jul 28, 2019

rclark commented Jul 29, 2019

freenerd commented Jul 31, 2019

rclark commented Jul 31, 2019

Alarm on percentage of failing workers #313

Alarm on percentage of failing workers #313

Comments

emilymcafee commented Jul 28, 2019

rclark commented Jul 29, 2019

freenerd commented Jul 31, 2019

rclark commented Jul 31, 2019