Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minSize and maxSize of service scaleup and scaledown, deadletter queue threshold, info to doc #211

Merged
merged 7 commits into from
Jun 12, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/worker-retry-cycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Worker retry cycle

Any single Watchbot message will be attempted up to `deadletterThreshold` times. Each time the message fails it is put back into the queue with an increasing backoff interval before it can be attempted again. These intervals look like:

attempt number | backoff interval (s)
--- | ---
1 | 2
2 | 4
3 | 8
4 | 16
5 | 32
6 | 64
7 | 128
8 | 256
9 | 512
10 | --> dead letter queue

This means that after a failure on the 9th attempt, the message will be invisible for at least 512 seconds before it is retried. Providing an increasing backoff interval with an increasing number of failures helps alleviate load that your processing may be placing on external systems.

The default `deadletterThreshold` is 10. The user can adjust while creating the
watchbot service.

Each time a message fails during processing, it is recorded in [the WorkerErrors or FailedWorkerPlacement metrics](./logging-and-metrics.md#custom-metrics). The [WorkerErrors alarm](./alarms.md#workererrors) will trigger whenever there are more than a configured number of failed attempts per minute. The [FailedWorkerPlacement alarm](./alarms.md#failedworkerplacement) will trigger if there are more than 5 failed placements per minute.

If the 10th attempt to process a message fails, then the message will have been retrying for a minimum of 17 minutes, and at this point it will fall into a dead letter queue.

## The dead letter queue

If a message fails processing deadletter Threshold times, Watchbot will stop attempting it. The message will be dropped into a second SQS queue, called a dead letter queue. When there are **any** messages visible in this queue, Watchbot will trip the [DeadLetter alarm](./alarms.md#deadletter). This helps to give visibility into edge-case messages that may highlight a bug in worker code.

Once a message is in the dead letter queue, it will stay there until it is manually removed, or after 14 days. See [the CLI documentation](./command-line-utilities.md#dead-letter) for instructions for interacting with the dead letter queue.
27 changes: 18 additions & 9 deletions lib/template.js
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,12 @@ const pkg = require(path.resolve(__dirname, '..', 'package.json'));
* to specify this in order to differentiate the resources.
* @param {String} [options.family] - the name of the the task definition family
* that watchbot will create revisions of.
* @param {Number|ref} [options.workers=1] - the maximum number of worker
* containers that can be launched to process jobs concurrently. This parameter
* can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
* @param {Number|ref} [options.maxSize=1] - the maximum size for the service to
* scale up to. This parameter can be provided as either a number or a reference,
* i.e. `{"Ref": "..."}`.
* @param {Number|ref} [options.minSize=0] - the minimum size for the service to
* scale down to. This parameter can be provided as either a number or a reference,
* i.e. `{"Ref": "..."}`.
* @param {String} [options.mounts=''] - if your worker containers need to mount
* files or folders from the host EC2 file system, specify those mounts with this parameter.
* A single persistent mount point can be specified as `{host location}:{container location}`,
Expand Down Expand Up @@ -97,6 +100,10 @@ const pkg = require(path.resolve(__dirname, '..', 'package.json'));
* of 1-minute periods before an alarm is triggered. The default is 1 period, or
* 1 minute. This parameter can be provided as either a number or a reference,
* i.e. `{"Ref": "..."}`.
* @param {Number|ref} [options.deadLetterThreshold=10] - Use this parameter to
* control the duration that the number of times a message is delivered to the
* source queue before being moved to the dead-letter queue. This parameter
* can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
*/
module.exports = (options = {}) => {
['service', 'serviceVersion', 'command', 'cluster'].forEach((required) => {
Expand All @@ -111,14 +118,16 @@ module.exports = (options = {}) => {
env: {},
messageTimeout: 600,
messageRetention: 1209600,
workers: 1,
maxSize: 1,
minSize: 0,
mounts: '',
privileged: false,
family: options.service,
errorThreshold: 10,
alarmThreshold: 40,
alarmPeriods: 24,
failedPlacementAlarmPeriods: 1
failedPlacementAlarmPeriods: 1,
deadletterThreshold: 10
},
options
);
Expand Down Expand Up @@ -210,7 +219,7 @@ module.exports = (options = {}) => {
MessageRetentionPeriod: options.messageRetention,
RedrivePolicy: {
deadLetterTargetArn: cf.getAtt(prefixed('DeadLetterQueue'), 'Arn'),
maxReceiveCount: 10
maxReceiveCount: options.deadletterThreshold
}
}
};
Expand Down Expand Up @@ -439,8 +448,8 @@ module.exports = (options = {}) => {
'/',
cf.getAtt(prefixed('Service'), 'Name')
]),
MinCapacity: 0,
MaxCapacity: options.workers,
MinCapacity: options.minSize,
MaxCapacity: options.maxSize,
RoleARN: cf.getAtt(prefixed('ScalingRole'), 'Arn')
}
};
Expand All @@ -457,7 +466,7 @@ module.exports = (options = {}) => {
MetricAggregationType: 'Average',
StepAdjustments: [
{
ScalingAdjustment: Math.ceil(options.workers / 10),
ScalingAdjustment: Math.ceil(options.maxSize / 10),
MetricIntervalLowerBound: 0.0
}
]
Expand Down
4 changes: 2 additions & 2 deletions test/__snapshots__/template.jest.js.snap
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ Object {
"StepAdjustments": Array [
Object {
"MetricIntervalLowerBound": 0,
"ScalingAdjustment": 9,
"ScalingAdjustment": 1,
},
],
},
Expand Down Expand Up @@ -435,7 +435,7 @@ Object {
},
"SoupScalingTarget": Object {
"Properties": Object {
"MaxCapacity": 90,
"MaxCapacity": 1,
"MinCapacity": 0,
"ResourceId": Object {
"Fn::Join": Array [
Expand Down