Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add link to MAX_RETRY allocation explain message #113657

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions docs/reference/cluster/allocation-explain.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ node.
====== Maximum number of retries exceeded

The following response contains an allocation explanation for an unassigned
primary shard that has reached the maximum number of allocation retry attempts.
primary shard that has reached the maximum number of allocation retry attempts.

[source,js]
----
Expand Down Expand Up @@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts.
{
"decider": "max_retry",
"decision" : "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API call in the message is actually POST /_cluster/reroute?retry_failed&metric=none - see org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider#RETRY_FAILED_API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we solidify the URL I vote leaving it off. Then Github suggestion to apply Dave's comment I believe would appear as:

Suggested change
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"

}
]
}
]
}
----
// NOTCONSOLE

If decider message indicates a transient allocation issue, use
<<cluster-reroute,the cluster reroute API>> to retry allocation.
This message indicates that the cluster was previously unable to
allocate this shard and chose to put a hold on further attempts.
This is done to avoid burdening the cluster with repeated requests that will fail.
If no other `no` decisions are present, then the transient allocation issue
that caused these failures has most likely been resolved, and you can use the
<<cluster-reroute,the cluster reroute API>> to retry allocation.
Comment on lines +209 to +211
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be confusing, there are normally always some no decisions e.g. for nodes in the wrong data tier.

Also I'd rather we used the imperative voice: "use the reroute API" rather than just suggesting "you can ...".

Finally there's a duplicate the (one inside the link and one outside).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming, I might say

Elasticsearch queues shard allocation retries in batches. If there are long running or a high quantity of shard recoveries occurring within the cluster, this process may time out for some shards resulting in MAX_RETRY. This surfaces infrequently but is expected to prevent infinite retries which may impact cluster performance. When encountered, run <<cluster-reroute,the cluster reroute API>> to retry allocation.


====== No valid shard copy

Expand Down Expand Up @@ -334,7 +337,7 @@ queued to allocate but currently waiting on other queued shards.
----
// NOTCONSOLE

This is a transient message that might appear when a large amount of shards are allocating.
This is a transient message that might appear when a large amount of shards are allocating.

===== Assigned shard

Expand Down Expand Up @@ -437,7 +440,7 @@ cluster balance.
===== No arguments

If you call the API with no arguments, {es} retrieves an allocation explanation
for an arbitrary unassigned primary or replica shard, returning any unassigned primary shards first.
for an arbitrary unassigned primary or replica shard, returning any unassigned primary shards first.

[source,console]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import org.elasticsearch.cluster.routing.UnassignedInfo;
import org.elasticsearch.cluster.routing.allocation.RoutingAllocation;
import org.elasticsearch.common.settings.Setting;
import org.elasticsearch.common.ReferenceDocs;

/**
* An allocation decider that prevents shards from being allocated on any node if the shards allocation has been retried N times without
Expand Down Expand Up @@ -72,9 +73,10 @@ private static Decision debugDecision(Decision decision, UnassignedInfo info, in
return Decision.single(
Decision.Type.NO,
NAME,
"shard has exceeded the maximum number of retries [%d] on failed allocation attempts - manually call [%s] to retry, [%s]",
"shard has exceeded the maximum number of retries [%d] on failed allocation attempts - manually call [%s] to retry, and for more information, see [%s], [%s]",
maxRetries,
RETRY_FAILED_API,
ReferenceDocs.ALLOCATION_EXPLAIN_MAX_RETRY,
info.toString()
);
} else {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ public enum ReferenceDocs {
FLOOD_STAGE_WATERMARK,
X_OPAQUE_ID,
FORMING_SINGLE_NODE_CLUSTERS,
ALLOCATION_EXPLAIN_MAX_RETRY,
// this comment keeps the ';' on the next line so every entry above has a trailing ',' which makes the diff for adding new links cleaner
;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,6 @@
"MAX_SHARDS_PER_NODE": "size-your-shards.html#troubleshooting-max-shards-open",
"FLOOD_STAGE_WATERMARK": "fix-watermark-errors.html",
"X_OPAQUE_ID": "api-conventions.html#x-opaque-id",
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining"
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining",
"ALLOCATION_EXPLAIN_MAX_RETRY": "cluster-allocation-explain.html#_maximum_number_of_retries_exceeded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a fixed [[anchor-name]] in the docs rather than using the #_auto_generated one which might change inadvertently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #113667 which forbids this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
Loading