Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add link to MAX_RETRY allocation explain message #113657

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

matthewabbott
Copy link

Adds maximum number of retries exceeded reference link to the max_retry allocation explanations string.

Adds more detail to documentation page describing that this was done to protect the cluster, but the real cause of the issue may now be gone and so allocation can be retried.

Also adds POST to the example _cluster/reroute API in the explanation because some customers would use GET and be confused why it didn’t work.

@matthewabbott matthewabbott added >docs General docs changes >non-issue :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels Sep 27, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added v9.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 27, 2024
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Also you need to run ./gradlew precommit and fix up the issues.

@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts.
{
"decider": "max_retry",
"decision" : "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API call in the message is actually POST /_cluster/reroute?retry_failed&metric=none - see org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider#RETRY_FAILED_API.

Comment on lines +209 to +211
If no other `no` decisions are present, then the transient allocation issue
that caused these failures has most likely been resolved, and you can use the
<<cluster-reroute,the cluster reroute API>> to retry allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be confusing, there are normally always some no decisions e.g. for nodes in the wrong data tier.

Also I'd rather we used the imperative voice: "use the reroute API" rather than just suggesting "you can ...".

Finally there's a duplicate the (one inside the link and one outside).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming, I might say

Elasticsearch queues shard allocation retries in batches. If there are long running or a high quantity of shard recoveries occurring within the cluster, this process may time out for some shards resulting in MAX_RETRY. This surfaces infrequently but is expected to prevent infinite retries which may impact cluster performance. When encountered, run <<cluster-reroute,the cluster reroute API>> to retry allocation.

@@ -43,5 +43,6 @@
"MAX_SHARDS_PER_NODE": "size-your-shards.html#troubleshooting-max-shards-open",
"FLOOD_STAGE_WATERMARK": "fix-watermark-errors.html",
"X_OPAQUE_ID": "api-conventions.html#x-opaque-id",
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining"
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining",
"ALLOCATION_EXPLAIN_MAX_RETRY": "cluster-allocation-explain.html#_maximum_number_of_retries_exceeded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a fixed [[anchor-name]] in the docs rather than using the #_auto_generated one which might change inadvertently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #113667 which forbids this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts.
{
"decider": "max_retry",
"decision" : "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we solidify the URL I vote leaving it off. Then Github suggestion to apply Dave's comment I believe would appear as:

Suggested change
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >docs General docs changes external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants