Add link to MAX_RETRY allocation explain message #113657

matthewabbott · 2024-09-27T01:43:22Z

Adds maximum number of retries exceeded reference link to the max_retry allocation explanations string.

Adds more detail to documentation page describing that this was done to protect the cluster, but the real cause of the issue may now be gone and so allocation can be retried.

Also adds POST to the example _cluster/reroute API in the explanation because some customers would use GET and be confused why it didn’t work.

…c page to further explain message.

github-actions · 2024-09-27T01:43:31Z

Documentation preview:

✨ Changed pages

DaveCTurner

Some small comments. Also you need to run ./gradlew precommit and fix up the issues.

DaveCTurner · 2024-09-27T07:37:10Z

docs/reference/cluster/allocation-explain.asciidoc

@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts.
        {
          "decider": "max_retry",
          "decision" : "NO",
-          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
+          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"


The API call in the message is actually POST /_cluster/reroute?retry_failed&metric=none - see org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider#RETRY_FAILED_API.

DaveCTurner · 2024-09-27T07:39:35Z

docs/reference/cluster/allocation-explain.asciidoc

+If no other `no` decisions are present, then the transient allocation issue
+that caused these failures has most likely been resolved, and you can use the
+<<cluster-reroute,the cluster reroute API>> to retry allocation.


I think this will be confusing, there are normally always some no decisions e.g. for nodes in the wrong data tier.

Also I'd rather we used the imperative voice: "use the reroute API" rather than just suggesting "you can ...".

Finally there's a duplicate the (one inside the link and one outside).

Brainstorming, I might say

Elasticsearch queues shard allocation retries in batches. If there are long running or a high quantity of shard recoveries occurring within the cluster, this process may time out for some shards resulting in MAX_RETRY. This surfaces infrequently but is expected to prevent infinite retries which may impact cluster performance. When encountered, run <<cluster-reroute,the cluster reroute API>> to retry allocation.

DaveCTurner · 2024-09-27T07:40:29Z

server/src/main/resources/org/elasticsearch/common/reference-docs-links.json

@@ -43,5 +43,6 @@
  "MAX_SHARDS_PER_NODE": "size-your-shards.html#troubleshooting-max-shards-open",
  "FLOOD_STAGE_WATERMARK": "fix-watermark-errors.html",
  "X_OPAQUE_ID": "api-conventions.html#x-opaque-id",
-  "FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining"
+  "FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining",
+  "ALLOCATION_EXPLAIN_MAX_RETRY": "cluster-allocation-explain.html#_maximum_number_of_retries_exceeded"


Please add a fixed [[anchor-name]] in the docs rather than using the #_auto_generated one which might change inadvertently.

See #113667 which forbids this.

X-post example.

stefnestor · 2024-09-27T18:22:10Z

docs/reference/cluster/allocation-explain.asciidoc

@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts.
        {
          "decider": "max_retry",
          "decision" : "NO",
-          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"
+          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"


Until we solidify the URL I vote leaving it off. Then Github suggestion to apply Dave's comment I believe would appear as:

Suggested change

"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"

"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]"

matthewabbott added 2 commits September 26, 2024 18:33

Add link to MAX_RETRY allocation explain message. Modify max_retry do…

54049ab

…c page to further explain message.

make format for consistent with the rest of the doc

51b760e

elasticsearchmachine added v9.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 27, 2024

DaveCTurner reviewed Sep 27, 2024

View reviewed changes

stefnestor reviewed Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add link to MAX_RETRY allocation explain message #113657

Add link to MAX_RETRY allocation explain message #113657

matthewabbott commented Sep 27, 2024

github-actions bot commented Sep 27, 2024

DaveCTurner left a comment

DaveCTurner Sep 27, 2024

DaveCTurner Sep 27, 2024

stefnestor Sep 27, 2024

DaveCTurner Sep 27, 2024

DaveCTurner Sep 27, 2024

stefnestor Sep 27, 2024

stefnestor Sep 27, 2024

Add link to MAX_RETRY allocation explain message #113657

Are you sure you want to change the base?

Add link to MAX_RETRY allocation explain message #113657

Conversation

matthewabbott commented Sep 27, 2024

github-actions bot commented Sep 27, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Sep 27, 2024

Choose a reason for hiding this comment

DaveCTurner Sep 27, 2024

Choose a reason for hiding this comment

stefnestor Sep 27, 2024

Choose a reason for hiding this comment

DaveCTurner Sep 27, 2024

Choose a reason for hiding this comment

DaveCTurner Sep 27, 2024

Choose a reason for hiding this comment

stefnestor Sep 27, 2024

Choose a reason for hiding this comment

stefnestor Sep 27, 2024

Choose a reason for hiding this comment