-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add link to MAX_RETRY allocation explain message #113657
base: main
Are you sure you want to change the base?
Add link to MAX_RETRY allocation explain message #113657
Conversation
…c page to further explain message.
Documentation preview: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments. Also you need to run ./gradlew precommit
and fix up the issues.
@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts. | |||
{ | |||
"decider": "max_retry", | |||
"decision" : "NO", | |||
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" | |||
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API call in the message is actually POST /_cluster/reroute?retry_failed&metric=none
- see org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider#RETRY_FAILED_API
.
If no other `no` decisions are present, then the transient allocation issue | ||
that caused these failures has most likely been resolved, and you can use the | ||
<<cluster-reroute,the cluster reroute API>> to retry allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be confusing, there are normally always some no
decisions e.g. for nodes in the wrong data tier.
Also I'd rather we used the imperative voice: "use the reroute API" rather than just suggesting "you can ...".
Finally there's a duplicate the
(one inside the link and one outside).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming, I might say
Elasticsearch queues shard allocation retries in batches. If there are long running or a high quantity of shard recoveries occurring within the cluster, this process may time out for some shards resulting in
MAX_RETRY
. This surfaces infrequently but is expected to prevent infinite retries which may impact cluster performance. When encountered, run <<cluster-reroute,the cluster reroute API>> to retry allocation.
@@ -43,5 +43,6 @@ | |||
"MAX_SHARDS_PER_NODE": "size-your-shards.html#troubleshooting-max-shards-open", | |||
"FLOOD_STAGE_WATERMARK": "fix-watermark-errors.html", | |||
"X_OPAQUE_ID": "api-conventions.html#x-opaque-id", | |||
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining" | |||
"FORMING_SINGLE_NODE_CLUSTERS": "modules-discovery-bootstrap-cluster.html#modules-discovery-bootstrap-cluster-joining", | |||
"ALLOCATION_EXPLAIN_MAX_RETRY": "cluster-allocation-explain.html#_maximum_number_of_retries_exceeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a fixed [[anchor-name]]
in the docs rather than using the #_auto_generated
one which might change inadvertently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #113667 which forbids this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -195,17 +195,20 @@ primary shard that has reached the maximum number of allocation retry attempts. | |||
{ | |||
"decider": "max_retry", | |||
"decision" : "NO", | |||
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" | |||
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we solidify the URL I vote leaving it off. Then Github suggestion to apply Dave's comment I believe would appear as:
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed=true] to retry, and for more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html#_maximum_number_of_retries_exceeded, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" | |
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-07-30T21:04:12.166Z], failed_attempts[5], failed_nodes[[mEKjwwzLT1yJVb8UxT6anw]], delayed=false, details[failed shard on node [mEKjwwzLT1yJVb8UxT6anw]: failed recovery, failure RecoveryFailedException], allocation_status[deciders_no]]]" |
Adds maximum number of retries exceeded reference link to the max_retry allocation explanations string.
Adds more detail to documentation page describing that this was done to protect the cluster, but the real cause of the issue may now be gone and so allocation can be retried.
Also adds
POST
to the example_cluster/reroute
API in the explanation because some customers would useGET
and be confused why it didn’t work.