Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#101193 Preserve Step Info Across ILM Auto Retries #113187

Merged

Conversation

lukewhiting
Copy link
Contributor

@lukewhiting lukewhiting commented Sep 19, 2024

This PR does a few things:

  • Adds a new field to the Explain ILM lifecycle endpoint (<target>/_ilm/explain) called previous_step_info that contains the previous step runs step_info field contents
  • Logic in the Lifecycle step transition logic to copy the step info of a previous run into the new field
  • Docs for the new field

Bonus content includes:

  • Refactored and simplified test object mutation logic
  • Refactored test's to use recommended Clock interface for time dependent tests rather than custom supplier Reverted this as it's too risky / unrelated to PR

Commits are atomic so feel free to review commit by commit if you prefer reviewing smaller chunks

Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v9.0.0 labels Sep 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@lukewhiting lukewhiting force-pushed the 101193-step-info-missing-exception-info branch from dca5328 to 940cee7 Compare September 19, 2024 14:30
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've created a changelog YAML for you.

@lukewhiting lukewhiting force-pushed the 101193-step-info-missing-exception-info branch 2 times, most recently from 940cee7 to 14e3a7a Compare September 19, 2024 14:34
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've updated the changelog YAML for you.

@lukewhiting lukewhiting linked an issue Sep 19, 2024 that may be closed by this pull request
@lukewhiting lukewhiting added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Sep 20, 2024
@lukewhiting
Copy link
Contributor Author

Don't forget to close the issue in the PR description.

Yep :-) I have linked it to this PR so GitHub should auto close it once it's merged but I'll make a note to double check

@lukewhiting
Copy link
Contributor Author

lukewhiting commented Sep 24, 2024

Local functional testing complete.

Method:

  1. Index template created with policy that has a rollover step
  2. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  3. Wait for ILM explain on index to confirm in check-rollover-ready state
  4. Create new index (ilm-error-test-000000) that also matches the template
  5. Manually create alias (ilm-error-test) for second index that matches rollover_alias of policy
  6. ILM explain for the first index enters Error state with step info index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]
  7. Wait for auto retry
  8. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

Test Output:

{
  "indices": {
    "ilm-error-test-000001": {
      "index": "ilm-error-test-000001",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1727179631766,
      "time_since_index_creation": "19.92m",
      "lifecycle_date_millis": 1727179631766,
      "age": "19.92m",
      "phase": "hot",
      "phase_time_millis": 1727180790334,
      "action": "rollover",
      "action_time_millis": 1727179631820,
      "step": "check-rollover-ready",
      "step_time_millis": 1727180790334,
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 1,
      "previous_step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]"
      },
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_docs": 200000000,
              "min_docs": 1,
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1727179591769
      }
    }
  }
}

@nielsbauman
Copy link
Contributor

@lukewhiting That functional test looks great! Well done! I think it would also be worthwhile to ensure the previous_step_info isn't cleared after the second retry, but we can do that in the integration test that we'll be writing.

@gmarouli
Copy link
Contributor

@lukewhiting nice work coming up with the test scenario!

Local functional testing complete.

Method:

  1. Index template created with policy that has a rollover step
  2. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  3. Wait for ILM explain on index to confirm in check-rollover-ready state
  4. Create new index (ilm-error-test-000000) that also matches the template
  5. Manually create alias (ilm-error-test) for second index that matches rollover_alias of policy
  6. ILM explain for the first index enters Error state with step info index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]
  7. Wait for auto retry
  8. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

Test Output:

{
  "indices": {
    "ilm-error-test-000001": {
      "index": "ilm-error-test-000001",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1727179631766,
      "time_since_index_creation": "19.92m",
      "lifecycle_date_millis": 1727179631766,
      "age": "19.92m",
      "phase": "hot",
      "phase_time_millis": 1727180790334,
      "action": "rollover",
      "action_time_millis": 1727179631820,
      "step": "check-rollover-ready",
      "step_time_millis": 1727180790334,
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 1,
      "previous_step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]"
      },
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_docs": 200000000,
              "min_docs": 1,
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1727179591769
      }
    }
  }
}

I think this could be simplified in an java rest test for example:

  1. Decrease ILM polling frequency to reduce the test duration
  2. Index template created without the rollover alias but with a regular alias and with policy that has a rollover step with max_doc: 1
  3. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  4. Wait for ILM explain on index to confirm in check-rollover-ready state
  5. ILM explain will throw the error setting [index.lifecycle.rollover_alias] for index
  6. Wait for auto retry
  7. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

It's very similar with what you have just a few less steps. Let me know if you want to work on it together.

Nit: I think the test-update-serverless can be removed because ILM is not used in serverless. That should speed up ci as well.

@lukewhiting
Copy link
Contributor Author

@nielsbauman @gmarouli Integration test for new field added based on the above manual testing: 91dac7e

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment but LGTM :)

@lukewhiting lukewhiting added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 30, 2024
@lukewhiting lukewhiting merged commit b1b249d into elastic:main Sep 30, 2024
15 of 16 checks passed
@lukewhiting lukewhiting deleted the 101193-step-info-missing-exception-info branch September 30, 2024 10:44
@lukewhiting lukewhiting restored the 101193-step-info-missing-exception-info branch September 30, 2024 10:45
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

@lukewhiting lukewhiting deleted the 101193-step-info-missing-exception-info branch September 30, 2024 10:45
lukewhiting added a commit to lukewhiting/elasticsearch that referenced this pull request Sep 30, 2024
…3187)

* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording

Co-authored-by: Mary Gouseti <[email protected]>

* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Sep 30, 2024
* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording



* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 4, 2024
…3187)

* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording

Co-authored-by: Mary Gouseti <[email protected]>

* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ILM] Still show step_info for is_auto_retryable_error
4 participants