Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate race condition in ocrWF #1289

Closed
peetucket opened this issue Jun 13, 2024 · 6 comments
Closed

Investigate race condition in ocrWF #1289

peetucket opened this issue Jun 13, 2024 · 6 comments
Assignees

Comments

@peetucket
Copy link
Member

peetucket commented Jun 13, 2024

ocrWF:start-ocr errored out a few times indicating the object could not be opened during accessioning

This could be because the last step of accessionWF still considered as running.

This shouldn't occur because of https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L50 but we should verify this working as expected and modify if needed to avoid race condition.

Error came from here: https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L164

Or maybe we a retry in start-ocr around the open call (less desirable)

Retry implemented as it occurred again. See #1321

Some more backstory:

Error coming from https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb#L43-L55

Still no explanation for why this happens sometimes. This happened for a very large object for which a number of steps ran slowly, but that shouldn't matter I don't think: https://argo-qa.stanford.edu/view/druid:xh838rk3862

This is where the source exception is coming from I believe, basically dor-services-client calls DSA and asks it to open, and then .open calls ensure_openable! first, which throws the exception:

https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L157-L158

This calls https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L55-L59 which looks for the lifecycle of accessioned via the workflow service, which I guess is coming back as missing at first. Since all of the calls are to workflow-server-rails, once a completed step of end-accession is there, it should come back.

This looks potentially suspicious, but has to do with solr which I don't think should matter in our case: https://github.com/sul-dlss/workflow-server-rails/blob/main/app/controllers/steps_controller.rb#L57-L61

@peetucket
Copy link
Member Author

peetucket commented Jun 13, 2024

It is possible this is related to the issues fixed in #1290 but not 100% sure. It feels related but I cannot logically explain why that change would explain the race condition. Worthy of more thought but may want to see what happens with that fix first above first and see if this recurs before spending a lot of time chasing this down.

@peetucket
Copy link
Member Author

Potentially add retry but with extra logging to see when this happens?

@peetucket
Copy link
Member Author

Not having seen this happen recently, closing until it actually happens again.

@peetucket peetucket reopened this Jun 28, 2024
@peetucket peetucket removed their assignment Jun 28, 2024
@peetucket
Copy link
Member Author

peetucket commented Jun 28, 2024

Analysis:

One solution is to move all of the cleanup work to a previous step in accessionWF, so by the time we get to accessionWF:end-accession, there are no async jobs being called, and thus the step will complete normally and immediately.

We noticed there is already a previous step in accessionWF called reset-workspace, that seems to be doing similar tasks, though in different background jobs. It would probably make sense to collapse and normalize this logic, perhaps even combining the work of https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/cleanup_job.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/reset_workspace_job.rb (which call out to https://github.com/sul-dlss/dor-services-app/blob/main/app/services/cleanup_service.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/services/) so that we don't have multiple jobs doing similar work in two different workflow steps.

We could end up with a single reset-workspace step that does all of this work in a single job.

@peetucket
Copy link
Member Author

Investigation complete.

Possible fixes in #1322 and sul-dlss/dor-services-app#5110

@jmartin-sul
Copy link
Member

Investigation complete.

Possible fixes in #1322 and sul-dlss/dor-services-app#5110

thanks for looking into this!

i finally just read over this explanation, nice trace through.

i added the #1322 and sul-dlss/dor-services-app#5110 to the backlog for prioritization as possible work during the speech-to-text WC, time permitting -- seems like it would be nice to get rid of a potential race condition for large files, and from the explanation, it sounds liike a similar problem might crop up in the speechToTextWF, since we're taking a similar approach with it to ocrWF, and since the issue resides in accessionWF end-accession. happy to remove from the board or leave unprioritized if that assessment seems off-base, or if this seems otherwise not worth the effort right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants