-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate race condition in ocrWF #1289
Comments
It is possible this is related to the issues fixed in #1290 but not 100% sure. It feels related but I cannot logically explain why that change would explain the race condition. Worthy of more thought but may want to see what happens with that fix first above first and see if this recurs before spending a lot of time chasing this down. |
Potentially add retry but with extra logging to see when this happens? |
Not having seen this happen recently, closing until it actually happens again. |
Analysis:
One solution is to move all of the cleanup work to a previous step in accessionWF, so by the time we get to accessionWF:end-accession, there are no async jobs being called, and thus the step will complete normally and immediately. We noticed there is already a previous step in accessionWF called reset-workspace, that seems to be doing similar tasks, though in different background jobs. It would probably make sense to collapse and normalize this logic, perhaps even combining the work of https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/cleanup_job.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/reset_workspace_job.rb (which call out to https://github.com/sul-dlss/dor-services-app/blob/main/app/services/cleanup_service.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/services/) so that we don't have multiple jobs doing similar work in two different workflow steps. We could end up with a single reset-workspace step that does all of this work in a single job. |
Investigation complete. Possible fixes in #1322 and sul-dlss/dor-services-app#5110 |
thanks for looking into this! i finally just read over this explanation, nice trace through. i added the #1322 and sul-dlss/dor-services-app#5110 to the backlog for prioritization as possible work during the speech-to-text WC, time permitting -- seems like it would be nice to get rid of a potential race condition for large files, and from the explanation, it sounds liike a similar problem might crop up in the |
ocrWF:start-ocr errored out a few times indicating the object could not be opened during accessioning
This could be because the last step of accessionWF still considered as running.
This shouldn't occur because of https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L50 but we should verify this working as expected and modify if needed to avoid race condition.
Error came from here: https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L164
Or maybe we a retry in start-ocr around the open call (less desirable)
Retry implemented as it occurred again. See #1321
Some more backstory:
Error coming from https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb#L43-L55
Still no explanation for why this happens sometimes. This happened for a very large object for which a number of steps ran slowly, but that shouldn't matter I don't think: https://argo-qa.stanford.edu/view/druid:xh838rk3862
This is where the source exception is coming from I believe, basically dor-services-client calls DSA and asks it to open, and then .open calls ensure_openable! first, which throws the exception:
https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L157-L158
This calls https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L55-L59 which looks for the lifecycle of accessioned via the workflow service, which I guess is coming back as missing at first. Since all of the calls are to workflow-server-rails, once a completed step of end-accession is there, it should come back.
This looks potentially suspicious, but has to do with solr which I don't think should matter in our case: https://github.com/sul-dlss/workflow-server-rails/blob/main/app/controllers/steps_controller.rb#L57-L61
The text was updated successfully, but these errors were encountered: