Investigate race condition in ocrWF #1289

peetucket · 2024-06-13T18:04:43Z

ocrWF:start-ocr errored out a few times indicating the object could not be opened during accessioning

This could be because the last step of accessionWF still considered as running.

This shouldn't occur because of https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L50 but we should verify this working as expected and modify if needed to avoid race condition.

Error came from here: https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L164

Or maybe we a retry in start-ocr around the open call (less desirable)

Retry implemented as it occurred again. See #1321

Some more backstory:

Error coming from https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb#L43-L55

Still no explanation for why this happens sometimes. This happened for a very large object for which a number of steps ran slowly, but that shouldn't matter I don't think: https://argo-qa.stanford.edu/view/druid:xh838rk3862

This is where the source exception is coming from I believe, basically dor-services-client calls DSA and asks it to open, and then .open calls ensure_openable! first, which throws the exception:

https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L157-L158

This calls https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L55-L59 which looks for the lifecycle of accessioned via the workflow service, which I guess is coming back as missing at first. Since all of the calls are to workflow-server-rails, once a completed step of end-accession is there, it should come back.

This looks potentially suspicious, but has to do with solr which I don't think should matter in our case: https://github.com/sul-dlss/workflow-server-rails/blob/main/app/controllers/steps_controller.rb#L57-L61

peetucket · 2024-06-13T22:24:14Z

It is possible this is related to the issues fixed in #1290 but not 100% sure. It feels related but I cannot logically explain why that change would explain the race condition. Worthy of more thought but may want to see what happens with that fix first above first and see if this recurs before spending a lot of time chasing this down.

peetucket · 2024-06-19T20:41:56Z

Potentially add retry but with extra logging to see when this happens?

peetucket · 2024-06-24T23:22:50Z

Not having seen this happen recently, closing until it actually happens again.

peetucket · 2024-06-28T18:59:29Z

Analysis:

ocrWF:start-ocr can be started by accessionWF:end-accession under certain conditions. See https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/accession/end_accession.rb#L25-L36
ocrWF:start-ocr tries to open a new version. See https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/ocr/start_ocr.rb#L28
A version is not openable if DSA thinks the object is currently being accessioned. This is determined by looking for the "accessioned" lifecycle, which is set by accessionWF:end-accession being marked as completed. If the condition is not met, DSA throws an exception, which is the one seen in the race condition. See https://app.honeybadger.io/projects/52894/faults/108881704 for the exception. See https://github.com/sul-dlss/dor-services-app/blob/main/app/services/version_service.rb#L157-L158 and https://github.com/sul-dlss/dor-services-app/blob/main/app/services/workflow_state_service.rb#L55-L59 for where it's coming from.
So if accessionWF:end-accession is not marked as completed before ocrWF:start-ocr begins it's work, this race condition can happen. In practice, it happens very rarely, only for large objects.
We were hoping that since starting up ocrWF happens at the very end of accessionWF:end-accession, by the time we got started with ocrWF:start-ocr, the accessionWF:end-accession would have been marked as completed, and we would be fine.
We were wrong, because it turns out that accessionWF:end-accession actually starts a background job in DSA and then the robot code ends with a NOOP, leaving it's status as running. See https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/accession/end_accession.rb#L20 which calls out to https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client/workspace.rb#L40-L47 which then makes a call to DSA to https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/workspaces_controller.rb#L17-L22 which starts the background job at https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/cleanup_job.rb which then calls the https://github.com/sul-dlss/dor-services-app/blob/main/app/services/cleanup_service.rb and then it finally updates the workflow step to completed via https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/log_success_job.rb
So we really are waiting for two jobs to complete before the accessionWF:end-accession status gets marked as completed, and if these jobs take more than a fraction of a second to finish and thus mark the step as complete, we hit the race condition.
Cleanup must be happening fast for small objects, thus allow them to complete before ocrWF:start-ocr begins, and longer for larger objects.

One solution is to move all of the cleanup work to a previous step in accessionWF, so by the time we get to accessionWF:end-accession, there are no async jobs being called, and thus the step will complete normally and immediately.

We noticed there is already a previous step in accessionWF called reset-workspace, that seems to be doing similar tasks, though in different background jobs. It would probably make sense to collapse and normalize this logic, perhaps even combining the work of https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/cleanup_job.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/jobs/reset_workspace_job.rb (which call out to https://github.com/sul-dlss/dor-services-app/blob/main/app/services/cleanup_service.rb and https://github.com/sul-dlss/dor-services-app/blob/main/app/services/) so that we don't have multiple jobs doing similar work in two different workflow steps.

We could end up with a single reset-workspace step that does all of this work in a single job.

peetucket · 2024-06-28T19:06:46Z

Investigation complete.

Possible fixes in #1322 and sul-dlss/dor-services-app#5110

jmartin-sul · 2024-09-24T16:11:25Z

Investigation complete.

Possible fixes in #1322 and sul-dlss/dor-services-app#5110

thanks for looking into this!

i finally just read over this explanation, nice trace through.

i added the #1322 and sul-dlss/dor-services-app#5110 to the backlog for prioritization as possible work during the speech-to-text WC, time permitting -- seems like it would be nice to get rid of a potential race condition for large files, and from the explanation, it sounds liike a similar problem might crop up in the speechToTextWF, since we're taking a similar approach with it to ocrWF, and since the issue resides in accessionWF end-accession. happy to remove from the board or leave unprioritized if that assessment seems off-base, or if this seems otherwise not worth the effort right now.

peetucket mentioned this issue Jun 19, 2024

add retries and extra logging for failure to open object #1302

Merged

peetucket self-assigned this Jun 19, 2024

peetucket closed this as completed Jun 24, 2024

peetucket reopened this Jun 28, 2024

peetucket removed their assignment Jun 28, 2024

This was referenced Jun 28, 2024

Combine reset workspace and cleanup job logic sul-dlss/dor-services-app#5110

Open

Remove the invocation of jobs in accessionWF:end-accession #1322

Open

peetucket self-assigned this Jun 28, 2024

peetucket closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate race condition in ocrWF #1289

Investigate race condition in ocrWF #1289

peetucket commented Jun 13, 2024 •

edited

Loading

peetucket commented Jun 13, 2024 •

edited

Loading

peetucket commented Jun 19, 2024

peetucket commented Jun 24, 2024

peetucket commented Jun 28, 2024 •

edited

Loading

peetucket commented Jun 28, 2024

jmartin-sul commented Sep 24, 2024

Investigate race condition in ocrWF #1289

Investigate race condition in ocrWF #1289

Comments

peetucket commented Jun 13, 2024 • edited Loading

peetucket commented Jun 13, 2024 • edited Loading

peetucket commented Jun 19, 2024

peetucket commented Jun 24, 2024

peetucket commented Jun 28, 2024 • edited Loading

peetucket commented Jun 28, 2024

jmartin-sul commented Sep 24, 2024

peetucket commented Jun 13, 2024 •

edited

Loading

peetucket commented Jun 13, 2024 •

edited

Loading

peetucket commented Jun 28, 2024 •

edited

Loading