adjust job-id params and namespace for sending files to s3 #1373

peetucket · 2024-09-27T19:07:56Z

Why was this change made? 🤔

Fixes #1371 - job id is now druid-version and so is the place where we put the files in S3 bucket

How was this change tested? 🤨

Spec

peetucket · 2024-09-27T19:08:28Z

lib/dor/text_extraction/speech_to_text.rb

        @cocina_object = cocina_object
        @workflow_context = workflow_context
        @bare_druid = cocina_object.externalIdentifier.delete_prefix('druid:')
-        @logger = logger || Logger.new($stdout)
      end


this class never uses the logger, so remove for now unless we specifically need it later (it came as copypasta from the similar Ocr class)

peetucket · 2024-09-27T19:09:00Z

lib/robots/dor_repo/speech_to_text/fetch_files.rb

          end
        end

        private

        def sttable_filenames
-          Dor::TextExtraction::SpeechToText.new(cocina_object:, workflow_context: workflow.context).filenames_to_stt
+          Dor::TextExtraction::SpeechToText.new(cocina_object:).filenames_to_stt


the filenames_to_stt method doesn't need workflow_context so don't bother passing it in

peetucket · 2024-09-27T20:04:19Z

lib/robots/dor_repo/speech_to_text/stt_create.rb

@@ -27,8 +27,7 @@ def perform_work
        def send_sqs_message
          message_body = {
            id: job_id,
-            druid:,
-            media:
+            druid:


leaving the druid in the job message (even though it happens to be included in the job_id now) ... maybe useful for logging or other purposes to have it distinct

jmartin-sul

assuming i didn't miss anything about the job JSON on the speech-to-text python worker side intentionally changing, i think this drops a field that's actually needed. if the same change is in the works on the python side, will approve.

jmartin-sul · 2024-10-01T19:17:12Z

lib/robots/dor_repo/speech_to_text/stt_create.rb

@@ -27,8 +27,7 @@ def perform_work
        def send_sqs_message
          message_body = {
            id: job_id,
-            druid:,
-            media:


so this got removed in the documentation, but afaict from the code and from testing locally yesterday evening, the media field is actually still required in the message body: https://github.com/sul-dlss/speech-to-text/blob/1fd0eb35b721349c75c75be5151dffc8564d4084/speech_to_text.py#L80-L85

i assume the removal from STT documentation was a mistake or a leftover from a discarded development path with the code. so adding that field back was one of the little touchups i was going to make (to the speech-to-text README) in the small follow-on PR i mentioned this morning at standup.

(or, maybe you and ed have discussed and this lines up with a coming change in the speech-to-text python worker?)

Ok, I hadn't further discussed this was @edsu - was operating under the assumption that the whipser container would look for all files that matched the key for the message sent.

BUT... we could also continue to send the media key here in the message body, and it could basically do this for the container (i.e. look at what is in S3 for that job_id and then send all of those files in the message body)

yeah, i could see either approach, and don't think i have a well thought out preference. seems like an easy adjustment either way... i might bias towards assuming for now that the current worker code is right... i just put up a small PR to that effect in speech-to-text: sul-dlss/speech-to-text#22

unless @edsu has this change to just list bucket contents in the works already?

i think we discussed this at the end of last week, but i can't remember if we settled on an approach or which way we decided if so.

i have to grab lunch after this DLSS all staff that's happening right now, and it'll be 5 eastern time once that's over anyway... maybe we can confer and decide (or remind me of last week's decision 😅) in the morning? seems like waiting till the morning shouldn't leave anyone seriously blocked this afternoon?

jmartin-sul · 2024-10-01T19:27:05Z

spec/robots/dor_repo/speech_to_text/stt_create_spec.rb

  end

  context 'when the message is sent successfully' do
-    let(:message_body) { { id: job_id, druid:, media: ['file1.mov', 'file2.mp3'], options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json }
+    let(:message_body) { { id: job_id, druid:, options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json }


and then i think this'll need to get reverted if the job json structure isn't actually changing

peetucket commented Sep 27, 2024

View reviewed changes

peetucket force-pushed the 1371-job-ids branch from 9ebbfc1 to 1fa3e60 Compare September 27, 2024 20:03

peetucket commented Sep 27, 2024

View reviewed changes

peetucket force-pushed the 1371-job-ids branch from 1fa3e60 to 7cb472c Compare September 27, 2024 20:08

peetucket added 2 commits September 30, 2024 16:45

adjust job-id params and namespace for sending files to s3

4987bdc

add method to return s3 key for filename

0f48f40

peetucket force-pushed the 1371-job-ids branch from 788f3e3 to 0f48f40 Compare September 30, 2024 23:45

jmartin-sul requested changes Oct 1, 2024

View reviewed changes

include media files in sqs message body

38200c7

jmartin-sul approved these changes Oct 1, 2024

View reviewed changes

jmartin-sul merged commit a45edb4 into main Oct 1, 2024
5 checks passed

jmartin-sul deleted the 1371-job-ids branch October 1, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adjust job-id params and namespace for sending files to s3 #1373

adjust job-id params and namespace for sending files to s3 #1373

peetucket commented Sep 27, 2024

peetucket Sep 27, 2024

peetucket Sep 27, 2024

peetucket Sep 27, 2024

jmartin-sul left a comment

jmartin-sul Oct 1, 2024

jmartin-sul Oct 1, 2024

peetucket Oct 1, 2024

jmartin-sul Oct 1, 2024

jmartin-sul Oct 1, 2024

adjust job-id params and namespace for sending files to s3 #1373

adjust job-id params and namespace for sending files to s3 #1373

Conversation

peetucket commented Sep 27, 2024

Why was this change made? 🤔

How was this change tested? 🤨

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmartin-sul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment