Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust job-id params and namespace for sending files to s3 #1373

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

peetucket
Copy link
Member

Why was this change made? 🤔

Fixes #1371 - job id is now druid-version and so is the place where we put the files in S3 bucket

How was this change tested? 🤨

Spec

@cocina_object = cocina_object
@workflow_context = workflow_context
@bare_druid = cocina_object.externalIdentifier.delete_prefix('druid:')
@logger = logger || Logger.new($stdout)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class never uses the logger, so remove for now unless we specifically need it later (it came as copypasta from the similar Ocr class)

end
end

private

def sttable_filenames
Dor::TextExtraction::SpeechToText.new(cocina_object:, workflow_context: workflow.context).filenames_to_stt
Dor::TextExtraction::SpeechToText.new(cocina_object:).filenames_to_stt
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filenames_to_stt method doesn't need workflow_context so don't bother passing it in

@@ -27,8 +27,7 @@ def perform_work
def send_sqs_message
message_body = {
id: job_id,
druid:,
media:
druid:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving the druid in the job message (even though it happens to be included in the job_id now) ... maybe useful for logging or other purposes to have it distinct

Copy link
Member

@jmartin-sul jmartin-sul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming i didn't miss anything about the job JSON on the speech-to-text python worker side intentionally changing, i think this drops a field that's actually needed. if the same change is in the works on the python side, will approve.

@@ -27,8 +27,7 @@ def perform_work
def send_sqs_message
message_body = {
id: job_id,
druid:,
media:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this got removed in the documentation, but afaict from the code and from testing locally yesterday evening, the media field is actually still required in the message body: https://github.com/sul-dlss/speech-to-text/blob/1fd0eb35b721349c75c75be5151dffc8564d4084/speech_to_text.py#L80-L85

i assume the removal from STT documentation was a mistake or a leftover from a discarded development path with the code. so adding that field back was one of the little touchups i was going to make (to the speech-to-text README) in the small follow-on PR i mentioned this morning at standup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or, maybe you and ed have discussed and this lines up with a coming change in the speech-to-text python worker?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I hadn't further discussed this was @edsu - was operating under the assumption that the whipser container would look for all files that matched the key for the message sent.

BUT... we could also continue to send the media key here in the message body, and it could basically do this for the container (i.e. look at what is in S3 for that job_id and then send all of those files in the message body)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i could see either approach, and don't think i have a well thought out preference. seems like an easy adjustment either way... i might bias towards assuming for now that the current worker code is right... i just put up a small PR to that effect in speech-to-text: sul-dlss/speech-to-text#22

unless @edsu has this change to just list bucket contents in the works already?

i think we discussed this at the end of last week, but i can't remember if we settled on an approach or which way we decided if so.

i have to grab lunch after this DLSS all staff that's happening right now, and it'll be 5 eastern time once that's over anyway... maybe we can confer and decide (or remind me of last week's decision 😅) in the morning? seems like waiting till the morning shouldn't leave anyone seriously blocked this afternoon?

end

context 'when the message is sent successfully' do
let(:message_body) { { id: job_id, druid:, media: ['file1.mov', 'file2.mp3'], options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json }
let(:message_body) { { id: job_id, druid:, options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then i think this'll need to get reverted if the job json structure isn't actually changing

@jmartin-sul jmartin-sul merged commit a45edb4 into main Oct 1, 2024
5 checks passed
@jmartin-sul jmartin-sul deleted the 1371-job-ids branch October 1, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQS Job Id should be druid-version pair
2 participants