-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adjust job-id params and namespace for sending files to s3 #1373
Conversation
@cocina_object = cocina_object | ||
@workflow_context = workflow_context | ||
@bare_druid = cocina_object.externalIdentifier.delete_prefix('druid:') | ||
@logger = logger || Logger.new($stdout) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class never uses the logger, so remove for now unless we specifically need it later (it came as copypasta from the similar Ocr class)
end | ||
end | ||
|
||
private | ||
|
||
def sttable_filenames | ||
Dor::TextExtraction::SpeechToText.new(cocina_object:, workflow_context: workflow.context).filenames_to_stt | ||
Dor::TextExtraction::SpeechToText.new(cocina_object:).filenames_to_stt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the filenames_to_stt
method doesn't need workflow_context so don't bother passing it in
9ebbfc1
to
1fa3e60
Compare
@@ -27,8 +27,7 @@ def perform_work | |||
def send_sqs_message | |||
message_body = { | |||
id: job_id, | |||
druid:, | |||
media: | |||
druid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving the druid in the job message (even though it happens to be included in the job_id now) ... maybe useful for logging or other purposes to have it distinct
1fa3e60
to
7cb472c
Compare
788f3e3
to
0f48f40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming i didn't miss anything about the job JSON on the speech-to-text python worker side intentionally changing, i think this drops a field that's actually needed. if the same change is in the works on the python side, will approve.
@@ -27,8 +27,7 @@ def perform_work | |||
def send_sqs_message | |||
message_body = { | |||
id: job_id, | |||
druid:, | |||
media: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this got removed in the documentation, but afaict from the code and from testing locally yesterday evening, the media
field is actually still required in the message body: https://github.com/sul-dlss/speech-to-text/blob/1fd0eb35b721349c75c75be5151dffc8564d4084/speech_to_text.py#L80-L85
i assume the removal from STT documentation was a mistake or a leftover from a discarded development path with the code. so adding that field back was one of the little touchups i was going to make (to the speech-to-text README) in the small follow-on PR i mentioned this morning at standup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or, maybe you and ed have discussed and this lines up with a coming change in the speech-to-text python worker?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I hadn't further discussed this was @edsu - was operating under the assumption that the whipser container would look for all files that matched the key for the message sent.
BUT... we could also continue to send the media key here in the message body, and it could basically do this for the container (i.e. look at what is in S3 for that job_id and then send all of those files in the message body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i could see either approach, and don't think i have a well thought out preference. seems like an easy adjustment either way... i might bias towards assuming for now that the current worker code is right... i just put up a small PR to that effect in speech-to-text: sul-dlss/speech-to-text#22
unless @edsu has this change to just list bucket contents in the works already?
i think we discussed this at the end of last week, but i can't remember if we settled on an approach or which way we decided if so.
i have to grab lunch after this DLSS all staff that's happening right now, and it'll be 5 eastern time once that's over anyway... maybe we can confer and decide (or remind me of last week's decision 😅) in the morning? seems like waiting till the morning shouldn't leave anyone seriously blocked this afternoon?
end | ||
|
||
context 'when the message is sent successfully' do | ||
let(:message_body) { { id: job_id, druid:, media: ['file1.mov', 'file2.mp3'], options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json } | ||
let(:message_body) { { id: job_id, druid:, options: { model: 'large', max_line_count: 80, beam_size: 10 } }.to_json } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then i think this'll need to get reverted if the job json structure isn't actually changing
Why was this change made? 🤔
Fixes #1371 - job id is now druid-version and so is the place where we put the files in S3 bucket
How was this change tested? 🤨
Spec