small readme and comment touchups #22

jmartin-sul · 2024-10-01T20:12:37Z

follow-ons from review and testing of #9:

tweak the readme section on job structure to match the current code, to explain more explicitly the relationship between job fields and worker file processing, and to remove vestiges (i think?) of an earlier iteration of Add initial Docker container #9 where the job info was passed to the worker via actual file on S3 bucket, instead of as payload of an SQS message.
tweak a code comment that seemed to drift from implementation, and add a region_name param to a to a boto3.resource instance creation, since i got errors in local testing without that.

jmartin-sul · 2024-10-01T20:13:14Z

README.md

@@ -74,21 +74,29 @@ Usually the message on the DONE queue will be processed by the captionWF in comm
 docker run --rm --tty --env-file .env sul-speech-to-text --receive
 ```

-## The Job File
+## The Job Message Structure


"File" seemed like a remnant of the initial implementation where the job info was dropped on an S3 bucket, maybe?

jmartin-sul · 2024-10-01T20:14:39Z

README.md

+  "id": "gy983cn1444-v2",
+  "media": [
+    "snl_tomlin_phone_company.mp4"
+  ],


might be jumping the gun here if you have a change in the works where the job path in the bucket just gets listed, and no explicit file list is needed. but seems like the current state of the code on main still needs this list?

we could make the media include the full key in the bucket, e.g. gy983cn1444-v2/snl_tomlin_phone_company.mp4 instead of just snl_tomlin_phone_company.mp4

or the container could do that since it has all of of the info (if we don't want to be explicit in the message itself)

i don't think i have a preference, other than maybe a very slight bias towards what the code currently does (concatenate job id and media entry to get full key), just because that seems as good as the other options in principle, and it's already implemented, so feels like an acceptable path of least resistance.

I did prematurely remove this part of the docs with #20 in mind. So it sounds like we want to submit media filenames as part of job message? I was thinking it was redundant, because the Job ID would be used to look in the bucket for relevant files. But perhaps its better to be explicit?

My latest PR sends the media element in the message with an array of filenames (with the job_id key part, so basically from the root of the bucket). If we decide later this is redundant and not needed, we can remove it from the message, but it also won't do any harm if its ignored.

d'oh, i think we had discussed this last week, but my fault if i forgot that we decided to just list the bucket and not send a list of file names explicitly in the job message. i'm totally fine with that approach. i did put up this PR after i put #20 in the ready column 🤦, so i suspect this was just me forgetting what we'd all discussed and settled on at the end of last week.

i'll amend this PR to get rid of the file list stuff.

sorry @peetucket for the misguided feedback i left on your PR yesterday!

leaving this in since we finally settled on closing #20

jmartin-sul · 2024-10-01T20:15:51Z

README.md

 }
 ```

+The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID.
+
+The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`.


perhaps these added sentences are me being too wordy, happy to trim or drop if you think they're overkill.

I think this is good to add, sure. I was kind of hoping to keep the docs independent of SDR wherever possible, if someone else wanted to use it. So maybe flipping the wording?

The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.

i like it, will do, thx

jmartin-sul · 2024-10-01T20:18:43Z

speech_to_text.py

@@ -184,6 +184,7 @@ def get_aws(service_name):
        cred = response["Credentials"]
        aws = boto3.resource(
            service_name,
+            region_name=os.environ.get("AWS_REGION"),


without this change, i got this exception when trying to process the queue (running the docker container locally):

% docker run --rm --tty --env-file .env sul-speech-to-text --no-daemon Traceback (most recent call last): File "/app/speech_to_text.py", line 342, in <module> main(daemon=args.daemon) File "/app/speech_to_text.py", line 24, in main job = get_job() ^^^^^^^^^ File "/app/speech_to_text.py", line 51, in get_job queue = get_todo_queue() ^^^^^^^^^^^^^^^^ File "/app/speech_to_text.py", line 206, in get_todo_queue return get_queue(os.environ.get("SPEECH_TO_TEXT_TODO_SQS_QUEUE")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/speech_to_text.py", line 216, in get_queue sqs = get_aws("sqs") ^^^^^^^^^^^^^^ File "/app/speech_to_text.py", line 185, in get_aws aws = boto3.resource( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/boto3/__init__.py", line 101, in resource return _get_default_session().resource(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...snip botocore internal part of stack trace... File "/usr/local/lib/python3.11/dist-packages/botocore/regions.py", line 278, in _endpoint_for_partition raise NoRegionError() botocore.exceptions.NoRegionError: You must specify a region.

Ah, I guess I wasn't seeing that because I have this in my .env:

AWS_DEFAULT_REGION=us-west-2

and env-example has:

AWS_REGION=us-west-2

Maybe we can update env-example instead?

yup, confirmed that works instead, thanks. rebuilt the container without this change, reproduced the region error, updated AWS_REGION to AWS_DEFAULT_REGION in my .env, error went away.

will update env-example...

README: * add back field that the code still requires * a bit more explanation of file naming and job id expectations for processing by worker * typo fix and remove vestiges of earlier implementation where job info was a file on S3 instead of a message payload in SQS speech_to_text.py * replaced specific number in comment with param name to keep the two from drifting as easily

jmartin-sul · 2024-10-03T02:10:49Z

@edsu this is ready for another look. i ended up squashing everything into one commit since it was just a small set of touchups, but for slightly easier review of the requested changes, here's the git diff origin/touchups HEAD result on my local copy before i pushed:

diff --git a/README.md b/README.md
index 18051a6..38483cd 100644
--- a/README.md
+++ b/README.md
@@ -87,7 +87,7 @@ The job is a JSON object (used as an SQS message payload) that contains informat
 }
 ```
 
-The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID.
+The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.
 
 The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`.
 
@@ -95,8 +95,11 @@ You can also pass in options for Whisper:
 
 ```json
 {
-  "id": "bc123df4567",
-  "media": ["cat_video.mp4"],
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
+  "media": [
+    "cat_video.mp4",
+    "The_Sea_otter.mp4"
+  ],
   "options": {
     "model": "large",
     "max_line_count": 80,
@@ -109,7 +112,7 @@ When you receive the message on the DONE SQS queue it will contain the JSON:
 
 ```json
 {
-  "id": "tr808sp1200",
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
   "media": ["bear_breaks_into_home_plays_piano_no_speech.mp4"],
   "options": {
     "model": "large",
diff --git a/env-example b/env-example
index 6e18611..d0958d2 100644
--- a/env-example
+++ b/env-example
@@ -1,6 +1,6 @@
 AWS_ACCESS_KEY_ID=CHANGE_ME
 AWS_SECRET_ACCESS_KEY=CHANGE_ME
-AWS_REGION=us-west-2
+AWS_DEFAULT_REGION=us-west-2
 AWS_ROLE_ARN=arn:aws:iam::418214828013:role/DevelopersRole
 SPEECH_TO_TEXT_S3_BUCKET=sul-speech-to-text-dev-your-username
 SPEECH_TO_TEXT_TODO_SQS_QUEUE=sul-speech-to-text-todo-dev-your-username
diff --git a/speech_to_text.py b/speech_to_text.py
index 2892871..cbf6412 100755
--- a/speech_to_text.py
+++ b/speech_to_text.py
@@ -184,7 +184,6 @@ def get_aws(service_name):
         cred = response["Credentials"]
         aws = boto3.resource(
             service_name,
-            region_name=os.environ.get("AWS_REGION"),
             aws_access_key_id=cred["AccessKeyId"],
             aws_secret_access_key=cred["SecretAccessKey"],
             aws_session_token=cred["SessionToken"],

(also, in case it helps someone else, i learned a markdown trick when i got frustrated formatting the above 😂)

jmartin-sul commented Oct 1, 2024

View reviewed changes

jmartin-sul requested a review from edsu October 1, 2024 20:19

jmartin-sul mentioned this pull request Oct 1, 2024

adjust job-id params and namespace for sending files to s3 sul-dlss/common-accessioning#1373

Merged

jmartin-sul force-pushed the touchups branch from 2a7c0b6 to e1b0e46 Compare October 3, 2024 02:08

jmartin-sul changed the title ~~small readme and code touchups~~ small readme and comment touchups Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small readme and comment touchups #22

small readme and comment touchups #22

jmartin-sul commented Oct 1, 2024

jmartin-sul Oct 1, 2024

jmartin-sul Oct 1, 2024

peetucket Oct 1, 2024

peetucket Oct 1, 2024

jmartin-sul Oct 1, 2024

edsu Oct 2, 2024

peetucket Oct 2, 2024

jmartin-sul Oct 2, 2024

jmartin-sul Oct 3, 2024

jmartin-sul Oct 1, 2024

edsu Oct 2, 2024

jmartin-sul Oct 3, 2024

jmartin-sul Oct 1, 2024

edsu Oct 2, 2024

jmartin-sul Oct 3, 2024

jmartin-sul commented Oct 3, 2024 •

edited

Loading

small readme and comment touchups #22

Are you sure you want to change the base?

small readme and comment touchups #22

Conversation

jmartin-sul commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmartin-sul commented Oct 3, 2024 • edited Loading

jmartin-sul commented Oct 3, 2024 •

edited

Loading