Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small readme and comment touchups #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

small readme and comment touchups #22

wants to merge 1 commit into from

Conversation

jmartin-sul
Copy link
Member

follow-ons from review and testing of #9:

  • tweak the readme section on job structure to match the current code, to explain more explicitly the relationship between job fields and worker file processing, and to remove vestiges (i think?) of an earlier iteration of Add initial Docker container #9 where the job info was passed to the worker via actual file on S3 bucket, instead of as payload of an SQS message.
  • tweak a code comment that seemed to drift from implementation, and add a region_name param to a to a boto3.resource instance creation, since i got errors in local testing without that.

@@ -74,21 +74,29 @@ Usually the message on the DONE queue will be processed by the captionWF in comm
docker run --rm --tty --env-file .env sul-speech-to-text --receive
```

## The Job File
## The Job Message Structure
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"File" seemed like a remnant of the initial implementation where the job info was dropped on an S3 bucket, maybe?

"id": "gy983cn1444-v2",
"media": [
"snl_tomlin_phone_company.mp4"
],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be jumping the gun here if you have a change in the works where the job path in the bucket just gets listed, and no explicit file list is needed. but seems like the current state of the code on main still needs this list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make the media include the full key in the bucket, e.g. gy983cn1444-v2/snl_tomlin_phone_company.mp4 instead of just snl_tomlin_phone_company.mp4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the container could do that since it has all of of the info (if we don't want to be explicit in the message itself)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think i have a preference, other than maybe a very slight bias towards what the code currently does (concatenate job id and media entry to get full key), just because that seems as good as the other options in principle, and it's already implemented, so feels like an acceptable path of least resistance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did prematurely remove this part of the docs with #20 in mind. So it sounds like we want to submit media filenames as part of job message? I was thinking it was redundant, because the Job ID would be used to look in the bucket for relevant files. But perhaps its better to be explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest PR sends the media element in the message with an array of filenames (with the job_id key part, so basically from the root of the bucket). If we decide later this is redundant and not needed, we can remove it from the message, but it also won't do any harm if its ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh, i think we had discussed this last week, but my fault if i forgot that we decided to just list the bucket and not send a list of file names explicitly in the job message. i'm totally fine with that approach. i did put up this PR after i put #20 in the ready column 🤦, so i suspect this was just me forgetting what we'd all discussed and settled on at the end of last week.

i'll amend this PR to get rid of the file list stuff.

sorry @peetucket for the misguided feedback i left on your PR yesterday!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving this in since we finally settled on closing #20

}
```

The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID.

The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps these added sentences are me being too wordy, happy to trim or drop if you think they're overkill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to add, sure. I was kind of hoping to keep the docs independent of SDR wherever possible, if someone else wanted to use it. So maybe flipping the wording?

The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it, will do, thx

@@ -184,6 +184,7 @@ def get_aws(service_name):
cred = response["Credentials"]
aws = boto3.resource(
service_name,
region_name=os.environ.get("AWS_REGION"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this change, i got this exception when trying to process the queue (running the docker container locally):

% docker run --rm --tty --env-file .env sul-speech-to-text --no-daemon
Traceback (most recent call last):
  File "/app/speech_to_text.py", line 342, in <module>
    main(daemon=args.daemon)
  File "/app/speech_to_text.py", line 24, in main
    job = get_job()
          ^^^^^^^^^
  File "/app/speech_to_text.py", line 51, in get_job
    queue = get_todo_queue()
            ^^^^^^^^^^^^^^^^
  File "/app/speech_to_text.py", line 206, in get_todo_queue
    return get_queue(os.environ.get("SPEECH_TO_TEXT_TODO_SQS_QUEUE"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/speech_to_text.py", line 216, in get_queue
    sqs = get_aws("sqs")
          ^^^^^^^^^^^^^^
  File "/app/speech_to_text.py", line 185, in get_aws
    aws = boto3.resource(
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/boto3/__init__.py", line 101, in resource
    return _get_default_session().resource(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...snip botocore internal part of stack trace...
  File "/usr/local/lib/python3.11/dist-packages/botocore/regions.py", line 278, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess I wasn't seeing that because I have this in my .env:

AWS_DEFAULT_REGION=us-west-2

and env-example has:

AWS_REGION=us-west-2

Maybe we can update env-example instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, confirmed that works instead, thanks. rebuilt the container without this change, reproduced the region error, updated AWS_REGION to AWS_DEFAULT_REGION in my .env, error went away.

will update env-example...

README:
* add back field that the code still requires
* a bit more explanation of file naming and job id expectations for processing by worker
* typo fix and remove vestiges of earlier implementation where job info was a file on S3 instead of a message payload in SQS

speech_to_text.py
* replaced specific number in comment with param name to keep the two from drifting as easily
@jmartin-sul
Copy link
Member Author

jmartin-sul commented Oct 3, 2024

@edsu this is ready for another look. i ended up squashing everything into one commit since it was just a small set of touchups, but for slightly easier review of the requested changes, here's the git diff origin/touchups HEAD result on my local copy before i pushed:

diff --git a/README.md b/README.md
index 18051a6..38483cd 100644
--- a/README.md
+++ b/README.md
@@ -87,7 +87,7 @@ The job is a JSON object (used as an SQS message payload) that contains informat
 }
 ```
 
-The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID.
+The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.
 
 The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`.
 
@@ -95,8 +95,11 @@ You can also pass in options for Whisper:
 
 ```json
 {
-  "id": "bc123df4567",
-  "media": ["cat_video.mp4"],
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
+  "media": [
+    "cat_video.mp4",
+    "The_Sea_otter.mp4"
+  ],
   "options": {
     "model": "large",
     "max_line_count": 80,
@@ -109,7 +112,7 @@ When you receive the message on the DONE SQS queue it will contain the JSON:
 
 ```json
 {
-  "id": "tr808sp1200",
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
   "media": ["bear_breaks_into_home_plays_piano_no_speech.mp4"],
   "options": {
     "model": "large",
diff --git a/env-example b/env-example
index 6e18611..d0958d2 100644
--- a/env-example
+++ b/env-example
@@ -1,6 +1,6 @@
 AWS_ACCESS_KEY_ID=CHANGE_ME
 AWS_SECRET_ACCESS_KEY=CHANGE_ME
-AWS_REGION=us-west-2
+AWS_DEFAULT_REGION=us-west-2
 AWS_ROLE_ARN=arn:aws:iam::418214828013:role/DevelopersRole
 SPEECH_TO_TEXT_S3_BUCKET=sul-speech-to-text-dev-your-username
 SPEECH_TO_TEXT_TODO_SQS_QUEUE=sul-speech-to-text-todo-dev-your-username
diff --git a/speech_to_text.py b/speech_to_text.py
index 2892871..cbf6412 100755
--- a/speech_to_text.py
+++ b/speech_to_text.py
@@ -184,7 +184,6 @@ def get_aws(service_name):
         cred = response["Credentials"]
         aws = boto3.resource(
             service_name,
-            region_name=os.environ.get("AWS_REGION"),
             aws_access_key_id=cred["AccessKeyId"],
             aws_secret_access_key=cred["SecretAccessKey"],
             aws_session_token=cred["SessionToken"],

(also, in case it helps someone else, i learned a markdown trick when i got frustrated formatting the above 😂)

@jmartin-sul jmartin-sul changed the title small readme and code touchups small readme and comment touchups Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants