-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small readme and comment touchups #22
base: main
Are you sure you want to change the base?
Conversation
@@ -74,21 +74,29 @@ Usually the message on the DONE queue will be processed by the captionWF in comm | |||
docker run --rm --tty --env-file .env sul-speech-to-text --receive | |||
``` | |||
|
|||
## The Job File | |||
## The Job Message Structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"File" seemed like a remnant of the initial implementation where the job info was dropped on an S3 bucket, maybe?
"id": "gy983cn1444-v2", | ||
"media": [ | ||
"snl_tomlin_phone_company.mp4" | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be jumping the gun here if you have a change in the works where the job path in the bucket just gets listed, and no explicit file list is needed. but seems like the current state of the code on main
still needs this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could make the media include the full key in the bucket, e.g. gy983cn1444-v2/snl_tomlin_phone_company.mp4
instead of just snl_tomlin_phone_company.mp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or the container could do that since it has all of of the info (if we don't want to be explicit in the message itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think i have a preference, other than maybe a very slight bias towards what the code currently does (concatenate job id and media entry to get full key), just because that seems as good as the other options in principle, and it's already implemented, so feels like an acceptable path of least resistance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did prematurely remove this part of the docs with #20 in mind. So it sounds like we want to submit media filenames as part of job message? I was thinking it was redundant, because the Job ID would be used to look in the bucket for relevant files. But perhaps its better to be explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My latest PR sends the media element in the message with an array of filenames (with the job_id key part, so basically from the root of the bucket). If we decide later this is redundant and not needed, we can remove it from the message, but it also won't do any harm if its ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d'oh, i think we had discussed this last week, but my fault if i forgot that we decided to just list the bucket and not send a list of file names explicitly in the job message. i'm totally fine with that approach. i did put up this PR after i put #20 in the ready column 🤦, so i suspect this was just me forgetting what we'd all discussed and settled on at the end of last week.
i'll amend this PR to get rid of the file list stuff.
sorry @peetucket for the misguided feedback i left on your PR yesterday!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving this in since we finally settled on closing #20
} | ||
``` | ||
|
||
The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID. | ||
|
||
The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps these added sentences are me being too wordy, happy to trim or drop if you think they're overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to add, sure. I was kind of hoping to keep the docs independent of SDR wherever possible, if someone else wanted to use it. So maybe flipping the wording?
The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like it, will do, thx
speech_to_text.py
Outdated
@@ -184,6 +184,7 @@ def get_aws(service_name): | |||
cred = response["Credentials"] | |||
aws = boto3.resource( | |||
service_name, | |||
region_name=os.environ.get("AWS_REGION"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this change, i got this exception when trying to process the queue (running the docker container locally):
% docker run --rm --tty --env-file .env sul-speech-to-text --no-daemon
Traceback (most recent call last):
File "/app/speech_to_text.py", line 342, in <module>
main(daemon=args.daemon)
File "/app/speech_to_text.py", line 24, in main
job = get_job()
^^^^^^^^^
File "/app/speech_to_text.py", line 51, in get_job
queue = get_todo_queue()
^^^^^^^^^^^^^^^^
File "/app/speech_to_text.py", line 206, in get_todo_queue
return get_queue(os.environ.get("SPEECH_TO_TEXT_TODO_SQS_QUEUE"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/speech_to_text.py", line 216, in get_queue
sqs = get_aws("sqs")
^^^^^^^^^^^^^^
File "/app/speech_to_text.py", line 185, in get_aws
aws = boto3.resource(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/boto3/__init__.py", line 101, in resource
return _get_default_session().resource(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...snip botocore internal part of stack trace...
File "/usr/local/lib/python3.11/dist-packages/botocore/regions.py", line 278, in _endpoint_for_partition
raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess I wasn't seeing that because I have this in my .env
:
AWS_DEFAULT_REGION=us-west-2
and env-example
has:
AWS_REGION=us-west-2
Maybe we can update env-example
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, confirmed that works instead, thanks. rebuilt the container without this change, reproduced the region error, updated AWS_REGION
to AWS_DEFAULT_REGION
in my .env, error went away.
will update env-example
...
README: * add back field that the code still requires * a bit more explanation of file naming and job id expectations for processing by worker * typo fix and remove vestiges of earlier implementation where job info was a file on S3 instead of a message payload in SQS speech_to_text.py * replaced specific number in comment with param name to keep the two from drifting as easily
@edsu this is ready for another look. i ended up squashing everything into one commit since it was just a small set of touchups, but for slightly easier review of the requested changes, here's the diff --git a/README.md b/README.md
index 18051a6..38483cd 100644
--- a/README.md
+++ b/README.md
@@ -87,7 +87,7 @@ The job is a JSON object (used as an SQS message payload) that contains informat
}
```
-The convention for the job id is druid-version, since the identifier needs to at least be unique to the druid version to process. But it could also be something more opaque, like a UUID.
+The job id must be a unique identifier like a UUID. In some use cases a natural key could be used, as is the case in the SDR where druid-version is used.
The worker will look in the configured S3 bucket for files to process at `"media/{job['id']}/{media_file}"` for each `media_file` in `job["media"]`. E.g. `gy983cn1444-v2/snl_tomlin_phone_company.mp4` for the above example JSON. You can see this illustrated in the `create_job` and `add_media` test functions in `speech_to_text.py`.
@@ -95,8 +95,11 @@ You can also pass in options for Whisper:
```json
{
- "id": "bc123df4567",
- "media": ["cat_video.mp4"],
+ "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
+ "media": [
+ "cat_video.mp4",
+ "The_Sea_otter.mp4"
+ ],
"options": {
"model": "large",
"max_line_count": 80,
@@ -109,7 +112,7 @@ When you receive the message on the DONE SQS queue it will contain the JSON:
```json
{
- "id": "tr808sp1200",
+ "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
"media": ["bear_breaks_into_home_plays_piano_no_speech.mp4"],
"options": {
"model": "large",
diff --git a/env-example b/env-example
index 6e18611..d0958d2 100644
--- a/env-example
+++ b/env-example
@@ -1,6 +1,6 @@
AWS_ACCESS_KEY_ID=CHANGE_ME
AWS_SECRET_ACCESS_KEY=CHANGE_ME
-AWS_REGION=us-west-2
+AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::418214828013:role/DevelopersRole
SPEECH_TO_TEXT_S3_BUCKET=sul-speech-to-text-dev-your-username
SPEECH_TO_TEXT_TODO_SQS_QUEUE=sul-speech-to-text-todo-dev-your-username
diff --git a/speech_to_text.py b/speech_to_text.py
index 2892871..cbf6412 100755
--- a/speech_to_text.py
+++ b/speech_to_text.py
@@ -184,7 +184,6 @@ def get_aws(service_name):
cred = response["Credentials"]
aws = boto3.resource(
service_name,
- region_name=os.environ.get("AWS_REGION"),
aws_access_key_id=cred["AccessKeyId"],
aws_secret_access_key=cred["SecretAccessKey"],
aws_session_token=cred["SessionToken"], (also, in case it helps someone else, i learned a markdown trick when i got frustrated formatting the above 😂) |
follow-ons from review and testing of #9:
region_name
param to a to aboto3.resource
instance creation, since i got errors in local testing without that.