check for model folder on startup #3105

eliteprox · 2024-07-26T19:42:18Z

What does this pull request do? Explain your changes. (required)

This quality of life improvement helps validate the ai-worker can find the model folder on startup. It helps validate both the configuration and the docker-in-docker volume mapping which will greatly improve orchestrator experience. Without this change, the ai worker will timeout waiting for the container to become available.

This PR requires livepeer/ai-worker#131

Specific updates (required)

This code checks if the model exists on startup and when processing requests.
Uses a new method ModelExists in ai-worker that returns boolean if specific model folder exists
Logs the exact path the container is looking for the model in on startup and individual requests when model is not found.
Improves response times by returning a 503 API error code immediately when the orchestrator is missing the model.

AI worker error log on startup:

2024/05/06 10:04:25 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 10:04:25.144208 2005927 starter.go:549] Error AI worker warming text-to-image container: model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist
I0506 10:04:25.144224 2005927 db.go:368] Closing DB

Gateway error log:

I0506 09:29:28.120307 1985227 discovery.go:180] Done fetching orch info numOrch=1 responses=1/1 timedOut=false
I0506 09:29:30.600500 1985227 ai_process.go:344] clientIP=127.0.0.1 request_id=14b57a61 Error submitting request cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1 try=1 orch=https://0.0.0.0:8936 err=Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600545 1985227 handlers.go:1479] clientIP=127.0.0.1 request_id=14b57a61 Error with API code=503 err=no orchestrators available within 2s timeout

AI Core error log on cold model request:

I0506 09:29:28.121922 1984042 ai_http.go:198] manifestID=27_stabilityai/stable-video-diffusion-img2vid-xt-1-1 orchSessionID=8983c425 clientIP=127.0.0.1 Received request id=6156387e cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
2024/05/06 09:29:30 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600020 1984042 handlers.go:1511] HTTP Response Error 503: Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1

How did you test each of these updates (required)

Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to true
Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to false
Sent AI request with gateway to go-livepeer running a cold model name that doesn't exist, received immediate error response from orchestrator of 503.

Does this pull request close any open issues?
Addresses LIV-117

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

ad-astra-video · 2024-08-09T17:59:21Z

@eliteprox do you think we could move the model exists check to the CheckAICapacity call? I think a log in that function call that the model folder does not exist would help the Orchestrator see whats going wrong and equally return a fast no capacity error so the Gateway can move on.

We would then not need to export it and not include in the interface. Since they both return the same error thinking this is a little cleaner to just handle it in the ai-worker. WDYT?

eliteprox · 2024-08-13T17:55:48Z

@eliteprox do you think we could move the model exists check to the CheckAICapacity call? I think a log in that function call that the model folder does not exist would help the Orchestrator see whats going wrong and equally return a fast no capacity error so the Gateway can move on.

We would then not need to export it and not include in the interface. Since they both return the same error thinking this is a little cleaner to just handle it in the ai-worker. WDYT?

I've made that change and retested it successfully. It is now working with cold models also. Thanks for the recommendation!

Warm startup
2024/08/13 13:30:22 INFO Removing existing managed container name=/audio-to-text_openai_whisper-large-v3
2024/08/13 13:30:24 ERROR model openai/whisper-large-v3 does not exist at /livepeer/ai/data/models/models--openai--whisper-large-v3
E0813 13:30:24.518225 1365250 starter.go:1147] Error AI worker warming audio-to-text container: model openai/whisper-large-v3 does not exist
I0813 13:30:24.518264 1365250 db.go:368] Closing DB

Cold startup
I0813 13:42:52.118602 1380808 ai_http.go:280] manifestID=31_openai/whisper-large-v3 orchSessionID=4a67bd69 clientIP=127.0.0.1 Received request id=81b64c09 cap=31 modelID=openai/whisper-large-v3
2024/08/13 13:42:52 ERROR model openai/whisper-large-v3 does not exist at /livepeer/ai/data/models/models--openai--whisper-large-v3
E0813 13:42:52.118653 1380808 handlers.go:1522] HTTP Response Error 500: model openai/whisper-large-v3 does not exist

This will just be a change to the ai-worker only now, so I'll close this PR

eliteprox · 2024-08-13T17:56:47Z

Closing as the change will only require update to ai-runner and go mod can be updated in a release

eliteprox changed the title ~~Check-models-folder~~ Check if model folder exists on startup and request processing Jul 26, 2024

eliteprox changed the base branch from master to ai-video July 26, 2024 19:44

eliteprox requested a review from rickstaa as a code owner July 26, 2024 19:44

eliteprox changed the title ~~Check if model folder exists on startup and request processing~~ check for model folder on startup Jul 26, 2024

eliteprox added 4 commits August 13, 2024 09:23

Check for model folder when creating container or processing request

f1c2428

Remove comment

aecd534

Update go mod

ef0614a

Update go mod

04b4eb7

eliteprox force-pushed the check-models-folder branch from c7d1d66 to 04b4eb7 Compare August 13, 2024 17:29

eliteprox added 2 commits August 13, 2024 13:44

remove CheckModelExists interface in favor of CheckAICapacity

dde4234

remove ModelExists

e419b68

eliteprox closed this Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for model folder on startup #3105

check for model folder on startup #3105

eliteprox commented Jul 26, 2024 •

edited

Loading

ad-astra-video commented Aug 9, 2024 •

edited

Loading

eliteprox commented Aug 13, 2024

eliteprox commented Aug 13, 2024

check for model folder on startup #3105

check for model folder on startup #3105

Conversation

eliteprox commented Jul 26, 2024 • edited Loading

ad-astra-video commented Aug 9, 2024 • edited Loading

eliteprox commented Aug 13, 2024

eliteprox commented Aug 13, 2024

eliteprox commented Jul 26, 2024 •

edited

Loading

ad-astra-video commented Aug 9, 2024 •

edited

Loading