Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for model folder on startup #3105

Closed
wants to merge 6 commits into from

Conversation

eliteprox
Copy link
Contributor

@eliteprox eliteprox commented Jul 26, 2024

What does this pull request do? Explain your changes. (required)

This quality of life improvement helps validate the ai-worker can find the model folder on startup. It helps validate both the configuration and the docker-in-docker volume mapping which will greatly improve orchestrator experience. Without this change, the ai worker will timeout waiting for the container to become available.

This PR requires livepeer/ai-worker#131

Specific updates (required)

  • This code checks if the model exists on startup and when processing requests.
  • Uses a new method ModelExists in ai-worker that returns boolean if specific model folder exists
  • Logs the exact path the container is looking for the model in on startup and individual requests when model is not found.
  • Improves response times by returning a 503 API error code immediately when the orchestrator is missing the model.

AI worker error log on startup:

2024/05/06 10:04:25 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 10:04:25.144208 2005927 starter.go:549] Error AI worker warming text-to-image container: model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist
I0506 10:04:25.144224 2005927 db.go:368] Closing DB

Gateway error log:

I0506 09:29:28.120307 1985227 discovery.go:180] Done fetching orch info numOrch=1 responses=1/1 timedOut=false
I0506 09:29:30.600500 1985227 ai_process.go:344] clientIP=127.0.0.1 request_id=14b57a61 Error submitting request cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1 try=1 orch=https://0.0.0.0:8936 err=Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600545 1985227 handlers.go:1479] clientIP=127.0.0.1 request_id=14b57a61 Error with API code=503 err=no orchestrators available within 2s timeout

AI Core error log on cold model request:

I0506 09:29:28.121922 1984042 ai_http.go:198] manifestID=27_stabilityai/stable-video-diffusion-img2vid-xt-1-1 orchSessionID=8983c425 clientIP=127.0.0.1 Received request id=6156387e cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
2024/05/06 09:29:30 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600020 1984042 handlers.go:1511] HTTP Response Error 503: Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1

How did you test each of these updates (required)

  1. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to true
  2. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to false
  3. Sent AI request with gateway to go-livepeer running a cold model name that doesn't exist, received immediate error response from orchestrator of 503.

Does this pull request close any open issues?
Addresses LIV-117

Checklist:

@eliteprox eliteprox changed the title Check-models-folder Check if model folder exists on startup and request processing Jul 26, 2024
@eliteprox eliteprox changed the base branch from master to ai-video July 26, 2024 19:44
@eliteprox eliteprox requested a review from rickstaa as a code owner July 26, 2024 19:44
@eliteprox eliteprox changed the title Check if model folder exists on startup and request processing check for model folder on startup Jul 26, 2024
@ad-astra-video
Copy link
Contributor

ad-astra-video commented Aug 9, 2024

@eliteprox do you think we could move the model exists check to the CheckAICapacity call? I think a log in that function call that the model folder does not exist would help the Orchestrator see whats going wrong and equally return a fast no capacity error so the Gateway can move on.

We would then not need to export it and not include in the interface. Since they both return the same error thinking this is a little cleaner to just handle it in the ai-worker. WDYT?

@eliteprox
Copy link
Contributor Author

@eliteprox do you think we could move the model exists check to the CheckAICapacity call? I think a log in that function call that the model folder does not exist would help the Orchestrator see whats going wrong and equally return a fast no capacity error so the Gateway can move on.

We would then not need to export it and not include in the interface. Since they both return the same error thinking this is a little cleaner to just handle it in the ai-worker. WDYT?

I've made that change and retested it successfully. It is now working with cold models also. Thanks for the recommendation!

Warm startup
2024/08/13 13:30:22 INFO Removing existing managed container name=/audio-to-text_openai_whisper-large-v3
2024/08/13 13:30:24 ERROR model openai/whisper-large-v3 does not exist at /livepeer/ai/data/models/models--openai--whisper-large-v3
E0813 13:30:24.518225 1365250 starter.go:1147] Error AI worker warming audio-to-text container: model openai/whisper-large-v3 does not exist
I0813 13:30:24.518264 1365250 db.go:368] Closing DB

Cold startup
I0813 13:42:52.118602 1380808 ai_http.go:280] manifestID=31_openai/whisper-large-v3 orchSessionID=4a67bd69 clientIP=127.0.0.1 Received request id=81b64c09 cap=31 modelID=openai/whisper-large-v3
2024/08/13 13:42:52 ERROR model openai/whisper-large-v3 does not exist at /livepeer/ai/data/models/models--openai--whisper-large-v3
E0813 13:42:52.118653 1380808 handlers.go:1522] HTTP Response Error 500: model openai/whisper-large-v3 does not exist

This will just be a change to the ai-worker only now, so I'll close this PR

@eliteprox
Copy link
Contributor Author

Closing as the change will only require update to ai-runner and go mod can be updated in a release

@eliteprox eliteprox closed this Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants