Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable errors #356

Open
yondonfu opened this issue Nov 14, 2022 · 3 comments
Open
Assignees

Comments

@yondonfu
Copy link
Member

An internal CUDA function can return CUDA_ERROR_ILLEGAL_ADDRESS during Nvidia transcoding which means that the process is in an inconsistent state s.t. it needs to be restarted. The original context in which we encountered this issue is documented in livepeer/go-livepeer#1921. In #267 we implemented a panic whenever an unrecoverable error is encountered, in livepeer/go-livepeer#2057 we bumped the LPMS version to include this update, and then in livepeer/go-livepeer#2094 and livepeer/go-livepeer#2352 we moved the unrecoverable error check into go-livepeer.

The problem is that LPMS will mark any unknown error (indicated by AVERROR_UNKNOWN) as unrecoverable. As a result, some CUDA errors that do not warrant a process restart would be marked as unrecoverable and go-livepeer would panic for those errors.

For example, a CUDA OOM error is also treated as an unknown error by the libav code:

[AVHWDeviceContext @ 0x7f43ec093800] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_OUT_OF_MEMORY: out of memory 
ERROR: decoder.c:251] Unable to open hardware context for decoding : Unknown error occurred 
ERROR: decoder.c:285] Unable to open video decoder : Error number -1448234581 occurred 
E0111 23:30:50.790498       1 ffmpeg.go:503] Transcoder Return : Unrecoverable state, restart process 
panic: Unrecoverable state, restart process 
goroutine 6108 [running]: 
github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode(0xc013aa0ec0, 0xc001f6de50, 0xc00056c3c0, 0x1, 0x1, 0x0, 0x0, 0x0) 
    /go/pkg/mod/github.com/livepeer/[email protected]/ffmpeg/ffmpeg.go:505 +0x25a7 
github.com/livepeer/go-livepeer/core.(*NvidiaTranscoder).Transcode(0xc013aa0ee0, 0xc0001db4a0, 0x2, 0x1, 0xc0001db401) 
    /build/core/transcoder.go:88 +0x1ce 
github.com/livepeer/go-livepeer/core.(*transcoderSession).loop(0xc012e54d00) 
    /build/core/lb.go:186 +0x184 
github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession.func2(0xc012e54d00, 0xc012e54d80) 
    /build/core/lb.go:127 +0x2b 
created by github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession 
    /build/core/lb.go:126 +0x61c 

We should only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable so that go-livepeer only panics for those errors.

@cyberj0g
Copy link
Contributor

Addressed by: livepeer/FFmpeg@92c358e
AFAIK no changes to lpms are required.
Commited directly to the branch by accident, hope it's trivial enough to not mandate a PR in a secondary repo.

@yondonfu
Copy link
Member Author

@cyberj0g Nice!

I see that livepeer/FFmpeg@92c358e updates ff_cuda_check() to return AVERROR(ENOMEM) if a CUDA_ERROR_OUT_OF_MEMORY is detected. Since LPMS only marks AVERROR_UNKNOWN as an unrecoverable error, with this change, LPMS will no longer mark CUDA OOM errors as unrecoverable.

In the future, it would be nice if there was a way to specifically signal the CUDA_ERROR_ILLEGAL_ADDRESS error from within ffmpeg since even with this change any other CUDA error besides CUDA_ERROR_OUT_OF_MEMORY would still get marked as unrecoverable since they all reach LPMS as AVERROR_UNKNOWN. The change in your commit as-is is still a useful improvement though!

@Quintendos
Copy link

Quintendos commented Dec 14, 2022

Diederick working on this - dec' 22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants