Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcoding hangs under load #158

Open
darkdarkdragon opened this issue Oct 12, 2019 · 15 comments
Open

Transcoding hangs under load #158

darkdarkdragon opened this issue Oct 12, 2019 · 15 comments

Comments

@darkdarkdragon
Copy link
Contributor

While doing throughput testing hit situation where transcoding hangs (reproduced three times).
It was transcoding one 10 minutes generated video to seven 720 renditions in ten streams simultaneously. Looking at output files, it completely transcoded most of the files, except some. nvidia-smi shows that Livepeer process still uses GPU, but output files (one not completed) not growing in size.
There is stopped GCE instance on which I've saw this, I've left it intact.
To reproduce:

  1. Start Ivan-gpu-p100 instance
  2. got to /home/dark/go-livepeer
  3. run bench.sh
  4. wait 12 minutes
  5. look at the /disk-1-temp directory - there will be 70 output files, most complete (same size), some smaller. and that smaller files not growing in size.
@darkdarkdragon
Copy link
Contributor Author

Also same thing happened on Genesis box with one 1080ti trying to transcode 50 streams of official_test_source_2s_keys_24pfs.mp4 with these renditions P144p30fps16x9;P360p30fps16x9;P576p30fps16x9. Out of 50 streams only 38 reported done in console.

@j0sh
Copy link
Collaborator

j0sh commented Oct 17, 2019

Thanks for the report, couple notes:

  • Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding

  • Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.

  • Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.

@darkdarkdragon
Copy link
Contributor Author

Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding

In the provided link I can't find any problems reported, can you point me to what they're reporting?

Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.

Will do.

Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.

For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream.
Looks very underloaded.

@j0sh
Copy link
Collaborator

j0sh commented Oct 17, 2019

In the provided link I can't find any problems reported, can you point me to what they're reporting?

@darkdarkdragon See the note on "Exceeding VRAM" :

Exceeding the VRAM usage will cause new transcodes to buffer indefinitely, even once VRAM has dropped below the maximum.

For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream.
Looks very underloaded.

So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?

Does it make a difference if the streams themselves are staggered a bit, eg started ~500ms-1sec after another?

@darkdarkdragon
Copy link
Contributor Author

See the note on "Exceeding VRAM" :

I thought it was something specific to their Plex Client. And they talking about new transcodes, in our case transcode hangs half-way through.

So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?

How can I distinguish between streams?
Butt there is pattern - last time, from total 70 output files, 7 stuck at the same point (same file size).
It was one GPU, 10 streams with 7 renditions (same 720p renditions) each.
So it's looks like it was one streams that not completed - but, not a one stream reported completion (exited from Transcode3 function).

@j0sh
Copy link
Collaborator

j0sh commented Oct 17, 2019

How can I distinguish between streams?

I usually have a file naming convention. Something like: out_<device_idx>_<sequence>_<renditition>.ts and so forth.

So it's looks like it was one streams that not completed

Ah, interesting. Is this consistently reproducible? Does it stop working at the same place each time? Do more streams make the issue worse? Anything in the logs?

@darkdarkdragon
Copy link
Contributor Author

I usually have a file naming convention.

I'm using random names. Just took piece of code from our transcoder.go

Is this consistently reproducible?

It hangs probably in 90% cases - once same test with 10 streams passed, and hanged only on next iteration, but most of the times it hangs.
But it hangs in different points (in different runs, but same point in one test).
Just checked old data - in two times out of three, it was 7 files not completed (so one stream). But in once case - it was 21 not finished files.

@darkdarkdragon
Copy link
Contributor Author

example results from one run:

-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_06b1ef519dc674f48992.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_081d4ade8d2a3c1ccebb.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_0b60f83f94845bf86bdb.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_127fb018986db33eeb72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1b177f2ea9bfbac0f796.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1d71d5c0d404fe97a07b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_23d7ebb33bb4732ab6ab.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_273ddeb0cb7a8134829d.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_2a6e92d7eb06e602fa72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2ba82c0a2af65c7bac19.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c0563db05615117ad73.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c72b44bc9f7e3c7d337.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_35c62d7365775d2ff0b5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3780f053557fd5e0e738.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_379f4eeae702a4cab428.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_3b38dc1de25b10da2f7f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3e0f379c137383a7ac37.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3f15fb903342acd8cb08.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_41a03e29addc2be05f67.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_4e7f25fb48a9c55ccd1b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5273f496a27cb0ad2736.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_52a42c3b2e9eb8c461dc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_575275fe839622eb1c4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_598d0039ce6581b5ade8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_59ff7ae27a6dd2241205.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5bd3c660b6e005cf9327.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5e69fd492b31b1706473.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_65ede9ca1fbdb5febdfa.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68c1a65652abd162ae4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68e458e519b9a5adbc71.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e136ebd831afba76b0b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e8b023cd6143979fcbc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_732df4c1925b798ceda5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_75d4b95583f334572acf.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_7b0b183433193ebfa81f.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_7e32f629d36554c32ff6.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_88d71c01e64b13bc0b51.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_8a5c12640211854c659f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_8d8e10a69b9d02abed1a.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_98c62c6a8fe2bb6df0aa.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_9b6eb73feab0a27a2018.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9f863b0f2288a5991e14.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9fe95c8e6d2451e76285.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a3d185993ee929c04be7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a6d9127d4a77b8e04114.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_ab04342228b924fd758e.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ae8563ac3e870255f72f.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_afcb68ad60d011e99832.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b0833441295c9ea414c3.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_b13f3529ee054b5fcd5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b398704392443f53ca8a.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b405241d4f059ef2eb82.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b740a463b3c0e8410f5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b7aa00e25b32ff3234a1.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c6c9d9248e6634f9a519.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c89d6ff988fd434f8837.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ca2fb40ef950c0eff0d8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d5fb385bf48e2a0f6666.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d639398ee79d95e80eb9.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d6f153c2442063dd1fa2.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dc5945ab712c1f033e2a.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dd4ce8d97ce55f4f0ab7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_de78db1b1716946c7eb5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_e799a345c04a0cc76444.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_f47cb2d6f6298e1a6160.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f555b25b8deb7394cb16.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f88c514aacede3e92ffa.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_f908a59eb870b14ffa59.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_faafbc9d2b7eb76e5f13.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_fcdebd9c16b987d707b6.ts

@darkdarkdragon
Copy link
Contributor Author

@j0sh can I enable logging inside ffmplib we're using?
Or I can put fprints into LPMS c code...
Given that they stuck at the same point for one streams, it's something with decoding (or filters?)
But I don't understand, why other streams no reporting completion?

@j0sh
Copy link
Collaborator

j0sh commented Oct 17, 2019

I'm using random names. Just took piece of code from our transcoder.go

For the benchmarking, wouldn't it be useful to have a more consistent naming scheme to help diagnose issues like these?

enable logging inside ffmplib we're using?

Either:

@darkdarkdragon
Copy link
Contributor Author

@j0sh and it is not VRAM exceed problem - I've just looked at the VRAM consumption - it was using around 7M memory on the card with 16M.

@j0sh
Copy link
Collaborator

j0sh commented Oct 17, 2019

it was using around 7M memory on the card with 16M.

Just 7MB ? Is that right?

Also, how is the measuring done? nvidia-smi dmon and pmon only show the instantaneous utilization of VRAM(eg, how much VRAM was actually read and written during the polling interval), while nvidia-smi by itself shows the absolute amount of VRAM in use on each card, broken down per process.

From the plex issue, it seems to have more to do with throughput, rather than the amount of memory in use. Anyway, agreed that it's hard to really say if it's the same issue we're having, but it seemed like an interesting and somewhat similar data point.

@darkdarkdragon
Copy link
Contributor Author

Just 7MB ? Is that right?

Just nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    56W / 250W |   7423MiB / 16280MiB |     37%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2117      C   /tmp/go-build655338158/b001/exe/livepeer    7411MiB |
+-----------------------------------------------------------------------------+

@darkdarkdragon
Copy link
Contributor Author

Traced it down to this line:

https://github.com/FFmpeg/FFmpeg/blob/f7f4691f9f745f6a087879ab855dd65f9f16879d/libavcodec/nvdec.c#L162

It enters cuvidDestroyDecoder (function in Nvidia library) and never exits.
So, this is either:

  1. Bug in Nvidia drivers
  2. FFmpeg messes something up
  3. Go runtime messes with threads/locks

For the next thing I'll try Nvidia's beta driver.

hangstack.txt

@j0sh
Copy link
Collaborator

j0sh commented Oct 22, 2019

Possibly related: cuda-memcheck is indicating issues within cuvidDestroyDecoder . This is also present in ffmpeg itself, which would seem to exclude issues on the golang / cgo side.

If the beta driver doesn't help, it would be interesting to see if the hang is also reproducible with ffmpeg itself.

cuda-memcheck with a custom program shows this:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_free_context + 0x19) [0x59d4b9]
=========     Host Frame:./a.out [0x4f5d]
=========     Host Frame:./a.out [0x7ffd]
=========     Host Frame:./a.out [0x838e]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76db]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x3f) [0x12188f]

This also occurs with ffmpeg:

$ cuda-memcheck ffmpeg -hwaccel cuvid -hwaccel_device 1 -c:v h264_cuvid -i in/short_testharness.m3u8 -c:a copy -c:v h264_nvenc -vf fps=30,scale_cuda=w=1280:h=720 -f null - -c:v h264_nvenc -vf fps=30,scale_cuda=1024:h=576 -f null -
...
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
=========     Host Frame:ffmpeg [0x2f71b]
=========     Host Frame:ffmpeg [0xbbee]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:ffmpeg [0xbdba]
=========
========= ERROR SUMMARY: 1 error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants