Replies: 17 comments 21 replies
-
I'll add I'm running Ubuntu 22.04 Server with standard repos and self compiled UG. As you can see the GPU is barely taxed.
|
Beta Was this translation helpful? Give feedback.
-
Hi Alan, I think it is still all about the RGB->YCbCr conversions being slow -- it is done on CPU and maybe it could been optimized (low-hanging fruit would be to avoid bound-checking). However, I am thinking if it is really needed -- I've tested with 1080 Ti, which should have similar chip to P6000 (GP102 family) and it currently supports 10-bit RGB natively (tested on current Debian 11 with distro drivers):
Mediainfo on a key-frame agrees that it is 10-bit RGB, also decoding seems to work to R10k without any further options. I assume that the conversion there is not hard-coded in the SIP but done using CUDA. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the info Martin. I was using the x265/ffmpeg libs provided via Ubuntu 22.04, which I don't think have been updated to 5.x yet. I'll try the AppImage and also the rsavoury PPAs which he keeps up to date. |
Beta Was this translation helpful? Give feedback.
-
Hi Martin, I got access to a T1000 so even newer architecture than the P6000.
See the results below.
|
Beta Was this translation helpful? Give feedback.
-
./UltraGrid-continuous-x86_64.AppImage -t testcard:size=3840x2160:fps=24:codec=R10k -c libavcodec:encoder=hevc_nvenc -VVV Display device : none Display initialized-none [1658944579.938] [testcard] capture set to 3840x2160 @24.00p, codec R10k, bpp 4, pattern: bars, audio on [1658944580.013] [lavc] Using codec: H.265, encoder: hevc_nvenc [lavc] Codec supported pixel formats: yuv420p nv12 p010le yuv444p p016le yuv444p16le bgr0 bgra rgb0 rgba x2rgb10le x2bgr10le gbrp gbrp16le cuda |
Beta Was this translation helpful? Give feedback.
-
So yeah, there definitely seems to be some bottleneck in libavcodec that seriously prevents full utilization of any codec. Even x265 caps out at the same exact performance numbers, and only utilizes about 6-7 cores out of the available 40 on this machine. At this point it seems impossible to get realtime UHD encoding.
|
Beta Was this translation helpful? Give feedback.
-
Now take a look at this.... every 20th frame incurs a significant compression penalty. I suppose those are the I-frames.
|
Beta Was this translation helpful? Give feedback.
-
I'm looking back at my notes from 18 months ago, NVENC was realtime. There may have been a regression in libavcodec during that time. This is from January 2021.
|
Beta Was this translation helpful? Give feedback.
-
OK....I know this is a lot of replies, but check this out. I downloaded a bunch of your archived continuous builds. Using the closest one to Jan. 2021 I ran the same command and compared it to most recent build. While still not realtime, is it way faster. Next thing I will try is reverting back to Ubuntu 20.04 instead of 22.04. OLD build
New build ./UltraGrid-continuous-x86_64.AppImage --tool uv -t decklink:codec=R10k -c libavcodec:encoder=hevc_nvenc:bitrate=40M --param lavc-use-codec=yuv444p16le --verbose Display device : none Display initialized-none |
Beta Was this translation helpful? Give feedback.
-
Hi, my notes:
|
Beta Was this translation helpful? Give feedback.
-
Hi Martin, Using the latest continuous AppImage which should include the delay=2 fix you made.
If I add a delay=10 there is almost no change:
|
Beta Was this translation helpful? Give feedback.
-
OK...so I compiled latest FFmpeg master with Decklink 12.4 support on Ubuntu 22.04 Summary is, it works and gets realtime. See below. 1). I'm not sure the ramifications of the ** full chroma interpolation for destination format 'x2rgb10le' not yet implemented** warning. Does that lead to less accurate color?
2). This gets realtime, is full rgb16 bit, no warning, but it captures a green image...see attached. I haven't figured out a way to get rgb16 and non green.
|
Beta Was this translation helpful? Give feedback.
-
Hi, could you please try it with the current code (commit at least f67aa7c)? I am getting around 60 FPS for 2160p R10k video (yuv444p16le and /implicit/ gbrp16le; x2rgb10le doesn't seem to work as indicated above, thus not enabled) with default parameters and 40 Mbps bitrate. Of course it depends on the content. It turned out that buffer needs to be allocated by the codec itself (see the commit; perhaps something like CUDA-pinned memory), otherwise the performance is degraded. |
Beta Was this translation helpful? Give feedback.
-
I haven't yet look at the actual output of the stream, the stats are positive.
|
Beta Was this translation helpful? Give feedback.
-
OK....so even though the encoder is realtime See attached. I tried just about every pixel format, and none produced a proper image. Even a standard Also, if I omit
|
Beta Was this translation helpful? Give feedback.
-
Encoder: Receiver: Test 1: GPU decode Encoder-
Receiver-
What is worse is if stream is not sending when the receiver starts or is the if the encoder is stopped Ctrl-C and then started again, even immediately, it seems as if the decoder reverts back to software decode, as the fans spin up and the FPS drop to ~21, which is the same result if I omit the force cuvid param. Even though the logs still indicate cuvid decode. You are probably not seeing this behavior since you are doing all-in-one command.
This is the decode time when stream is already going and receiver is started: And here is if the stream is restart after receiver is launch, or if initial stream after receiver is launched: Logs attached as: ############################################################################################################# Test 2: SW decode Same encoder command as before-
Receiver-
Now looking at log output with -VVV the decompress is not the bottleneck here, as each frame is only taking about 9ms, but still only getting ~21fps.
Now in a real world scenario, the application that we use to output to SDI, outputs RGB 12bit 4:4:4 because their RGB 10bit 4:4:4 mode only outputs limited range, 12bit is full range. I've asked the company why, as with AJA 10bit has selectable to be full range, and they stated it is a BMD SDK limitation. When giving UG SDI 12bit444 with NVenc HEVC even though we call Additonally, I am using SDL due to the poor performance of GL. Even using the disable-10b param, GL is not able to provide realtime, even at HD/2K rez. I'd prefer to utilize GL with the 10bit option since it provides superiour color representation. |
Beta Was this translation helpful? Give feedback.
-
NVENC slow is fixed.
|
Beta Was this translation helpful? Give feedback.
-
Hello,
I am testing NVENC_HEVC with Quadro P6000. This is on the NVENC matrix for supporting encode of HEVC
10-bit. My input signal is SDI 12G UHD RGB12bit 444 into a BMD 8K. It seems that the color space conversions are still starving the encoder resulting in non-realtime performance. Please see the following examples.
Let's start off basic with very few parameters:
Here we see that after the first partial interval it maintains a consistent realtime rate of ~24/fps. The problem here is that it is encoding using bgr0 color which is 8bit RGB. We need 10bit though (YUV acceptable).
Ok, so now we add --param lavc-use-codec=yuv444p16le so that it will use a more accurate colorspace.
Here our performance is just as bad as using software CPU encode.
Now add preset=p1 (the fastest mode) and we gain 2fps but still far from realtime.
Now add tune=ull (ultra low latency) useful for streaming. Maybe we've added .5fps.
and finally if I add a bunch of stuff, we get close, but still not realtime:
I feel like when I tested this workflow ~2 years ago I was able to get this workflow realtime with ease. Am I missing something now? I would have thought that with NVENC this would have been no issue. The only thing I can think is that the R10K--> yuv444p16le conversion is starving the GPU encoder. Any thoughts?
Thanks,
Alan
Beta Was this translation helpful? Give feedback.
All reactions