-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PSNR (Y/U/V) for outbound-rtp #794
base: main
Are you sure you want to change the base?
Conversation
This is similar to qpSum but codec-independent. Since PSNR requires additional computation it is defined with an accompanying psnrMeasurements counter to allow the computation of an average PSNR. Defined as three components for the Y, U and V planes respeectively. See also https://datatracker.ietf.org/doc/html/rfc8761#section-5
@henbos ^ |
This needs to be presented at the next virtual interim. Youenn mentions he'd like to hear about use case of the metric. |
https://www.researchgate.net/publication/383545049_Low-Complexity_Video_PSNR_Measurement_in_Real-Time_Communication_Products has a whole paper about this. tl;dr is "gp is codec dependent", PSNR is not (but comes at a cost hence this can not be a simple sum) @youennf the folks who implemented https://developer.apple.com/documentation/videotoolbox/kvtcompressionpropertykey_calculatemeansquarederror?changes=l_4_8 might be able to tell you more too. cc @taste1981 |
I have not read the paper, is a preprint available? But based on my own interactions with PSNR, there is a source and decoded image. Is the measurement on the outbound-rtp related to source and the encoded image, i.e., the PSNR due to the encoder. |
This one is encoder PSNR, not scaling PSNR. Scaling PSNR would end up living on media-source in stats. |
webrtc-stats.html
Outdated
The PSNR is defined in [[ISO-29170-1:2017]]. | ||
</p> | ||
<p class="note"> | ||
The frequency of PSNR measurements is outside the scope of this specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should recommend that perceptually, Y-PSNR is used. And PSNR values can vary based on motion and encoder settings.
Typically, Average Y-PSNR is calculated by dividing with the number of frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that division is done by the application.
The other thing that I am curious about is if the PSNR requires decoding the encoded video or is this calculated as part of the encoder operation. Mainly the impact on CPU if it requires some kind of decode step, I wonder if this is only calculated applied to I-frames or huge-frames. |
@sprangerik have you looked at this? |
@jesup thoughts? |
I am supportive of this. For context, see also https://webrtc-review.googlesource.com/c/src/+/368960 |
The idea is to have it as part of the encoder process. The encoder is by definition also a decoder, so it can directly use both the raw input and reconstructed state without penalty. The actual PSNR calculation will of course often incur an extra CPU hit, unless it is already a part of e.g. a rate-distortion aware rate controller - but that's not often the case for real-time encoders. That's why it's proposed to limit the frequency of PSNR calculations. This of course means the user cannot count on PSNR metrics being populated. Even for a given stream, the PSNR values might suddenly disappear if e.g. there is a software/hardware switching event and only one implementation supports PSNR output. |
since webrtc-stats anyway does aggregate values, we could do a sumPsnr and countFrames, i.e., each time a psnr is calculated it is added and corresponding frame count counter goes up. If it is done for all frames, we would not need a frame counter |
This issue was discussed in WebRTC February 2025 meeting – (#794 Add PSNR) |
This seems like a nice feature that could have a few uses. I do wonder if it could be a separate API instead of part of the outbound RTP stats. My initial concerns are calculating this data regardless if the application is even interested in the data and no specification or recommendations on frequency of measurements. Some pros and cons that come to mind if this were implemented as a separate API instead.
Cons:
If this should remain in the stats could we consider adding some sort of getStats object to enable logging for this kind of data? |
WebRTC users routinely log getStats data, so adding this would not be any big overhead. If the stats are collected on a timescale of seconds, the overhead is usually negligible. (polling stats for every frame is not a good idea). |
If I understand correctly, the concern of overhead is in the browser doing an expensive calculation most websites would never request (though per-frame is not an issue due to caching, per-second might be; is never an acceptable frequency?) How expensive is this computation? Our webstats model is like a boat we keep loading with new stuff. Eventually, it becomes problematic. At some point (maybe now?) might we wish we had something like this? await sender.getStats({verbosity: "high"}) // low | medium (default) | high |
I don't think WebRTC has to do these measurements very often for the PSNR measurements to be valuable and if they aren't done very often (say every second or every several seconds) then I don't think we need to make API changes. A similar example is that if you negotiate corruption-detection we do corruptionMeasurements, but since we only make these once per second they don't have any significant performance implications compared to the rest of the decoding pipeline. |
Btw this is unrelated to the polling frequency since the metrics only update when a measurement is made and a measurement happens in the background whether or not the app is polling getStats. (Polling getStats several times per second is bad because of the overhead of that call, not because of counters incrementing in the background) |
One concern is this stat seems to require making two getStats call over some interval. E.g. is the use case here to try one encoder setting, get stats, then wait 1 second and call getStats again expecting two different measurements? If so this might cause divide by zero error in one browser but not another. |
All metrics in the getStats API are used like so: "delta foo / delta bar", that is true whether it is a rate (delta bytesSent / delta timestamp), or a measurement thingy (delta totalCorruptionProbability / delta corruptionMeasurements) or something more exotic like (delta qpSum / delta framesEncoded) or even (jitterBufferDelay / jitterBufferEmittedCount). I could go on with more examples but "divide by zero" is something that the user of this API should be aware of |
qpSum / framesDecoded might be a better example of a foot gun since that could fail when network glitches but not in stable environment. In practice web developers will make helper functions that does lookup of deltas and rates taking care of foot guns. Also you have to be prepared for a metric not being present all of a sudden |
Yes I didn't mean to suggest the divide by zero hazard was limited to this API. The difference is something like I think the concern in this case is:
|
PSNR is similar to qp so having it in getStats makes sense. As the paper says we have done this at a frequency higher than one per second on devices where battery consumption is a concern and it works there. Hardware encoder support makes this "cheaper" even. Note that the calculation is done by the encoder so can not be triggered by calling getStats with some magic option. I considered whether it was possible to gate it on the corruption detection RTP header extension but that would have been quite awkward since it is not closely related (not without precedence, quite a few statistics depend on header extensions) When I say "A/B testing" consider a project like Jitsi moving to AV1, in particular the "Metrics Captured" which, unsurprisingly, relies on getStats. See here for how one uses PSNR to evaluate when it is available. Such experiments are designed not to compare 🍎 to 🍌 (different browsers, different operating systems) so letting a UA decide on sampling frequency is not a concern as long as it does so consistently. |
Polling is just asking "do you have any new measurements for me?" It doesn't matter if app polling interval and browser polling interval aligns or not and it's clear from the guidelines that there is no control of sampling period. So I would argue that the only thing that matters is if the measurements are arriving at a granular enough level to be useful. If the concern is that a browser implementer doesn't know what a useful measurement interval is, maybe we can provide some guidance there, but I fail to see the interop issue with different polling intervals that are all within a "useful" range. FTR I think 15 second is too large of an interval since a lot can happen in that period of time. |
I would not even poll getStats for A/B testing purposes. One would typically poll periodically and use the last result or call getStats explicitly before closing the peerconnection and then calculate the average PSNR as psnrSum_{y,u,v}/psnrMeasurements. Only calls with enough psnrMeasurements should be taken into account which one needs to irrespective of sampling frequency to exclude "short calls". (while we are rambling: it seems Firefox throws when calling getStats on a closed peerconnection which is not my understanding of #3 arguably with all the transceivers gone all the interesting stats disappear nowadays) |
That's fine for telemetry. I think our concern was more someone making runtime decisions off stats, e.g.: // probe and switch to best codec for media being sent right now:
let bestCodec, bestY = 0;
for (const codec of sender.getParameters().codecs) {
const params = sender.getParameters();
params.encodings[0].codec = codec;
await sender.setParameters(params);
await wait(1000);
const ortp1 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
await wait(1000);
const ortp2 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
const y = (ortp2.psnrSum.y - ortp1.psnrSum.y) / (ortp2.psnrMeasurements - ortp1.psnrMeasurements);
if (bestY < y) { bestY = y; bestCodec = codec;
}
const params = sender.getParameters();
params.encodings[0].codec = bestCodec; }
await sender.setParameters(params);
This might be good for an implementer to know. |
I can replace that with "is implementation-defined" linking to https://infra.spec.whatwg.org/#implementation-defined And copy @sprangerik's great "these metrics should primarily be used as a basis for statistical analysis rather than be used as an absolute truth on a per-frame basis" |
We could say that PSNR measurement frequency is implementation-specific but that "the user agent SHOULD make make PSNR measurements no less frequently than every X seconds, if PSNR measurements are supported for the current encoder implementation". I still think the app needs to handle the case that a PSNR measurement is not available for a given encoder implementation or browser though, but this would give the app an upper bound on how long to maximally wait for. |
From what I can see browser implementations would boil down to the same line of code in libWebRTC for software encoders, no? None of the parameters for video encoding are "specified", why do we need to be prescribe here? updated along the lines of #794 (comment) |
<p> | ||
The PSNR is defined in [[ISO-29170-1:2017]]. | ||
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with the same text under psnrSum
(line 2235) which is already link to here.
<p> | |
The PSNR is defined in [[ISO-29170-1:2017]]. | |
</p> |
<p class="note"> | ||
PSNR metrics should primarily be used as a basis for statistical analysis rather | ||
than be used as an absolute truth on a per-frame basis. | ||
The frequency of PSNR measurements is [=implementation-defined=]. | ||
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to avoid "should" and normative statements in non-normative notes. If we want to be normative we should lift it out.
Based on #794 (comment) are we comfortable picking a min frequency of every 5 seconds?
(Edit: added "or the encoding frame rate whichever is lower")
<p class="note"> | |
PSNR metrics should primarily be used as a basis for statistical analysis rather | |
than be used as an absolute truth on a per-frame basis. | |
The frequency of PSNR measurements is [=implementation-defined=]. | |
</p> | |
<p> | |
If the current encoder supports taking PSNR measurements, their | |
frequency SHOULD be no less than every 5 seconds or the | |
encoding frame rate, whichever is lower. | |
</p> | |
<p class="note"> | |
This allows for testing. PSNR measurements are intended for | |
statistical analysis, and aren't expected to be accurate down | |
to a frame. | |
</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why 5 seconds? keyFramesEncoded
and keyFramesDecoded
are examples very-low-frequency events happening already where you can not make an assumption about the minimum interval between two getStats calls that will give you an increase. Same for packetsLost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value would you like? The frequency of keyFramesEncoded
and keyFramesDecoded
are determined by external factors unlikely to vary by user agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 is arbitrary. In this thread I heard 15 seconds was too high, and that 1 second was not too expensive, but also that we'd rather not overconstrain implementations too much. 10 seconds?
This value, whatever we pick would go into WPT and also give web developers that can't wait for some reason a minimum time to wait to be interoperable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we need to also qualify that the encoder is actually encoding something? E.g. if the track is muted or a canvas track, then the frame rate may be less than what value we pick. Maybe we add "... or the encoding frame rate whichever is lower"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: this is a parameter of the video encoder configuration. It becomes observable in stats just like others.
What the spec should say is "implementation defined" - the implied warning is "do not compare apples to oranges"
The frequency of keyFramesEncoded is a good example, it is determined by the encoder setting such as gop size. One wants this as large as possible for good performance but this is something where the encoder can be tuned without "interoperability" constraints since the decoder behavior is required to be flexible.
If one wants to get fancy about PSNR one may need to take into account things like "is this a screen sharing track" (higher resolution, lower frame rate and wholly different content) or frame rate (which may depend on BWE). Worth the effort for statistical analysis? Unlikely.
15 seconds was arbitrary too I think with the purpose of "i dont think this makes sense anymore". One second picked in the code is 3x the value from the paper (which has a parameter study) and hence 1/3rd as expensive in terms of power impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A SHOULD implies implementation-defined. Is there a value that would be satisfactory?
From editors meeting: Main reason for a number seems to be feature detection - people want to know that if they have waited this long between calls to GetStats, and the frame counter has increased by X, and the stat doesn't show up, the feature is off. However, feature detection can be done by insisting that the stat is visible and with value zero if supported. |
If we have the psnr measurement frequency be implementation-defined, the spec should help the UA developer select a good frequency value. Spec should add some guidelines, for instance doing measurements every second or every 30 frames or so. |
Huh? This is a counter, it can not disappear, it can only stop increasing. |
I think the gist of it is that we want the frequency to be a high as possible as long as the performance impact can be kept negligible. Perhaps we can add some text to that effect. Trying to give guidelines in terms of the expected frequency doesn't seem helpful to UA developers - but on the other hand those developers are probably in the best position to determine at which frequency the performance hit becomes non-negligible for their particular encoder implementations. |
That's great, because I don't think anyone is proposing a ceiling on frequency. A floor would be nice though. I disagree giving guidance is bad, since what you wrote sounds like good guidance already. Let's write it down. |
As we say in German, "Guter Rat ist teuer". The guidance should "talk to your video encoding guys". That is what "implementation-defined" is for, no? |
This naturally leads to privacy questions if the selection of the rate depends on the encoder, the system load, the CPU... |
Who is proposing this?
How is this a new concern given video codecs in WebRTC already are adapting to CPU, thermal state, etc? Which is not described by any specification either. Note that PSNR should be possible to compute with JS and existing APIs already. And you are most welcome to invite privacy guys into your meetings obviously. |
This is similar to qpSum but codec-independent.
Since PSNR requires additional computation it is defined with an
accompanying psnrMeasurements counter to allow the computation of
an average PSNR.
Defined as a record with components for the Y, U and V planes respectively.
See also
https://datatracker.ietf.org/doc/html/rfc8761#section-5
Preview | Diff