Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PSNR (Y/U/V) for outbound-rtp #794

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add PSNR (Y/U/V) for outbound-rtp #794

wants to merge 4 commits into from

Conversation

fippo
Copy link
Contributor

@fippo fippo commented Feb 5, 2025

This is similar to qpSum but codec-independent.

Since PSNR requires additional computation it is defined with an
accompanying psnrMeasurements counter to allow the computation of
an average PSNR.

Defined as a record with components for the Y, U and V planes respectively.

See also
https://datatracker.ietf.org/doc/html/rfc8761#section-5


Preview | Diff

fippo added 3 commits January 23, 2025 10:09
This is similar to qpSum but codec-independent.
Since PSNR requires additional computation it is defined with an
accompanying psnrMeasurements counter to allow the computation of
an average PSNR.

Defined as three components for the Y, U and V planes respeectively.

See also
  https://datatracker.ietf.org/doc/html/rfc8761#section-5
@fippo
Copy link
Contributor Author

fippo commented Feb 5, 2025

@henbos ^

@henbos
Copy link
Collaborator

henbos commented Feb 6, 2025

This needs to be presented at the next virtual interim. Youenn mentions he'd like to hear about use case of the metric.

@fippo
Copy link
Contributor Author

fippo commented Feb 6, 2025

https://www.researchgate.net/publication/383545049_Low-Complexity_Video_PSNR_Measurement_in_Real-Time_Communication_Products has a whole paper about this.

tl;dr is "gp is codec dependent", PSNR is not (but comes at a cost hence this can not be a simple sum)

@youennf the folks who implemented https://developer.apple.com/documentation/videotoolbox/kvtcompressionpropertykey_calculatemeansquarederror?changes=l_4_8 might be able to tell you more too.

cc @taste1981

@vr000m
Copy link
Contributor

vr000m commented Feb 18, 2025

I have not read the paper, is a preprint available?

But based on my own interactions with PSNR, there is a source and decoded image. Is the measurement on the outbound-rtp related to source and the encoded image, i.e., the PSNR due to the encoder.

@fippo
Copy link
Contributor Author

fippo commented Feb 18, 2025

This one is encoder PSNR, not scaling PSNR. Scaling PSNR would end up living on media-source in stats.
I'll have a copy of the paper sent to you

The PSNR is defined in [[ISO-29170-1:2017]].
</p>
<p class="note">
The frequency of PSNR measurements is outside the scope of this specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should recommend that perceptually, Y-PSNR is used. And PSNR values can vary based on motion and encoder settings.

Typically, Average Y-PSNR is calculated by dividing with the number of frames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that division is done by the application.

@vr000m
Copy link
Contributor

vr000m commented Feb 18, 2025

The other thing that I am curious about is if the PSNR requires decoding the encoded video or is this calculated as part of the encoder operation. Mainly the impact on CPU if it requires some kind of decode step, I wonder if this is only calculated applied to I-frames or huge-frames.

@alvestrand
Copy link
Contributor

@sprangerik have you looked at this?

@jan-ivar
Copy link
Member

@jesup thoughts?

@sprangerik
Copy link
Contributor

I am supportive of this.

For context, see also https://webrtc-review.googlesource.com/c/src/+/368960

@sprangerik
Copy link
Contributor

The other thing that I am curious about is if the PSNR requires decoding the encoded video or is this calculated as part of the encoder operation. Mainly the impact on CPU if it requires some kind of decode step, I wonder if this is only calculated applied to I-frames or huge-frames.

The idea is to have it as part of the encoder process. The encoder is by definition also a decoder, so it can directly use both the raw input and reconstructed state without penalty. The actual PSNR calculation will of course often incur an extra CPU hit, unless it is already a part of e.g. a rate-distortion aware rate controller - but that's not often the case for real-time encoders. That's why it's proposed to limit the frequency of PSNR calculations.

This of course means the user cannot count on PSNR metrics being populated. Even for a given stream, the PSNR values might suddenly disappear if e.g. there is a software/hardware switching event and only one implementation supports PSNR output.

@vr000m
Copy link
Contributor

vr000m commented Feb 22, 2025

since webrtc-stats anyway does aggregate values, we could do a sumPsnr and countFrames, i.e., each time a psnr is calculated it is added and corresponding frame count counter goes up. If it is done for all frames, we would not need a frame counter

@dontcallmedom-bot
Copy link

This issue was discussed in WebRTC February 2025 meeting – (#794 Add PSNR)

@Drekabi
Copy link

Drekabi commented Feb 27, 2025

This seems like a nice feature that could have a few uses. I do wonder if it could be a separate API instead of part of the outbound RTP stats. My initial concerns are calculating this data regardless if the application is even interested in the data and no specification or recommendations on frequency of measurements.

Some pros and cons that come to mind if this were implemented as a separate API instead.
Pros:

  • This would allow for those interested in the information to request a PSNR directly.
  • Eliminates the need to constantly be collecting PSNR measurements that may not be used.
  • It would allow for applications to have some control over the frequency of PSNR measurements.

Cons:

  • Applications would not be getting PSNR sum for free and need to calculate and store this on their own.

If this should remain in the stats could we consider adding some sort of getStats object to enable logging for this kind of data?

@alvestrand
Copy link
Contributor

WebRTC users routinely log getStats data, so adding this would not be any big overhead. If the stats are collected on a timescale of seconds, the overhead is usually negligible. (polling stats for every frame is not a good idea).
I presume that the stats would not be populated if not implemented or enabled; I don't know if it's worth it to expose a control to turn computation on and off on a track.

@jan-ivar
Copy link
Member

If I understand correctly, the concern of overhead is in the browser doing an expensive calculation most websites would never request (though per-frame is not an issue due to caching, per-second might be; is never an acceptable frequency?)

How expensive is this computation? Our webstats model is like a boat we keep loading with new stuff. Eventually, it becomes problematic.

At some point (maybe now?) might we wish we had something like this?

await sender.getStats({verbosity: "high"}) // low | medium (default) | high

@henbos
Copy link
Collaborator

henbos commented Feb 27, 2025

I don't think WebRTC has to do these measurements very often for the PSNR measurements to be valuable and if they aren't done very often (say every second or every several seconds) then I don't think we need to make API changes.

A similar example is that if you negotiate corruption-detection we do corruptionMeasurements, but since we only make these once per second they don't have any significant performance implications compared to the rest of the decoding pipeline.

@henbos
Copy link
Collaborator

henbos commented Feb 27, 2025

Btw this is unrelated to the polling frequency since the metrics only update when a measurement is made and a measurement happens in the background whether or not the app is polling getStats. (Polling getStats several times per second is bad because of the overhead of that call, not because of counters incrementing in the background)

@jan-ivar
Copy link
Member

One concern is this stat seems to require making two getStats call over some interval.

E.g. is the use case here to try one encoder setting, get stats, then wait 1 second and call getStats again expecting two different measurements? If so this might cause divide by zero error in one browser but not another.

@henbos
Copy link
Collaborator

henbos commented Feb 27, 2025

All metrics in the getStats API are used like so: "delta foo / delta bar", that is true whether it is a rate (delta bytesSent / delta timestamp), or a measurement thingy (delta totalCorruptionProbability / delta corruptionMeasurements) or something more exotic like (delta qpSum / delta framesEncoded) or even (jitterBufferDelay / jitterBufferEmittedCount).

I could go on with more examples but "divide by zero" is something that the user of this API should be aware of

@henbos
Copy link
Collaborator

henbos commented Feb 27, 2025

qpSum / framesDecoded might be a better example of a foot gun since that could fail when network glitches but not in stable environment.

In practice web developers will make helper functions that does lookup of deltas and rates taking care of foot guns.

Also you have to be prepared for a metric not being present all of a sudden

@jan-ivar
Copy link
Member

Yes I didn't mean to suggest the divide by zero hazard was limited to this API. The difference is something like framesDecoded is tied to real-world reception of RTP which won't vary by user agent.

I think the concern in this case is:

  1. lack of cap on frequency might lead two browsers to implement vastly different intervals, e.g. 1 vs 15 seconds.
  2. whether an anticipated use case is A/B codec testing where a website might tune its interval to the fastest browser

@fippo
Copy link
Contributor Author

fippo commented Feb 27, 2025

PSNR is similar to qp so having it in getStats makes sense.

As the paper says we have done this at a frequency higher than one per second on devices where battery consumption is a concern and it works there. Hardware encoder support makes this "cheaper" even. Note that the calculation is done by the encoder so can not be triggered by calling getStats with some magic option. I considered whether it was possible to gate it on the corruption detection RTP header extension but that would have been quite awkward since it is not closely related (not without precedence, quite a few statistics depend on header extensions)

When I say "A/B testing" consider a project like Jitsi moving to AV1, in particular the "Metrics Captured" which, unsurprisingly, relies on getStats. See here for how one uses PSNR to evaluate when it is available.

Such experiments are designed not to compare 🍎 to 🍌 (different browsers, different operating systems) so letting a UA decide on sampling frequency is not a concern as long as it does so consistently.

@henbos
Copy link
Collaborator

henbos commented Feb 28, 2025

I think the concern in this case is:

  1. lack of cap on frequency might lead two browsers to implement vastly different intervals, e.g. 1 vs 15 seconds.
  2. whether an anticipated use case is A/B codec testing where a website might tune its interval to the fastest browser

Polling is just asking "do you have any new measurements for me?"
Whether I poll once per 15 seconds and get that measurement on the first try or if I poll getStats 15 times and get 14 "no's" and 1 "yes", I think I end up in the same situation, so what exactly was the problem there (and why is this different from the other hundred+ metrics we have)?

It doesn't matter if app polling interval and browser polling interval aligns or not and it's clear from the guidelines that there is no control of sampling period. So I would argue that the only thing that matters is if the measurements are arriving at a granular enough level to be useful.

If the concern is that a browser implementer doesn't know what a useful measurement interval is, maybe we can provide some guidance there, but I fail to see the interop issue with different polling intervals that are all within a "useful" range. FTR I think 15 second is too large of an interval since a lot can happen in that period of time.

@fippo
Copy link
Contributor Author

fippo commented Feb 28, 2025

I would not even poll getStats for A/B testing purposes. One would typically poll periodically and use the last result or call getStats explicitly before closing the peerconnection and then calculate the average PSNR as psnrSum_{y,u,v}/psnrMeasurements. Only calls with enough psnrMeasurements should be taken into account which one needs to irrespective of sampling frequency to exclude "short calls".

(while we are rambling: it seems Firefox throws when calling getStats on a closed peerconnection which is not my understanding of #3 arguably with all the transceivers gone all the interesting stats disappear nowadays)

@jan-ivar
Copy link
Member

I think the concern in this case is:
...
2. whether an anticipated use case is A/B codec testing where a website might tune its interval to the fastest browser

Polling is just asking "do you have any new measurements for me?" Whether I poll once per 15 seconds and get that measurement on the first try or if I poll getStats 15 times and get 14 "no's" and 1 "yes", I think I end up in the same situation, so what exactly was the problem there

That's fine for telemetry. I think our concern was more someone making runtime decisions off stats, e.g.:

// probe and switch to best codec for media being sent right now:
let bestCodec, bestY = 0;
for (const codec of sender.getParameters().codecs) {
  const params = sender.getParameters();
  params.encodings[0].codec = codec;
  await sender.setParameters(params);
  await wait(1000);
  const ortp1 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
  await wait(1000);
  const ortp2 = [...(await sender.getStats()).values()].find(({type}) => type == "outbound-rtp");
  const y = (ortp2.psnrSum.y - ortp1.psnrSum.y) / (ortp2.psnrMeasurements - ortp1.psnrMeasurements);
  if (bestY < y) { bestY = y; bestCodec = codec;
}
const params = sender.getParameters();
params.encodings[0].codec = bestCodec; }
await sender.setParameters(params);

(and why is this different from the other hundred+ metrics we have)?

No other stat has this note:
image

FTR I think 15 second is too large of an interval since a lot can happen in that period of time.

This might be good for an implementer to know.

@fippo
Copy link
Contributor Author

fippo commented Feb 28, 2025

I can replace that with "is implementation-defined" linking to https://infra.spec.whatwg.org/#implementation-defined

And copy @sprangerik's great "these metrics should primarily be used as a basis for statistical analysis rather than be used as an absolute truth on a per-frame basis"

@henbos
Copy link
Collaborator

henbos commented Mar 3, 2025

We could say that PSNR measurement frequency is implementation-specific but that "the user agent SHOULD make make PSNR measurements no less frequently than every X seconds, if PSNR measurements are supported for the current encoder implementation". I still think the app needs to handle the case that a PSNR measurement is not available for a given encoder implementation or browser though, but this would give the app an upper bound on how long to maximally wait for.

@fippo
Copy link
Contributor Author

fippo commented Mar 3, 2025

From what I can see browser implementations would boil down to the same line of code in libWebRTC for software encoders, no? None of the parameters for video encoding are "specified", why do we need to be prescribe here?

updated along the lines of #794 (comment)

Comment on lines +2248 to +2250
<p>
The PSNR is defined in [[ISO-29170-1:2017]].
</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with the same text under psnrSum (line 2235) which is already link to here.

Suggested change
<p>
The PSNR is defined in [[ISO-29170-1:2017]].
</p>

Comment on lines +2251 to +2255
<p class="note">
PSNR metrics should primarily be used as a basis for statistical analysis rather
than be used as an absolute truth on a per-frame basis.
The frequency of PSNR measurements is [=implementation-defined=].
</p>
Copy link
Member

@jan-ivar jan-ivar Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid "should" and normative statements in non-normative notes. If we want to be normative we should lift it out.

Based on #794 (comment) are we comfortable picking a min frequency of every 5 seconds?

(Edit: added "or the encoding frame rate whichever is lower")

Suggested change
<p class="note">
PSNR metrics should primarily be used as a basis for statistical analysis rather
than be used as an absolute truth on a per-frame basis.
The frequency of PSNR measurements is [=implementation-defined=].
</p>
<p>
If the current encoder supports taking PSNR measurements, their
frequency SHOULD be no less than every 5 seconds or the
encoding frame rate, whichever is lower.
</p>
<p class="note">
This allows for testing. PSNR measurements are intended for
statistical analysis, and aren't expected to be accurate down
to a frame.
</p>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why 5 seconds? keyFramesEncoded and keyFramesDecoded are examples very-low-frequency events happening already where you can not make an assumption about the minimum interval between two getStats calls that will give you an increase. Same for packetsLost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value would you like? The frequency of keyFramesEncoded and keyFramesDecoded are determined by external factors unlikely to vary by user agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 is arbitrary. In this thread I heard 15 seconds was too high, and that 1 second was not too expensive, but also that we'd rather not overconstrain implementations too much. 10 seconds?

This value, whatever we pick would go into WPT and also give web developers that can't wait for some reason a minimum time to wait to be interoperable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we need to also qualify that the encoder is actually encoding something? E.g. if the track is muted or a canvas track, then the frame rate may be less than what value we pick. Maybe we add "... or the encoding frame rate whichever is lower"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: this is a parameter of the video encoder configuration. It becomes observable in stats just like others.
What the spec should say is "implementation defined" - the implied warning is "do not compare apples to oranges"

The frequency of keyFramesEncoded is a good example, it is determined by the encoder setting such as gop size. One wants this as large as possible for good performance but this is something where the encoder can be tuned without "interoperability" constraints since the decoder behavior is required to be flexible.

If one wants to get fancy about PSNR one may need to take into account things like "is this a screen sharing track" (higher resolution, lower frame rate and wholly different content) or frame rate (which may depend on BWE). Worth the effort for statistical analysis? Unlikely.

15 seconds was arbitrary too I think with the purpose of "i dont think this makes sense anymore". One second picked in the code is 3x the value from the paper (which has a parameter study) and hence 1/3rd as expensive in terms of power impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A SHOULD implies implementation-defined. Is there a value that would be satisfactory?

@alvestrand
Copy link
Contributor

alvestrand commented Mar 13, 2025

From editors meeting: Main reason for a number seems to be feature detection - people want to know that if they have waited this long between calls to GetStats, and the frame counter has increased by X, and the stat doesn't show up, the feature is off.

However, feature detection can be done by insisting that the stat is visible and with value zero if supported.

@youennf
Copy link
Contributor

youennf commented Mar 13, 2025

If we have the psnr measurement frequency be implementation-defined, the spec should help the UA developer select a good frequency value. Spec should add some guidelines, for instance doing measurements every second or every 30 frames or so.

@fippo
Copy link
Contributor Author

fippo commented Mar 13, 2025

From editors meeting: Main reason for a number seems to be feature detection - people want to know that if they have waited this long between calls to GetStats, and the frame counter has increased by X, and the stat doesn't show up, the feature is off.

Huh? This is a counter, it can not disappear, it can only stop increasing.

@sprangerik
Copy link
Contributor

If we have the psnr measurement frequency be implementation-defined, the spec should help the UA developer select a good frequency value. Spec should add some guidelines, for instance doing measurements every second or every 30 frames or so.

I think the gist of it is that we want the frequency to be a high as possible as long as the performance impact can be kept negligible. Perhaps we can add some text to that effect. Trying to give guidelines in terms of the expected frequency doesn't seem helpful to UA developers - but on the other hand those developers are probably in the best position to determine at which frequency the performance hit becomes non-negligible for their particular encoder implementations.

@jan-ivar
Copy link
Member

I think the gist of it is that we want the frequency to be a high as possible as long as the performance impact can be kept negligible.

That's great, because I don't think anyone is proposing a ceiling on frequency. A floor would be nice though.

I disagree giving guidance is bad, since what you wrote sounds like good guidance already. Let's write it down.

@fippo
Copy link
Contributor Author

fippo commented Mar 15, 2025

As we say in German, "Guter Rat ist teuer". The guidance should "talk to your video encoding guys". That is what "implementation-defined" is for, no?

@youennf
Copy link
Contributor

youennf commented Mar 17, 2025

talk to your video encoding guys seems to induce that the rate selection may be encoder specific.

This naturally leads to privacy questions if the selection of the rate depends on the encoder, the system load, the CPU...
I only see benefits in the spec defining the implementor playground.

@fippo
Copy link
Contributor Author

fippo commented Mar 17, 2025

if the selection of the rate depends on the encoder, the system load, the CPU...

Who is proposing this?

This naturally leads to privacy questions

How is this a new concern given video codecs in WebRTC already are adapting to CPU, thermal state, etc? Which is not described by any specification either. Note that PSNR should be possible to compute with JS and existing APIs already.

And you are most welcome to invite privacy guys into your meetings obviously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants