-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute_loudness
issues
#361
Comments
The loudness calculation has been puzzling me too. @nielsrolf the potential bug you discovered looks like it could be the culprit. I think one could also just change the mean calculation to:
I assumed we'd want to average in the linear domain and not in the power domain. |
I tried the version that @voodoohop proposed without padding and with a shorter The shorter fft size really helps to distinguish separate notes. Before that, I played around a little bit and implemented a less theoretically backed version, but one that describes the perceived loudness also really good in my opinion: s = stft_fn(audio, frame_size=n_fft, overlap=overlap, pad_end=False)
# Compute power.
amplitude = lib.abs(s)
power = amplitude**2
frequencies = librosa.fft_frequencies(sr=sample_rate, n_fft=n_fft)
a_weighting = librosa.A_weighting(frequencies)[lib.newaxis, lib.newaxis, :]
weighting = 10**(a_weighting/10)
power = power * weighting
power = reduce_mean(power, axis=-1)
loudness = log(power*100 + 1) |
I agree with what's been said so far. The way the current code adds the A-weighting and then averages over decibels means that the loudness values are only shifted a little. Furthermore, take a look at this frequency sweep from 0 Hz to 8 kHz: This is a linear spectrogram in Audacity with an FFT size of 2048, Hann windows, and zero padding 1. Notice the ringing around the curve (this is known as spectral leakage?). Now, look at the loudness features for this sweep: The blue line is using the NumPy loudness code. The orange line is the same code, but with loudness = np.mean(loudness, axis=-1) replaced by loudness = np.mean(np.power(10, loudness / 10.0), axis=-1)
loudness = 10.0 * np.log10(np.maximum(1e-20, loudness)) so that instead of taking the mean over decibels, we convert to power, take the mean, and then convert back to decibels. This gets rid of the artifacts and produces a smooth curve, which seems to better match the smooth rise in loudness of the sweep. I suppose there's a chance that the fluctuations of the blue line help the model by giving it extra information. But it seems unlikely, because DDSP is supposed to extract loudness in the same way as "Fast and Flexible Neural Audio Synthesis", and it says in the FFNAS appendix that: (emphasis mine)
|
Shall we compile a pull request from this? |
The question is how to do it without breaking old models, or if it's fine to break old models (maybe for v2.0?). Some input from the core devs would be nice @jesseengel |
We could add a gin parameter to enable the "fixed" loudness calculation to not break old models. |
I implemented the fix as proposed by @PluieElectrique in pollinations/ddsp. A demo of the new method can be found in this colab. I added a |
Hi all, Sorry for the long delay on the reply. Thanks for looking into this and keeping up on it. I've caught up on the discussion and I think it makes a lot of sense. This loudness calculation was always an unnecessary headache (power seems to work just as well), and I think I got confused in the initial implementation because The colab makes this pretty clear and the curves look good. Thanks @nielsrolf! So I'd like to incorporate the changes, and the question just becomes how is the best way to do it? It's a bit complicated because there's the existing trained DDSP models for use in the colab (trained with loudness), there's the models in magenta.js (trained with power), and we're trying to finish up the work on this VST plugin (also using power). There are actually 3 separate bugs in the power/loudness that I'd like to fix, so it seems now is as good a time as any to make a breaking change and do them together.
Rather than create a tangled web of deprecated kwargs and constants, I think I'd prefer a breaking change (for sanity's sake). My plan is to try incorporating all those changes in a PR, and train some new reference models to make sure everything is okay, and replace those for the colab notebook. Then merge the PR and push to v2.0.0 to mark the breaking change. That still leaves updating the magenta.js base models and code, but I can follow up with that. If anyone has any big objections, let me know, otherwise I'm just going to go ahead with it. Old models can of course still be used, but will be pinned to older versions. |
Sounds good! |
Breaking Change (-> v2.0.0): * Perform loudness averaging in linear scale, not dB. * Remove extra square from compute power. * Set reference db to 0.0 to align with librosa instead of magic 20.7 number for white noise reference. That way for power, dB=0.0 corresponds to amplitude = 1.0. Correspondingly, set dB range to [-80, 0] instead of [-120, 0] to align with librosa. * Provide convenience functions for `power_to_db` and `db_to_power` that are aligned to librosa. Clearer docstrings. * Moved db functions to core with other scaling functions. * Tests to make sure the db functions match librosa. * Default to `use_tf=True`. * Moved diff() to core (not spectral specific) * Default loudness, power, and rms to much more reasonable frame size default of 512 (instead of 1024/2048) [better time resolution]. * Compute Loudness on the fly by default (it's very fast and doesn't hurt training speed). Still TODO: Retrain and swap out models in Colab demo. PiperOrigin-RevId: 391661045
Breaking Change (-> v2.0.0): * Perform loudness averaging in linear scale, not dB. * Remove extra square from compute power. * Set reference db to 0.0 to align with librosa instead of magic 20.7 number for white noise reference. That way for power, dB=0.0 corresponds to amplitude = 1.0. Correspondingly, set dB range to [-80, 0] instead of [-120, 0] to align with librosa. * Provide convenience functions for `power_to_db` and `db_to_power` that are aligned to librosa. Clearer docstrings. * Moved db functions to core with other scaling functions. * Tests to make sure the db functions match librosa. * Default to `use_tf=True`. * Moved diff() to core (not spectral specific) * Default loudness, power, and rms to much more reasonable frame size default of 512 (instead of 1024/2048) [better time resolution]. * Compute Loudness on the fly by default (it's very fast and doesn't hurt training speed). Still TODO: Retrain and swap out models in Colab demo. PiperOrigin-RevId: 391661045
Breaking Change (-> v2.0.0): * Perform loudness averaging in linear scale, not dB. * Remove extra square from compute power. * Set reference db to 0.0 to align with librosa instead of magic 20.7 number for white noise reference. That way for power, dB=0.0 corresponds to amplitude = 1.0. Correspondingly, set dB range to [-80, 0] instead of [-120, 0] to align with librosa. * Provide convenience functions for `power_to_db` and `db_to_power` that are aligned to librosa. Clearer docstrings. * Moved db functions to core with other scaling functions. * Tests to make sure the db functions match librosa. * Default to `use_tf=True`. * Moved diff() to core (not spectral specific) * Default loudness, power, and rms to much more reasonable frame size default of 512 (instead of 1024/2048) [better time resolution]. * Compute Loudness on the fly by default (it's very fast and doesn't hurt training speed). Future TODO: Retrain and swap out models in Colab demo. PiperOrigin-RevId: 391661045
Breaking Change (-> v2.0.0): * Perform loudness averaging in linear scale, not dB. * Remove extra square from compute power. * Set reference db to 0.0 to align with librosa instead of magic 20.7 number for white noise reference. That way for power, dB=0.0 corresponds to amplitude = 1.0. Correspondingly, set dB range to [-80, 0] instead of [-120, 0] to align with librosa. * Provide convenience functions for `power_to_db` and `db_to_power` that are aligned to librosa. Clearer docstrings. * Moved db functions to core with other scaling functions. * Tests to make sure the db functions match librosa. * Default to `use_tf=True`. * Moved diff() to core (not spectral specific) * Default loudness, power, and rms to much more reasonable frame size default of 512 (instead of 1024/2048) [better time resolution]. * Compute Loudness on the fly by default (it's very fast and doesn't hurt training speed). Future TODO: Retrain and swap out models in Colab demo. PiperOrigin-RevId: 391661045
Hello,
I was looking at the loudness curves that
spectral_ops.compute_loudness
produces and found that they don't describe the perceived loudness very well. An example is shown below:The loudness has a little peak somewhere where there is no audible sound, and from the curve you cannot see the four distinct notes as clearly as I would expect.
When looking into the code, I think I found the issue
loudness
is a decibel value per frequency bin before taking the mean, but it does not make sense to take the mean over decibel values I believe. Instead, the a-weighting should be applied to the power spectrum (not in dB), then it should be summed over the frequency bins and then converted to dB scale.In the way it is currently implemented, the A-weighting changes nothing besides adding its mean value to every time step. I plotted the result of

compute_loudness
with and without A-weighting, but in the without case I added its mean value, and this is the result (on another sample):So I tried to look into how librosa computes loudness but found that they don't offer this function because it seems to be so complex, but in this discussion there is some helpful code.
Another unrelated small issue is, that the stft is computed with padding, which makes the loudness go down at the end of every sample a lot.
Let me know if you agree that these things are issues, and if I should work on a PR for it.
The text was updated successfully, but these errors were encountered: