date | tags |
---|---|
2020-03-23 |
paper, gan, deep-learning, vocoder, tts, speech, generative |
Congyi Wang, Yu Chen, Bin Wang, Yi Shi
Report
Year: 2021
Samples: https://anonymous1086.github.io/prlsgan-vocoder/
This paper presents a GAN-based vocoder based on MelGAN, that introduces the Pointwise Relativistic Least Square GAN (PRLSGAN) to improve the performance of the GAN-based vocoder. It is based in Relativistic GAN and it shows improvement mainly in the artifacts normally produced by GAN-based vocoders.
The authors make a great overview of different types of vocoder, to motivate the use of GAN vocoders which, among other things, are able to generate audio in parallel from a set of features (e.g. spectrogram).
After, they introduce the basic methods:
- Parallel WaveGAN (PWGAN): consists of a generator, a discriminator and a multi-resolution STFT auxiliary loss
- Multi-resolution STFT: consists of a conventional loss applied over the STFT of the signal produced by the generator.
- MelGAN: consists of a generator implementing transposed convolutions, and a multi-scale discriminator that handles audio at different levels. As loss function, it implements a feature-matching loss between generated and real audio, as well as an adversarial loss.
PRLSGAN builds upon MelGAN and implements a different loss: a combination of the MSE loss and a pointwise relative discrepancy loss that seems to increase the minimax difficulty, forcing the generator to refine the output.
The method