layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Mask estimation for GSS |
|
true |
Guided source separation (GSS) is an unsupervised algorithm for target speech extraction, first proposed in the Paderborn submission to the CHiME-5 challenge. Given a noisy (and reverberant) multi-channel recording containing multiple speakers, and a time-annotated segment where a desired speaker is active, GSS solves the task of extracting a (relatively) clean audio of the desired speaker in the segment, while removing interference in the form of background noise or overlapping speakers.
Note: I have recently implemented a GPU-accelearated version of GSS which can be used for datasets other than CHiME-5 (such as LibriCSS or AMI). It can be found here.
The overall GSS method contains 3 stages:
- Dereverberation using WPE
- Mask estimation using CACGMMs
- Mask-based MVDR beamforming
In this note, I will focus specifically on Step 2, i.e., mask estimation using CACGMMs. We will look at WPE and MVDR components in other notes.
Let
$$ \mathbf{Y}{t,f} = \sum_k \mathbf{X}{t,f,k}^{\mathrm{early}} + \sum_k \mathbf{X}{t,f,k}^{\mathrm{tail}} + \mathbf{N}{t,f}, $$
where
The WPE component (stage 1 of the method) estimates this quantity
The mask estimation technique is based on the "sparsity assumption", which says that only one speaker is active in each time-frequency bin. Using this assumption, the vector in each T-F bin can be assumed to have been generated from a mixture model where each component of the mixture belongs to a different speaker.
In the case of GSS, each mixture component is a complex angular central Gaussian. This can
seem like a loaded term, but let us break it down. It is similar to a standard multivariate
normal distribution, except for 2 things: (i) it models complex-valued random variables
instead of real-valued variables (which is useful for us since STFT's are complex-valued),
and (ii) it distributes the random variable on a unit hypersphere
Recall that a standard multi-variate Gaussian is characterized by a mean vector
In the case of a CACG, since it is zero-centered, we only have one parameter, denoted as
The CACGMM is then given as a mixture of CACG components as follows:
$$ p(\tilde{\mathbf{Y}}{t,f}) = \sum_k \pi{f,k} \mathcal{A}(\tilde{\mathbf{Y}}{t,f};\mathbf{B}{f,k}), $$
where
At this point, it may seem like we can just run the EM algorithm independently for each frequency bin on the CACGMM model to compute its parameters. But there are two problems:
-
The same mixture component
$k$ may correspond to different speakers in different frequency bins, leading to the well-known permutation problem. -
We do not know the number of mixture components
$k$ .
This is where the "guided" part of GSS comes in: if we have external guidance in the form of
speaker-level time annotations (either oracle or computed using a diarizer), we can use it to
(i) fix the global speaker order, and (ii) fix the number of mixture components. We denote the
speaker-time annotations as
Now we are ready to apply the EM algorithm to learn the CACGMM. The E-step involves computing the state posteriors at each time step as:
$$ \gamma_{t,f,k} = \frac{\pi_{t,f,k}|\mathbf{B}{f,k}|^{-1}(\tilde{\mathbf{Y}}{t,f}^H \mathbf{B}^{-1} \tilde{\mathbf{Y}}{t,f})^{-D}}{\sum{k'}\pi_{t,f,k'}|\mathbf{B}{f,k'}|^{-1}(\tilde{\mathbf{Y}}{t,f}^H \mathbf{B}^{-1} \tilde{\mathbf{Y}}_{t,f})^{-D}}. $$
And the M-step is:
$$ \mathbf{B}{f, k}=D \frac{\sum{t} \gamma_{t, f, k} \frac{\tilde{\mathbf{Y}}{t, f}^{\mathrm{H}} \tilde{\mathbf{Y}}{t, f}}{\tilde{\mathbf{Y}}{t, f}^{\mathrm{H}} \mathbf{B}{f, k}^{-1} \tilde{\mathbf{Y}}{t, f}}}{\sum{t} \gamma_{t, f, k}}. $$
The E and M steps are repeated for a specified number of iterations, and the
In subsequent notes, we will see how these masks can be used for target-speaker extraction from the noisy multi-speaker mixture.