layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: MVDR beamforming |
|
true |
In a previous note, we described the process of mask estimation using complex angular central
GMMs that are used in guided source separation (GSS). Mask estimation means computing the
activity for each speaker at each time-frequency bin, i.e.,
Of course, using CACGMMs is not the only mask estimation method. Recently, it is quite popular to use neural networks to estimate them. However, they require a fixed number of speakers at train and test time. In subsequent notes, we will see continuous speech separation methods which circumvent this problem to use neural network based mask estimation.
In any case, once the speaker-specific masks have been estimated, we still need to extract the speaker audio from the mixture (which was the task in the first place). In this note, we will describe a popular method for doing this, known as mask-based MVDR beamforming. This discussion is based on Erdogan et al.
Again, our problem formulation is the same as the [previous post]({% post_url 2022-04-12-gbo-gss %}),
where we have (unit normalized and dereverberated) mixture STFT features
$$ \tilde{\mathbf{Y}}{t,f} = \sum_k \mathbf{X}{t,f,k} + \mathbf{N'}_{t,f}, $$
where
$$ \begin{aligned} \tilde{\mathbf{Y}}{t,f} &= \mathbf{X}{t,f,k} + \sum_{k'\neq k} \mathbf{X}{t,f,k'} + \mathbf{N'}{t,f} \ &= \mathbf{X}{t,f,k} + \mathbf{N}{t,f}, \end{aligned} $$
where we have combined the interference speakers
We currently have the mask
This beamforming technique is similar to the simpler delay-and-sum method, which I will first summarize.
The idea is that if signal from the same source is captured by several receivers, they all have
the same waveform, but differ in delay and phase. So, if a different delay is applied to each
input signal depending on the location of the microphone, the source signals captured at all
the microphones become "in-phase", and the additive noise will be out of phase. Adding all the
signals and normalizing by the number of channels will then remove the noise from the signal.
Mathematically, for a linear microphone array consisting of
where
where
Note that in the above equation, the amplitude weight is in the time domain, where in fact it is computed in the frequency domain and then multiplied to the signal in the same domain.
Given the masks
$$ \mathbf{\Phi}k (f) = \frac{1}{T} \sum{t} \gamma_{t,f,k} \tilde{\mathbf{Y}}{t,f} \tilde{\mathbf{Y}}{t,f}^H, $$
for the target speaker
$$ \mathbf{h}(f) = \frac{\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}k(f) \mathbf{e}{\mathrm{ref}}}{\mathrm{tr}\left(\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}_k(f)\right)}, $$
where
$$ \hat{\mathbf{X}}{t,f} = \tilde{\mathbf{Y}}{t,f} \cdot \mathbf{h}(f), $$
where <$\cdot$> is the dot product. It is clear that the filter vector is constant for all