Filter-and-sum beamforming

This beamforming technique is similar to the simpler delay-and-sum method, which I will first summarize. The idea is that if signal from the same source is captured by several receivers, they all have the same waveform, but differ in delay and phase. So, if a different delay is applied to each input signal depending on the location of the microphone, the source signals captured at all the microphones become "in-phase", and the additive noise will be out of phase. Adding all the signals and normalizing by the number of channels will then remove the noise from the signal. Mathematically, for a linear microphone array consisting of $D$ microphones, each placed $m$ meters apart, the beamformed signal is given as

$$ y(t)=\frac{1}{D} \sum_{d=1}^{D} x_{d}\left(t-\tau_{d}\right) $$

where $\tau_d$ is the delay applied to signal at microphone $d$, and is given as

$$ \tau_{d}=\frac{(d-1) m \cos \phi^{\prime}}{c}, $$

where $c$ is the speed of sound and $\phi^{\prime}$ is the direction in which we want to steer the main lobe of the signal. Here, all channels are assumed to have equal frequency response and hence are assigned equal amplitude weights, i.e., $a_d(f) = \frac{1}{D}$. In filter-and-sum beamforming, this assumption is removed, and instead, the frequency response is computed for each microphone

$$ y(t)=\sum_{d=1}^{D} a_{d}(t) x_{d}\left(t-\tau_{d}\right). $$

Note that in the above equation, the amplitude weight is in the time domain, where in fact it is computed in the frequency domain and then multiplied to the signal in the same domain.

MVDR beamforming

Given the masks $\gamma_{t,f,k}$ and the mixture STFT signal $\tilde{\mathbf{Y}}_{t,f}$, we first compute the spatial covariance matrices as

$$ \mathbf{\Phi}k (f) = \frac{1}{T} \sum{t} \gamma_{t,f,k} \tilde{\mathbf{Y}}{t,f} \tilde{\mathbf{Y}}{t,f}^H, $$

for the target speaker $k$, and similarly for the distortion mask (which is the sum of noise mask and all interfering speaker masks). Using the SCMs, the filters are computed as

$$ \mathbf{h}(f) = \frac{\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}k(f) \mathbf{e}{\mathrm{ref}}}{\mathrm{tr}\left(\mathbf{\Phi}_N^{-1}(f) \mathbf{\Phi}_k(f)\right)}, $$

where $\mathbf{h}(f)$ is a $D$-dimensional vector that gives the weight of each channel for frequency $f$. The final beamformed signal is then given as

$$ \hat{\mathbf{X}}{t,f} = \tilde{\mathbf{Y}}{t,f} \cdot \mathbf{h}(f), $$

where <$\cdot$> is the dot product. It is clear that the filter vector is constant for all $t$, and so this type of beamformer is called a time-invariant beamformer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022-04-12-gbo-mvdr.md

2022-04-12-gbo-mvdr.md

Filter-and-sum beamforming

MVDR beamforming

Files

2022-04-12-gbo-mvdr.md

Latest commit

History

2022-04-12-gbo-mvdr.md

File metadata and controls

Filter-and-sum beamforming

MVDR beamforming