layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Variational Bayes and the VBx algorithm |
|
true |
Speaker diarization is often formulated as a clustering of speaker embeddings. If we use conventional clustering methods such as k-means or spectral clustering, they ignore the sequential nature of turn-taking and only perform the clustering based on similarity of the embeddings. BUT's VBx is a robust and mathematically principled approach to solve this problem by performing clustering by modeling speakers as HMM latent states and the embeddings as the emissions.
In this note, we will describe variational Bayes and the VBx algorithm. This note is based on Landini et al.'s paper and online resources on VB methods (such as this post).
Suppose there is some latent variable
In most such problems, we are actually looking to compute the posterior
Here, the denominator term is the marginal
For this, we approximate the posterior by a distribution
where
$$ \begin{aligned} \text{KL} (q(\mathbf{Z})||p(\mathbf{Z}\mid\mathbf{X})) &= \mathbb{E}{q(\mathbf{Z})} \left[ \log \frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid\mathbf{X})} \right] \ &= \mathbb{E}{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{Z}\mid\mathbf{X}) \right] \ &= \mathbb{E}{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] + \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{X}) \right] \ &= \mathbb{E}{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] - \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] + \log p(\mathbf{X}), \end{aligned} $$
where the last step is because
$$ \begin{aligned} \text{ELBO}(q) &= \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{Z},\mathbf{X}) \right] - \mathbb{E}{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] \ &= \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{X}\mid\mathbf{Z}) \right] + \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{Z}) \right] - \mathbb{E}{q(\mathbf{Z})} \left[ \log q(\mathbf{Z}) \right] \ &= \mathbb{E}{q(\mathbf{Z})} \left[ \log p(\mathbf{X}\mid\mathbf{Z}) \right] - \text{KL}(q(\mathbf{Z})||p(\mathbf{Z})). \end{aligned} $$
So maximizing the objective means maximizing the likelihood while ensuring that
The objective is called "variational" because we are trying to find a function, instead of a single value, that maximizes some objective.
Since
We can then optimize the ELBO by considering one of the
Now that we have some idea of how variational inference works, let us turn to VBx, which applies this to the task of infering a sequence of speaker states given a sequence of x-vectors.
The algorithm assumes that the sequence of x-vectors is generated using a Bayesian HMM, where the latent variables are discrete speaker states. Transition probabilities are simple and are given as
The emission probabilities are where the "bayesian" part of the HMM comes in. Suppose the
x-vectors are transformed to a space with zero mean. For each speaker state
where
By taking a standard normal distributed variable
where
The above distribution is fully characterized by
$$ \begin{aligned} p(\mathbf{X},\mathbf{Z},\mathbf{Y}) &= p(\mathbf{X}\mid\mathbf{Z},\mathbf{Y}) p(\mathbf{Z}) p(\mathbf{Y}) \ &= \prod_t p(\mathbf{x}t|z_t) \prod_t p(z_t|z{t-1}) \prod_s p(\mathbf{y}_s). \end{aligned} $$
For inference, we need to compute
As we saw before, the posterior
We can write this as an optimization problem similar to the one described in the previous section, and solve it by maximizing the ELBO, which is given as
$$ \mathrm{ELBO}(q) = \mathbb{E}{q(\mathbf{Z},\mathbf{Y})}\left[ \log p(\mathbf{X}\mid\mathbf{Y},\mathbf{Z}) \right] - \mathbb{E}{q(\mathbf{Y})} \left[ \log \frac{q(\mathbf{Y})}{p(\mathbf{Y})}\right] - \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{q(\mathbf{Z})}{p(\mathbf{Z})}\right], $$
similar to the simplification done earlier.
As per the mean-field approximation, we solve the above optimization problem by iteratively
solving for