layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Expectation Maximization |
|
true |
In this note, we will describe how to estimate the parameters of GMM and HMM models using expectation-maximization method. The equations and discussion is heavily based on Jeff Bilmes' paper.
A popular method to estimate the parameters of a statistical model is using maximum
likelihood. Given a set of observations
This is also called the likelihood of
where
In fact, we can show that the derivative of the log-likelihood w.r.t.
It is not always so simple to maximize the likelihood function since the derivative may
not have an analytical solution. However, we can often simplify such cases by assuming that
the data
We can then define a "complete" data likelihood as
This is called the E-step of the EM algorithm. Once we have the complete-data likelihood,
we can maximize it w.r.t.
which is known as the M-step. Note that this is a very abstract characterization of the method, and the exact equations would depend on the underlying statistical model. In the remainder of this post, we will see how the EM algorithm works in the case of GMM and HMM models.
Often, a single Gaussian is not sufficient to describe the set of observations, in which case
we turn to mixture models. A Gaussian mixture model (GMM) is a mixture of a fixed number of
Gaussians. The probability of generating the observations by a GMM containing
where
If we try to directly apply the MLE on this likelihood function, we will run into a problem
since the log cannot be taken inside the sum, and so there is no analytical solution (unlike
the case of a single Gaussian). In fact, the derivate of the GMM likelihood function (w.r.t
where
$$ \gamma_{z_i}(k) = \frac{\pi_k \mathcal{N}(\mathbf{x}i;\mu_k,\sigma_k)}{\sum{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_i;\mu_k,\sigma_k)}. $$
Now, in the derivative above, there is a latent variable (
$$ \mu_k = \frac{\sum_{i=1}^N \gamma_{z_i}(k) \mathbf{x}i}{\sum{i=1}^N \gamma_{z_i}(k)}, $$
which is essentially a weighted average of the observations, with the posteriors as the weights.
Similarly, we can obtain equations for
At this point, it seems like we are done. But remember that these equations involve
-
E-step: Compute the posterior probabilities
$\gamma_{z_i}(k)$ using the current values of the parameters$\mu_k$ ,$\sigma_k$ , and$\pi_k$ . -
M-step: Estimate new values for the parameters using the computed values for
$\gamma_{z_i}(k)$ .
We repeat this iterative process until the change in log-likelihood is less than some threshold. Note that it is guaranteed that the log-likelihood will keep increasing.
Hidden Markov Model
Note: Some of the material in this section is modified from this blog post.
An HMM is a stochastic model which assumes an underlying latent variable
$$\begin{aligned} P(X,Z) &= p(\mathbf{x}_1,\ldots,\mathbf{x}_N,\mathbf{z}_1,\ldots,\mathbf{z}_N) \ &= p(\mathbf{z}_1)p(\mathbf{x}_1\mid \mathbf{z}1)\prod{i=2}^N p(\mathbf{z}_i\mid \mathbf{z}_1^{i-1}) p(\mathbf{x}_i\mid \mathbf{z}_1^i, \mathbf{x}_1^{i-1}) \ &= p(\mathbf{z}_1)p(\mathbf{x}_1\mid \mathbf{z}1)\prod{i=2}^N p(\mathbf{z}i\mid \mathbf{z}{i-1}) p(\mathbf{x}_i\mid \mathbf{z}_i). \end{aligned}$$
The joint probability derived as such can be factorized into transition probabilities for the
latent variable
$$ p(Z) = p(\mathbf{z}1) \prod{i=2}^N p(\mathbf{z}i \mid \mathbf{z}{i-1}), $$
and emission probabilities for generating
For example, the emission probability may be modeled using a single Gaussian or a mixture model. The EM algorithm for computing the MLE parameters of an HMM is also known as the Baum-Welch algorithm.
The parameters we need to estimate are:
- the transition matrix
$A$ which is an$m \times m$ matrix, where$n$ is the number of possible states that$Z$ can take, - the initial probabilities of the states
$\pi$ , - and the parameters
$\phi$ of the emission distribution,
i.e.,
Again, we solve this problem by alternating between an E-step and an M-step, where the E-step involves computing the complete-data likelihood and the M-step maximizes it to get the parameters at that instant. However, in this case, the E-step is a little more complicated since there are an exponential number of parameters. As such, it involves using the forward-backward algorithm (and so Baum-Welch = EM with forward-backward).
In particular, in the E-step we need to compute 2 posteriors:
In the M-step, the posteriors computed above are used to re-estimate the parameters of the model, thus increasing the complete-data likelihood.
I skip over the details of the derivation here since they can be found in the sources linked above.