Table of Contents
- Problem Setting
- Evidence Lower Bound
- Expectation Maximization
- Example: Coin Flips
- Example: Gaussian Mixture Model
- Advice for Deriving EM Algorithms
These notes provide a theoretical treatment of Expectation Maximization, an iterative parameter estimation algorithm used to find local maxima of the likelihood function in the presence of hidden variables. Introductory textbooks [murphy:mlapp, bishop:prml] typically state the algorithm without explanation and expect students to work blindly through derivations. We find this approach to be unsatisfying, and instead choose to tackle the theory head-on, followed by plenty of examples. Following [neal1998:em], we view expectation-maximization as coordinate ascent on the Evidence Lower Bound. This perspective takes much of the mystery out of the algorithm and allows us to easily derive variants like Hard EM and Variational EM.
Suppose we observe data generated from a model with true parameters in the presence of hidden variables . As usual, we wish to compute the maximum likelihood estimate
of the parameters given our observed data. In some cases, we also seek to infer the values of the hidden variables . In the Bayesian spirit, we will treat the parameter as a realization of some random variable .
The observed data log-likelihood of the parameters given the observed data is useful for both inference and parameter estimation, in which we must grapple with uncertainty about the hidden variables. Working directly with this quantity is often difficult in latent variable models because the inner sum cannot be brought out of the logarithm when we marginalize over the latent variables:
In general, this likelihood is non-convex with many local maxima. In contrast, [murphy:mlapp] shows that when are exponential family distributions, the likelihood is convex, so learning is much easier. Expectation maximization exploits the fact that learning is easy when we observe all variables. We will alternate between inferring the values of the latent variables and re-estimating the parameters, assuming we have complete data.
Evidence Lower Bound
Our general approach will be to reason about the hidden variables through a proxy distribution , which we use to compute a lower-bound on the log-likelihood. This section is devoted to deriving one such bound, called the Evidence Lower Bound (ELBO).
We can expand the data log-likelihood by marginalizing over the hidden variables:
Through Jensen's inequality, we obtain the following bound, valid for any :
The lower bound can be rewritten as follows:
Relationship to Relative Entropy
The first term in the last line above closely resembles the cross entropy between and the joint distribution of the observed and hidden variables. However, the variables are fixed to our observations and so is an unnormalized [ref]In this case, .[/ref] distribution over . It is easy to see that this does not set us back too far; in fact, the lower bound differs from a Kullback-Liebler divergence only by a constant with respect to :
This yields a second proof of the evidence lower bound, following from the nonnegativity of relative entropy. In fact, this is the proof given in [tzikas2008:variational] and [murphy:mlapp].
Selecting a Proxy Distribution
The quality of our lower bound depends heavily on the choice of proxy distribution . We now show that the evidence lower bound is tight in the sense that equality holds when the proxy distribution is chosen to be the hidden posterior . This will be useful later for proving that the Expectation Maximization algorithm converges.
Maximizing with respect to is equivalent to minimizing the relative entropy between and the hidden posterior . Hence, the optimal choice for is exactly the hidden posterior, for which , and
In cases where the hidden posterior is intractable to compute, we choo
Recall that the maximum likelihood estimate of the parameters given observed data in the presence of hidden variables is
Unfortunately, when reasoning about hidden variables, finding a global maximum is difficult. Instead, the Expectation Maximization algorithm is an iterative procedure for computing a local maximum of the likelihood function, under the assumption that the hidden posterior is tractable. We will take advantage of the evidence lower bound
on the data likelihood. Consider only proxy distributions of the form , where is some fixed configuration of the variables , possibly different from our estimate . The optimal value for , in the sense that is maximum, depends on the particular choice of . Similarly, the optimal value for depends on the choice of . This suggests an iterative scheme in which we alternate between maximizing with respect to and with respect to , gradually improving the log-likelihood.
Suppose at time we have an estimate of the parameters. To improve our estimate, we perform two steps of coordinate ascent on , as described in [neal1998:em],
Compute a new lower bound on the observed log-likelihood, with
Estimate new parameters by optimizing over the lower bound,
In the M-Step, the expectation is taken with respect to .
In the M-Step, the entropy term of the evidence lower bound does not depend on . The remaining term is sometimes called the auxiliary function or Q-function. To us, this is the expected complete-data log-likelihood.
Proof of Convergence
To prove convergence of this algorithm, we show that the data likelihood increases after each update.
After a single iteration of Expectation Maximization, the observed data likelihood of the estimated parameters has not decreased, that is,
This result is a simple consequence of all the hard work we have put in so far:
It is also possible to show that Expectation-Maximization converges to something useful.
(Neal & Hinton 1998, Thm. 2) Every local maximum of the evidence lower bound is a local maximum of the data likelihood .
Starting from an initial guess , We run this procedure until some stopping criterion is met and obtain a sequence of parameter estimates.
Example: Coin Flips
Now that we have a good grasp on the theory behind Expectation Maximization, let's get some intuition by means of a simple example. As usual, the simplest possible example involves coin flips!
Suppose we have two coins, each with a different probability of heads, and , unknown to us. We collect data from a series of trials in order to estimate the bias of each coin. Each trial consists of flipping the same random coin a total of times and recording only the total number of heads.
This situation is best described by the following generative probabilistic model, which precisely describes our assumptions about how the data was generated. The corresponding graphical model and a set of sample data are shown in Figure .\
Complete Data Log-Likelihood
The complete data log-likelihood for a single trial is In this model, is uniform. The remaining term is
Now that we have specified the probabilistic model and worked out all relevant probabilities, we are ready to derive an Expectation Maximization algorithm.
The E-Step is straightforward. The M-Step computes a new parameter estimate by optimizing over the lower bound found in the E-Step. Let . Then,
Now, because each trial is conditionally independent of the others, given the parameters,
Let and . Note . To maximize the above expression with respect to the parameters, we take derivatives with respect to and and set to zero:
Solving for and , we obtain
Example: Gaussian Mixture Model
In a Gaussian Mixture Model, samples are drawn from a random cluster, each normally distributed with its own mean and variance. Our goal will be to estimate the following parameters: The full model specification is below. A graphical model is shown in Figure .
Complete Data Log-Likelihood
The complete data log-likelihood for a single datapoint is Similarly, the complete data log-likelihood over all points is
The hidden posterior for a single point can be found using Bayes' rule:
Our derivation will follow that of [murphy:mlapp], adapted to our notation.
Before the E-step, we have an estimate of the parameters, and seek to compute a new lower bound on the observed log-likelihood. Earlier, we showed that the optimal lower bound is where and the second term is constant with respect to . The E-Step requires us to derive an expression for the first term. Using , the expected complete data log-likelihood is given by where is the responsibility that cluster takes for data point after step . During the E-Step, we compute these values explicitly with .
During the M-Step, we optimize our lower bound with respect to the parameters . For the mixing weights , we use Lagrange multipliers to maximize the ELBO subject to the constraint . The Lagrangian is Carrying out the optimization, we find that . The correct update for the mixing weights is where is the effective number of points assigned to cluster . For the cluster centers and variance , you should verify that the correct updates are
Advice for Deriving EM Algorithms
The previous two examples suggest a general approach for deriving a new algorithm.
- Specify the probabilistic model. Identify the observed variables, hidden variables, and parameters. Draw the corresponding graphical model to help determine the underlying independence structure.
- Identify the complete-data likelihood . For exponential family models, the complete-data likelihood will be convex and easy to optimize. In other models, other work may be required.
- Identify the hidden posterior . If this distribution is not tractable, you may want to consider variational inference, which we will discuss later.
- Derive the E-Step. Write down an expression for .
- Derive the M-Step. Try taking derivatives and setting to zero. If this doesn't work, you may need to resort to gradient-based methods or variational inference.