Expectation Maximization



These notes provide a theoretical treatment of Expectation Maximization, an iterative parameter estimation algorithm used to find local maxima of the likelihood function in the presence of hidden variables. Introductory textbooks [murphy:mlapp, bishop:prml] typically state the algorithm without explanation and expect students to work blindly through derivations. We find this approach to be unsatisfying, and instead choose to tackle the theory head-on, followed by plenty of examples. Following [neal1998:em], we view expectation-maximization as coordinate ascent on the Evidence Lower Bound. This perspective takes much of the mystery out of the algorithm and allows us to easily derive variants like Hard EM and Variational EM.

Problem Setting

Suppose we observe data X\X generated from a model pp with true parameters θ\theta^* in the presence of hidden variables ZZ. As usual, we wish to compute the maximum likelihood estimate θ^ML=argmaxθ(θX)=argmaxθlogp(Xθ) \hat\theta_{ML} = \arg\max_\theta \ell(\theta|\X) = \arg\max_\theta \log p(\X | \theta)

of the parameters given our observed data. In some cases, we also seek to infer the values Z\mathcal{Z} of the hidden variables ZZ. In the Bayesian spirit, we will treat the parameter θ\theta^* as a realization of some random variable Θ\Theta.

The observed data log-likelihood (θX)=logp(Xθ)\ell(\theta|\X) = \log p(\X | \theta) of the parameters given the observed data is useful for both inference and parameter estimation, in which we must grapple with uncertainty about the hidden variables. Working directly with this quantity is often difficult in latent variable models because the inner sum cannot be brought out of the logarithm when we marginalize over the latent variables:

(θX)=logp(Xθ)=logzp(X,zθ) \ell(\theta|\X) = \log p(\X | \theta) = \log \sum_z p(\X, z | \theta)

In general, this likelihood is non-convex with many local maxima. In contrast, [murphy:mlapp] shows that when p(xn,znθ)p(x_n, z_n | \theta) are exponential family distributions, the likelihood is convex, so learning is much easier. Expectation maximization exploits the fact that learning is easy when we observe all variables. We will alternate between inferring the values of the latent variables and re-estimating the parameters, assuming we have complete data.

Evidence Lower Bound

Our general approach will be to reason about the hidden variables through a proxy distribution qq, which we use to compute a lower-bound on the log-likelihood. This section is devoted to deriving one such bound, called the Evidence Lower Bound (ELBO).

We can expand the data log-likelihood by marginalizing over the hidden variables:

(θX)=logp(Xθ)=logzp(X,zθ) \ell(\theta|\X) = \log p(\X|\theta) = \log \sum_z p(\X, z | \theta)

Through Jensen's inequality, we obtain the following bound, valid for any qq:

(θX)=logzp(X,zθ)=logzq(z)p(X,zθ)q(z)zq(z)logp(X,zθ)q(z)L(q,θ)\begin{aligned} \ell(\theta|\X) &= \log \sum_z p(\X, z | \theta) \\ &= \log \sum_z q(z) \frac{p(\X, z | \theta)}{q(z)} \\ &\geq \sum_z q(z) \log \frac{p(\X, z | \theta)}{q(z)} \equiv \mathcal{L}(q,\theta) \end{aligned}

The lower bound L(q,θ)\mathcal{L}(q,\theta) can be rewritten as follows:

(θX)L(q,θ)=zq(z)logp(X,zθ)q(z)=zq(z)logp(X,zθ)zq(z)logq(z)=Eq[logp(X,Zθ)]Eq[logq(z)]=Eq[logp(X,Zθ)]+H(q)\begin{aligned} \ell(\theta|\X) \geq \mathcal{L}(q,\theta) &= \sum_z q(z) \log \frac{p(\X, z | \theta)}{q(z)} \\ &= \sum_z q(z) \log p(\X, z | \theta) -\sum_z q(z) \log q(z) \\ &= E_q[ \log p(\X, Z | \theta)] -E_q[ \log q(z) ] \\ &= E_q[ \log p(\X, Z | \theta)] + H(q) \end{aligned}

Relationship to Relative Entropy

The first term in the last line above closely resembles the cross entropy between q(Z)q(Z) and the joint distribution p(X,Z)p(X, Z) of the observed and hidden variables. However, the variables XX are fixed to our observations X=XX=\X and so p(X,Z)p(\X,Z) is an unnormalized [ref]In this case, p(X,z)dz1\int p(\X, z)\, dz \neq 1.[/ref] distribution over ZZ. It is easy to see that this does not set us back too far; in fact, the lower bound L(q,θ)\mathcal{L}(q,\theta) differs from a Kullback-Liebler divergence only by a constant with respect to ZZ:

DKL(qp(ZX,θ))=H(q,p(ZX,θ))H(q)=Eq[logp(ZX,θ)]H(q)=Eq[logp(X,Zθ)]Eq[logp(Xθ)]H(q)=Eq[logp(X,Zθ)]+logp(Xθ)H(q)=L(q,θ)+const.\begin{aligned} D_{KL}(q || p(Z|\X,\theta)) &= H(q, p(Z|\X,\theta)) - H(q) \\ &= E_q[ -\log p(Z|\X,\theta) ] - H(q) \\ &= E_q[ -\log p(\X,Z | \theta) ] - E_q[ -\log p(\X|\theta) ] - H(q) \\ &= E_q[ -\log p(\X,Z | \theta) ] + \log p(\X|\theta) - H(q) \\ &= -\mathcal{L}(q,\theta) + \mathrm{const.} \end{aligned}

This yields a second proof of the evidence lower bound, following from the nonnegativity of relative entropy. In fact, this is the proof given in [tzikas2008:variational] and [murphy:mlapp].

logp(Xθ)=DKL(qp(ZX,θ))+L(q,θ)L(q,θ) \log p(\X | \theta) = D_{KL}(q || p(Z|\X, \theta)) + \mathcal{L}(q,\theta) \geq \mathcal{L}(q,\theta)

Selecting a Proxy Distribution

The quality of our lower bound L(q,θ)\mathcal{L}(q,\theta) depends heavily on the choice of proxy distribution q(Z)q(Z). We now show that the evidence lower bound is tight in the sense that equality holds when the proxy distribution q(Z)q(Z) is chosen to be the hidden posterior p(ZX,θ)p(Z|\X,\theta). This will be useful later for proving that the Expectation Maximization algorithm converges.

Maximizing L(q,θ)\mathcal{L}(q,\theta) with respect to qq is equivalent to minimizing the relative entropy between qq and the hidden posterior p(ZX,θ)p(Z|\X,\theta). Hence, the optimal choice for qq is exactly the hidden posterior, for which DKL(qp(ZX,θ))=0D_{KL}(q || p(Z|\X,\theta)) = 0, and logp(Xθ)=Eq[logp(X,Zθ)]+H(q)=L(q,θ) \log p(\X | \theta) = E_q[ \log p(\X,Z | \theta) ] + H(q) = \mathcal{L}(q,\theta)

In cases where the hidden posterior is intractable to compute, we choo

Expectation Maximization

Recall that the maximum likelihood estimate of the parameters θ\theta given observed data X\X in the presence of hidden variables ZZ is θ^ML=argmaxθ(θX)=argmaxθlogp(Xθ) \hat\theta_{ML} = \arg\max_\theta \ell(\theta|\X) = \arg\max_\theta \log p(\X | \theta)

Unfortunately, when reasoning about hidden variables, finding a global maximum is difficult. Instead, the Expectation Maximization algorithm is an iterative procedure for computing a local maximum of the likelihood function, under the assumption that the hidden posterior p(ZX,θ)p(Z|\X,\theta) is tractable. We will take advantage of the evidence lower bound (θX)L(q,θ) \ell(\theta|\X) \geq \mathcal{L}(q,\theta)

on the data likelihood. Consider only proxy distributions of the form qϑ(Z)=p(ZX,ϑ)q_\vartheta(Z) = p(Z|\X,\vartheta), where ϑ\vartheta is some fixed configuration of the variables Θ\Theta, possibly different from our estimate θ\theta. The optimal value for ϑ\vartheta, in the sense that L(qϑ,θ)\mathcal{L}(q_\vartheta, \theta) is maximum, depends on the particular choice of θ\theta. Similarly, the optimal value for θ\theta depends on the choice of ϑ\vartheta. This suggests an iterative scheme in which we alternate between maximizing with respect to ϑ\vartheta and with respect to θ\theta, gradually improving the log-likelihood.

Iterative Procedure

Suppose at time tt we have an estimate θt\theta_t of the parameters. To improve our estimate, we perform two steps of coordinate ascent on L(ϑ,θ)L(qϑ,θ)\L(\vartheta, \theta) \equiv \L(q_\vartheta, \theta), as described in [neal1998:em],


Compute a new lower bound on the observed log-likelihood, with ϑt+1=argmaxϑL(ϑ,θt)=θt \vartheta_{t+1} = \arg\max_\vartheta \mathcal{L}(\vartheta, \theta_t) = \theta_t


Estimate new parameters by optimizing over the lower bound, θt+1=argmaxθL(ϑt+1,θ)=argmaxθEq[logp(X,Zθ)] \theta_{t+1} = \arg\max_\theta \mathcal{L}(\vartheta_{t+1}, \theta) = \arg\max_\theta E_q[ \log p(\X,Z|\theta) ]

In the M-Step, the expectation is taken with respect to qϑt+1q_{\vartheta_{t+1}}.

Alternate Formulation

In the M-Step, the entropy term of the evidence lower bound L(ϑt+1,θ)\mathcal{L}(\vartheta_{t+1}, \theta) does not depend on θ\theta. The remaining term Q(θt,θ)=Eq[logp(X,Zθ)]Q(\theta_t, \theta)=E_q[\log p(\X,Z|\theta)] is sometimes called the auxiliary function or Q-function. To us, this is the expected complete-data log-likelihood.

Proof of Convergence

To prove convergence of this algorithm, we show that the data likelihood (θX)\ell(\theta|\X) increases after each update.

After a single iteration of Expectation Maximization, the observed data likelihood of the estimated parameters has not decreased, that is, (θtX)(θt+1X) \ell(\theta_t | \X) \leq \ell(\theta_{t+1} | \X)

This result is a simple consequence of all the hard work we have put in so far:

(θtX)=L(qϑt+1,θt)(tightness)L(qϑt+1,θt+1)(M-Step)(θt+1X)(ELBO)\begin{aligned} \ell(\theta_t | \X) &= \mathcal{L}(q_{\vartheta_{t+1}}, \theta_t) & \text{(tightness)} \\ &\leq \mathcal{L}(q_{\vartheta_{t+1}}, \theta_{t+1}) & \text{(M-Step)} \\ &\leq \ell(\theta_{t+1} | \X) & \text{(ELBO)} \end{aligned}

It is also possible to show that Expectation-Maximization converges to something useful.

(Neal & Hinton 1998, Thm. 2) Every local maximum of the evidence lower bound L(q,θ)\mathcal{L}(q, \theta) is a local maximum of the data likelihood (θX)\ell(\theta | \X).

Starting from an initial guess θ0\theta_0, We run this procedure until some stopping criterion is met and obtain a sequence {(ϑt,θt)}t=1T\{ (\vartheta_t, \theta_t) \}_{t=1}^T of parameter estimates.

Example: Coin Flips

Now that we have a good grasp on the theory behind Expectation Maximization, let's get some intuition by means of a simple example. As usual, the simplest possible example involves coin flips!

Probabilistic Model

Suppose we have two coins, each with a different probability of heads, θA\theta_A and θB\theta_B, unknown to us. We collect data from a series of NN trials in order to estimate the bias of each coin. Each trial kk consists of flipping the same random coin ZkZ_k a total of MM times and recording only the total number XkX_k of heads.

This situation is best described by the following generative probabilistic model, which precisely describes our assumptions about how the data was generated. The corresponding graphical model and a set of sample data are shown in Figure .\ θ=(θA,θB)fixed coin biasesZnUniform{A,B}n=1,,Ncoin indicatorsXnZn,θBin[θZn,M]n=1,,Nhead count \begin{aligned} \theta &= (\theta_A, \theta_B) & &&\text{fixed coin biases} \\ Z_n &\sim \mathrm{Uniform}\{A, B\} & \forall\, n=1,\dots,N &&\text{coin indicators} \\ X_n | Z_n, \theta &\sim \mathrm{Bin}[\theta_{Z_n}, M] & \forall\, n=1,\dots,N &&\text{head count} \end{aligned}

Complete Data Log-Likelihood

The complete data log-likelihood for a single trial (xn,zn)(x_n, z_n) is logp(xn,znθ)=logp(zn)+logp(xnzn,θ) \log p(x_n, z_n | \theta) = \log p(z_n) + \log p(x_n | z_n, \theta) In this model, P(zn)=12P(z_n) = \frac{1}{2} is uniform. The remaining term is logp(xnzn,θ)=log(Mxn)θznxn(1θzn)Mxn=log(Mxn)+xnlogθzn+(Mxn)log(1θzn)\begin{aligned} \log p(x_n | z_n, \theta) &= \log \binom{M}{x_n} \theta_{z_n}^{x_n} (1-\theta_{z_n})^{M-x_n} \\ &= \log \binom{M}{x_n} + x_n \log\theta_{z_n} + (M-x_n)\log(1-\theta_{z_n}) \end{aligned}

Expectation Maximization

Now that we have specified the probabilistic model and worked out all relevant probabilities, we are ready to derive an Expectation Maximization algorithm.

The E-Step is straightforward. The M-Step computes a new parameter estimate θt+1\theta_{t+1} by optimizing over the lower bound found in the E-Step. Let ϑ=ϑt+1=θt\vartheta = \vartheta_{t+1} = \theta_t. Then, θt+1=argmaxθL(θ,qϑ)=argmaxθEq[logp(X,Zθ)]=argmaxθEq[logp(XZ,θ)p(Z)]=argmaxθEq[logp(XZ,θ)]+logp(Z)=argmaxθEq[logp(XZ,θ)]\begin{aligned} \theta_{t+1} = \arg\max_\theta \L(\theta, q_\vartheta) &= \arg\max_\theta E_q[\log p(\X, Z | \theta )] \\ &= \arg\max_\theta E_q[\log p(\X | Z, \theta) p(Z)] \\ &= \arg\max_\theta E_q[\log p(\X | Z, \theta)] + \log p(Z) \\ &= \arg\max_\theta E_q[\log p(\X | Z, \theta)] \end{aligned}

Now, because each trial is conditionally independent of the others, given the parameters, Eq[logp(XZ,θ)]=Eq[logn=1Np(xnZn,θ)]=n=1NEq[logp(xnZn,θ)]=n=1NEq[xnlogθzn+(Mxn)log(1θzn)]+n=1Nlog(Mxn)=n=1NEq[xnlogθzn+(Mxn)log(1θzn)]+const. w.r.t. θ=n=1Nqϑ(zn=A)[xnlogθA+(Mxn)logθA]+n=1Nqϑ(zn=B)[xnlogθB+(Mxn)logθB]+const. w.r.t. θ\begin{aligned} E_q[\log p(\X | Z, \theta)] &= E_q\left[ \log \prod_{n=1}^N p(x_n | Z_n, \theta) \right] = \sum_{n=1}^N E_q[\log p(x_n | Z_n , \theta)] \\ &= \sum_{n=1}^N E_q \bigg[ x_n \log \theta_{z_n} + (M-x_n) \log (1-\theta_{z_n}) \bigg] + \sum_{n=1}^N \log \binom{M}{x_n} \\ &= \sum_{n=1}^N E_q \bigg[ x_n \log \theta_{z_n} + (M-x_n) \log (1-\theta_{z_n}) \bigg] + \text{const. w.r.t. } \theta \\ &= \sum_{n=1}^N q_\vartheta(z_n=A) \bigg[ x_n \log \theta_A + (M-x_n) \log \theta_A \bigg] \\ &+ \sum_{n=1}^N q_\vartheta(z_n=B) \bigg[ x_n \log \theta_B + (M-x_n) \log \theta_B \bigg] + \text{const. w.r.t. } \theta \end{aligned}

Let ak=q(zk=A)a_k = q(z_k = A) and bk=q(zk=B)b_k = q(z_k = B). Note k=1Nak=k=1Nbk=1\sum_{k=1}^N a_k = \sum_{k=1}^N b_k = 1. To maximize the above expression with respect to the parameters, we take derivatives with respect to θA\theta_A and θB\theta_B and set to zero: θA[Eq[logp(XZ,θ)]]=1θAn=1Nanxn+11θAn=1Nan(Mxn)=0θB[Eq[logp(XZ,θ)]]=1θBn=1Nbnxn+11θBn=1Nbn(Mxn)=0\begin{aligned} \frac{\partial}{\partial \theta_A} \bigg[ E_q[\log p(\X | Z, \theta)] \bigg] &= \frac{1}{\theta_A} \sum_{n=1}^N a_n x_n + \frac{1}{1-\theta_A} \sum_{n=1}^N a_n (M-x_n) = 0 \\ % \frac{\partial}{\partial \theta_B} \bigg[ E_q[\log p(\X | Z, \theta)] \bigg] &= \frac{1}{\theta_B} \sum_{n=1}^N b_n x_n + \frac{1}{1-\theta_B} \sum_{n=1}^N b_n (M-x_n) = 0 \\ \end{aligned}

Solving for θA\theta_A and θB\theta_B, we obtain θA=n=1Nanxnn=1NanMθB=n=1Nbnxnn=1NbnM \theta_A = \frac{\sum_{n=1}^N a_n x_n}{\sum_{n=1}^N a_n M} \qquad \theta_B = \frac{\sum_{n=1}^N b_n x_n}{\sum_{n=1}^N b_n M}

Example: Gaussian Mixture Model

Probabilistic Model

In a Gaussian Mixture Model, samples are drawn from a random cluster, each normally distributed with its own mean and variance. Our goal will be to estimate the following parameters: π=(π1,,πK)mixing weightsμ=(μ1,,μK)cluster centersΣ=(Σ1,,ΣK)cluster variance \begin{aligned} \vec\pi &= (\pi_1, \dots, \pi_K) && \text{mixing weights} \\ \vec\mu &= (\mu_1, \dots, \mu_K) && \text{cluster centers} \\ \vec\Sigma &= (\Sigma_1, \dots, \Sigma_K) && \text{cluster variance} \end{aligned} The full model specification is below. A graphical model is shown in Figure . θ=(π,μ,Σ)model parametersznCat[π]cluster indicatorsxnzn,θN(μzn,Σzn)base distribution \begin{aligned} \theta &= (\vec\pi, \vec\mu, \vec\Sigma) && \text{model parameters} \\ z_n &\sim \mathrm{Cat}[\pi] && \text{cluster indicators} \\ x_n | z_n, \theta &\sim \mathcal{N}(\mu_{z_n}, \Sigma_{z_n}) && \text{base distribution} \end{aligned}

Complete Data Log-Likelihood

The complete data log-likelihood for a single datapoint (xn,zn)(x_n, z_n) is logp(xn,znθ)=logk=1KπkN(xnμk,Σk)I(zn=k)=k=1KI(zn=k)logπkN(xnμk,Σk)\begin{aligned} \log p(x_n, z_n | \theta) &= \log \prod_{k=1}^K \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{\mathbb{I}(z_n = k)} \\ &= \sum_{k=1}^K \mathbb{I}(z_n = k) \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \end{aligned} Similarly, the complete data log-likelihood over all points {(xn,zn)}n=1N\{ (x_n, z_n) \}_{n=1}^N is logp(X,Zθ)=n=1Nlogp(xn,znθ)=n=1Nk=1KI(zn=k)logπkN(xnμk,Σk) \log p(X,Z | \theta) = \sum_{n=1}^N \log p(x_n, z_n | \theta) = \sum_{n=1}^N \sum_{k=1}^K \mathbb{I}(z_n = k) \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)

Hidden Posterior

The hidden posterior for a single point (xn,zn)(x_n, z_n) can be found using Bayes' rule: p(zn=kxn,θ)=P(zn=kθ)p(xnzn=k,θ)p(xNθ)=πkN(xnμk,Σk)k=1KπkN(xnμk,Σk)\begin{aligned} p(z_n = k | x_n, \theta) &= \frac{P(z_n=k | \theta) p(x_n | z_n=k, \theta)}{p(x_N | \theta)} \\ &= \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{k'=1}^K \pi_{k'} \mathcal{N}(x_n | \mu_{k'}, \Sigma_{k'})} \end{aligned}

Expectation Maximization

Our derivation will follow that of [murphy:mlapp], adapted to our notation.


Before the E-step, we have an estimate θt\theta_t of the parameters, and seek to compute a new lower bound on the observed log-likelihood. Earlier, we showed that the optimal lower bound is L(qθt,θ)=Eq[logp(X,Zθ)]+const. \L(q_{\theta_t}, \theta) = E_q[ \log p(\X,Z|\theta)] + \text{const.} where qθt(z)p(zX,θt)q_{\theta_t}(z) \equiv p(z|\X,\theta_t) and the second term is constant with respect to θ\theta. The E-Step requires us to derive an expression for the first term. Using , the expected complete data log-likelihood is given by Q(θt,θ)=Eq[logp(X,Zθ)]=n=1Nk=1KEq[I(zn=k)logπkN(xnμk,Σk)]=n=1Nk=1KEq[I(zn=k)]logπkN(xnμk,Σk)=n=1Nk=1Kp(zn=kxn,θt)logπkN(xnμk,Σk)=n=1Nk=1Krnklogπk+n=1Nk=1KrnklogN(xnμk,Σk)\begin{aligned} Q(\theta_t, \theta) = E_q[ \log p(\X,Z|\theta)] &= \sum_{n=1}^N \sum_{k=1}^K E_q\big[ \mathbb{I}(z_n = k) \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \big] \\ &= \sum_{n=1}^N \sum_{k=1}^K E_q\big[ \mathbb{I}(z_n = k) \big] \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \\ &= \sum_{n=1}^N \sum_{k=1}^K p(z_n=k \mid x_n, \theta_t) \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \\ &= \sum_{n=1}^N \sum_{k=1}^K r_{nk} \log \pi_k + \sum_{n=1}^N \sum_{k=1}^K r_{nk} \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \end{aligned} where rnkp(zn=kxn,θt)r_{nk} \equiv p(z_n = k \mid x_n, \theta_t) is the responsibility that cluster kk takes for data point xnx_n after step tt. During the E-Step, we compute these values explicitly with .


During the M-Step, we optimize our lower bound with respect to the parameters θ=(π,μ,Σ)\theta = (\vec\pi, \vec\mu, \vec\Sigma). For the mixing weights π\vec\pi, we use Lagrange multipliers to maximize the ELBO subject to the constraint k=1Kπk=1\sum_{k=1}^K \pi_k = 1. The Lagrangian is Λ(π,λ)=Q(θt,θ)+λ(k=1Kπk1) \Lambda(\pi, \lambda) = Q(\theta_t, \theta) + \lambda \left( \sum_{k=1}^K \pi_k - 1 \right) Carrying out the optimization, we find that λ=N\lambda = -N. The correct update for the mixing weights is πk=1Nn=1Nrnk=rkN \boxed{ \pi_k = \frac{1}{N} \sum_{n=1}^N r_{nk} = \frac{r_k}{N} } where rkn=1nrnkr_k \equiv \sum_{n=1}^n r_{nk} is the effective number of points assigned to cluster kk. For the cluster centers μ\vec\mu and variance Σ\vec\Sigma, you should verify that the correct updates are μk=n=1NrnkxnrkΣk=n=1NrnkxnxnTrkμkμkT \boxed{ \mu_k = \frac{\sum_{n=1}^N r_{nk} x_n}{r_k} } \qquad \boxed{ \Sigma_k = \frac{\sum_{n=1}^N r_{nk} x_n x_n^T}{r_k} - \mu_k \mu_k^T }

Advice for Deriving EM Algorithms

The previous two examples suggest a general approach for deriving a new algorithm.

  • Specify the probabilistic model. Identify the observed variables, hidden variables, and parameters. Draw the corresponding graphical model to help determine the underlying independence structure.
  • Identify the complete-data likelihood P(X,Zθ)P(X,Z|\theta). For exponential family models, the complete-data likelihood will be convex and easy to optimize. In other models, other work may be required.
  • Identify the hidden posterior P(ZX,θ)P(Z|X,\theta). If this distribution is not tractable, you may want to consider variational inference, which we will discuss later.
  • Derive the E-Step. Write down an expression for Eq[logp(XZ,θ)]E_q[ \log p(\X | Z,\theta)].
  • Derive the M-Step. Try taking derivatives and setting to zero. If this doesn't work, you may need to resort to gradient-based methods or variational inference.