Definition

Let be observed data and be unobserved (latent) variables. We define as the pdf of , as the joint pdf of , and as the conditional pdf of given .

The goal of the EM algorithm is maximizing the observed likelihood by utilizing the joint density .

By the definition of a Conditional Distribution, holds. We can derive the decomposition of the log-likelihood for an arbitrary but fixed :

\ln L(\boldsymbol{\theta}|\mathbf{X}) &= \int \ln[h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})] k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0}) d \mathbf{Z} - \int \ln[k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})]k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})d\mathbf{Z}\\ &= \mathbb{E}_{\mathbf{Z} \sim k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})}[\ln h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})] - \mathbb{E}_{\mathbf{Z} \sim k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})}[\ln k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})] \end{aligned}$$ Let the first term of the RHS be the **quasi-likelihood function** (or Q-function): $$Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X}) := \mathbb{E}_{\mathbf{Z} \sim k(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}_{0})}[\ln h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})]$$ EM algorithm maximizes $Q(\boldsymbol{\theta} | \boldsymbol{\theta}_{0}, \mathbf{X})$ iteratively as a proxy for maximizing $\ln L(\boldsymbol{\theta}|\mathbf{X})$. # Algorithm 1. Expectation Step (E-step): Close the Gap Compute the expected complete log-likelihood under the current posterior: $$Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X}) := \mathbb{E}_{\mathbf{Z} \sim k(\mathbf{Z} | \mathbf{X}, \hat{\boldsymbol{\theta}}^{(m)})} [\ln h(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})]$$ where $m = 0, 1, \dots$ denotes the iteration index. ** 2. Maximization Step (M-step): Lift the Lower Bound Update the parameters by maximizing the Q-function: $$\hat{\boldsymbol{\theta}}^{(m+1)} = \underset{\boldsymbol{\theta}}{\operatorname{arg max}} Q(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}^{(m)}, \mathbf{X})$$ # Properties ## Convergence The sequence of estimates $\hat{\boldsymbol{\theta}}^{(m)}$ satisfies the monotonicity property: $$L(\hat{\boldsymbol{\theta}}^{(m+1)}|\mathbf{X}) \geq L(\hat{\boldsymbol{\theta}}^{(m)}|\mathbf{X})$$ Therefore, the sequence of EM estimates converges to at least a local optimum of the observed likelihood. # Facts > EM algorithm is an ideal case of variational inference where the approximate posterior $q(z)$ can perfectly match the true posterior $p(z|x)$, allowing us to eliminate the gap completely at each E-step.