Definition

Variational autoencoder (VAE) utilizes the decoder of an Autoencoder as generator. Unlike traditional autoencoders, VAE models the latent space as a probability distribution, typically a Multivariate Normal Distribution.

Architecture

Instead of outputting a single point in latent space, the encoder of VAE produces parameters of a probability distribution on the latent space. The latent vector is sampled from the distribution and is reconstructed by the decoder.

Latent Variable Model

VAE assumes a latent variable model where each observation is generated from an unobserved latent variable where is the prior over the latent space, and is the decoder (a neural network parameterized by ).

The training objective is to find that maximizes the log-likelihood of observed data:

However, this is intractable, because computing requires integrating over all possible , which has no closed form for a neural network decoder.

Motivation via Exact Decomposition

For any valid distribution , the log-likelihood of the target distribution is decomposed into ELBO and a KL divergence term:

\log p_{\theta}(x) &= \int \log (p_{\theta}(x)) r(z) dz \\ &= \int \log \left( \frac{p_{\theta}(x, z)}{p_{\theta}(z | x)} \right) r(z) dz\\ &= \int \log \left( \frac{p_{\theta}(x, z)}{r(z)} \cdot \frac{r(z)}{p_{\theta}(z | x)} \right) r(z) dz \\ &= \underbrace{\int \log \left( \frac{p_{\theta}(x, z)}{r(z)} \right) r(z) dz}_{\text{ELBO}} + \underbrace{\int \log \left( \frac{r(z)}{p_{\theta}(z | x)} \right) r(z) dz}_{D_{KL}(r(z)\|p_{\theta}(z|x)) \geq 0} \\ \end{aligned}$$ Since the $KL \geq 0$, $ELBO \leq \ln p_{\theta}(x)$ holds, hence "lower bound." The KL term is intractable ($\because$ requires computing $p_\theta(z|x) = \cfrac{p_{\theta}(x|z)p(z)}{p_{\theta}(x)}$, where $p_\theta(x) = \int p_\theta(x|z)p(z)dz$ is itself intractable). This makes it impossible to directly track whether optimizing $\theta$ is actually improving $\log p_\theta(x)$. ### Strategy (Variational Inference) If we choose an $r(z) \approx p_{\theta}(z|x)$, the $KL$ term is eliminated, thereby $\ln p_{\theta}(x) \approx ELBO$. Thus, maximizing ELBO w.r.t. $\theta$ reliably maximizes $\log p_\theta(x)$. We parameterize $r(z)$ as a neural network with parameters $\phi$: $r(z) \rightarrow q_\phi(z|x)$. This is why the method is called **Variational**; the optimization variable is a distribution (function), not just a vector. ### Alternative Perspective: Importance Sampling The VAE objective can be interpreted from the perspective of [[Importance Sampling]]. Since the marginal likelihood $p_{\theta}(x)$ is intractable to compute directly, we utilize the encoder $q_{\phi}(z|x)$ as a proposal distribution. $$p_{\theta}(x) = \int p_\theta(x|z)p(z) = \int p_{\theta}(x|z) \frac{p(z)}{q_{\phi}(z|x)} q_{\phi}(z|x) dz = \mathbb{E}_{z \sim q_{\phi}(z|x)} \left[ p_{\theta}(x|z)\frac{p(z)}{q_{\phi}(z|x)} \right]$$ where the ratio $\cfrac{p(z)}{q_{\phi}(z|x)}$ acts as the importance weight, correcting for the discrepancy between sampling from $q_\phi$ instead of the prior. The encoder $q_\phi(z|x)$ concentrates samples on regions of the latent space most likely to have generated $x$, making the estimate far more efficient than naive sampling from the prior $p(z)$. As shown above, maximizing the ELBO w.r.t. $\phi$ is equivalent to minimizing $D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$, which drives the proposal $q_\phi(z|x)$ toward $p_{\theta}(z|x) \propto p_{\theta}(x,z) = p_{\theta}(x|z)p(z)$. This is precisely the optimal proposal that minimizes the variance of the importance-weighted estimate. ## Loss Function (ELBO) ![[Pasted image 20250412013209.webp|800]] Now substituting $q_{\phi}(z|x)$ for $r(z)$, we can expand the ELBO explicitly using the factorization $p_\theta(x,z) = p_\theta(x|z)p(z)$: $$\begin{aligned} ELBO &:= \int \log \left( \frac{p_{\theta}(x, z)}{r(z)} \right) r(z) dz \\ &= \int \log \left( \frac{p_{\theta}(x, z)}{q_{\phi}(z|x)} \right) q_{\phi}(z|x) dz\quad (r(z) \rightarrow q_\phi(z|x))\\ &= \int \log \left( \frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)} \right) q_{\phi}(z|x) dz \\ &= \int \log(p_{\theta}(x|z)) q_{\phi}(z|x) dz - \int \log \left( \frac{q_{\phi}(z|x)}{p(z)} \right) q_{\phi}(z|x) dz \\ &= \underbrace{\mathbb{E}_{z\sim q_{\phi}(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction loss}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization loss}} \end{aligned}$$ where $q(z|x)$ is the encoder distribution, $p(x|z)$ is the decoder distribution, and $p(z)$ is the prior distribution. ![[Variational Autoencoder_3f10c5d1.webp|800]] In this formulation, the first term represents the expected log-likelihood, encouraging the decoder to reconstruct the input accurately, while the second term is the [[Kullback-Leibler Divergence|KL-Divergence]], acting as a regularizer that pulls the approximate posterior toward the prior $p(z)$. This structural tension prevents the model from collapsing into a standard autoencoder and ensures that the latent space remains continuous and meaningful for sampling. ![[Pasted image 20240912194621.webp|800]] The regularization term ensures the continuity and completeness of the latent space. ### Why Maximizing ELBO Leads to Minimizing the Gap By applying [[Bayes Theorem]] to the ELBO, we can see the role of $\phi$ in a training process $$\begin{aligned} ELBO &= \int \log \left( \frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)} \right) q_{\phi}(z|x) dz \\ &= \int \log \left( \frac{p_{\theta}(z|x)p(x)}{q_{\phi}(z|x)} \right) q_{\phi}(z|x) dz\quad(\because p_{\theta}(x|z)p(z) = p_\theta(z|x)p_{\theta}(x)) \\ &= \int \log \left( \frac{p_{\theta}(z|x)}{q_{\phi}(z|x)} \right) q_{\phi}(z|x) dz + \int \log(p_{\theta}(x)) q_{\phi}(z|x) dz \\ &= \underbrace{-D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z|x))}_{\text{Negative Gap}} + \underbrace{\log p_{\theta}(x)}_{\text{Constant for } \phi}\quad \left( \because \int q_{\phi}(z|x) dz = 1 \right) \end{aligned}$$ Since $\log p_{\theta}(x)$ does not depend on $\phi$, the only way for $\phi$ to maximize the ELBO is to minimize the $D_{KL}(q_{\phi}(z|x) \| p_{\theta}(z|x))$ term. This proves that even though we use a loss function consisting of reconstruction and prior regularization, the encoder is mathematically forced to minimize the Gap between the approximate and true posterior. ## Reparameterization Trick ![[Pasted image 20240912191141.webp|600]] Sampling $z \sim q_{\phi}(z|x)$ is a stochastic operation with no well-defined gradient w.r.t. $\phi$, so backpropagation cannot flow through it directly. So, instead of directly sampling from the distribution $N(\mu, \sigma)$, we randomly sample $\epsilon \sim N(0, 1)$ and make a latent vector $z = \mu + \sigma \epsilon$. When calculating the gradient in the backpropagation, the sampled $\epsilon$ is considered as a constant ($\cfrac{dz}{d\mu} = 1$ and $\cfrac{dz}{d\sigma} = \epsilon$). # Facts > ![[Pasted image 20240910130423.webp|800]] > The data encoded by VAE is semantically well-distinguished in a latent low-dimensional space.