IWAE: Importance Weighted Autoencoders

Paper Information

Citation: Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). Importance Weighted Autoencoders. International Conference on Learning Representations (ICLR) 2016. https://arxiv.org/abs/1509.00519

Publication: ICLR 2016

What kind of paper is this?

This paper introduces the Importance Weighted Autoencoder (IWAE), a generative model that shares the same architecture as the Variational Autoencoder (VAE) but uses a different, tighter objective function. The key innovation is using importance weighting to derive a strictly tighter log-likelihood lower bound than the standard VAE objective.

What is the motivation?

The standard VAE has several limitations that motivated this work:

Strong assumptions: VAEs typically assume the posterior distribution is simple (e.g., approximately factorial) and that its parameters can be easily approximated from observations
Simplified representations: The VAE objective can force models to learn overly simplified representations that don’t utilize the network’s full modeling capacity
Harsh penalization: The VAE objective harshly penalizes approximate posterior samples that are poor explanations for the data, which can be overly restrictive
Inactive units (posterior collapse): VAEs tend to learn latent spaces with effective dimensions far below their capacity, a phenomenon later termed posterior collapse. The authors wanted to investigate if a new objective could address this issue

What is the novelty here?

The core novelty is the IWAE objective function, denoted as $\mathcal{L}_{k}$.

VAE ($\mathcal{L}_{1}$ Bound): The standard VAE maximizes $\mathcal{L}(x)=\mathbb{E}_{q(h|x)}[\log\frac{p(x,h)}{q(h|x)}]$. This is equivalent to the new bound when $k=1$.
IWAE ($\mathcal{L}_{k}$ Bound): The IWAE maximizes a tighter bound that uses $k$ samples drawn from the recognition model $q(h|x)$:

$$\mathcal{L}_{k}(x)=\mathbb{E}_{h_{1},…,h_{k}\sim q(h|x)}\left[\log\frac{1}{k}\sum_{i=1}^{k}\frac{p(x,h_{i})}{q(h_{i}|x)}\right]$$

Tighter Bound: The authors prove that this bound is always tighter than or equal to the VAE bound ($\mathcal{L}_{k+1} \geq \mathcal{L}_{k}$) and that as $k$ approaches infinity, $\mathcal{L}_{k}$ approaches the true log-likelihood $\log p(x)$.
Increased Flexibility: Using multiple samples gives the IWAE additional flexibility to learn generative models whose posterior distributions are complex and don’t fit the VAE’s simplifying assumptions.

Key Concept: Averaging Inside vs. Outside the Log

A crucial distinction exists between how VAE and IWAE utilize $k$ samples. Understanding this difference explains why increasing $k$ in IWAE actually improves the bound, while in VAE it merely reduces variance.

Flowchart comparing VAE and IWAE computation: VAE takes the log of each weight then averages (average of logs), while IWAE averages the weights first then takes the log (log of average) — VAE vs IWAE computation flow. The key difference is where the log operation occurs: VAE computes log(w_i) for each sample then averages, while IWAE averages the weights first then applies log to the result.

VAE (Average of Logs):

For a VAE, using $k$ samples approximates:

$$\mathbb{E}\left[ \frac{1}{k} \sum_{i=1}^k \log w_i \right] \approx \text{ELBO}$$

where $w_i = p(x, h_i) / q(h_i | x)$. Increasing $k$ here merely reduces the variance of the gradient estimator. The model still targets the same ELBO bound, so performance gains saturate quickly.

IWAE (Log of Average):

IWAE performs the averaging inside the logarithm:

$$\mathbb{E}\left[ \log \left( \frac{1}{k} \sum_{i=1}^k w_i \right) \right] = \mathcal{L}_k$$

By Jensen’s Inequality ($\log(\mathbb{E}[X]) \geq \mathbb{E}[\log(X)]$ for concave functions), this bound is mathematically guaranteed to be at least as tight as the VAE bound. Each increase in $k$ defines a new, strictly tighter lower bound on the log-likelihood.

Why This Matters for Gradients:

In IWAE, the gradient weights are normalized importance weights $\tilde{w}_i = w_i / \sum_j w_j$. This means “bad” samples (those with low $w_i$) contribute very little to the gradient update since they essentially vanish from the weighted sum. In contrast, VAE uses unweighted samples, so a single sample with extremely low probability produces a massive negative log value that can dominate the loss and harshly penalize the model. IWAE’s formulation allows the model to focus learning on the samples that actually explain the data well.

What experiments were performed?

The authors compared VAE and IWAE on density estimation tasks:

Datasets:

MNIST: $28 \times 28$ binarized handwritten digits (60,000 training / 10,000 test)
Omniglot: $28 \times 28$ binarized handwritten characters from various alphabets (24,345 training / 8,070 test)
Binarization: Dynamic sampling where binary values are sampled with expectations equal to the real pixel intensities (following Salakhutdinov & Murray, 2008)

Architectures: Two main network architectures were tested:

One stochastic layer (50 units) with two deterministic layers (200 units each)
Two stochastic layers (100 and 50 units) with deterministic layers in between

Architecture specifics:

Activations: tanh for deterministic layers; exp applied to variance predictions to ensure positivity
Distributions: Gaussian latent layers with diagonal covariance; Bernoulli observation layer
Initialization: Glorot & Bengio (2010) heuristic

Training: VAE and IWAE models were trained with $k \in {1, 5, 50}$ samples.

Optimization details:

Optimizer: Adam ($\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-4}$)
Batch Size: 20
Learning Rate Schedule: Annealed rate of $0.001 \cdot 10^{-i/7}$ for $3^i$ epochs (where $i=0…7$), totaling 3,280 passes over the data
Hardware: GPU-based implementation using mini-batch replication to parallelize the $k$ samples

Metrics:

Test Log-Likelihood: Primary measure of generative performance, estimated as the mean of $\mathcal{L}_{5000}$ (5000 samples) on the test set
Active Units: To quantify latent space richness, the authors measured “active” latent dimensions. A unit $u$ was defined as active if its activity statistic $A_{u}=\text{Cov}_{x}(\mathbb{E}_{u\sim q(u|x)}[u])$ exceeded $10^{-2}$

What outcomes/conclusions?

Bar chart comparing active latent units between VAE and IWAE across different k values on MNIST and Omniglot datasets — Active latent units for VAE vs IWAE (1 stochastic layer). VAE active units remain flat while IWAE increases with k. Data from Table 1 of Burda et al. (2016).

Better Performance: IWAE achieved significantly higher log-likelihoods than VAEs. IWAE performance improved with increasing $k$, while VAE performance benefited only slightly from using more samples ($k>1$).
Richer Representations: In all experiments with $k>1$, IWAE learned more active latent dimensions than VAE, suggesting richer latent representations.
Objective Drives Representation: The authors found that latent dimension inactivation is driven by the objective function rather than optimization issues. They demonstrated this through an “objective swap” experiment:

Bar charts showing the objective swap experiment results with active units and NLL changes when switching between VAE and IWAE objectives — Objective swap experiment on MNIST (1 stochastic layer). Switching a trained VAE to the IWAE objective improves both metrics, while switching IWAE to VAE degrades them. Data from Table 2 of Burda et al. (2016).

This experiment provides causal evidence that the objective function itself determines latent utilization:

VAE → IWAE: A converged VAE model, when fine-tuned with the IWAE objective ($k=50$), gained 3 active units (19 → 22) and improved test NLL from 86.76 to 84.88
IWAE → VAE: Conversely, a converged IWAE model fine-tuned with the VAE objective lost 2 active units (25 → 23) and worsened test NLL from 84.78 to 86.02

The symmetry of these results rules out explanations based on optimization dynamics, initialization, or architecture. The objective function is the determining factor.

Conclusion: IWAEs learn richer latent representations and achieve better generative performance than VAEs with equivalent architectures and training time.

Reproducibility Details

Variance Control: A common concern with importance sampling is high variance. The authors prove that the Mean Absolute Deviation of their estimator is bounded by $2 + 2\delta$, where $\delta$ is the gap between the bound and true log-likelihood. As the bound tightens, variance remains controlled.
Active Unit Threshold: The $10^{-2}$ threshold for classifying units as “active” is justified by a bimodal distribution of the log activity statistic, showing clear separation between active and inactive units.
Fixed Binarization: Results on a fixed binarization of MNIST (Larochelle, 2011) confirm that IWAE outperforms VAE regardless of preprocessing method, though with notably more overfitting compared to dynamic sampling.

Auto-Encoding Variational Bayes (VAE) - The foundational VAE paper that IWAE builds upon

Publication Details
Authors	Yuri Burda, Roger Grosse, Ruslan Salakhutdinov
Paper Title	Importance Weighted Autoencoders
Category	Generative Modeling
Date	November 2025
Links	📄 Paper

Paper Information#

What kind of paper is this?#

What is the motivation?#

What is the novelty here?#

Key Concept: Averaging Inside vs. Outside the Log#

What experiments were performed?#

What outcomes/conclusions?#

Reproducibility Details#

Related Notes#