Paper Summary

Citation: Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). Importance Weighted Autoencoders. International Conference on Learning Representations (ICLR) 2016.

Publication: ICLR 2016

What kind of paper is this?

This paper introduces the Importance Weighted Autoencoder (IWAE), a generative model that shares the same architecture as the Variational Autoencoder (VAE) but uses a different, tighter objective function. The key innovation is using importance weighting to derive a strictly tighter log-likelihood lower bound than the standard VAE objective.

What is the motivation?

The standard VAE has several limitations that motivated this work:

  • Strong assumptions: VAEs typically assume the posterior distribution is simple (e.g., approximately factorial) and that its parameters can be easily approximated from observations
  • Simplified representations: The VAE objective can force models to learn overly simplified representations that don’t utilize the network’s full modeling capacity
  • Harsh penalization: The VAE objective harshly penalizes approximate posterior samples that are poor explanations for the data, which can be overly restrictive
  • Inactive units: VAEs tend to learn latent spaces with effective dimensions far below their capacity, and the authors wanted to investigate if a new objective could address this issue

What is the novelty?

The core novelty is the IWAE objective function, denoted as $\mathcal{L}_{k}$.

  • VAE ($\mathcal{L}_{1}$ Bound): The standard VAE maximizes $\mathcal{L}(x)=\mathbb{E}_{q(h|x)}[\log\frac{p(x,h)}{q(h|x)}]$. This is equivalent to the new bound when $k=1$.

  • IWAE ($\mathcal{L}_{k}$ Bound): The IWAE maximizes a tighter bound that uses $k$ samples drawn from the recognition model $q(h|x)$: $$\mathcal{L}{k}(x)=\mathbb{E}{h_{1},…,h_{k}\sim q(h|x)}\left[\log\frac{1}{k}\sum_{i=1}^{k}\frac{p(x,h_{i})}{q(h_{i}|x)}\right]$$

  • Tighter Bound: The authors prove that this bound is always tighter than or equal to the VAE bound ($\mathcal{L}{k+1} \geq \mathcal{L}{k}$) and that as $k$ approaches infinity, $\mathcal{L}_{k}$ approaches the true log-likelihood $\log p(x)$.

  • Increased Flexibility: Using multiple samples gives the IWAE additional flexibility to learn generative models whose posterior distributions are complex and don’t fit the VAE’s simplifying assumptions.

What experiments were performed?

The authors compared VAE and IWAE on density estimation tasks:

Datasets:

  • MNIST: $28 \times 28$ binarized handwritten digits
  • Omniglot: $28 \times 28$ binarized handwritten characters from various alphabets

Architectures: Two main network architectures were tested:

  1. One stochastic layer (50 units) with two deterministic layers (200 units each)
  2. Two stochastic layers (100 and 50 units) with deterministic layers in between

Training: VAE and IWAE models were trained with $k \in {1, 5, 50}$ samples.

Metrics:

  1. Test Log-Likelihood: Primary measure of generative performance, estimated using the $\mathcal{L}_{5000}$ bound (5000 samples) on the test set
  2. Active Units: To quantify latent space richness, the authors measured “active” latent dimensions. A unit $u$ was defined as active if its activity statistic $A_{u}=\text{Cov}{x}(\mathbb{E}{u\sim q(u|x)}[u])$ exceeded $10^{-2}$

What were the outcomes and conclusions?

  • Better Performance: IWAE achieved significantly higher log-likelihoods than VAEs. IWAE performance improved with increasing $k$, while VAE performance benefited only slightly from using more samples ($k>1$).

  • Richer Representations: In all experiments with $k>1$, IWAE learned more active latent dimensions than VAE, suggesting richer latent representations.

  • Objective Drives Representation: The authors found that latent dimension inactivation is driven by the objective function rather than optimization issues. They demonstrated this by:

    • Taking a trained VAE and continuing training with the IWAE objective increased active units and log-likelihood
    • Training a trained IWAE with the VAE objective decreased active units and log-likelihood
  • Conclusion: IWAEs learn richer latent representations and achieve better generative performance than VAEs with equivalent architectures and training time.

Additional Resources