Paper Summary
Citation: Dai, B., Wipf, D., & Dai, B. (2021). A contrastive learning approach for training variational autoencoder priors. Advances in Neural Information Processing Systems, 34, 29604-29616.
Publication: NeurIPS 2021
What kind of paper is this?
This is a method paper that introduces a novel training approach for Variational Autoencoders (VAEs) to address fundamental limitations in their generative quality through improved prior learning.
What is the motivation?
The work is motivated by a critical limitation in Variational Autoencoders known as the “prior hole” problem, where the prior distribution p(z) fails to match the aggregate approximate posterior q(z). This mismatch leads to areas in the latent space with high density under the prior that don’t map to realistic data samples, resulting in poor generative quality compared to GANs and other generative models.
What is the novelty here?
The authors propose a novel energy-based model (EBM) prior that is trained using Noise Contrastive Estimation (NCE), which they term a Noise Contrastive Prior (NCP). The key innovations are:
- Two-Stage Training Process: First, a standard VAE is trained with a simple base prior. Then, the VAE weights are frozen and a binary classifier learns to distinguish between samples from the aggregate posterior q(z) and the base prior p(z).
- Reweighting Strategy: The core idea is to reweight a base prior distribution p(z) with a learned reweighting factor r(z) to make the resulting prior $p_{NCP}(z)$ better match the aggregate posterior q(z).
- NCE for EBM Training: The method avoids computationally expensive MCMC sampling typically required for training EBMs by framing it as a binary classification task.
- Scalability to Hierarchical Models: For hierarchical VAEs with multiple latent groups, the NCP approach can be applied independently and in parallel to each group’s conditional prior.
What experiments were performed?
The method was evaluated on several standard image generation benchmarks to demonstrate its broad applicability:
- MNIST (dynamically binarized): Likelihood evaluation on a controlled, small-latent-space task
- CIFAR-10: Standard computer vision benchmark for generative modeling
- CelebA 64x64: Applied to both standard VAE architectures and more advanced VAEs with GMM priors (RAE model)
- CelebA HQ 256x256: High-resolution face generation task
The experiments compared FID scores, likelihood metrics, and qualitative sample quality between baseline VAEs and NCP-enhanced versions, with particular focus on state-of-the-art hierarchical VAEs (NVAE).
What were the outcomes and conclusions drawn?
The proposed NCP method demonstrated significant improvements in generative quality across all evaluated datasets:
- CelebA-64: NCP improved FID scores from 48.12 to 41.28 for standard VAEs, and from 40.95 to 39.00 for RAE models with GMM priors.
- Hierarchical Models (NVAE): The impact was particularly pronounced on state-of-the-art hierarchical VAEs:
- CIFAR-10: FID improved from 51.71 to 24.08
- CelebA-64: FID improved dramatically from 13.48 to 5.25, making it competitive with GANs
- CelebA HQ 256x256: FID reduced from 40.26 to 24.79
- Likelihood Performance: On MNIST, NCP-VAE achieved 78.10 nats NLL vs. baseline NVAE’s 78.67 nats
The key conclusions are that two-stage training with noise contrastive estimation provides an effective framework for learning expressive energy-based priors that:
- Addresses the prior hole problem by aligning priors with aggregate posteriors
- Scales to hierarchical models through parallel training of reweighting factors
- Avoids expensive MCMC sampling by framing EBM training as binary classification
- Significantly improves sample quality while maintaining computational efficiency
Additional Resources
- Paper at NeurIPS 2021
Note: This is a personal learning note and may be incomplete or evolving.