Consistency Models: Fast One-Step Diffusion Generation

What kind of paper is this?

This is a Method paper. It proposes consistency models, a new class of generative models designed for fast one-step (or few-step) generation. The models can be trained either by distilling pretrained diffusion models (consistency distillation) or as standalone generative models from scratch (consistency training). The paper provides theoretical analysis of both training modes and achieves FID 3.55 on CIFAR-10 for single-step non-adversarial generation (state of the art at the time of publication).

The Slow Sampling Problem in Diffusion

Diffusion models produce high-quality samples but require iterating through many denoising steps (often tens to hundreds), making generation slow compared to GANs or VAEs. Previous approaches to speed up sampling include faster ODE/SDE solvers (DDIM, DPM-Solver) and progressive distillation. These either still require multiple steps or depend on a complex multi-stage distillation pipeline. The goal is a model that can generate high-quality samples in a single forward pass while optionally allowing more steps for better quality.

Core Innovation: The Self-Consistency Property

The key idea builds on the Probability Flow (PF) ODE from the score-based SDE framework. The PF ODE describes a deterministic trajectory that converts noise into data, governed by the learned score function. For the VE-SDE parameterization used by EDM (Karras et al., 2022), this takes the form:

$$\frac{d\mathbf{x}_t}{dt} = -t , s_\phi(\mathbf{x}_t, t)$$

where $s_\phi$ is a pretrained score model, a consistency function $f(\mathbf{x}_t, t)$ maps any point on an ODE trajectory to the trajectory’s origin $\mathbf{x}_\epsilon$. The defining property is self-consistency:

$$f(\mathbf{x}_t, t) = f(\mathbf{x}_{t’}, t’) \quad \text{for all } t, t’ \in [\epsilon, T]$$

for any points $\mathbf{x}_t$ and $\mathbf{x}_{t’}$ on the same PF ODE trajectory.

Parameterization. The model enforces the boundary condition $f(\mathbf{x}_\epsilon, \epsilon) = \mathbf{x}_\epsilon$ using skip connections:

$$f_\theta(\mathbf{x}, t) = c_{\text{skip}}(t) , \mathbf{x} + c_{\text{out}}(t) , F_\theta(\mathbf{x}, t)$$

where $c_{\text{skip}}(\epsilon) = 1$ and $c_{\text{out}}(\epsilon) = 0$, ensuring the boundary condition is satisfied by construction.

Consistency Distillation (CD). Given a pretrained diffusion model, CD trains a consistency model by enforcing self-consistency between adjacent timesteps:

$$\mathcal{L}_{\text{CD}}^N(\theta, \theta^-; \phi) = \mathbb{E}\left[\lambda(t_n) , d!\left(f_\theta(\mathbf{x}_{t_{n+1}}, t_{n+1}), , f_{\theta^-}(\hat{\mathbf{x}}_{t_n}^\phi, t_n)\right)\right]$$

where $\hat{\mathbf{x}}_{t_n}^\phi$ is obtained by running one step of the ODE solver using the pretrained score model, $\theta^-$ is an exponential moving average (EMA) of $\theta$, and $d(\cdot, \cdot)$ is a distance metric. The use of a target network $\theta^-$ (updated via EMA) parallels techniques from deep Q-learning and momentum contrastive learning.

Consistency Training (CT). CT eliminates the need for a pretrained diffusion model. It replaces the ODE solver step with a score estimate derived from the denoising score matching identity:

$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = \mathbb{E}\left[\frac{\mathbf{x} - \mathbf{x}_t}{t^2} ,\middle|, \mathbf{x}_t\right]$$

Because this identity lets us estimate the score from noisy data alone (without a pretrained model), we can compute the ODE update directly from training samples. This allows training directly on data pairs $(\mathbf{x}, \mathbf{x} + t\mathbf{z})$ where $\mathbf{z} \sim \mathcal{N}(0, I)$.

Theoretical guarantee. If CD achieves zero loss, the consistency model error is bounded by $O((\Delta t)^p)$ where $\Delta t$ is the maximum timestep gap and $p$ is the order of the ODE solver.

Experiments and Benchmarks

Datasets: CIFAR-10 (32x32), ImageNet 64x64, LSUN Bedroom 256x256, LSUN Cat 256x256.

Architecture: All models use the NCSN++/EDM architecture. CD distills from pretrained EDM models.

Key results for consistency distillation (CD):

Dataset	Steps	FID
CIFAR-10	1	3.55
CIFAR-10	2	2.93
ImageNet 64x64	1	6.20
ImageNet 64x64	2	4.70
LSUN Bedroom 256	1	7.80
LSUN Bedroom 256	2	5.22
LSUN Cat 256	1	11.0
LSUN Cat 256	2	8.84

CD outperforms progressive distillation (PD) across all datasets and sampling steps, with the exception of single-step generation on Bedroom 256x256 where CD with $\ell_2$ slightly underperforms PD with $\ell_2$.

Key results for consistency training (CT):

Dataset	Steps	FID
CIFAR-10	1	8.70
CIFAR-10	2	5.83
ImageNet 64x64	1	13.0
ImageNet 64x64	2	11.1
LSUN Bedroom 256	1	16.0
LSUN Cat 256	1	20.7

CT outperforms existing single-step non-adversarial models (VAEs, normalizing flows), e.g., improving over DC-VAE’s FID of 17.90 on CIFAR-10. Samples from CT share structural similarity with EDM samples from the same initial noise, suggesting CT does not suffer from mode collapse.

Zero-shot editing: Consistency models support colorization, super-resolution, inpainting, stroke-guided generation, interpolation, and denoising at test time without task-specific training, by modifying the multi-step sampling algorithm.

Findings and Limitations

Consistency distillation achieves state-of-the-art FID for one-step generation (3.55 on CIFAR-10, 6.20 on ImageNet 64x64).
Multi-step sampling provides a smooth quality-compute tradeoff: more steps yield better FID.
CT produces competitive results without any pretrained diffusion model, making consistency models a standalone generative model family.
The LPIPS distance metric $d(\cdot, \cdot)$ generally outperforms $\ell_1$ and $\ell_2$ for training consistency models.
At higher resolutions (LSUN 256x256), the gap between CD/CT and full EDM sampling widens.
CT currently underperforms CD, suggesting room for improvement in the standalone training paradigm.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Primary benchmark	CIFAR-10	32x32, 50K train	FID on 50K samples
Scaling benchmark	ImageNet 64x64	64x64, 1.28M	Unconditional generation
High-res benchmark	LSUN Bedroom, Cat	256x256	Unconditional generation

Algorithms

ODE solver for CD: Euler and Heun (2nd order) solvers on the empirical PF ODE
EMA for target network: Decay rate $\mu$ scheduled as a function of training step
Schedule functions: $N$ (number of discretization steps) and $\mu$ (EMA rate) increase over training following specific schedules (see Appendix C of the paper)
Distance metric: LPIPS performs best; $\ell_2$ and $\ell_1$ also evaluated

Models

Architecture: NCSN++/EDM architecture from Karras et al. (2022)
CD teacher: Pretrained EDM models
Parameterization: Skip-connection formulation with $c_{\text{skip}}(t)$ and $c_{\text{out}}(t)$ from EDM

Evaluation

Metric	Dataset	CD 1-step	CT 1-step	EDM (full)
FID	CIFAR-10	3.55	8.70	2.04
FID	ImageNet 64	6.20	13.0	2.44
FID	LSUN Bedroom	7.80	16.0	3.57
FID	LSUN Cat	11.0	20.7	6.69

Hardware

Training details follow EDM conventions
CD and CT use the same batch sizes and learning rate schedules as EDM training

Artifacts

Artifact	Type	License	Notes
openai/consistency_models	Code	MIT	Official implementation with pretrained checkpoints

Paper Information

Citation: Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. https://arxiv.org/abs/2303.01469

Publication: ICML 2023

@inproceedings{song2023consistency,
  title     = {Consistency Models},
  author    = {Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  url       = {https://arxiv.org/abs/2303.01469}
}

Additional Resources:

What kind of paper is this?#

The Slow Sampling Problem in Diffusion#

Core Innovation: The Self-Consistency Property#

Experiments and Benchmarks#

Findings and Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#