<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Diffusion on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/tags/diffusion/</link><description>Recent content in Diffusion on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Thu, 09 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/diffusion/index.xml" rel="self" type="application/rss+xml"/><item><title>PharMolixFM: Multi-Modal All-Atom Molecular Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/pharmolixfm-all-atom-foundation-models/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/pharmolixfm-all-atom-foundation-models/</guid><description>PharMolixFM unifies diffusion, flow matching, and Bayesian flow networks for all-atom molecular modeling and generation with task-specific denoising priors.</description><content:encoded><![CDATA[<h2 id="a-unified-framework-for-all-atom-molecular-foundation-models">A Unified Framework for All-Atom Molecular Foundation Models</h2>
<p>PharMolixFM is a <strong>Method</strong> paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.</p>
<h2 id="challenges-in-multi-modal-atomic-modeling">Challenges in Multi-Modal Atomic Modeling</h2>
<p>Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.</p>
<p>First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.</p>
<p>Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.</p>
<p>PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.</p>
<h2 id="multi-modal-denoising-with-task-specific-priors">Multi-Modal Denoising with Task-Specific Priors</h2>
<p>The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.</p>
<p>The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:</p>
<p>$$
p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)
$$</p>
<p>The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:</p>
<p>$$
q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})})
$$</p>
<p>A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This &ldquo;Fix&rdquo; mechanism enables multiple training tasks:</p>
<ul>
<li><strong>Docking</strong> ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.</li>
<li><strong>Structure-based drug design</strong> ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.</li>
<li><strong>Robustness augmentation</strong> ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.</li>
</ul>
<h3 id="three-generative-model-variants">Three Generative Model Variants</h3>
<p><strong>Multi-modal diffusion (PharMolixFM-Diff)</strong> uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:</p>
<p>$$
q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})})
$$</p>
<p>$$
q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T
$$</p>
<p>The training loss combines coordinate MSE with cross-entropy for discrete variables:</p>
<p>$$
\mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right]
$$</p>
<p><strong>Multi-modal flow matching (PharMolixFM-Flow)</strong> constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.</p>
<p><strong>Bayesian flow networks (PharMolixFM-BFN)</strong> perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:</p>
<p>$$
p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})}
$$</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.</p>
<h2 id="docking-and-drug-design-experiments">Docking and Drug Design Experiments</h2>
<h3 id="protein-small-molecule-docking">Protein-Small-Molecule Docking</h3>
<p>PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD &lt; 2 Angstrom.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Self-Ranking (%)</th>
          <th>Oracle-Ranking (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffDock</td>
          <td>38.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>RFAA</td>
          <td>42.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Vina</td>
          <td>52.3</td>
          <td>-</td>
      </tr>
      <tr>
          <td>UniMol-Docking V2</td>
          <td>77.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>SurfDock</td>
          <td>78.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>AlphaFold3</td>
          <td>90.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>PocketXMol (50 repeats)</td>
          <td>82.2</td>
          <td>95.3</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (50 repeats)</td>
          <td>83.4</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow (50 repeats)</td>
          <td>73.4</td>
          <td>93.7</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN (50 repeats)</td>
          <td>78.5</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (500 repeats)</td>
          <td>83.9</td>
          <td>98.1</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.</p>
<h3 id="structure-based-drug-design">Structure-Based Drug Design</h3>
<p>Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score (Avg/Med)</th>
          <th>QED</th>
          <th>SA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-5.14 / -4.70</td>
          <td>0.57</td>
          <td>0.76</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-5.47 / -6.30</td>
          <td>0.48</td>
          <td>0.58</td>
      </tr>
      <tr>
          <td>DecompDiff</td>
          <td>-5.67 / -6.04</td>
          <td>0.45</td>
          <td>0.61</td>
      </tr>
      <tr>
          <td>MolCRAFT</td>
          <td>-6.61 / -8.14</td>
          <td>0.46</td>
          <td>0.62</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff</td>
          <td>-6.18 / -6.44</td>
          <td>0.50</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow</td>
          <td>-6.34 / -6.47</td>
          <td>0.49</td>
          <td>0.74</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN</td>
          <td>-6.38 / -6.45</td>
          <td>0.48</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.</p>
<h3 id="inference-scaling-law">Inference Scaling Law</h3>
<p>The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:</p>
<p>$$
\text{Acc} = a \log(bR + c) + d
$$</p>
<p>where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.</p>
<h2 id="competitive-docking-with-faster-inference-but-limited-task-scope">Competitive Docking with Faster Inference, but Limited Task Scope</h2>
<p>PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:</p>
<ol>
<li><strong>Diffusion outperforms flow matching and BFN</strong> for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.</li>
<li><strong>Oracle-ranking reveals untapped potential</strong>: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.</li>
<li><strong>The three variants show similar performance for drug design</strong>, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.</li>
<li><strong>Inference scaling laws hold</strong> for molecular generative models, paralleling findings in NLP.</li>
</ol>
<p>Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3&rsquo;s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PDBBind, Binding MOAD, CrossDocked2020, PepBDB</td>
          <td>Not specified</td>
          <td>Filtered by PocketXMol criteria</td>
      </tr>
      <tr>
          <td>Docking eval</td>
          <td>PoseBusters benchmark</td>
          <td>428 complexes</td>
          <td>Holo docking with known protein</td>
      </tr>
      <tr>
          <td>SBDD eval</td>
          <td>CrossDocked test set</td>
          <td>100 pockets</td>
          <td>100 molecules per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks</li>
<li>Task-specific noise via Fix mechanism (0, 0.7, or 1.0)</li>
<li>Training tasks selected with equal probability per sample</li>
<li>AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$</li>
<li>Linear warmup to learning rate 0.001 over 1000 steps</li>
<li>180K training steps with batch size 40</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)</li>
<li>kNN graph construction for protein and protein-molecule interactions</li>
<li>Independent prediction heads for coordinates, atom types, bond types</li>
<li>Confidence heads for self-ranking during inference</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PharMolixFM-Diff</th>
          <th>AlphaFold3</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSD &lt; 2A self-ranking</td>
          <td>83.4% (50 rep)</td>
          <td>90.4%</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>RMSD &lt; 2A oracle-ranking</td>
          <td>98.1% (500 rep)</td>
          <td>-</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>Inference time (per complex)</td>
          <td>~4.6s</td>
          <td>~249.0s</td>
          <td>Single A800 GPU</td>
      </tr>
      <tr>
          <td>Vina score (avg)</td>
          <td>-6.18</td>
          <td>-</td>
          <td>CrossDocked SBDD</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 4x 80GB A800 GPUs</li>
<li>Inference benchmarked on single A800 GPU</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/PharMolix/OpenBioMed">OpenBioMed (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, Y., Wang, J., Fan, S., &amp; Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. <em>arXiv preprint arXiv:2503.21788</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{luo2025pharmolixfm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2503.21788}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Latent Diffusion Models for High-Res Image Synthesis</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/latent-diffusion-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/latent-diffusion-models/</guid><description>Latent Diffusion Models train diffusion in a compressed latent space, enabling high-res image synthesis with cross-attention conditioning at reduced compute.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It introduces Latent Diffusion Models (LDMs), which train denoising diffusion models in the latent space of pretrained autoencoders rather than directly in pixel space. The key insight is that separating perceptual compression from generative learning enables high-resolution image synthesis at a fraction of the computational cost of pixel-based diffusion. The paper also introduces a cross-attention conditioning mechanism for flexible multi-modal generation.</p>
<h2 id="computational-cost-of-pixel-space-diffusion">Computational Cost of Pixel-Space Diffusion</h2>
<p>Training diffusion models directly in pixel space is computationally expensive (150 to 1000 V100 GPU-days for leading models at the time) because the model must process high-dimensional RGB data at every denoising step. Much of this compute is spent modeling imperceptible high-frequency details. The authors observe that learning can be split into two stages: a perceptual compression stage that removes high-frequency detail, and a semantic compression stage where the generative model learns the conceptual composition. Prior two-stage approaches (VQGAN, DALL-E) relied on aggressive compression and autoregressive modeling in discrete latent spaces, trading off reconstruction quality for tractability.</p>
<h2 id="core-innovation-diffusion-in-latent-space">Core Innovation: Diffusion in Latent Space</h2>
<p>LDMs decompose image synthesis into two phases:</p>
<p><strong>Phase 1: Perceptual Compression.</strong> A pretrained autoencoder (encoder $\mathcal{E}$, decoder $\mathcal{D}$) maps images $x \in \mathbb{R}^{H \times W \times 3}$ to a lower-dimensional latent representation $z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}$ with spatial downsampling factor $f = H/h$. The autoencoder is trained with a perceptual loss (matching deep features from a pretrained VGG network) and a patch-based adversarial objective, with either KL or VQ regularization on the latent space.</p>
<p><strong>Phase 2: Latent Diffusion.</strong> A standard denoising diffusion model operates in this latent space. The training objective becomes:</p>
<p>$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t) \right|_2^2 \right]$$</p>
<p>where $z_t$ is the noised latent at timestep $t$, and $\epsilon_\theta$ is a time-conditional UNet.</p>
<p><strong>Cross-Attention Conditioning.</strong> To enable conditioning on text, semantic maps, or other modalities, the authors introduce cross-attention layers into the UNet. A domain-specific encoder $\tau_\theta$ maps conditioning input $y$ to an intermediate representation $\tau_\theta(y) \in \mathbb{R}^{M \times d_\tau}$, which interacts with the UNet features via:</p>
<p>$$Q = W_Q^{(i)} \cdot \varphi_i(z_t), \quad K = W_K^{(i)} \cdot \tau_\theta(y), \quad V = W_V^{(i)} \cdot \tau_\theta(y)$$</p>
<p>The conditional objective then becomes:</p>
<p>$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \right|_2^2 \right]$$</p>
<p>Both $\tau_\theta$ and $\epsilon_\theta$ are optimized jointly.</p>
<h2 id="experimental-setup-and-results">Experimental Setup and Results</h2>
<p>The authors evaluate across multiple tasks and datasets:</p>
<p><strong>Perceptual compression tradeoffs.</strong> Downsampling factors $f \in {1, 2, 4, 8, 16, 32}$ are compared on ImageNet class-conditional generation. LDM-1 (pixel-based) trains slowly; LDM-32 loses too much information. LDM-4 and LDM-8 achieve the best balance, with LDM-8 outperforming pixel-based diffusion by 38 FID points after 2M training steps on a single A100.</p>
<p><strong>Unconditional image synthesis</strong> on CelebA-HQ 256, FFHQ 256, LSUN Churches/Bedrooms 256: LDM-4 achieves FID 5.11 on CelebA-HQ (state of the art at the time), outperforming LSGM, GANs, and other likelihood-based models. On LSUN-Bedrooms, LDM-4 achieves FID 2.95, close to ADM (1.90) with half the parameters and roughly 4x less training compute (see Appendix E.3.5).</p>
<p><strong>Text-to-image synthesis</strong> on MS-COCO: A 1.45B parameter LDM-KL-8 model trained on LAION-400M achieves FID 12.63 with classifier-free guidance (a technique that amplifies the conditioning signal at the cost of diversity, by interpolating between conditional and unconditional predictions) at scale s=1.5, on par with GLIDE (FID 12.24, 6B params) and Make-A-Scene (FID 11.84, 4B params) with substantially fewer parameters.</p>
<p><strong>Class-conditional ImageNet 256:</strong> LDM-4-G achieves FID 3.60, IS 247.67, outperforming ADM-G (FID 4.59) with fewer parameters and less compute.</p>
<p><strong>Super-resolution:</strong> LDM-4 (big) achieves FID 2.4 on ImageNet 64-to-256 upscaling (validation split), outperforming SR3 in FID.</p>
<p><strong>Inpainting</strong> on Places: LDM-4 (big, w/ ft) achieves FID 1.50, setting a new state of the art on image inpainting.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<ul>
<li>LDM-4 and LDM-8 offer the best tradeoff between perceptual compression and generation quality.</li>
<li>The autoencoder only needs to be trained once and can be reused across different diffusion models and tasks.</li>
<li>Cross-attention conditioning generalizes to text, semantic layouts, and bounding boxes without architecture changes.</li>
<li>Convolutional sampling enables generation at resolutions higher than the training resolution (up to 1024x1024).</li>
<li>Sequential sampling remains slower than GANs. The autoencoder reconstruction can become a bottleneck for tasks requiring pixel-level precision.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unconditional</td>
          <td>CelebA-HQ, FFHQ, LSUN</td>
          <td>256x256</td>
          <td>Standard benchmarks</td>
      </tr>
      <tr>
          <td>Class-conditional</td>
          <td>ImageNet</td>
          <td>256x256</td>
          <td>1000 classes</td>
      </tr>
      <tr>
          <td>Text-to-image</td>
          <td>LAION-400M</td>
          <td>256x256</td>
          <td>400M image-text pairs</td>
      </tr>
      <tr>
          <td>Inpainting</td>
          <td>Places</td>
          <td>256x256, 512x512</td>
          <td>Following LaMa protocol</td>
      </tr>
      <tr>
          <td>Super-resolution</td>
          <td>ImageNet</td>
          <td>64 to 256</td>
          <td>Following SR3 pipeline</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Autoencoder regularization</strong>: KL-reg (KL penalty toward standard normal, weighted by ~$10^{-6}$) or VQ-reg (vector quantization layer on the latent space with a learned codebook)</li>
<li><strong>Diffusion</strong>: Standard DDPM denoising with reweighted objective</li>
<li><strong>Sampling</strong>: DDIM sampler with configurable steps (100 to 500 depending on task)</li>
<li><strong>Guidance</strong>: Classifier-free diffusion guidance with scale $s$ (1.5 for class-conditional and text-to-image quantitative evaluation; 10.0 for qualitative text-to-image samples)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Autoencoder</strong>: Based on VQGAN architecture with perceptual + adversarial loss</li>
<li><strong>UNet backbone</strong>: Time-conditional with cross-attention layers at multiple resolutions</li>
<li><strong>Text encoder</strong>: BERT-tokenizer with transformer $\tau_\theta$ for LAION text-to-image model</li>
<li><strong>LDM-4-G</strong>: 400M parameters, $f=4$ downsampling</li>
<li><strong>LDM-KL-8 (text)</strong>: 1.45B parameters, $f=8$ downsampling, KL-regularized</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CelebA-HQ unconditional</td>
          <td>5.11</td>
          <td>500 DDIM steps</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet class-conditional</td>
          <td>3.60</td>
          <td>LDM-4-G, cfg s=1.5</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>MS-COCO text-to-image</td>
          <td>12.63</td>
          <td>LDM-KL-8-G, 250 steps, cfg s=1.5</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>Places inpainting</td>
          <td>1.50</td>
          <td>LDM-4 big, w/ ft</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet 4x super-resolution</td>
          <td>2.4</td>
          <td>LDM-4 big, 100 steps</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Perceptual compression tradeoff experiments: single NVIDIA A100</li>
<li>Inpainting model trained on eight V100</li>
<li>Training at least 2.7x faster than pixel-based diffusion at equal parameters</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CompVis/latent-diffusion">CompVis/latent-diffusion</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., &amp; Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. <em>CVPR 2022</em>. <a href="https://arxiv.org/abs/2112.10752">https://arxiv.org/abs/2112.10752</a></p>
<p><strong>Publication</strong>: CVPR 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{rombach2022highresolution,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{High-Resolution Image Synthesis with Latent Diffusion Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\&#34;o}rn}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>     = <span style="color:#e6db74">{10684--10695}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/CompVis/latent-diffusion">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>D3PM: Discrete Denoising Diffusion Probabilistic Models</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/discrete-diffusion-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/discrete-diffusion-models/</guid><description>D3PMs extend diffusion models to discrete data with structured transition matrices, connecting diffusion to masked language models.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It extends denoising diffusion probabilistic models (DDPMs) from continuous to discrete state-spaces by introducing structured Markov transition matrices for the corruption process. The paper unifies several corruption strategies, draws a formal connection between absorbing-state diffusion and masked language models, and demonstrates competitive results on both image and text generation.</p>
<h2 id="diffusion-beyond-continuous-spaces">Diffusion Beyond Continuous Spaces</h2>
<p>Standard DDPMs operate in continuous state-spaces (e.g., pixel values treated as real numbers) and use Gaussian noise for corruption. Many important data types are inherently discrete: text (tokens from a vocabulary), quantized images (discrete pixel values), molecular structures, and segmentation maps. Prior work by Hoogeboom et al. extended binary diffusion to multinomial diffusion with uniform transition probabilities, but this limits the structure of the corruption process. D3PMs generalize this by allowing arbitrary transition matrices that encode domain-specific inductive biases.</p>
<h2 id="core-innovation-structured-transition-matrices">Core Innovation: Structured Transition Matrices</h2>
<p>D3PMs define a forward corruption process over discrete variables $\mathbf{x} \in {1, \ldots, K}^D$ using transition matrices $\mathbf{Q}_t \in \mathbb{R}^{K \times K}$:</p>
<p>$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_{t-1} \mathbf{Q}_t)$$</p>
<p>where $\mathbf{x}_{t-1}$ is a one-hot row vector. The cumulative transition after $t$ steps is $\overline{\mathbf{Q}}_t = \mathbf{Q}_1 \mathbf{Q}_2 \cdots \mathbf{Q}_t$, giving:</p>
<p>$$q(\mathbf{x}_t | \mathbf{x}_0) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_0 \overline{\mathbf{Q}}_t)$$</p>
<p>The paper explores several transition matrix designs:</p>
<p><strong>Uniform diffusion:</strong> $[\mathbf{Q}_t]_{ij} = (1 - \beta_t) \mathbf{1}_{i=j} + \beta_t / K$. Transitions with equal probability to any state. Stationary distribution is uniform.</p>
<p><strong>Absorbing state:</strong> In absorbing-state diffusion, each non-mask token transitions to the mask state with probability $\beta_t$ per step, while tokens already at the mask state remain there:</p>
<p>$[\mathbf{Q}_t]_{ij} = (1-\beta_t)\mathbf{1}_{i=j\neq m} + \beta_t \mathbf{1}_{j=m} + \mathbf{1}_{i=j=m}$. Each token transitions to a designated absorbing state $m$ (e.g., [MASK] for text, gray pixel for images) with probability $\beta_t$. This establishes a direct connection to masked language models like BERT.</p>
<p><strong>Discretized Gaussian:</strong> Transition probabilities decay as a function of the distance $|i-j|$ between states, mimicking Gaussian diffusion on ordinal data like pixel values.</p>
<p><strong>Embedding-based nearest neighbor:</strong> For text, transitions are weighted by proximity in a pretrained word embedding space, so corruption preferentially swaps words with semantically similar ones.</p>
<p><strong>Training objective.</strong> The reverse process $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ is parameterized by predicting $\tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$ and computing the posterior:</p>
<p>$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \propto \sum_{\tilde{\mathbf{x}}_0} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \tilde{\mathbf{x}}_0) , \tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$$</p>
<p>The loss combines the variational lower bound (VLB) with an auxiliary cross-entropy loss $L_\lambda$:</p>
<p>$$L = L_{\text{VLB}} + \lambda , L_{\text{CE}}$$</p>
<p>where $L_{\text{CE}}$ is a reweighted cross-entropy loss on the $\mathbf{x}_0$ prediction that stabilizes training and improves sample quality. The VLB decomposes into per-timestep KL divergences between the true and predicted reverse transitions.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p><strong>Image generation (CIFAR-10):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Loss</th>
          <th>IS</th>
          <th>FID</th>
          <th>NLL (bpd)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>D3PM uniform</td>
          <td>$L_{\text{VLB}}$</td>
          <td>5.99</td>
          <td>51.27</td>
          <td>5.08</td>
      </tr>
      <tr>
          <td>D3PM absorbing</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>6.78</td>
          <td>30.97</td>
          <td>4.40</td>
      </tr>
      <tr>
          <td>D3PM Gauss</td>
          <td>$L_{\text{VLB}}$</td>
          <td>7.75</td>
          <td>15.30</td>
          <td>3.97</td>
      </tr>
      <tr>
          <td>D3PM Gauss</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>8.54</td>
          <td>8.34</td>
          <td>3.98</td>
      </tr>
      <tr>
          <td>D3PM Gauss + logistic</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>8.56</td>
          <td>7.34</td>
          <td>3.44</td>
      </tr>
      <tr>
          <td>DDPM $L_{\text{simple}}$ (continuous)</td>
          <td>&ndash;</td>
          <td>9.46</td>
          <td>3.17</td>
          <td>3.75</td>
      </tr>
  </tbody>
</table>
<p>The best discrete D3PM variant is D3PM Gauss + logistic, which achieves FID 7.34 and NLL 3.44 bpd using the combined $L_\lambda$ loss with a truncated logistic parameterization. The truncated logistic parameterization replaces the standard softmax output with a discretized logistic distribution over pixel values, assigning probability mass to each discrete bin based on a continuous logistic CDF. This provides a smoother output distribution that better captures the ordinal structure of pixel intensities. This variant exceeds the continuous DDPM in log-likelihood (3.44 vs. 3.75 bpd) while approaching its sample quality (FID 7.34 vs. 3.17).</p>
<p><strong>Text generation (text8, character-level, 1000 steps):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>bpc</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>D3PM absorbing ($L_\lambda$)</td>
          <td>1.45</td>
      </tr>
      <tr>
          <td>D3PM NN ($L_{\text{VLB}}$)</td>
          <td>1.59</td>
      </tr>
      <tr>
          <td>D3PM uniform</td>
          <td>1.61</td>
      </tr>
      <tr>
          <td>Discrete Flow (Tran et al.)</td>
          <td>1.23</td>
      </tr>
  </tbody>
</table>
<p>Among the D3PM variants and baselines evaluated, D3PM absorbing achieves the best bpc on text8 apart from Discrete Flow (Tran et al., 2019). On LM1B (sentencepiece vocabulary of 8192 tokens), D3PM absorbing achieves a perplexity of 76.9 at 1000 steps, compared to 137.9 for D3PM uniform and 43.6 for a comparable autoregressive transformer, demonstrating that discrete diffusion scales to large vocabularies.</p>
<p><strong>Ablation findings:</strong></p>
<ul>
<li>The auxiliary cross-entropy loss $L_\lambda$ is critical: for D3PM Gauss, it improves FID from 15.30 ($L_{\text{VLB}}$) to 8.34 ($L_\lambda$, $\lambda{=}0.001$). Adding the truncated logistic parameterization further improves FID to 7.34.</li>
<li>Discretized Gaussian transitions outperform both uniform and absorbing-state transitions on CIFAR-10 across all metrics.</li>
<li>For text, the absorbing-state (mask) model outperforms uniform and nearest-neighbor models. Nearest-neighbor diffusion provides only marginal improvement over uniform, a surprising negative result.</li>
<li>The $\mathbf{x}_0$-parameterization ensures the learned reverse distribution has the correct sparsity pattern dictated by the transition matrix $\mathbf{Q}_t$.</li>
</ul>
<h2 id="findings-and-limitations">Findings and Limitations</h2>
<ul>
<li>The choice of transition matrix is an important design decision that encodes domain-specific inductive biases. Discretized Gaussian transitions work best for ordinal image data; absorbing-state transitions work best for text.</li>
<li>D3PMs formally unify diffusion models and masked language models: absorbing-state diffusion with a [MASK] token is equivalent to a reweighted BERT-style training objective.</li>
<li>The combined VLB + auxiliary loss ($L_\lambda$) achieves better density estimation (3.44 bpd) than continuous DDPMs (3.75 bpd) while producing competitive samples.</li>
<li>Sample quality (best FID 7.34 for D3PM Gauss + logistic) still lags behind continuous-space DDPMs (FID 3.17) on CIFAR-10, though the gap narrows with structured transitions and the auxiliary loss.</li>
<li>Scaling to very large numbers of categories $K$ requires special techniques (low-rank corruption or matrix exponentials) to manage the $O(K^2 T)$ memory cost of storing transition matrices.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Image generation</td>
          <td>CIFAR-10</td>
          <td>32x32, 256 categories</td>
          <td>Quantized to 256 ordinal values per channel</td>
      </tr>
      <tr>
          <td>Text generation</td>
          <td>text8</td>
          <td>Character-level</td>
          <td>27 character vocabulary, sequences of length 256</td>
      </tr>
      <tr>
          <td>Text generation</td>
          <td>LM1B</td>
          <td>Word-level</td>
          <td>Sentencepiece vocabulary of 8192 tokens, sequence length 128</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Noise schedules</strong>: Linear schedule for D3PM Gauss, cosine schedule for D3PM uniform, and a novel mutual information schedule for absorbing and nearest-neighbor models</li>
<li><strong>Reverse parameterization</strong>: $\mathbf{x}_0$-parameterization with posterior computation via Bayes&rsquo; rule</li>
<li><strong>Loss</strong>: $L_{\text{VLB}} + \lambda L_{\text{CE}}$ with $\lambda = 0.001$ for images and $\lambda = 0.01$ for text absorbing models</li>
<li><strong>Scaling</strong>: Low-rank corruption (absorbing, uniform) scales as $O(r^2 T)$; matrix exponentials for nearest-neighbor transitions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Image models</strong>: Modified U-Net architecture from Ho et al. (2020) adapted for categorical output via softmax over $K$ classes</li>
<li><strong>Text models</strong>: 12-layer <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style transformer encoder with 70M parameters (12 heads, MLP dim 3072, QKV dim 768)</li>
<li><strong>Timesteps</strong>: $T = 1000$ for both images and text, though text models can be evaluated with fewer steps (e.g., 256 or 20)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>Best D3PM</th>
          <th>Continuous DDPM</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CIFAR-10</td>
          <td>7.34 (Gauss + logistic)</td>
          <td>3.17</td>
      </tr>
      <tr>
          <td>NLL (bpd)</td>
          <td>CIFAR-10</td>
          <td>3.44 (Gauss + logistic)</td>
          <td>3.75</td>
      </tr>
      <tr>
          <td>BPC</td>
          <td>text8 (char)</td>
          <td>1.45 (absorbing, $L_\lambda$)</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Perplexity</td>
          <td>LM1B</td>
          <td>76.9 (absorbing)</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained for 1M steps with batch size 512 on TPUv2 or TPUv3</li>
<li>Text models: 12-layer transformer encoder (T5 architecture), 70M parameters</li>
<li>Image models: Modified U-Net architecture from Ho et al. (2020)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/google-research/google-research/tree/master/d3pm">google-research/d3pm</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official JAX/Flax implementation for image and text experiments</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Austin, J., Johnson, D. D., Ho, J., Tarlow, D., &amp; van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. <em>NeurIPS 2021</em>. <a href="https://arxiv.org/abs/2107.03006">https://arxiv.org/abs/2107.03006</a></p>
<p><strong>Publication</strong>: NeurIPS 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{austin2021structured,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Structured Denoising Diffusion Models in Discrete State-Spaces}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>    = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2021}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>Consistency Models: Fast One-Step Diffusion Generation</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/consistency-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/consistency-models/</guid><description>Consistency models enable one-step generation by learning to map any point on a diffusion ODE trajectory to its origin, achieving FID 3.55 on CIFAR-10.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It proposes consistency models, a new class of generative models designed for fast one-step (or few-step) generation. The models can be trained either by distilling pretrained diffusion models (consistency distillation) or as standalone generative models from scratch (consistency training). The paper provides theoretical analysis of both training modes and achieves FID 3.55 on CIFAR-10 for single-step non-adversarial generation (state of the art at the time of publication).</p>
<h2 id="the-slow-sampling-problem-in-diffusion">The Slow Sampling Problem in Diffusion</h2>
<p>Diffusion models produce high-quality samples but require iterating through many denoising steps (often tens to hundreds), making generation slow compared to GANs or VAEs. Previous approaches to speed up sampling include faster ODE/SDE solvers (DDIM, DPM-Solver) and progressive distillation. These either still require multiple steps or depend on a complex multi-stage distillation pipeline. The goal is a model that can generate high-quality samples in a single forward pass while optionally allowing more steps for better quality.</p>
<h2 id="core-innovation-the-self-consistency-property">Core Innovation: The Self-Consistency Property</h2>
<p>The key idea builds on the Probability Flow (PF) ODE from the score-based SDE framework. The PF ODE describes a deterministic trajectory that converts noise into data, governed by the learned score function. For the VE-SDE parameterization used by EDM (Karras et al., 2022), this takes the form:</p>
<p>$$\frac{d\mathbf{x}_t}{dt} = -t , s_\phi(\mathbf{x}_t, t)$$</p>
<p>where $s_\phi$ is a pretrained score model, a <strong>consistency function</strong> $f(\mathbf{x}_t, t)$ maps any point on an ODE trajectory to the trajectory&rsquo;s origin $\mathbf{x}_\epsilon$. The defining property is self-consistency:</p>
<p>$$f(\mathbf{x}_t, t) = f(\mathbf{x}_{t&rsquo;}, t&rsquo;) \quad \text{for all } t, t&rsquo; \in [\epsilon, T]$$</p>
<p>for any points $\mathbf{x}_t$ and $\mathbf{x}_{t&rsquo;}$ on the same PF ODE trajectory.</p>
<p><strong>Parameterization.</strong> The model enforces the boundary condition $f(\mathbf{x}_\epsilon, \epsilon) = \mathbf{x}_\epsilon$ using skip connections:</p>
<p>$$f_\theta(\mathbf{x}, t) = c_{\text{skip}}(t) , \mathbf{x} + c_{\text{out}}(t) , F_\theta(\mathbf{x}, t)$$</p>
<p>where $c_{\text{skip}}(\epsilon) = 1$ and $c_{\text{out}}(\epsilon) = 0$, ensuring the boundary condition is satisfied by construction.</p>
<p><strong>Consistency Distillation (CD).</strong> Given a pretrained diffusion model, CD trains a consistency model by enforcing self-consistency between adjacent timesteps:</p>
<p>$$\mathcal{L}_{\text{CD}}^N(\theta, \theta^-; \phi) = \mathbb{E}\left[\lambda(t_n) , d!\left(f_\theta(\mathbf{x}_{t_{n+1}}, t_{n+1}), , f_{\theta^-}(\hat{\mathbf{x}}_{t_n}^\phi, t_n)\right)\right]$$</p>
<p>where $\hat{\mathbf{x}}_{t_n}^\phi$ is obtained by running one step of the ODE solver using the pretrained score model, $\theta^-$ is an exponential moving average (EMA) of $\theta$, and $d(\cdot, \cdot)$ is a distance metric. The use of a target network $\theta^-$ (updated via EMA) parallels techniques from deep Q-learning and momentum contrastive learning.</p>
<p><strong>Consistency Training (CT).</strong> CT eliminates the need for a pretrained diffusion model. It replaces the ODE solver step with a score estimate derived from the denoising score matching identity:</p>
<p>$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = \mathbb{E}\left[\frac{\mathbf{x} - \mathbf{x}_t}{t^2} ,\middle|, \mathbf{x}_t\right]$$</p>
<p>Because this identity lets us estimate the score from noisy data alone (without a pretrained model), we can compute the ODE update directly from training samples. This allows training directly on data pairs $(\mathbf{x}, \mathbf{x} + t\mathbf{z})$ where $\mathbf{z} \sim \mathcal{N}(0, I)$.</p>
<p><strong>Theoretical guarantee.</strong> If CD achieves zero loss, the consistency model error is bounded by $O((\Delta t)^p)$ where $\Delta t$ is the maximum timestep gap and $p$ is the order of the ODE solver.</p>
<h2 id="experiments-and-benchmarks">Experiments and Benchmarks</h2>
<p><strong>Datasets:</strong> CIFAR-10 (32x32), ImageNet 64x64, LSUN Bedroom 256x256, LSUN Cat 256x256.</p>
<p><strong>Architecture:</strong> All models use the NCSN++/EDM architecture. CD distills from pretrained EDM models.</p>
<p><strong>Key results for consistency distillation (CD):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Steps</th>
          <th>FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CIFAR-10</td>
          <td>1</td>
          <td>3.55</td>
      </tr>
      <tr>
          <td>CIFAR-10</td>
          <td>2</td>
          <td>2.93</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>1</td>
          <td>6.20</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>2</td>
          <td>4.70</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>1</td>
          <td>7.80</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>2</td>
          <td>5.22</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>1</td>
          <td>11.0</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>2</td>
          <td>8.84</td>
      </tr>
  </tbody>
</table>
<p>CD outperforms progressive distillation (PD) across all datasets and sampling steps, with the exception of single-step generation on Bedroom 256x256 where CD with $\ell_2$ slightly underperforms PD with $\ell_2$.</p>
<p><strong>Key results for consistency training (CT):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Steps</th>
          <th>FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CIFAR-10</td>
          <td>1</td>
          <td>8.70</td>
      </tr>
      <tr>
          <td>CIFAR-10</td>
          <td>2</td>
          <td>5.83</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>1</td>
          <td>13.0</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>2</td>
          <td>11.1</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>1</td>
          <td>16.0</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>1</td>
          <td>20.7</td>
      </tr>
  </tbody>
</table>
<p>CT outperforms existing single-step non-adversarial models (VAEs, normalizing flows), e.g., improving over DC-VAE&rsquo;s FID of 17.90 on CIFAR-10. Samples from CT share structural similarity with EDM samples from the same initial noise, suggesting CT does not suffer from mode collapse.</p>
<p><strong>Zero-shot editing:</strong> Consistency models support colorization, super-resolution, inpainting, stroke-guided generation, interpolation, and denoising at test time without task-specific training, by modifying the multi-step sampling algorithm.</p>
<h2 id="findings-and-limitations">Findings and Limitations</h2>
<ul>
<li>Consistency distillation achieves state-of-the-art FID for one-step generation (3.55 on CIFAR-10, 6.20 on ImageNet 64x64).</li>
<li>Multi-step sampling provides a smooth quality-compute tradeoff: more steps yield better FID.</li>
<li>CT produces competitive results without any pretrained diffusion model, making consistency models a standalone generative model family.</li>
<li>The LPIPS distance metric $d(\cdot, \cdot)$ generally outperforms $\ell_1$ and $\ell_2$ for training consistency models.</li>
<li>At higher resolutions (LSUN 256x256), the gap between CD/CT and full EDM sampling widens.</li>
<li>CT currently underperforms CD, suggesting room for improvement in the standalone training paradigm.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Primary benchmark</td>
          <td>CIFAR-10</td>
          <td>32x32, 50K train</td>
          <td>FID on 50K samples</td>
      </tr>
      <tr>
          <td>Scaling benchmark</td>
          <td>ImageNet 64x64</td>
          <td>64x64, 1.28M</td>
          <td>Unconditional generation</td>
      </tr>
      <tr>
          <td>High-res benchmark</td>
          <td>LSUN Bedroom, Cat</td>
          <td>256x256</td>
          <td>Unconditional generation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>ODE solver for CD</strong>: Euler and Heun (2nd order) solvers on the empirical PF ODE</li>
<li><strong>EMA for target network</strong>: Decay rate $\mu$ scheduled as a function of training step</li>
<li><strong>Schedule functions</strong>: $N$ (number of discretization steps) and $\mu$ (EMA rate) increase over training following specific schedules (see Appendix C of the paper)</li>
<li><strong>Distance metric</strong>: LPIPS performs best; $\ell_2$ and $\ell_1$ also evaluated</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: NCSN++/EDM architecture from Karras et al. (2022)</li>
<li><strong>CD teacher</strong>: Pretrained EDM models</li>
<li><strong>Parameterization</strong>: Skip-connection formulation with $c_{\text{skip}}(t)$ and $c_{\text{out}}(t)$ from EDM</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>CD 1-step</th>
          <th>CT 1-step</th>
          <th>EDM (full)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CIFAR-10</td>
          <td>3.55</td>
          <td>8.70</td>
          <td>2.04</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet 64</td>
          <td>6.20</td>
          <td>13.0</td>
          <td>2.44</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>LSUN Bedroom</td>
          <td>7.80</td>
          <td>16.0</td>
          <td>3.57</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>LSUN Cat</td>
          <td>11.0</td>
          <td>20.7</td>
          <td>6.69</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training details follow EDM conventions</li>
<li>CD and CT use the same batch sizes and learning rate schedules as EDM training</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/openai/consistency_models">openai/consistency_models</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained checkpoints</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, Y., Dhariwal, P., Chen, M., &amp; Sutskever, I. (2023). Consistency Models. <em>ICML 2023</em>. <a href="https://arxiv.org/abs/2303.01469">https://arxiv.org/abs/2303.01469</a></p>
<p><strong>Publication</strong>: ICML 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{song2023consistency,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Consistency Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>    = <span style="color:#e6db74">{202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>       = <span style="color:#e6db74">{https://arxiv.org/abs/2303.01469}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/openai/consistency_models">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>Score-Based Generative Modeling with SDEs (Song 2021)</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/score-based-generative-modeling-sde/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/score-based-generative-modeling-sde/</guid><description>Unified SDE framework for score-based generative models, introducing Predictor-Corrector samplers and setting CIFAR-10 records with FID 2.20 and 2.99 bits/dim.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a unified framework that generalizes previous discrete score-based models (SMLD and DDPM) into continuous-time Stochastic Differential Equations (SDEs). The paper introduces algorithms for sampling (Predictor-Corrector) and likelihood computation (Probability Flow ODE), validated by setting new records on CIFAR-10 (FID 2.20, IS 9.89 at the time of publication). It also contains elements of <strong>Systematization</strong> by showing how existing methods are special cases of this broader framework.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Prior successful generative models, specifically Score Matching with Langevin Dynamics (SMLD) and Denoising Diffusion Probabilistic Models (DDPM), operate by sequentially corrupting data with slowly increasing noise and learning to reverse the process. Both methods treat the noise scales as a finite set of discrete steps. The authors aim to generalize this to a continuum of noise scales by modeling the diffusion process as a Stochastic Differential Equation (SDE). This continuous formulation enables:</p>
<ul>
<li><strong>Flexible sampling:</strong> Use of general-purpose SDE solvers.</li>
<li><strong>Exact likelihood computation:</strong> Via connection to Neural ODEs.</li>
<li><strong>Controllable generation:</strong> Solving inverse problems (inpainting, colorization) without retraining.</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>SDE framework</strong> for score-based generative modeling:</p>
<ul>
<li><strong>Continuous Generalization:</strong> Proving that SMLD and DDPM noise perturbations correspond to discretizations of Variance Exploding (VE) SDEs and Variance Preserving (VP) SDEs, respectively.</li>
<li><strong>Reverse-Time SDE:</strong> Leveraging Anderson&rsquo;s result (Anderson, 1982: a result on time-reversal of diffusion processes showing that the reverse is also a diffusion, with the forward drift reversed and a correction term involving the score of the marginal density) that the reverse of a diffusion process is also a diffusion process, governed by the score (gradient of log density).</li>
<li><strong>Predictor-Corrector (PC) Samplers:</strong> A hybrid sampling strategy where a numerical SDE solver (Predictor) estimates the next step, and a score-based MCMC approach (Corrector) corrects the marginal distribution.</li>
<li><strong>Probability Flow ODE:</strong> Deriving a deterministic ODE that shares the same marginal densities as the SDE, enabling near-exact likelihood computation (accuracy is limited by both numerical ODE solver discretization and variance of the unbiased Hutchinson trace estimator) and latent space manipulation.</li>
<li><strong>Sub-VP SDE:</strong> A new SDE class proposed to improve likelihoods by bounding variance tighter than the VP SDE.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the framework on standard image benchmarks:</p>
<ul>
<li><strong>Datasets:</strong> CIFAR-10 (32x32), CelebA (64x64), LSUN (Bedroom, Church), and CelebA-HQ (256x256 and 1024x1024).</li>
<li><strong>Ablation Studies:</strong> Comparing samplers (Ancestral vs. Reverse Diffusion vs. Probability Flow vs. PC) and SDE types (VE, VP, sub-VP).</li>
<li><strong>Architecture Search:</strong> Exploring improvements like FIR up/downsampling, rescaling skip connections, and increasing depth (leading to NCSN++ and DDPM++ architectures).</li>
<li><strong>Likelihood Evaluation:</strong> Computing Negative Log-Likelihood (NLL) in bits/dim using the Probability Flow ODE.</li>
<li><strong>Inverse Problems:</strong> Testing class-conditional generation, inpainting, and colorization using the conditional reverse-time SDE.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Record Performance:</strong> The <strong>NCSN++ cont. (deep, VE)</strong> model achieved an Inception Score of 9.89 and FID of 2.20 on CIFAR-10 (as of ICLR 2021).</li>
<li><strong>High-Fidelity Generation:</strong> First score-based model to generate 1024x1024 images (CelebA-HQ).</li>
<li><strong>Competitive Likelihoods:</strong> The <strong>DDPM++ cont. (deep, sub-VP)</strong> model achieved 2.99 bits/dim on uniformly dequantized CIFAR-10, a record at the time.</li>
<li><strong>Sampling Efficiency:</strong> PC samplers consistently outperformed predictor-only methods (like standard ancestral sampling) for the same computational cost.</li>
<li><strong>Controllable Generation:</strong> Successful application to inpainting and colorization using a single unconditional model.</li>
<li><strong>Limitations:</strong> Sampling remains slower than GANs on the same datasets. The breadth of available samplers introduces many hyperparameters (SDE type, predictor, corrector, signal-to-noise ratio, number of steps) that require tuning.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>CIFAR-10</strong>: Used for main benchmarking (FID, Inception Score, NLL).</li>
<li><strong>CelebA-HQ</strong>: Used for high-resolution experiments at 256x256 and 1024x1024.</li>
<li><strong>LSUN</strong>: Bedroom and Church Outdoor categories (256x256) used for sampler comparison and controllable generation (inpainting, colorization).</li>
<li><strong>Preprocessing</strong>: CIFAR-10 images are 32x32; CelebA pre-processed to 64x64 following Song &amp; Ermon (2020). Data is typically scaled to $[0, 1]$ or standardized depending on the specific SDE config.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Forward SDEs</strong>:</p>
<p>Here $dw$ denotes a Wiener process increment (a small, independent Gaussian noise burst at each timestep).</p>
<ul>
<li><strong>VE SDE (Variance Exploding)</strong>: $dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} dw$. Corresponds to SMLD. Used with $\sigma_{\min}=0.01$ and $\sigma_{\max}$ chosen via heuristics.</li>
<li><strong>VP SDE (Variance Preserving)</strong>: $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)} dw$. Corresponds to DDPM.</li>
<li><strong>Sub-VP SDE</strong>: $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s)ds})} dw$. Bounded variance, good for likelihoods.</li>
</ul>
<p><strong>Reverse-Time SDE Solver (Predictor)</strong>:</p>
<ul>
<li>Discretized via <strong>Reverse Diffusion Sampling</strong>, which matches the forward discretization.</li>
<li><strong>Euler-Maruyama</strong> solver used for continuously-trained models.</li>
</ul>
<p><strong>Corrector Algorithm</strong>:</p>
<ul>
<li><strong>Langevin MCMC</strong>: Applies annealed Langevin dynamics: adds noise and takes a score-guided gradient step to correct the marginal distribution at each timestep.</li>
<li><strong>PC Sampling</strong>: Alternates between one step of the Predictor and one step of the Corrector.</li>
<li><strong>Signal-to-Noise Ratio ($r$)</strong>: A hyperparameter for the corrector step size. Tuned values: $r \approx 0.16$ for VE SDEs on CIFAR-10.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NCSN++</strong>: Optimized architecture for VE SDEs. Key features:
<ul>
<li>4 residual blocks per resolution.</li>
<li>BigGAN-type residual blocks.</li>
<li>Rescaling skip connections by $1/\sqrt{2}$.</li>
<li>FIR (Finite Impulse Response) up/downsampling.</li>
<li>&ldquo;Residual&rdquo; progressive architecture for input, no progressive growing for output.</li>
</ul>
</li>
<li><strong>DDPM++</strong>: Optimized architecture for VP/sub-VP SDEs. Similar to NCSN++ but without FIR upsampling and no progressive growing.</li>
<li><strong>Deep Variants</strong>: &ldquo;cont. (deep)&rdquo; models double the depth (from 4 to 8 blocks per resolution) for the best reported results.</li>
<li><strong>Conditioning</strong>: Time $t$ is conditioned via random Fourier feature embeddings (scale 16) for continuous models.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>FID (Fréchet Inception Distance)</strong>: Computed on 50k samples.</li>
<li><strong>Inception Score</strong>: Reported for CIFAR-10.</li>
<li><strong>NLL (Negative Log-Likelihood)</strong>: Reported in bits/dim on uniformly dequantized data using the Probability Flow ODE.</li>
</ul>
<p><strong>Denoising</strong>: A single denoising step using Tweedie&rsquo;s formula is applied at the end of sampling to remove residual noise, which significantly improves FID.</p>
<h3 id="hardware">Hardware</h3>
<p><strong>Training</strong>:</p>
<ul>
<li>Batch size: 128 for CIFAR-10, 64 for LSUN, 8 for high-res CelebA-HQ.</li>
<li>Iterations: Discrete-objective models trained for 1.3M iterations during architecture exploration. Continuous-objective models (cont.) trained for 0.95M iterations. High-res CelebA-HQ (1024x1024) trained for approximately 2.4M iterations.</li>
<li><strong>EMA</strong>: Exponential Moving Average rate of 0.999 used for VE models, 0.9999 for VP models.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/yang-song/score_sde">yang-song/score_sde</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official JAX and PyTorch implementation with pretrained checkpoints</td>
      </tr>
  </tbody>
</table>
<p>All datasets used (CIFAR-10, CelebA-HQ, LSUN) are publicly available. Pretrained model checkpoints for CIFAR-10, CelebA-HQ, and FFHQ are provided in the repository. Specific hardware requirements (GPU type, training time) are not detailed in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., &amp; Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. <em>ICLR 2021</em>. <a href="https://arxiv.org/abs/2011.13456">https://arxiv.org/abs/2011.13456</a></p>
<p><strong>Publication</strong>: ICLR 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{song2021scorebased,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Score-Based Generative Modeling through Stochastic Differential Equations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>       = <span style="color:#e6db74">{https://openreview.net/forum?id=PxTIG12RRHS}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/yang-song/score_sde">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-matching-denoising-autoencoders/">Score Matching and Denoising Autoencoders</a></li>
</ul>
]]></content:encoded></item><item><title>Rectified Flow: Learning to Generate and Transfer Data</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/rectified-flow/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/rectified-flow/</guid><description>A unified ODE-based framework for generative modeling and domain transfer that learns straight paths for fast 1-step generation.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, with a significant <strong>Theory</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes &ldquo;Rectified Flow,&rdquo; a novel generative framework that learns ordinary differential equations (ODEs) to transport distributions via straight paths. It introduces the &ldquo;Reflow&rdquo; algorithm to iteratively straighten these paths.</li>
<li><strong>Theory</strong>: It provides rigorous proofs connecting the method to Optimal Transport, showing that the rectification process yields a coupling with non-increasing convex transport costs and that recursive reflow reduces the curvature of trajectories.</li>
</ul>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The work addresses two main challenges in unsupervised learning: generative modeling (generating data from noise) and domain transfer (mapping between two observed distributions).</p>
<ul>
<li><strong>Inefficiency of ODE/SDE Models</strong>: Continuous-time models (like Score-based Generative Models and DDPMs) require simulating diffusions over many steps, resulting in high computational costs during inference.</li>
<li><strong>Complexity of GANs</strong>: GANs provide fast (one-step) generation alongside challenges with training instability and mode collapse.</li>
<li><strong>Disconnection</strong>: Generative modeling and domain transfer are often treated as separate tasks requiring different techniques.</li>
</ul>
<p>The authors aim to unify these tasks into a single &ldquo;transport mapping&rdquo; problem while bridging the gap between high-quality continuous models and fast one-step models.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Rectified Flow</strong> framework and the <strong>Reflow</strong> procedure.</p>
<ul>
<li><strong>Straight-Line ODEs</strong>: Rectified Flow learns an ODE drift $v$ to follow the straight line connecting data pairs $(X_0, X_1)$, providing an alternative to diffusion models that rely on stochastic paths or specific forward processes. This is achieved via a simple least-squares optimization problem.</li>
<li><strong>Reflow (Iterative Straightening)</strong>: The authors introduce a recursive training procedure where a new flow is trained on the data pairs $(Z_0, Z_1)$ generated by the previous flow. Theoretical analysis shows this reduces the &ldquo;transport cost&rdquo; and straightens the trajectories, allowing for accurate 1-step simulation (effectively converting the ODE into a one-step model).</li>
<li><strong>Unified Framework</strong>: The method uses the exact same algorithm for generation ($\pi_0$ is Gaussian) and domain transfer ($\pi_0$ is a source dataset), removing the need for adversarial losses or cycle-consistency constraints.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the method across image generation, translation, and domain adaptation tasks.</p>
<ul>
<li><strong>Unconditioned Image Generation</strong>:
<ul>
<li><strong>Dataset</strong>: CIFAR-10 ($32\times32$).</li>
<li><strong>Baselines</strong>: Compared against GANs (StyleGAN2, TDPM), Diffusion/SDE Models (VP SDE, sub-VP SDE, VE SDE), ODE methods (VP ODE, sub-VP ODE, VE ODE), and distilled methods (DDIM Distillation).</li>
<li><strong>High-Res</strong>: Validated on LSUN Bedroom/Church, CelebA-HQ, and AFHQ ($256\times256$).</li>
</ul>
</li>
<li><strong>Image-to-Image Translation</strong>:
<ul>
<li><strong>Datasets</strong>: AFHQ (Cat $\leftrightarrow$ Dog/Wild), MetFace $\leftrightarrow$ CelebA-HQ.</li>
<li><strong>Setup</strong>: Transferring styles while preserving semantic identity (using a classifier-based feature mapping metric).</li>
</ul>
</li>
<li><strong>Domain Adaptation</strong>:
<ul>
<li><strong>Datasets</strong>: DomainNet, Office-Home.</li>
<li><strong>Metric</strong>: Classification accuracy on the transferred testing data.</li>
</ul>
</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Superior 1-Step Generation</strong>: On CIFAR-10 with a single Euler step (as of ICLR 2023), the distilled 2-Rectified Flow achieved an FID of <strong>4.85</strong>, beating the best one-step U-Net model TDPM (FID 8.91, a truncated diffusion model using a GAN). The distilled 3-Rectified Flow reached a Recall of <strong>0.51</strong>, beating the GAN baseline StyleGAN2+ADA (Recall 0.49).</li>
<li><strong>Straightening Effect</strong>: The &ldquo;Reflow&rdquo; procedure was empirically shown to reduce the &ldquo;straightness&rdquo; error and transport costs, validating the theoretical claims. &ldquo;Straightness&rdquo; is measured as $S(Z) = \mathbb{E}[\int_0^1 |\dot{Z}_t - (Z_1 - Z_0)|^2, dt]$ (zero means perfectly straight); &ldquo;transport cost&rdquo; is $\mathbb{E}[c(Z_1 - Z_0)]$ for a convex cost $c$, and Reflow reduces this for all convex costs.</li>
<li><strong>High-Quality Transfer</strong>: The model successfully performed image translation (e.g., Cat to Wild Animal) without paired data or cycle-consistency losses.</li>
<li><strong>Strong Full-Simulation Results</strong>: With RK45 adaptive ODE solving, 1-Rectified Flow achieves FID 2.58 and Recall 0.57 on CIFAR-10 (Table 1a), the best among ODE methods and comparable to fully simulated SDEs (VP SDE: FID 2.55).</li>
<li><strong>Fast Simulation</strong>: The method allows for extremely coarse time discretization (e.g., $N=1$) without significant quality loss after reflow, effectively solving the slow inference speed of standard ODE models.</li>
<li><strong>Domain Adaptation</strong>: On Office-Home, Rectified Flow achieves 69.2% accuracy, outperforming Deep CORAL (68.7%) and other baselines. On DomainNet, it achieves 41.4%, comparable to Deep CORAL (41.5%) and MLDG (41.2%).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper utilizes several standard computer vision benchmarks.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size/Resolution</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generation</td>
          <td><strong>CIFAR-10</strong></td>
          <td>32x32</td>
          <td>Standard split</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td><strong>LSUN</strong> (Bedroom, Church)</td>
          <td>256x256</td>
          <td>High-res evaluation</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td><strong>CelebA-HQ</strong></td>
          <td>256x256</td>
          <td>High-res evaluation</td>
      </tr>
      <tr>
          <td>Gen/Transfer</td>
          <td><strong>AFHQ</strong> (Cat, Dog, Wild)</td>
          <td>512x512</td>
          <td>256x256 for generation, 512x512 for transfer</td>
      </tr>
      <tr>
          <td>Transfer</td>
          <td><strong>MetFace</strong></td>
          <td>1024x1024</td>
          <td>Resized to 512x512 for experiments</td>
      </tr>
      <tr>
          <td>Adaptation</td>
          <td><strong>DomainNet</strong></td>
          <td>Mixed</td>
          <td>345 categories, 6 domains</td>
      </tr>
      <tr>
          <td>Adaptation</td>
          <td><strong>Office-Home</strong></td>
          <td>Mixed</td>
          <td>65 categories, 4 domains</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>
<p><strong>Objective Function</strong>:
The drift $v(Z_t, t)$ is trained by minimizing a least-squares regression objective:
$$\min_{v} \int_{0}^{1} \mathbb{E}[|(X_1 - X_0) - v(X_t, t)|^2] dt$$
where $X_t = tX_1 + (1-t)X_0$ is the linear interpolation.</p>
</li>
<li>
<p><strong>Reflow Procedure</strong>:
Iteratively updates the flow. Let $Z^k$ be the $k$-th rectified flow.</p>
<ol>
<li>Generate 4 million data pairs $(Z_0^k, Z_1^k)$ by simulating the current flow.</li>
<li>Fine-tune the $i$-rectified flow model for 300,000 steps on these pairs to obtain the $(i+1)$-rectified flow.</li>
</ol>
</li>
<li>
<p><strong>Distillation</strong>:
For 1-step distillation ($k=1$), the L2 loss is replaced with LPIPS perceptual similarity, which empirically yields better image quality. For multi-step distillation, training samples $t$ from ${0, 1/k, \ldots, (k-1)/k}$ rather than the full $[0, 1]$ interval.</p>
</li>
<li>
<p><strong>ODE Solver</strong>:</p>
<ul>
<li>Training: Analytical linear interpolation.</li>
<li>Inference: Euler method (constant step size $1/N$) or RK45 (adaptive).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>
<p><strong>Architecture</strong>:</p>
<ul>
<li>Uses the <strong>DDPM++ U-Net</strong> architecture (from Song et al., 2020) across experiments. Implementation is modified from the open-source code of Song et al.</li>
</ul>
</li>
<li>
<p><strong>Optimization</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: Adam (CIFAR-10) or AdamW (Transfer/Adaptation).</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>LR: $2 \times 10^{-4}$ (CIFAR), Grid search for transfer.</li>
<li>EMA: 0.999999 (CIFAR), 0.9999 (Transfer).</li>
<li>Batch Size: 4 (Transfer), 16 (Domain Adaptation).</li>
<li>Dropout: 0.15 (CIFAR), 0.1 (Transfer).</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (CIFAR-10, N=1)</th>
          <th>Baseline (Best 1-step)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>FID</strong></td>
          <td><strong>4.85</strong> (2-Rectified + Distill)</td>
          <td>8.91 (TDPM)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td><strong>Recall</strong></td>
          <td><strong>0.51</strong> (3-Rectified + Distill)</td>
          <td>0.49 (StyleGAN2+ADA)</td>
          <td>Higher is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU models or training times. The DDPM++ U-Net architecture used in the experiments typically requires multi-GPU setups for training on high-resolution datasets.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gnobitab/RectifiedFlow">RectifiedFlow (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation with CIFAR-10 and high-res training code, plus pre-trained checkpoints</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Gong, C., &amp; Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. <em>International Conference on Learning Representations (ICLR)</em>. <a href="https://openreview.net/forum?id=XVjTT1nw5z">https://openreview.net/forum?id=XVjTT1nw5z</a></p>
<p><strong>Publication</strong>: ICLR 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liuFlowStraightFast2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Flow {{Straight}} and {{Fast}}: {{Learning}} to {{Generate}} and {{Transfer Data}} with {{Rectified Flow}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Liu, Xingchao and Gong, Chengyue and Liu, Qiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://openreview.net/forum?id=XVjTT1nw5z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/gnobitab/RectifiedFlow">Official Code Repository</a></li>
<li><a href="https://openreview.net/forum?id=XVjTT1nw5z">OpenReview Page</a></li>
</ul>
]]></content:encoded></item><item><title>Building Normalizing Flows with Stochastic Interpolants</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/stochastic-interpolants/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/stochastic-interpolants/</guid><description>A continuous-time normalizing flow using stochastic interpolants and quadratic loss to bypass costly ODE backpropagation.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, with significant <strong>Theory</strong> contributions.</p>
<p>The authors propose a specific algorithm (&ldquo;InterFlow&rdquo;) for constructing generative models based on continuous-time normalizing flows. The work is characterized by the derivation of a new training objective (a simple quadratic loss) that bypasses the computational bottlenecks of previous methods. It includes prominent baseline comparisons against continuous flow methods (FFJORD, OT-Flow) and diffusion models. The theoretical component establishes the validity of the interpolant density satisfying the continuity equation (a conservation law governing how probability mass flows) and bounds the Wasserstein-2 distance (a measure of transport cost between distributions, penalizing squared displacement) of the transport.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The primary motivation is to overcome the computational inefficiency of training Continuous Normalizing Flows (CNFs) using Maximum Likelihood Estimation (MLE). Standard CNF training requires backpropagating through numerical ODE solvers, which is costly and limits scalability.</p>
<p>Additionally, while score-based diffusion models (SDEs) have achieved high sample quality, they theoretically require infinite time integration and rely on specific noise schedules. The authors aim to establish a method that works strictly with Probability Flow ODEs on finite time intervals, retaining the flexibility to connect arbitrary densities without the complexity of SDEs or the cost of standard ODE adjoint methods.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Stochastic Interpolant</strong> framework:</p>
<ul>
<li><strong>Explicit Interpolant Construction</strong>: The method defines a time-dependent interpolant $x_t = I_t(x_0, x_1)$ (e.g., trigonometric interpolation) that connects samples from the base density $\rho_0$ and target $\rho_1$.</li>
<li><strong>Simulation-Free Training</strong>: The velocity field $v_t(x)$ of the probability flow is learned by minimizing a simple quadratic objective: $G(\hat{v}) = \mathbb{E}[|\hat{v}_t(x_t)|^2 - 2\partial_t x_t \cdot \hat{v}_t(x_t)]$. Because $\partial_t I_t$ is known analytically from the interpolant definition, the expectation can be estimated by sampling $(x_0, x_1, t)$ directly. This avoids ODE integration during training (ODE integration is still required at inference).</li>
<li><strong>Decoupling Path and Optimization</strong>: The choice of path (interpolant) is separated from the optimization of the velocity field. MLE methods couple the path and objective.</li>
<li><strong>Connection to Score-Based Models</strong>: The authors show that for Gaussian base densities and trigonometric interpolants, the learned velocity field is explicitly related to the score function $\nabla \log \rho_t$, providing a theoretical bridge between CNFs and diffusion models.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors performed validation across synthetic, tabular, and image domains:</p>
<ul>
<li><strong>2D Density Estimation</strong>: Benchmarked on &ldquo;Checkerboard&rdquo;, &ldquo;8 Gaussians&rdquo;, and anisotropic curved densities to visualize mode coverage and transport smoothness.</li>
<li><strong>High-Dimensional Tabular Data</strong>: Evaluated on standard benchmarks (POWER, GAS, HEPMASS, MINIBOONE, BSDS300) comparing Negative Log Likelihood (NLL) against FFJORD, OT-Flow, and others.</li>
<li><strong>Image Generation</strong>: Trained models on CIFAR-10 ($32 \times 32$), ImageNet ($32 \times 32$), and Oxford Flowers ($128 \times 128$) to test scalability.</li>
<li><strong>Ablations</strong>: Investigated optimizing the interpolant path itself (e.g., learning Fourier coefficients for the path) to approach optimal transport and minimize path length.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Performance</strong>: The method matches or supersedes conventional ODE flows (like FFJORD) in terms of NLL while being significantly cheaper to train.</li>
<li><strong>Efficiency</strong>: The training cost per epoch is constant (simulation-free), whereas MLE-based ODE methods see growing costs as the dynamics become more complex.</li>
<li><strong>Scalability</strong>: The method successfully scales to $128 \times 128$ resolution on a single GPU, a resolution that prior ab-initio ODE flows had not demonstrated.</li>
<li><strong>Flexibility</strong>: The framework can connect <em>any</em> two arbitrary densities (e.g., connecting two different complex 2D distributions) without needing one to be Gaussian.</li>
<li><strong>Optimal Transport</strong>: For a fixed interpolant, minimizing $G(\hat{v})$ over the velocity field recovers the velocity for that specific path. Additionally optimizing over the interpolant family yields a solution to the Benamou-Brenier optimal transport problem.</li>
<li><strong>Limitations</strong>: The authors acknowledge that image FID scores trail dedicated diffusion models, noting that InterFlow was not optimized with standard training tricks such as exponential moving averages, truncation, or learning rate warm-ups. The framework&rsquo;s sample quality could likely improve with these additions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Tabular Datasets</strong>: POWER (6D), GAS (8D), HEPMASS (21D), MINIBOONE (43D), BSDS300 (63D).
<ul>
<li>Training points range from ~30k (MINIBOONE) to ~1.6M (POWER).</li>
</ul>
</li>
<li><strong>Image Datasets</strong>:
<ul>
<li>CIFAR-10 ($32 \times 32$, 50k training points).</li>
<li>ImageNet ($32 \times 32$, ~1.28M training points).</li>
<li>Oxford Flowers ($128 \times 128$, ~315k training points).</li>
</ul>
</li>
<li><strong>Time Sampling</strong>: Time $t$ is sampled from a Beta distribution during training (reweighting) to focus learning near the target.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Interpolant</strong>: The primary interpolant used is trigonometric: $I_t(x_0, x_1) = \cos(\frac{\pi t}{2})x_0 + \sin(\frac{\pi t}{2})x_1$.
<ul>
<li>Alternative linear interpolant: $I_t = a_t x_0 + b_t x_1$.</li>
</ul>
</li>
<li><strong>Loss Function</strong>:
$$G(\hat{v}) = \mathbb{E}_{t, x_0, x_1}[|\hat{v}_t(x_t)|^2 - 2\partial_t I_t(x_0, x_1) \cdot \hat{v}_t(x_t)]$$
<ul>
<li>The expectation is amenable to empirical estimation using batches of $x_0, x_1, t$.</li>
</ul>
</li>
<li><strong>Sampling</strong>: Numerical integration using Dormand-Prince (Runge-Kutta 4/5).</li>
<li><strong>Optimization</strong>: SGD/Adam variants used for optimization.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Tabular Architectures</strong>:
<ul>
<li>Feed-forward networks with 4-5 hidden layers.</li>
<li>Hidden widths: 512 (POWER, GAS, HEPMASS, MINIBOONE) or 1024 (BSDS300).</li>
<li>Activation: ReLU (general) or ELU (BSDS300).</li>
</ul>
</li>
<li><strong>Image Architectures</strong>:
<ul>
<li>U-Net based on the DDPM implementation.</li>
<li>Dimensions: 256 hidden dimension.</li>
<li>Sinusoidal time embeddings used.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Negative Log Likelihood (NLL) in nats (tabular) or bits per dim (images), Frechet Inception Distance (FID) for images.</li>
<li><strong>Baselines</strong>: FFJORD, Glow, Real NVP, OT-Flow, ScoreFlow, DDPM.</li>
</ul>
<p><strong>Tabular NLL</strong> (nats, lower is better; Table 2 Left):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>POWER</th>
          <th>GAS</th>
          <th>HEPMASS</th>
          <th>MINIBOONE</th>
          <th>BSDS300</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MADE</td>
          <td>3.08</td>
          <td>-3.56</td>
          <td>20.98</td>
          <td>15.59</td>
          <td>-148.85</td>
      </tr>
      <tr>
          <td>Real NVP</td>
          <td>-0.17</td>
          <td>-8.33</td>
          <td>18.71</td>
          <td>13.55</td>
          <td>-153.28</td>
      </tr>
      <tr>
          <td>Glow</td>
          <td>-0.17</td>
          <td>-8.15</td>
          <td>18.92</td>
          <td>11.35</td>
          <td>-155.07</td>
      </tr>
      <tr>
          <td>CPF</td>
          <td>-0.52</td>
          <td>-10.36</td>
          <td>16.93</td>
          <td>10.58</td>
          <td>-154.99</td>
      </tr>
      <tr>
          <td>NSP</td>
          <td>-0.64</td>
          <td>-13.09</td>
          <td>14.75</td>
          <td>9.67</td>
          <td>-157.54</td>
      </tr>
      <tr>
          <td>FFJORD</td>
          <td>-0.46</td>
          <td>-8.59</td>
          <td>14.92</td>
          <td>10.43</td>
          <td>-157.40</td>
      </tr>
      <tr>
          <td>OT-Flow</td>
          <td>-0.30</td>
          <td>-9.20</td>
          <td>17.32</td>
          <td>10.55</td>
          <td>-154.20</td>
      </tr>
      <tr>
          <td><strong>Ours</strong></td>
          <td><strong>-0.57</strong></td>
          <td><strong>-12.35</strong></td>
          <td><strong>14.85</strong></td>
          <td><strong>10.42</strong></td>
          <td><strong>-156.22</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation NLL and FID</strong> (Table 2 Right; NLL in bits per dim, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>CIFAR-10 NLL</th>
          <th>CIFAR-10 FID</th>
          <th>ImageNet-32 NLL</th>
          <th>ImageNet-32 FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FFJORD</td>
          <td>3.40</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Glow</td>
          <td>3.35</td>
          <td>-</td>
          <td>4.09</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DDPM</td>
          <td>≤3.75</td>
          <td>3.17</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DDPM++ (Song et al., 2021)</td>
          <td>≤3.37</td>
          <td>2.90</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ScoreSDE (Song et al., 2021)</td>
          <td>2.99</td>
          <td>2.92</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>VDM</td>
          <td>≤2.65</td>
          <td>7.41</td>
          <td>≤3.72</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Soft Truncation</td>
          <td>2.88</td>
          <td>3.45</td>
          <td>3.85</td>
          <td>8.42</td>
      </tr>
      <tr>
          <td>ScoreFlow</td>
          <td>2.81</td>
          <td>5.40</td>
          <td>3.76</td>
          <td>10.18</td>
      </tr>
      <tr>
          <td><strong>Ours</strong></td>
          <td><strong>2.99</strong></td>
          <td><strong>10.27</strong></td>
          <td><strong>3.48</strong></td>
          <td><strong>8.49</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: DDPM++ is from Song et al. (2021), the same work as ScoreSDE (it is the architecture optimized for VP/sub-VP SDEs). InterFlow matches ScoreSDE on CIFAR-10 NLL (2.99 bits per dim) while being simulation-free. FID is weaker than dedicated image models (10.27 vs 2.92 for ScoreSDE), reflecting the paper&rsquo;s primary focus on tractable likelihood rather than sample quality.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: All models were trained on a single NVIDIA A100 GPU.</li>
<li><strong>Training Time</strong>:
<ul>
<li>Tabular: $10^5$ steps.</li>
<li>Images: $1.5 \times 10^5$ to $6 \times 10^5$ steps.</li>
<li>Speedup: Demonstrated ~400x speedup compared to FFJORD on MiniBooNE dataset.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>lucidrains/denoising-diffusion-pytorch (link defunct)</td>
          <td>Code</td>
          <td>MIT</td>
          <td>Base U-Net architecture used for image experiments; original GitHub account no longer available</td>
      </tr>
  </tbody>
</table>
<p>No official code release accompanies this paper. All tabular datasets (POWER, GAS, HEPMASS, MINIBOONE, BSDS300) are publicly available from prior work. CIFAR-10 and ImageNet are standard public benchmarks. Oxford Flowers 102 is also publicly available. Hyperparameters and architectures are fully specified in Tables 3 and 4 of the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Albergo, M. S., &amp; Vanden-Eijnden, E. (2023). Building Normalizing Flows with Stochastic Interpolants. <em>The Eleventh International Conference on Learning Representations</em>.</p>
<p><strong>Publication</strong>: ICLR 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{albergoBuildingNormalizingFlows2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Building {{Normalizing Flows}} with {{Stochastic Interpolants}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{The {{Eleventh International Conference}} on {{Learning Representations}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Albergo, Michael Samuel and {Vanden-Eijnden}, Eric}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://openreview.net/forum?id=li7qeBbCR1t}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=li7qeBbCR1t">OpenReview</a></li>
<li><a href="https://arxiv.org/abs/2209.15571">arXiv</a></li>
</ul>
]]></content:encoded></item><item><title>OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</guid><description>A diffusion-based data augmentation pipeline (OCSAug) using DDPM and RePaint to improve optical chemical structure recognition on hand-drawn images.</description><content:encoded><![CDATA[<h2 id="document-taxonomy-ocsaug-as-a-novel-method">Document Taxonomy: OCSAug as a Novel Method</h2>
<p>This is a <strong>Method</strong> paper according to the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">taxonomy</a>. It proposes a novel data augmentation pipeline (<strong>OCSAug</strong>) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.</p>
<h2 id="expanding-hand-drawn-training-data-for-ocsr">Expanding Hand-Drawn Training Data for OCSR</h2>
<p>A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train effective models, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.</p>
<h2 id="ocsaug-pipeline-masked-repaint-via-generative-ai">OCSAug Pipeline: Masked RePaint via Generative AI</h2>
<p>The core novelty is <strong>OCSAug</strong>, a three-phase pipeline that uses generative AI to synthesize training data:</p>
<ol>
<li><strong>DDPM + RePaint</strong>: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.</li>
<li><strong>Structural Masking</strong>: It introduces <strong>vertical and horizontal stripe pattern masks</strong>. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular &ldquo;hand-drawn&rdquo; styles while preserving the underlying chemical topology.</li>
<li><strong>Label Transfer</strong>: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.</li>
</ol>
<h2 id="benchmarking-diffusion-augmentations-on-decimer">Benchmarking Diffusion Augmentations on DECIMER</h2>
<p>The authors evaluated OCSAug using the <strong>DECIMER dataset</strong>, specifically a &ldquo;drug-likeness&rdquo; subset filtered by Lipinski&rsquo;s and Veber&rsquo;s rules.</p>
<ul>
<li><strong>Baselines</strong>: The method was compared against <strong>RDKit</strong> (digital generation) and <strong>Randepict</strong> (rule-based augmentation).</li>
<li><strong>Models</strong>: Four recent OCSR models were fine-tuned: <strong>MolScribe</strong>, <strong>DECIMER 1.0 (I2S)</strong>, <strong>MolNexTR</strong>, and <strong>MPOCSR</strong>.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Tanimoto Similarity</strong>: To measure prediction accuracy against ground truth.</li>
<li><strong>Fréchet Inception Distance (FID)</strong>: To measure the distributional similarity between generated and real hand-drawn images.</li>
<li><strong>RMSE</strong>: To quantify pixel-level structural preservation across different mask thicknesses.</li>
</ul>
</li>
</ul>
<h2 id="improved-generalization-capabilities-and-fid-scores">Improved Generalization Capabilities and FID Scores</h2>
<ul>
<li><strong>Performance Boost</strong>: OCSAug improved recognition accuracy (Tanimoto similarity) by <strong>1.918 to 3.820 times</strong> compared to non-fine-tuned baselines (Improvement Ratio), outperforming traditional augmentation techniques such as RDKit and Randepict (1.570-3.523x).</li>
<li><strong>Data Quality</strong>: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.</li>
<li><strong>Generalization</strong>: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.</li>
<li><strong>Resolution Mixing</strong>: Training MolScribe and MolNexTR with a mix of $128 \times 128$, $256 \times 256$, and $512 \times 512$ resolution images improved Tanimoto similarity (e.g., MolScribe from 0.585 to 0.640), though this strategy did not help I2S or MPOCSR.</li>
<li><strong>Real-World Evaluation</strong>: On a newly collected dataset of 463 hand-drawn images from 6 volunteers (88 drug compounds), the MPOCSR model fine-tuned with OCSAug achieved 0.367 exact-match accuracy (Tanimoto = 1.0), compared to 0.365 for non-augmented fine-tuning and 0.037 for no fine-tuning. The area under the accuracy curve showed a more notable improvement in reducing misrecognition.</li>
<li><strong>Limitations</strong>: The generation process is slow (3 weeks for 10k images on a single GPU). The fixed stripe masks may struggle with highly complex, non-drug-like geometries: when evaluated on the full DECIMER dataset (without drug-likeness filtering), OCSAug did not yield uniform improvements across all models.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jjjabcd/OCSAug">OCSAug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using guided-diffusion and RePaint</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6456306">DECIMER Hand-Drawn Dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY 4.0</td>
          <td>5,088 hand-drawn molecular structure images from 24 individuals</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: DECIMER dataset (hand-drawn images).</li>
<li><strong>Filtering</strong>: A &ldquo;drug-likeness&rdquo; filter was applied (Lipinski&rsquo;s rule of 5 + Veber&rsquo;s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).</li>
<li><strong>Final Size</strong>: 3,194 samples, split into:
<ul>
<li><strong>Training</strong>: 2,604 samples.</li>
<li><strong>Validation</strong>: 290 samples.</li>
<li><strong>Test</strong>: 300 samples.</li>
</ul>
</li>
<li><strong>Resolution</strong>: All images resized to $256 \times 256$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DDPM implemented using <code>guided-diffusion</code>.</li>
<li><strong>RePaint Settings</strong>:
<ul>
<li>Total time steps: 250.</li>
<li>Jump length: 10.</li>
<li>Resampling counts: 10.</li>
</ul>
</li>
<li><strong>Masking Strategy</strong>:
<ul>
<li><strong>Vertical Stripes</strong>: Obscure atom symbols to vary handwriting style.</li>
<li><strong>Horizontal Stripes</strong>: Obscure bonds to vary length/thickness/alignment.</li>
<li><strong>Optimal Thickness</strong>: A stripe thickness of <strong>4 pixels</strong> was found to be optimal for balancing diversity and structural preservation.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.</p>
<ul>
<li><strong>MolScribe</strong>: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.</li>
<li><strong>I2S (DECIMER 1.0)</strong>: Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.</li>
<li><strong>MolNexTR</strong>: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.</li>
<li><strong>MPOCSR</strong>: MPViT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>
<p><strong>Metric</strong>: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated iteratively or defined as:</p>
<p>$$
\text{IR} = \frac{\text{TS}_{\text{finetuned}}}{\text{TS}_{\text{non-finetuned}}}
$$</p>
</li>
<li>
<p><strong>Validation</strong>: Cross-validation on the split DECIMER dataset.</p>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: NVIDIA GeForce RTX 4090.</li>
<li><strong>Training Time</strong>: DDPM training took ~6 days.</li>
<li><strong>Generation Time</strong>: Generating 2,600 augmented images took ~70 hours.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, J. H., &amp; Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. <em>The Journal of Supercomputing</em>, 81, 926.</p>
<p><strong>Publication</strong>: The Journal of Supercomputing 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/jjjabcd/OCSAug">Official Repository</a></li>
<li><a href="https://zenodo.org/records/6456306">DECIMER Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimOCSAugDiffusionbasedOptical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSAug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kim, Jin Hyuk and Choi, Jonghwan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{The Journal of Supercomputing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{81}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{926}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11227-025-07406-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Platinum Adatom Diffusion on Pt(100): LAMMPS Simulation</title><link>https://hunterheidenreich.com/videos/pt-adatom-diffusion/</link><pubDate>Wed, 27 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/videos/pt-adatom-diffusion/</guid><description>LAMMPS molecular dynamics simulation of platinum adatom diffusion on a Pt(100) surface, showing atomic mobility mechanisms.</description><content:encoded><![CDATA[<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/1hhf5cQh56w?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Details on the simulation can be found in the <a href="/posts/adatom-cu-diffusion/">LAMMPS Tutorial: Copper and Platinum Adatom Diffusion</a> post and the <a href="/projects/lammps-adatom-diffusion/">Automated Adatom Diffusion Workflow</a> project page.</p>
]]></content:encoded></item><item><title>Copper Adatom Diffusion on Cu(100): LAMMPS Simulation</title><link>https://hunterheidenreich.com/videos/cu-adatom-diffusion/</link><pubDate>Wed, 27 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/videos/cu-adatom-diffusion/</guid><description>LAMMPS molecular dynamics simulation of copper adatom diffusion on a Cu(100) surface, showing atomic mobility mechanisms.</description><content:encoded><![CDATA[<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/nIdbNqEEPys?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Details on the simulation can be found in the <a href="/posts/adatom-cu-diffusion/">Cu Adatom Diffusion on Cu(100)</a> post and the <a href="/projects/lammps-adatom-diffusion/">Automated Adatom Diffusion Workflow</a> project page.</p>
]]></content:encoded></item><item><title>Automated Adatom Diffusion Workflow</title><link>https://hunterheidenreich.com/projects/lammps-adatom-diffusion/</link><pubDate>Thu, 21 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/lammps-adatom-diffusion/</guid><description>Python-wrapped reference implementation for surface diffusion simulations using LAMMPS and EAM potentials, with automated analysis pipelines.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This project provides a complete &ldquo;input-to-analysis&rdquo; workflow for simulating adatom diffusion on FCC metal surfaces. It demonstrates how to set up surface diffusion simulations in LAMMPS, manage EAM potentials, and automatically parse trajectory data into publication-ready visualizations using Python.</p>
<p>The workflow covers two material systems (Copper (Cu) and Platinum (Pt)) providing comparative datasets that highlight how atomic mass and bonding strength affect surface dynamics.</p>
<h2 id="features">Features</h2>
<h3 id="simulation-architecture">Simulation Architecture</h3>
<p>The project separates simulation logic from analysis code:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Directory</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong><code>/adatom_cu</code></strong></td>
          <td style="text-align: left">Copper adatom diffusion on Cu(100)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong><code>/adatom_pt</code></strong></td>
          <td style="text-align: left">Platinum adatom diffusion on Pt(100)</td>
      </tr>
  </tbody>
</table>
<p>Each directory contains:</p>
<ul>
<li><strong>LAMMPS input scripts</strong> (<code>.in</code> files) defining the physics</li>
<li><strong>EAM potential files</strong> for accurate metallic bonding</li>
<li><strong>Python analysis scripts</strong> for trajectory and energy parsing</li>
</ul>
<h3 id="key-features">Key Features</h3>
<ul>
<li><strong>EAM Potentials</strong>: Uses Embedded Atom Method alloy potentials to accurately model metallic bonding and surface energies, providing accuracy beyond simple Lennard-Jones potentials</li>
<li><strong>Automated Analysis</strong>: Python pipeline (<code>plot_energy.py</code>, <code>plot_xy.py</code>) that parses raw thermodynamic logs and trajectory dumps to generate &ldquo;health check&rdquo; dashboards</li>
<li><strong>Workflow Orchestration</strong>: Demonstrates the &ldquo;Input → Simulation → Analysis&rdquo; loop, automating the transition from raw <code>.lammpstrj</code> files to publication-ready plots</li>
<li><strong>Kokkos Support</strong>: Includes high-performance execution commands for GPU/multi-threaded runs</li>
</ul>
<h3 id="simulation-parameters">Simulation Parameters</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Ensemble</strong></td>
          <td style="text-align: left">NVT → NVE</td>
          <td style="text-align: left">Equilibration followed by energy conservation checks</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Potential</strong></td>
          <td style="text-align: left">EAM/alloy</td>
          <td style="text-align: left">Accurate metallic bonding for surface dynamics</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Minimization</strong></td>
          <td style="text-align: left">CG (1.0e-4)</td>
          <td style="text-align: left">Remove steric overlaps before dynamics</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Output</strong></td>
          <td style="text-align: left">5 fs resolution</td>
          <td style="text-align: left">High-fidelity trajectory capture</td>
      </tr>
  </tbody>
</table>
<h2 id="usage">Usage</h2>
<p>The repository includes LAMMPS input scripts and Python analysis scripts. Run the LAMMPS scripts to generate trajectory data, then use the Python scripts to visualize the results.</p>
<h2 id="results">Results</h2>
<p>This workflow is documented in detail in companion blog posts:</p>
<ul>
<li><a href="/posts/adatom-cu-diffusion/">LAMMPS Tutorial: Copper and Platinum Adatom Diffusion</a> - Complete setup walkthrough with line-by-line script explanation and comparison of how heavier atoms behave differently on surfaces</li>
</ul>
]]></content:encoded></item></channel></rss>