<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Method Papers: New Algorithms, Architectures, and Mechanisms on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/paper-types/method/</link><description>Recent content in Method Papers: New Algorithms, Architectures, and Mechanisms on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 12 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/paper-types/method/index.xml" rel="self" type="application/rss+xml"/><item><title>MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-ccsdt/</link><pubDate>Sun, 12 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-ccsdt/</guid><description>MB-nrg decomposes polyalanine into n-mer building blocks fit to DLPNO-CCSD(T) references, reaching coupled-cluster accuracy for gas-phase peptide dynamics.</description><content:encoded><![CDATA[<h2 id="a-modular-mb-nrg-method-for-biomolecular-potentials">A Modular MB-nrg Method for Biomolecular Potentials</h2>
<p>This is a <strong>Method</strong> paper. Zhou and colleagues extend the MB-nrg (many-body energy) formalism to covalently bonded biomolecules and build the first coupled-cluster-accurate potential energy function (PEF) for polyalanine in the gas phase. The contribution has three parts: a generalization of the MB-nrg decomposition from whole-molecule 1-mers to functional-group &ldquo;natural building blocks,&rdquo; a DLPNO-CCSD(T)/aug-cc-pVTZ training protocol driven by parallel-bias metadynamics sampling, and a demonstration that the resulting PEF reproduces alanine dipeptide energetics and AceAla$_9$Nme secondary-structure dynamics more faithfully than the Amber ff14SB and ff19SB force fields.</p>
<h2 id="why-empirical-force-fields-fall-short-for-protein-dynamics">Why Empirical Force Fields Fall Short for Protein Dynamics</h2>
<p>Protein dynamics span femtosecond vibrations to millisecond conformational changes, and capturing them at atomic resolution is central to understanding catalysis, allostery, and ligand binding. Classical force fields such as CHARMM, OPLS, and Amber approximate the potential energy surface with pairwise-additive analytical terms. This functional form struggles with the many-body interactions that shape disordered regions of proteins, including exchange-repulsion, charge transfer, charge penetration, and cooperative hydrogen bonding. Polarizable force fields add induced dipoles but remain empirically parameterized and fail to capture short-range many-body effects from electron-density overlap.</p>
<p>Quantum-mechanical methods avoid this, but <a href="https://en.wikipedia.org/wiki/Coupled_cluster">coupled cluster theory</a> scales as $\mathcal{O}(N^7)$ in the number of electrons and even DFT remains $\mathcal{O}(N^3)$ to $\mathcal{O}(N^4)$, ruling out direct ab initio molecular dynamics for biomolecules. Fragmentation methods like molecular fractionation with conjugate caps (MFCC) mitigate the cost, but they truncate the many-body expansion at two bodies and miss long-range hydrogen bonding. <a href="/notes/chemistry/molecular-simulation/ml-potentials/dark-side-of-forces/">Machine-learned force fields (MLFFs)</a> reach near-QM accuracy at lower cost, yet they typically train on DFT data (inheriting delocalization errors and poor dispersion), struggle with interpretability, and extrapolate unreliably. Existing permutationally invariant polynomial (PIP) approaches scale factorially in the number of atoms, capping direct applicability at roughly ten to fifteen atoms per fragment.</p>
<p>MB-nrg PEFs based on the many-body expansion and PIPs have successfully modeled water, halides in water, carbon dioxide, methane, ammonia, nitrogen pentoxide, and N-methylacetamide. Extending them to covalently bonded biomolecules requires rethinking what counts as a &ldquo;body.&rdquo;</p>
<h2 id="building-polyalanine-from-functional-group-n-mers">Building Polyalanine from Functional-Group n-mers</h2>
<p>The MB-nrg formalism starts from the many-body expansion of the total energy,</p>
<p>$$
E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i&lt;j}^{N} \varepsilon^{2\mathrm{B}}(i,j) + \sum_{i&lt;j&lt;k}^{N} \varepsilon^{3\mathrm{B}}(i,j,k) + \dots + \varepsilon^{N\mathrm{B}}(1, \dots, N)
$$</p>
<p>where each $n$-body contribution is defined recursively as the $n$-mer energy minus all lower-order terms. The full PEF combines physics-based and data-driven components,</p>
<p>$$
V_{\mathrm{MB\text{-}nrg}} = V_{\mathrm{ML}} + V_{\mathrm{phys}}
$$</p>
<p>with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{1\mathrm{B}} + V_{\mathrm{ML}}^{2\mathrm{B}} + V_{\mathrm{ML}}^{3\mathrm{B}}$ capturing short-range quantum-mechanical interactions, and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}} + V_{\mathrm{rep}}$ supplying electrostatics, dispersion, and repulsion. Dispersion follows a Tang-Toennies damped $C_6/R^6$ form with XDM-derived coefficients; electrostatics uses a Thole-modified self-consistent polarization model inherited from MB-pol; the repulsion term is a Lennard-Jones $R^{-12}$ contribution borrowed from Amber ff14SB, activated only for non-bonded atom pairs not covered by a PIP.</p>
<p>Each data-driven $n$-body term is expressed as</p>
<p>$$
V_{\mathrm{ML}}^{n\mathrm{B}} = \sum_{\mathrm{M}_1 &lt; \dots &lt; \mathrm{M}_n}^{N} s^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n), V_{\mathrm{PIP}}^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n)
$$</p>
<p>where $V_{\mathrm{PIP}}^{n\mathrm{B}}$ is a permutationally invariant polynomial in Morse-like variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$ and $s^{n\mathrm{B}}$ is a switching function.</p>
<p>The key extension in this paper, building on earlier work on linear alkanes, is to treat functional groups (not whole molecules) as 1-mers. An Ace-capped, Nme-capped polyalanine chain decomposes into three distinct 1-mer types (-CH-, CH$_3$-, -CONH-), five distinct 2-mer types, and six distinct 3-mer types, for 14 unique PIPs that cover every $n$-mer appearing in any AceAla$_n$Nme chain. Cleaving covalent bonds between 1-mers would produce radicals, so the authors cap dangling valences with &ldquo;ghost&rdquo; hydrogen atoms at fixed C-H (1.14 Å) and N-H (1.09 Å) distances. Each $n$-mer energy is then referenced to its own optimized H-capped structure,</p>
<p>$$
E_n(1, \dots, n) = E_n^{\mathrm{H\text{-}capped}}(1, \dots, n) - E_n^{\mathrm{H\text{-}capped,opt}}(1, \dots, n).
$$</p>
<p>In the current implementation, only covalently bonded $n$-mers receive PIPs, the 2-body contribution from a dimer with one intervening 1-mer is folded into the corresponding 3-body term, and non-bonded 1-mers interact through the Lennard-Jones repulsion alone. Crucially, no whole-chain polyalanine data enters any stage of training: every PIP is parameterized on isolated $n$-mer configurations, and the total energy is reconstructed through the many-body expansion.</p>
<h2 id="training-on-dlpno-ccsdt-with-metadynamics-sampling">Training on DLPNO-CCSD(T) with Metadynamics Sampling</h2>
<p>Training sets are generated for each of the 14 $n$-mer types using <a href="https://en.wikipedia.org/wiki/Metadynamics">parallel-bias metadynamics (PBMetaD)</a> with partitioned families, biasing heavy-atom bonds, angles, and dihedrals across 300 K, 500 K, and 700 K in LAMMPS interfaced with PLUMED and modified OPLS/CM1A and Amber ff14SB force fields. For each $n$-mer, 200,000 candidate configurations are sampled, then reduced to roughly 10,000-20,000 training configurations (and about 1,000 test configurations) through Mini-batch K-means clustering on chemically equivalent pairwise distances. Reference energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA.</p>
<p>Each PIP minimizes a weighted, ridge-regularized sum of squared errors,</p>
<p>$$
\chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2
$$</p>
<p>with $\Gamma = 0.0005$ throughout and low-energy bias weights</p>
<p>$$
w_k = \left( \frac{\delta E}{\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E} \right)^2.
$$</p>
<p>MB-Fit handles the fit, combining simplex optimization for non-linear parameters $k_{\tau(ij)}$ with ridge regression for the linear coefficients $c_l$.</p>
<p>Table 1 in the paper reports, for each of the 14 PIPs, the polynomial degree (5 for the smaller -CH- and CH$_3$- 1-mers, 3 for the larger -CONH- 1-mer and for all 2-mers and 3-mers), the number of symmetrized monomials (ranging from 635 for the -CH- and CH$_3$- 1-mers to 2871 for the -CONH-CH-CONH- 3-mer), the training-set size, and RMSDs for the train and test splits. All training RMSDs stay below 0.4 kcal/mol and all test RMSDs below 0.5 kcal/mol, with the smallest errors for the -CH- and CH$_3$- 1-mers (0.05 kcal/mol train, 0.14 kcal/mol test) and the largest test RMSD (0.47 kcal/mol) for the -CONH-CH- 2-mer.</p>
<p>MD validations run in LAMMPS interfaced with MBX and PLUMED. For alanine dipeptide metadynamics, bias potentials on the backbone $\varphi$ and $\psi$ angles are deposited every 500 steps with a 1.0 kJ/mol height and 11.46° width over 10 ns trajectories in the NVT ensemble, using the velocity-Verlet integrator with a 0.5 fs time step. Analogous MetaD runs with Amber ff14SB and ff19SB are performed in Amber23. The longer AceAla$_9$Nme trajectories start from fully extended structures and run in a 100 Å × 100 Å × 100 Å gas-phase box.</p>
<h2 id="ccsdt-energy-landscapes-free-energy-surfaces-and-helix-dynamics">CCSD(T) Energy Landscapes, Free-Energy Surfaces, and Helix Dynamics</h2>
<p><strong>Alanine dipeptide 2D PES.</strong> Alanine dipeptide geometries are optimized on a <a href="https://en.wikipedia.org/wiki/Ramachandran_plot">Ramachandran</a> grid with 10° spacing at the RI-MP2/def2-TZVP level and then evaluated at DLPNO-CCSD(T)/aug-cc-pVTZ. Despite never seeing whole alanine dipeptide in training, MB-nrg closely matches the reference locations and relative energies of four minima ($m_1$ to $m_4$), three maxima ($M_1$ to $M_3$), and one saddle point ($X$). Amber ff14SB and ff19SB capture the minima reasonably but badly overshoot the barriers: at $M_1$, MB-nrg misses the reference by only -2.41 kcal/mol, while ff14SB and ff19SB overshoot by +7.50 and +7.83 kcal/mol. The authors also note that ff19SB incorrectly orders the secondary minima by predicting $m_3$ lower than $m_2$.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>RMSD overall (kcal/mol)</th>
          <th>RMSD $\leq 10$ kcal/mol</th>
          <th>RMSD $&gt; 10$ kcal/mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MB-nrg</td>
          <td>1.27</td>
          <td>1.18</td>
          <td>1.59</td>
      </tr>
      <tr>
          <td>Amber ff14SB</td>
          <td>6.33</td>
          <td>5.72</td>
          <td>8.44</td>
      </tr>
      <tr>
          <td>Amber ff19SB</td>
          <td>5.23</td>
          <td>4.79</td>
          <td>6.81</td>
      </tr>
  </tbody>
</table>
<p>The authors attribute MB-nrg&rsquo;s residual high-energy error to terminal methyl groups approaching the backbone in conformations where non-bonded 1-mer interactions are modeled by the simple LJ repulsion rather than an explicit PIP.</p>
<p><strong>Harmonic vibrations.</strong> Normal modes for the $m_1$ and $m_4$ alanine dipeptide conformers, computed by diagonalizing the Hessian, match RI-MP2/def2-TZVP references with mean deviations of 17.41 cm$^{-1}$ and 21.07 cm$^{-1}$ across all 60 modes. The authors acknowledge that some of this discrepancy reflects differences in theoretical levels (MB-nrg is trained to CCSD(T)/aug-cc-pVTZ, while the reference normal modes are computed at RI-MP2/def2-TZVP).</p>
<p><strong>Free-energy surfaces.</strong> Well-tempered metadynamics at 300 K produces 2D free-energy surfaces over $(\varphi, \psi)$. MB-nrg yields a smoother FES whose extrema line up with the DLPNO-CCSD(T) reference PES. Amber ff14SB and ff19SB remain reasonable near the low-energy $m_1$ and $m_2$ minima but systematically overestimate the barriers near $M_1$, $M_2$, and $M_3$, which the authors argue artificially confines the dipeptide and suppresses conformational transitions.</p>
<p><strong>Secondary structure in AceAla$_9$Nme.</strong> In 600 ps NVT MD starting from a fully extended structure, the <a href="https://en.wikipedia.org/wiki/STRIDE_(algorithm)">STRIDE algorithm</a> tracks residue-level secondary structures. Amber ff14SB and ff19SB collapse into $\alpha$-helices at roughly 40 ps and 80 ps, respectively, with ff19SB remaining especially rigid. MB-nrg takes about 100 ps before helices begin to form and then exhibits continuous oscillations between $3_{10}$- and $\alpha$-helical conformations. Ramachandran plots over the nine alanine residues show MB-nrg exploring the &ldquo;bridge&rdquo; region ($\varphi &lt; 0°$, $-20° \leq \psi \leq 20°$) associated with $3_{10}$-helices and sampling the left-handed $\alpha_L$ region that Amber rarely visits. The authors tie this flexibility to experimental observations of alanine-rich peptides in the gas phase and to similar predictions from GEMS and MACE-OFF.</p>
<h2 id="transferability-without-whole-chain-training-data">Transferability Without Whole-Chain Training Data</h2>
<p>The paper demonstrates that a modular, bottom-up PEF built from functional-group $n$-mers can reach CCSD(T) accuracy for polyalanine in the gas phase without ever training on whole-chain data. Truncating explicit data-driven terms at the 3-body level appears to balance cost and fidelity, with long-range effects handled by many-body polarization in $V_{\mathrm{elec}}$ and by Amber-derived repulsion between distant 1-mers. The 2D PES, harmonic frequencies, free-energy surface, and secondary-structure dynamics each validate a different facet of the model.</p>
<p>The authors are explicit about limitations. The current PEF applies only to gas-phase polyalanine; solvent effects and other amino acids remain open. The Lennard-Jones repulsion for non-bonded 1-mers is a placeholder for eventual 2-body PIPs that should capture short-range interactions during folding. Long-range hydrogen bonding in compact secondary structures (π-helices, $3_{10}$-helices, $\alpha$-helices) may produce non-negligible higher-order many-body contributions that the current 3-body truncation omits. The 2-body contribution from a dimer with one intervening monomer is currently folded into the 3-body term because of steric conflicts between capping hydrogens, and a systematic fix is flagged for future work. The authors position this paper as the first in a series (the &ldquo;I.&rdquo; in the title refers to &ldquo;Polyalanine in the Gas Phase&rdquo;) that will extend MB-nrg to broader biomolecular systems under physiological conditions. The follow-up, <a href="/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-water/">MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs</a>, adds explicit 1-mer/water 2-body PIPs and benchmarks alanine dipeptide solvation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Per $n$-mer pools from PBMetaD in LAMMPS/PLUMED</td>
          <td>200,000 configurations each, reduced to ~10-20k via Mini-batch K-means</td>
          <td>OPLS/CM1A and Amber ff14SB sampled at 300 K, 500 K, 700 K</td>
      </tr>
      <tr>
          <td>Training labels</td>
          <td>DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA</td>
          <td>14 unique $n$-mer types</td>
          <td>Domain-based local pair natural orbital approximation to canonical CCSD(T)</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>Held-out $n$-mer configurations</td>
          <td>~1,000 per $n$-mer</td>
          <td>Same clustering protocol</td>
      </tr>
      <tr>
          <td>Alanine dipeptide benchmark</td>
          <td>Ramachandran grid at 10° spacing, RI-MP2/def2-TZVP geometries</td>
          <td>1,296 grid points (approximate)</td>
          <td>Single-point energies at DLPNO-CCSD(T)/aug-cc-pVTZ, ff14SB, ff19SB, MB-nrg</td>
      </tr>
      <tr>
          <td>AceAla$_9$Nme dynamics</td>
          <td>600 ps NVT MD from fully extended start</td>
          <td>Single trajectory per model</td>
          <td>STRIDE for secondary-structure assignment</td>
      </tr>
  </tbody>
</table>
<p>Per the Data Availability statement, &ldquo;any data generated and analyzed in this study are available from the authors upon request.&rdquo; No public release is announced in the text.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Many-body expansion of the energy with 1-, 2-, and 3-body data-driven terms.</li>
<li>Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms.</li>
<li>&ldquo;Ghost&rdquo; H-capping at cleaved covalent bonds, with fixed C-H (1.14 Å) and N-H (1.09 Å) bond lengths and per-$n$-mer optimized-structure referencing.</li>
<li>Non-linear parameters fit by simplex minimization, linear coefficients by ridge regression with $\Gamma = 0.0005$.</li>
<li>Low-energy weighting in the loss through $w_k = (\delta E / (\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E))^2$.</li>
<li>Tang-Toennies damped dispersion with XDM-derived $C_6$ and damping parameters, Thole-modified many-body polarization, and LJ repulsion borrowed from Amber ff14SB.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>14 PIPs total covering three 1-mer types, five 2-mer types, and six 3-mer types. Polynomial degree is 5 for the -CH- and CH$_3$- 1-mers, and 3 for the -CONH- 1-mer together with all 2-mers and 3-mers. Term counts range from 635 (-CH-, CH$_3$-) to 2871 (-CONH-CH-CONH-).</li>
<li>MB-nrg PEF implemented in the MBX code and exercised through LAMMPS and PLUMED.</li>
<li>Training set sizes per $n$-mer range from roughly 12,000 to 47,000 configurations (the -CONH- 1-mer dataset is the largest at 47,438).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MB-nrg</th>
          <th>Amber ff14SB</th>
          <th>Amber ff19SB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$n$-mer training RMSD</td>
          <td>$\leq 0.35$ kcal/mol</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>$n$-mer test RMSD</td>
          <td>$\leq 0.47$ kcal/mol</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>Alanine dipeptide 2D PES RMSD (overall)</td>
          <td>1.27 kcal/mol</td>
          <td>6.33 kcal/mol</td>
          <td>5.23 kcal/mol</td>
      </tr>
      <tr>
          <td>Same, $\leq 10$ kcal/mol region</td>
          <td>1.18 kcal/mol</td>
          <td>5.72 kcal/mol</td>
          <td>4.79 kcal/mol</td>
      </tr>
      <tr>
          <td>Same, $&gt; 10$ kcal/mol region</td>
          <td>1.59 kcal/mol</td>
          <td>8.44 kcal/mol</td>
          <td>6.81 kcal/mol</td>
      </tr>
      <tr>
          <td>Alanine dipeptide $m_1$ normal-mode mean deviation vs RI-MP2/def2-TZVP</td>
          <td>17.41 cm$^{-1}$</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>Alanine dipeptide $m_4$ normal-mode mean deviation vs RI-MP2/def2-TZVP</td>
          <td>21.07 cm$^{-1}$</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>AceAla$_9$Nme helix-formation onset (from extended start)</td>
          <td>~100 ps ($\alpha$/$3_{10}$ mix)</td>
          <td>~40 ps ($\alpha$)</td>
          <td>~80 ps ($\alpha$)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Computational resources came from the Air Force Office of Scientific Research (FA9550-20-1-0351), NSF award 2311260, the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhou, R., Bull-Vulpe, E. F., Pan, Y., &amp; Paesani, F. (2025). Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase. <em>ChemRxiv</em>. <a href="https://doi.org/10.26434/chemrxiv-2025-b05k5">https://doi.org/10.26434/chemrxiv-2025-b05k5</a></p>
<p><strong>Publication</strong>: ChemRxiv preprint, 25 March 2025.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/paesanilab/MBX">MBX software (Paesani group)</a></li>
<li><a href="https://github.com/paesanilab/MB-Fit">MB-Fit (training pipeline)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{zhou2025data,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhou, Ruihan and Bull-Vulpe, Ethan F. and Pan, Yuanhui and Paesani, Francesco}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.26434/chemrxiv-2025-b05k5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">howpublished</span>=<span style="color:#e6db74">{\url{https://doi.org/10.26434/chemrxiv-2025-b05k5}}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-water/</link><pubDate>Sun, 12 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-water/</guid><description>Zhou and Paesani extend MB-nrg to peptide-water interactions, training 1-mer-water 2-body PIPs on DLPNO-CCSD(T) and benchmarking alanine dipeptide solvation.</description><content:encoded><![CDATA[<h2 id="extending-mb-nrg-from-gas-phase-polyalanine-to-aqueous-solution">Extending MB-nrg from Gas-Phase Polyalanine to Aqueous Solution</h2>
<p>This is a <strong>Method</strong> paper, the second installment in Zhou and Paesani&rsquo;s MB-nrg-for-biomolecules series. Paper I (covered in <a href="/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-ccsdt/">MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine</a>) decomposed gas-phase polyalanine into functional-group $n$-mers and fit permutationally invariant polynomials (PIPs) to DLPNO-CCSD(T)/aug-cc-pVTZ reference data. This sequel adds the missing piece: explicit, machine-learned 2-body interactions between every polyalanine functional-group 1-mer and a water molecule, trained on the same <a href="https://en.wikipedia.org/wiki/Coupled_cluster">coupled-cluster</a> reference. The resulting PEF couples the gas-phase intramolecular MB-nrg term, the MB-pol water model, and a new MB-nrg ala-water cross term within a single modular many-body decomposition.</p>
<h2 id="why-empirical-force-fields-struggle-with-hydrated-peptides">Why Empirical Force Fields Struggle with Hydrated Peptides</h2>
<p>Biomolecular function in water emerges from a coupling of intramolecular flexibility with solvent-mediated interactions, including hydrogen-bond networks, cooperative polarization, dispersion, and short-range exchange-repulsion. Empirical force fields such as AMBER, CHARMM, and OPLS approximate the multidimensional PES with pairwise-additive analytical terms whose parameters are tuned to experimental observables or low-level quantum data. The authors note that this functional form leads to systematic errors in predicted conformational ensembles for short peptides and <a href="https://en.wikipedia.org/wiki/Intrinsically_disordered_proteins">intrinsically disordered proteins (IDPs)</a>, with reported overpopulation of polyproline II (pPII) basins and antiparallel $\beta$ regions for alanine residues, plus underrepresentation of the transitional $\beta$ basin compared to experiment.</p>
<p>Polarizable force fields recover dielectric and hydration trends through induced dipoles, but still lean on empirical functional forms and miss short-range quantum effects (charge transfer, charge penetration, exchange-repulsion) that arise from electron-density overlap. <a href="/notes/chemistry/molecular-simulation/ml-potentials/dark-side-of-forces/">Machine-learned force fields</a> like MACE-OFF, GEMS, and FeNNix-Bio1 have improved bio-organic accuracy, but they still depend critically on the diversity and quality of training data, struggle to decompose energies into physically interpretable components, and most rely on DFT references that inherit delocalization errors and incomplete long-range correlation. Local descriptors common to MLFFs also limit treatment of long-range electrostatics and many-body correlations, both essential for biomolecular solvation.</p>
<p>The MB-nrg formalism, originally developed for water and small molecules and recently extended to alkanes and gas-phase polyalanine, offers an alternative: a rigorous many-body expansion (MBE) of the energy combined with both data-driven $n$-body PIPs and physics-based long-range terms. Paper II asks whether this modular gas-phase scaffold can be cleanly extended to aqueous environments by adding only short-range peptide-water 2-body PIPs.</p>
<h2 id="a-modular-mb-nrg-pef-for-polyalanine-in-water">A Modular MB-nrg PEF for Polyalanine in Water</h2>
<p>The MBE writes the total energy of a system of $N$ 1-mers as</p>
<p>$$
E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i&lt;j}^{N} \varepsilon^{2\mathrm{B}}(i,j) + \sum_{i&lt;j&lt;k}^{N} \varepsilon^{3\mathrm{B}}(i,j,k) + \dots + \varepsilon^{N\mathrm{B}}(1, \dots, N)
$$</p>
<p>with each $n$-body term defined recursively as the $n$-mer energy minus all lower-order contributions. The MBE converges quickly for insulating molecular systems with large electronic band gaps (such as water and peptides), so explicit PIP corrections are typically truncated at $n \leq 4$, with higher-order effects absorbed into many-body polarization.</p>
<p>For polyalanine in water, the total potential is partitioned into three modular blocks:</p>
<p>$$
V_{\mathrm{MB\text{-}nrg}}^{\mathrm{tot}} = V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}
$$</p>
<p>where $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}}$ is the gas-phase intramolecular polyalanine PEF from Paper I, $V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}}$ is the MB-pol water model, and $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$ is the new peptide-water cross term. The cross term itself follows the MB-nrg recipe of splitting machine-learned and physics-based contributions:</p>
<p>$$
V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}} = V_{\mathrm{ML}} + V_{\mathrm{phys}}
$$</p>
<p>with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{2\mathrm{B}}$ (only 2-body PIPs in this implementation) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$. The 2-body machine-learned term sums switched PIPs over every (1-mer, water) dimer:</p>
<p>$$
V_{\mathrm{ML}}^{2\mathrm{B}} = \sum_{i=1}^{N} s^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT}), V_{\mathrm{PIP}}^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT})
$$</p>
<p>where $\mathrm{M}_i$ is the $i$-th polyalanine functional-group 1-mer (-CH-, CH$_3$-, or -CONH-), WAT is a water molecule, and $s^{2\mathrm{B}}$ is a cosine switching function</p>
<p>$$
s^{2\mathrm{B}}(x) = \begin{cases} 1 &amp; x &lt; 0 \\ (1 + \cos(x))/2 &amp; 0 \leq x &lt; 1 \\ 0 &amp; 1 \leq x \end{cases}, \quad x = \frac{R - R_{\mathrm{in}}}{R_{\mathrm{out}} - R_{\mathrm{in}}}
$$</p>
<p>that smoothly attenuates the short-range PIP beyond a defined distance to preserve energy conservation in MD. The physics-based block uses a Thole-modified self-consistent polarization model (inherited from MB-pol) for $V_{\mathrm{elec}}$ and a Tang-Toennies damped dispersion sum</p>
<p>$$
V_{\mathrm{disp}} = -\sum_{\substack{\alpha \in 1\text{-mers} \\ \beta \in \mathrm{water}}} f(\mathrm{b}_{\alpha\beta} R_{\alpha\beta}), \frac{C_{6, \alpha\beta}}{R_{\alpha\beta}^{6}}
$$</p>
<p>with $C_{6, \alpha\beta}$ coefficients and atomic polarizabilities derived from the exchange-hole dipole moment (XDM) method, and atomic charges fit to reproduce the permanent multipole moments of each $n$-mer&rsquo;s optimized structure.</p>
<p>The authors stress that explicit 3-body and higher peptide-water PIPs are deliberately omitted in this first implementation; their effects are absorbed into the classical polarization term. They flag that strongly hydrogen-bonded or cooperative configurations may benefit from adding higher-body corrections in future work, following the precedent of MB-pol(2023) for water.</p>
<h2 id="training-set-generation-and-dlpno-ccsdt-reference-data">Training Set Generation and DLPNO-CCSD(T) Reference Data</h2>
<p>Training pools for the three 1-mer-water dimers (CH$_3$-H$_2$O, -CH&ndash;H$_2$O, -CONH&ndash;H$_2$O) extend the <a href="https://en.wikipedia.org/wiki/Metadynamics">parallel-bias metadynamics with partitioned families (PBMetaD+PFs)</a> protocol from Paper I. Covalent boundaries are capped with &ldquo;ghost&rdquo; hydrogens at fixed C-H (1.14 Å) and N-H (1.09 Å) distances to preserve closed-shell character; each 2-body energy is referenced to the corresponding optimized capped 1-mer-water geometry to remove constant offsets.</p>
<p>PBMetaD simulations are run in LAMMPS interfaced with PLUMED, using Amber ff14SB for the alanine 1-mers and TIP4P/2005f for water. Collective variables span all heavy-atom bonds, angles, and dihedrals in each dimer. To target distinct interaction regimes, three separate biased runs apply upper and lower walls on the 1-mer/water center-of-mass distance: 0-4 Å (short-range repulsion), 4-7 Å (mid-range attraction), and 7-10 Å (long-range orientation-dependent interactions). Each dimer yields about 600,000 configurations, reduced to roughly 40,000 training and 2,000 test configurations per type by K-means clustering.</p>
<p>Reference 2-body energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA, using the aug-cc-pVTZ/C auxiliary basis, the RIJCOSX approximation, TightSCF, TightPNO, and the PModel pair-selection option. The counterpoise method corrects every 2-body energy for <a href="https://en.wikipedia.org/wiki/Basis_set_superposition_error">basis set superposition error</a>.</p>
<p>Each PIP minimizes a weighted, ridge-regularized least-squares objective:</p>
<p>$$
\chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{2\mathrm{B}}(k) - \varepsilon^{2\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2
$$</p>
<p>with $\Gamma = 0.0005$ throughout. Training weights bias the fit toward low-energy configurations,</p>
<p>$$
w_k = \left( \frac{\delta E}{\varepsilon^{2\mathrm{B}}(k) - \varepsilon_{\mathrm{min}}^{2\mathrm{B}} + \delta E} \right)^2
$$</p>
<p>with $\delta E = 40$ kcal/mol for all 1-mer-water pairs. MB-Fit handles the optimization, combining simplex minimization for non-linear parameters (Morse decay constants) with ridge regression for the linear coefficients.</p>
<p>Table 1 reports the PIP specifications. All three PIPs use polynomial degree 3 with a complete, unscreened basis. The -CH- and CH$_3$- dimers each require 710 symmetrized terms; the chemically richer -CONH- dimer requires 1,267 terms to capture its dipolar character and directional hydrogen bonding. Training-set sizes range from 41,781 to 43,174 configurations.</p>
<table>
  <thead>
      <tr>
          <th>1-mer type</th>
          <th>PIP degree</th>
          <th>PIP terms</th>
          <th>Training configs</th>
          <th>Train RMSD (kcal/mol)</th>
          <th>Test RMSD (kcal/mol)</th>
          <th>Train MAE</th>
          <th>Test MAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>-CH-</td>
          <td>3</td>
          <td>710</td>
          <td>43,174</td>
          <td>0.07</td>
          <td>0.08</td>
          <td>0.06</td>
          <td>0.06</td>
      </tr>
      <tr>
          <td>CH$_3$-</td>
          <td>3</td>
          <td>710</td>
          <td>43,172</td>
          <td>0.08</td>
          <td>0.08</td>
          <td>0.05</td>
          <td>0.05</td>
      </tr>
      <tr>
          <td>-CONH-</td>
          <td>3</td>
          <td>1,267</td>
          <td>41,781</td>
          <td>0.18</td>
          <td>0.20</td>
          <td>0.13</td>
          <td>0.16</td>
      </tr>
  </tbody>
</table>
<p>All RMSDs sit below 0.20 kcal/mol on both train and test splits, well within sub-chemical accuracy.</p>
<h2 id="validation-dimer-scans-free-energy-surfaces-and-hydration">Validation: Dimer Scans, Free-Energy Surfaces, and Hydration</h2>
<p>The authors stage four validation studies of increasing complexity, each touching a distinct facet of the new PEF.</p>
<p><strong>Alanine dipeptide-water dimer scans.</strong> One-dimensional scans probe the interaction energy along four hydrogen-bonding coordinates of an alanine dipeptide-water dimer: O$_1$-H$_w$, H$_1$-O$_w$, O$_2$-H$_w$, and H$_2$-O$_w$, where subscripts 1 and 2 mark the acetyl and N-methyl termini. The dipeptide is constrained to four representative <a href="https://en.wikipedia.org/wiki/Ramachandran_plot">Ramachandran</a> conformations: C5 ($\varphi = -150°$, $\psi = 150°$), pPII ($\varphi = -80°$, $\psi = 150°$), C7$_{\mathrm{eq}}$ ($\varphi = -80°$, $\psi = 70°$), and right-handed $\alpha$-helix $\alpha_R$ ($\varphi = -80°$, $\psi = -30°$). MB-nrg closely tracks the DLPNO-CCSD(T)/aug-cc-pVTZ reference curves across all 16 (4 conformation $\times$ 4 site) scans, despite never seeing the full dipeptide-water surface during training. Amber ff14SB/TIP3P and ff19SB/OPC underestimate hydrogen-bond depths and miss curvature near equilibrium, with the ff14SB/TIP3P combination yielding slightly better overall agreement than ff19SB/OPC even though TIP3P is the less accurate water model.</p>
<p>Two specific failure modes of the empirical force fields stand out. In the pPII conformation, both ff14SB and ff19SB predict significantly deeper interaction wells than the reference, overstabilizing several hydrogen bonds. In the H$_2$-O$_w$ scan of the $\alpha_R$ conformation, both empirical FFs exhibit a spurious 2.5-4.0 Å energy barrier that the authors trace to the simple Lennard-Jones repulsion between the acetyl carbonyl oxygen and water; MB-nrg and DLPNO-CCSD(T) instead show a smoothly decaying profile. The one MB-nrg deviation noted is the C5 H$_1$-O$_w$ scan in the 1.5-2.5 Å range, where MB-nrg predicts a slightly more attractive interaction than the reference. Here the H$_1$-O$_2$ distance is 2.3 Å and water acts simultaneously as acceptor at H$_1$ and donor to O$_2$, a cooperative pattern the authors expect would require explicit 2-mer-water or 3-mer-water terms to fully reproduce.</p>
<p><strong>Free-energy surface in explicit MB-pol water.</strong> Four-walker well-tempered metadynamics (WT-MetaD) simulations explore the conformational landscape of alanine dipeptide as a function of $(\varphi, \psi)$, biasing the central alanine residue&rsquo;s backbone dihedrals every 500 steps with 1.0 kJ/mol Gaussians of 11.46° width. The free-energy section reports 2.5 ns per replica across four parallel walkers (10 ns aggregate, matching the Figure 6 caption); the methods section states 8 ns total, an internal inconsistency in the paper. The MB-nrg FES recovers all major low-energy conformers identified by NMR and prior MP2/DFT studies: a global minimum at $\alpha_R$, additional local minima in C5, $\beta_2$, and $\alpha_L$, and a metastable pPII basin. The C7$_{\mathrm{eq}}$ minimum that dominates the gas-phase Ramachandran surface in Paper I is significantly destabilized in solution, consistent with experiment.</p>
<p>Quantitatively, MB-nrg predicts $\alpha_R$ and $\beta_2$ as isoenergetic global minima, with C5 about 3 kcal/mol higher in free energy. Prior DFT-with-implicit-solvation studies (Mironov et al., Yang and Honig) report C5, $\alpha_R$, and $\beta_2$ as nearly isoenergetic, and the authors note that the discrepancy may reflect the explicit MB-pol water treatment, residual DFT errors in the reference, or both. They flag a planned systematic benchmarking of MB-nrg PEFs for diverse polypeptides against both DFT and DLPNO-CCSD(T) data in future work. The Amber FESs over-stabilize pPII relative to C5/$\alpha_R$, contradicting experimental and DFT benchmarks; ff19SB/OPC also exhibits a spurious C7$_{\mathrm{eq}}$ minimum that is absent from MB-nrg.</p>
<p><strong>Hydration radial distribution functions.</strong> Site-site RDFs at 300 K for the same hydrogen-bond contacts (O$_1$-H$_w$, O$_2$-H$_w$, H$_1$-O$_w$, H$_2$-O$_w$) are computed from NVT MD trajectories. All three models reproduce well-defined first-shell peaks near 2.0 Å. For the O-H$_w$ pairs, MB-nrg shows a broader, slightly right-shifted second-shell peak, indicating less rigid water structure beyond the first shell. The amide-hydrogen RDFs are nearly identical between ff14SB/TIP3P and ff19SB/OPC, while MB-nrg reveals subtle first-shell shifts (shorter H$_1$-O$_w$, longer H$_2$-O$_w$) and weaker, less-defined second-shell features near 3.7-3.8 Å that are absent from the empirical force fields and consistent with prior ab initio MD on alanine dipeptide.</p>
<h2 id="a-modular-path-to-chemically-accurate-biomolecular-simulations">A Modular Path to Chemically Accurate Biomolecular Simulations</h2>
<p>Across the four benchmarks, the same picture emerges: a modular, bottom-up MB-nrg PEF built from functional-group $n$-mers and trained only on isolated 1-mer-water dimers can reach DLPNO-CCSD(T) accuracy for both energetic and structural observables of alanine dipeptide in explicit water. The decomposition into a gas-phase intramolecular term, an MB-pol water model, and an MB-nrg cross term keeps each piece interpretable and individually replaceable; the gas-phase polyalanine PEF from Paper I drops in unchanged, and the new ala-water PIPs were fit without ever seeing the full alanine dipeptide-water PES.</p>
<p>The authors are explicit about limitations:</p>
<ul>
<li>The cross term currently includes only 2-body PIPs (one 1-mer with one water). Higher-body peptide-water terms ($n &gt; 2$) are folded into the classical polarization, which the authors expect will be inadequate for strongly cooperative configurations such as the C5 H$_1$-O$_w$ scan where one water bridges H$_1$ and O$_2$.</li>
<li>Quantitative differences between the MB-nrg FES and prior implicit-solvation DFT studies (relative depths of $\alpha_R$, $\beta_2$, and C5) remain to be reconciled through systematic benchmarking against higher-level reference data.</li>
<li>Only polyalanine is considered. The framework is designed to generalize to other amino acids and side-chain-water interactions, but sequence- and side-chain-specific PIPs are still to be fit.</li>
<li>No public release of the parameterized PEF or training data is announced; the data availability statement says &ldquo;available from the authors upon request.&rdquo;</li>
</ul>
<p>The paper positions MB-nrg as a transferable, interpretable strategy for chemically accurate biomolecular simulations in solution, with future work aimed at heteropolypeptides and explicit higher-order many-body cross terms.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training pools</td>
          <td>PBMetaD+PFs in LAMMPS/PLUMED</td>
          <td>~600,000 configs per dimer, reduced to ~40,000</td>
          <td>ff14SB for alanine 1-mers, TIP4P/2005f for water; 300 K, 500 K, 700 K</td>
      </tr>
      <tr>
          <td>Distance regimes</td>
          <td>Walls on 1-mer/water COM distance</td>
          <td>0-4, 4-7, and 7-10 Å</td>
          <td>Short-range repulsion, mid-range attraction, long-range orientation</td>
      </tr>
      <tr>
          <td>Training labels</td>
          <td>DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA</td>
          <td>3 unique 1-mer-water dimer types</td>
          <td>RIJCOSX, TightSCF, TightPNO, PModel; counterpoise BSSE correction</td>
      </tr>
      <tr>
          <td>Test sets</td>
          <td>Held-out clustered configs</td>
          <td>~2,000 per dimer</td>
          <td>Same K-means clustering protocol</td>
      </tr>
      <tr>
          <td>Alanine dipeptide-water scans</td>
          <td>1D scans along 4 H-bond coordinates in 4 conformations</td>
          <td>16 scans total</td>
          <td>C5, pPII, C7$_{\mathrm{eq}}$, and $\alpha_R$ conformations</td>
      </tr>
      <tr>
          <td>Alanine dipeptide FES</td>
          <td>WT-MetaD on $\varphi$, $\psi$ in MB-pol water</td>
          <td>4 walkers, 2.5 ns each (10 ns total per the results section and Figure 6 caption; methods section states 8 ns)</td>
          <td>1.0 kJ/mol height, 11.46° width, deposition every 500 steps</td>
      </tr>
      <tr>
          <td>Hydration RDFs</td>
          <td>NVT MD at 300 K</td>
          <td>Single trajectory per model</td>
          <td>Same H-bond sites as the dimer scans</td>
      </tr>
  </tbody>
</table>
<p>Per the data availability statement, &ldquo;any data generated and analyzed in this study, including the MB-nrg PEF, are available from the authors upon request.&rdquo; The MBX engine is publicly available on <a href="https://github.com/paesanilab/MBX">GitHub</a> under a UC Regents custom license that grants free use for educational, research, and non-profit purposes but restricts commercial use. No public release of the new ala-water PIPs is announced in the text.</p>
<h4 id="artifacts-table">Artifacts table</h4>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/paesanilab/MBX">MBX</a></td>
          <td>Code</td>
          <td>UC Regents custom (academic/non-profit only; no SPDX-recognized OSS license)</td>
          <td>C++ many-body potential engine; runs the MB-nrg PEF via LAMMPS and PLUMED</td>
      </tr>
      <tr>
          <td><a href="https://github.com/paesanilab/MB-Fit">MB-Fit</a></td>
          <td>Code</td>
          <td>Check repo</td>
          <td>Training pipeline for PIP fitting; used to fit the new 1-mer-water PIPs</td>
      </tr>
      <tr>
          <td>MB-nrg ala-water PIPs (this paper)</td>
          <td>Model</td>
          <td>Not released</td>
          <td>&ldquo;Available from the authors upon request&rdquo; per the data availability statement</td>
      </tr>
      <tr>
          <td>DLPNO-CCSD(T) training/test sets</td>
          <td>Dataset</td>
          <td>Not released</td>
          <td>Same statement; ~600,000 raw configs per dimer reduced to ~40,000 train + ~2,000 test</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Many-body expansion of the energy partitioned into three modular blocks: $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$.</li>
<li>Cross term split into $V_{\mathrm{ML}}^{2\mathrm{B}}$ (PIPs over every 1-mer-water dimer) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$.</li>
<li>Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms; same construction as the NMA-water PIPs.</li>
<li>Cosine switching function $s^{2\mathrm{B}}$ smoothly attenuates short-range PIPs between user-defined inner and outer cutoffs.</li>
<li>Dispersion: Tang-Toennies damped $C_6/R^6$ with XDM-derived coefficients and damping parameters.</li>
<li>Electrostatics: modified Thole model with self-consistent induced dipoles for many-body polarization; per-atom charges fit to reproduce permanent multipole moments of each $n$-mer&rsquo;s optimized structure.</li>
<li>Ghost-H capping at cleaved covalent boundaries with fixed C-H (1.14 Å) and N-H (1.09 Å) distances; per-dimer optimized-structure referencing.</li>
<li>Training with simplex minimization for non-linear parameters and ridge regression for linear coefficients via MB-Fit, with low-energy weighting and $\Gamma = 0.0005$, $\delta E = 40$ kcal/mol.</li>
<li>WT-MetaD with four parallel walkers for the alanine dipeptide FES.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Three new 1-mer-water 2-body PIPs covering -CH-/H$_2$O, CH$_3$-/H$_2$O, and -CONH-/H$_2$O dimers.</li>
<li>All three PIPs use polynomial degree 3 with a complete, unscreened basis (no term screening).</li>
<li>Term counts: 710 for -CH-/H$_2$O and CH$_3$-/H$_2$O, 1,267 for -CONH-/H$_2$O.</li>
<li>Combined with the gas-phase polyalanine MB-nrg PEF from Paper I and the MB-pol water model, exercised through MBX, LAMMPS, and PLUMED.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MB-nrg</th>
          <th>Amber ff14SB/TIP3P</th>
          <th>Amber ff19SB/OPC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>-CH-/H$_2$O 2-body train/test RMSD</td>
          <td>0.07 / 0.08 kcal/mol</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>CH$_3$-/H$_2$O 2-body train/test RMSD</td>
          <td>0.08 / 0.08 kcal/mol</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>-CONH-/H$_2$O 2-body train/test RMSD</td>
          <td>0.18 / 0.20 kcal/mol</td>
          <td>n/a</td>
          <td>n/a</td>
      </tr>
      <tr>
          <td>Alanine dipeptide-water 1D scans (qualitative)</td>
          <td>Tracks DLPNO-CCSD(T) curves across 16 scans</td>
          <td>Underestimates H-bond depths; spurious $\alpha_R$ H$_2$-O$_w$ barrier</td>
          <td>Same shape as ff14SB/TIP3P</td>
      </tr>
      <tr>
          <td>Alanine dipeptide FES global minima</td>
          <td>Isoenergetic $\alpha_R$ and $\beta_2$; C5 ~3 kcal/mol higher</td>
          <td>Over-stabilizes pPII</td>
          <td>Over-stabilizes pPII; spurious C7$_{\mathrm{eq}}$ minimum</td>
      </tr>
      <tr>
          <td>O-H$_w$ second shell</td>
          <td>Broader, right-shifted; finer detail consistent with prior AIMD</td>
          <td>Sharper, less detail</td>
          <td>Sharper, less detail</td>
      </tr>
      <tr>
          <td>H-O$_w$ second shell</td>
          <td>Weak features near 3.7-3.8 Å</td>
          <td>Absent</td>
          <td>Absent</td>
      </tr>
  </tbody>
</table>
<p>Quantitative RMSD or KL-divergence values for the FES and RDF benchmarks are not reported in the main text.</p>
<h3 id="hardware">Hardware</h3>
<p>The authors acknowledge support from the Air Force Office of Scientific Research (FA9550-20-1-0351, theoretical development) and NSF (award 2311260, MBX implementation). Computational resources came from the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhou, R., &amp; Paesani, F. (2025). Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water. <em>ChemRxiv</em>. <a href="https://doi.org/10.26434/chemrxiv-2025-j6cwv-v2">https://doi.org/10.26434/chemrxiv-2025-j6cwv-v2</a></p>
<p><strong>Publication</strong>: ChemRxiv preprint (version 2), 10 October 2025.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/paesanilab/MBX">MBX software (Paesani group)</a></li>
<li><a href="https://github.com/paesanilab/MB-Fit">MB-Fit (training pipeline)</a></li>
<li>Companion paper: <a href="/notes/chemistry/molecular-simulation/ml-potentials/mb-nrg-polyalanine-ccsdt/">MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine</a> (Paper I)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhou2025toward,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhou, Ruihan and Paesani, Francesco}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ChemRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.26434/chemrxiv-2025-j6cwv-v2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Graph Grammar and ILP for Carbon Fixation Pathways</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/carbon-fixation-pathway-design/</link><pubDate>Sun, 12 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/carbon-fixation-pathway-design/</guid><description>Graph-based chemical space expansion with ILP flow queries discovers novel autocatalytic carbon fixation pathways competitive with CETCH and rTCA.</description><content:encoded><![CDATA[<h2 id="a-graph-grammar-and-ilp-framework-for-pathway-discovery">A Graph-Grammar and ILP Framework for Pathway Discovery</h2>
<p>Abel et al. present a Method paper that couples generative chemical space expansion with <a href="https://en.wikipedia.org/wiki/Integer_programming">integer linear programming</a> (ILP) pathway queries to systematically propose artificial carbon fixation pathways. The workflow uses the cheminformatics package MØD to iteratively expand a reaction hypergraph from a seed set of metabolites and rule-based enzyme reactions, then queries the resulting network for autocatalytic flows producing a chosen target molecule. Post-hoc annotation with eQuilibrator Gibbs energies and cofactor accounting ranks candidates by thermodynamic feasibility. Applied to the Acetyl-CoA-Succinyl-CoA pathway family plus selected synthetic and theoretical pathways, the framework recovers the natural pathways and proposes two new theoretical autocatalytic cycles (an 11-step Acetyl-CoA cycle and a 12-step Malate cycle) whose efficiency, measured in ATP and redox cofactors per fixed carbon, is comparable to the synthetic CETCH cycle and the natural <a href="https://en.wikipedia.org/wiki/Reverse_Krebs_cycle">rTCA</a>.</p>
<h2 id="why-computational-pathway-design-for-carbon-fixation">Why Computational Pathway Design for Carbon Fixation</h2>
<p>Fixing atmospheric CO$_2$ or bicarbonate into value-added chemicals is a thermodynamically unfavorable process that nature solves through enzymatic cascades coupled to cofactor-driven reactions. Seven natural carbon fixation pathways are known, along with several artificial proposals, and the Acetyl-CoA-Succinyl-CoA family is particularly appealing as a design template because each member overlaps structurally with at least one other and each exhibits <a href="https://en.wikipedia.org/wiki/Autocatalysis">autocatalysis</a>. Prior approaches to artificial pathway design (e.g., Erb Lab CETCH, HOPAC) rely heavily on manual heuristics, database searches, and extensive in-vitro optimization including <a href="https://en.wikipedia.org/wiki/Directed_evolution">directed evolution</a>. Earlier computational work (Löwe and Kremling, 2021) uses <a href="https://en.wikipedia.org/wiki/Flux_balance_analysis">flux balance analysis</a> and expert curation that requires complete kinetic parameterization, making generative exploration infeasible. Abel et al. target the design stage directly: a computational approach that can quickly enumerate many topologically distinct pathway candidates without requiring a priori kinetic parameters.</p>
<h2 id="generative-chemical-space-expansion-with-graph-grammar-rules">Generative Chemical Space Expansion with Graph-Grammar Rules</h2>
<p>The core innovation is treating the chemical reaction network (CRN) as a <a href="https://en.wikipedia.org/wiki/Hypergraph">directed multi-hypergraph</a> $H = (V, E)$ where vertices in $V$ are molecules and each hyperedge $e \in E$ is a directed pair $(e_{tail}, e_{head})$ of multisets representing reactants and products. This hyperedge formalization directly captures the many-to-many nature of biochemical reactions.</p>
<p>Reactions are specified as graph transformation rules written in the Graph Modeling Language (GML). A rule defines the bond rewiring at a reaction center plus a tunable molecular context around that center. A rule with no context is fully promiscuous (every oxidoreductase class reaction, say); a rule with rich context mimics a specific enzyme. This rule-based formalism lets one rule represent an entire reaction class, so the CRN can be expanded without enumerating every possible enzyme-substrate pair in advance. Expansion proceeds iteratively: the rules act on the current molecule pool, producing new molecules and new hyperedges, until a user-defined step count is reached. Two biochemical sanity constraints bound the combinatorial explosion: molecules are restricted to at most 6 carbon atoms in the backbone (excluding the CoA moiety), and at most one CoA group per molecule.</p>
<p>Pathway discovery is then an ILP flow query over the CRN. A pathway is a hyperflow: an assignment of integer flow values to hyperedges such that internal molecules balance between production and consumption, leaving only designated source and sink molecules with net flow. The main optimization objective minimizes the number of reactions used and, as a tiebreaker, the magnitude of flow on those reactions:</p>
<p>$$
\min \left(\sum_{e \in E} z_e \cdot w + x_e\right)
$$</p>
<p>where $z_e$ is a boolean indicator that hyperedge $e$ carries flow, $x_e$ is the integer flow on $e$, and the weight $w = 1000$ prioritizes minimizing the edge count over the total flow magnitude. Autocatalysis is encoded as a constraint on the autocatalyst molecule $a$: its inflow and outflow are both positive, with outflow strictly exceeding inflow so the cycle nets at least one additional molecule of the autocatalyst.</p>
<p>$$
0 &lt; x_a^{in} &lt; x_a^{out}
$$</p>
<p>Only the autocatalyst itself, cofactors, and CO$_2$/HCO$_3^-$ are permitted as sources and sinks, so any valid flow represents a net reaction that fixes carbon and regenerates the autocatalyst. Unlike classical flux balance analysis, which optimizes continuous flux distributions at steady state, the integer-valued ILP formulation emphasizes pathway structure (which reactions are active) rather than flux magnitude.</p>
<p>Solutions are post-annotated with two feasibility measures. The first is cofactor accounting, split into ATP/ADP as an energy proxy and reduced redox cofactors (NAD(P)H, ubiquinone, Ferredoxin) as an electron proxy. The second is the standard Gibbs free energy of the net reaction computed via the eQuilibrator 3.0 component-contribution method at pH 7 and ionic strength 0.1 M using the eQuilibrator API 0.6.0:</p>
<p>$$
\Delta_r G&rsquo;^{\circ} = \sum \Delta_f G&rsquo;^{\circ}_{\text{products}} - \sum \Delta_f G&rsquo;^{\circ}_{\text{reactants}}
$$</p>
<h2 id="experimental-setup-queries-and-comparison-to-literature">Experimental Setup, Queries, and Comparison to Literature</h2>
<p>The seed pool for expansion contains 49 intermediates drawn from the Acetyl-CoA-Succinyl-CoA family (rTCA, DC/4-HB, 3-HP/4-HB, 3-HP bicycle), the synthetic CETCH cycle, and theoretical pathways proposed by Bar-Even et al., plus 20 helper molecules (cofactors, water, CO$_2$). Rule contexts were derived from <a href="https://en.wikipedia.org/wiki/KEGG">KEGG</a> enzyme entries. The <a href="https://en.wikipedia.org/wiki/Calvin_cycle">Calvin-Benson-Basham cycle</a> and the non-autocatalytic <a href="https://en.wikipedia.org/wiki/Wood%E2%80%93Ljungdahl_pathway">Wood-Ljungdahl</a> and reductive glycine pathways were excluded.</p>
<p>Expansion statistics (Table 4 in the paper):</p>
<table>
  <thead>
      <tr>
          <th>Expansion steps</th>
          <th>Molecules (vertices)</th>
          <th>Reactions (hyperedges)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>165</td>
          <td>220</td>
      </tr>
      <tr>
          <td>2</td>
          <td>318</td>
          <td>942</td>
      </tr>
      <tr>
          <td>5</td>
          <td>996</td>
          <td>29,266</td>
      </tr>
  </tbody>
</table>
<p>At one expansion step, flow queries recover only the input pathways with no recombinations. Two expansion steps produce sufficient novelty for recombined pathways while keeping ILP runtimes tractable. Five steps makes flow queries computationally prohibitive without adding biological insight. All reported analyses use the two-step CRN.</p>
<p>Three benchmark flow queries target autocatalytic pathways producing Acetyl-CoA, Malate, and Propionyl-CoA. Each query is run to return 1000 topologically distinct optimal solutions (under the ILP objective, solutions with equal length are equally optimal). All flow queries were solved with Gurobi 11.0.3 under an academic license on a consumer laptop (AMD Ryzen 7 5700U, 16 GB RAM, Windows 11). The full 1000-solution search took just under 18 hours.</p>
<h2 id="two-novel-autocatalytic-cycles-competitive-with-synthetic-pathways">Two Novel Autocatalytic Cycles Competitive with Synthetic Pathways</h2>
<p>The shortest-pathway queries yield two novel theoretical autocatalytic cycles: an 11-step Acetyl-CoA cycle and a 12-step Malate cycle. Comparison to natural, theoretical, and synthetic pathways on the four standard measures (steps, ATP units, cofactors, carbon units fixed per cycle):</p>
<table>
  <thead>
      <tr>
          <th>Pathway</th>
          <th>Status</th>
          <th>Steps</th>
          <th>ATP</th>
          <th>Cofactors</th>
          <th>C fixed</th>
          <th>ATP/C</th>
          <th>Cof/C</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Shortest Acetyl-CoA (this work)</td>
          <td>Theoretical</td>
          <td>11</td>
          <td>2</td>
          <td>5</td>
          <td>2</td>
          <td>1</td>
          <td>2.5</td>
      </tr>
      <tr>
          <td>Shortest Malate (this work)</td>
          <td>Theoretical</td>
          <td>12</td>
          <td>3</td>
          <td>8</td>
          <td>4</td>
          <td>0.75</td>
          <td>2</td>
      </tr>
      <tr>
          <td>CETCH</td>
          <td>Synthetic</td>
          <td>11</td>
          <td>1</td>
          <td>4</td>
          <td>2</td>
          <td>0.5</td>
          <td>2</td>
      </tr>
      <tr>
          <td>rGPS-MCG</td>
          <td>Synthetic</td>
          <td>18</td>
          <td>4</td>
          <td>6</td>
          <td>3</td>
          <td>1.33</td>
          <td>2</td>
      </tr>
      <tr>
          <td>C4-glyoxylate / alanine</td>
          <td>Theoretical</td>
          <td>9</td>
          <td>2</td>
          <td>2</td>
          <td>2</td>
          <td>1</td>
          <td>1</td>
      </tr>
      <tr>
          <td>rTCA</td>
          <td>Natural</td>
          <td>12</td>
          <td>4</td>
          <td>7</td>
          <td>4</td>
          <td>1</td>
          <td>1.75</td>
      </tr>
      <tr>
          <td>3HP/4HB</td>
          <td>Natural</td>
          <td>16</td>
          <td>4</td>
          <td>6</td>
          <td>2</td>
          <td>2</td>
          <td>3</td>
      </tr>
      <tr>
          <td>DC/4HB</td>
          <td>Natural</td>
          <td>14</td>
          <td>4</td>
          <td>7</td>
          <td>2</td>
          <td>2</td>
          <td>3.5</td>
      </tr>
      <tr>
          <td>3HP-bicycle</td>
          <td>Natural</td>
          <td>19</td>
          <td>3</td>
          <td>4</td>
          <td>2</td>
          <td>1.5</td>
          <td>2</td>
      </tr>
  </tbody>
</table>
<p>The 11-step Acetyl-CoA cycle matches CETCH in length and carbon units fixed while using one more ATP and one more redox cofactor. The Malate cycle is the same length as rTCA (12 steps) but uses one fewer ATP and one fewer cofactor while fixing the same four carbons.</p>
<p>Across the 1000-solution benchmarks (Table 2 of the paper), the Acetyl-CoA cycle is the most cofactor-efficient per step (0.69 cofactors/step; average 7.6 total), while Propionyl-CoA and Malate average 0.89 and 0.88 cofactors/step. Gibbs energies average $\Delta_r G&rsquo;^{\circ} = -150.66$ kJ/mol for Acetyl-CoA, $-165.82$ for Propionyl-CoA, and $-196.98$ for Malate, making the Malate query the most thermodynamically driven even after accounting for its higher cofactor count. Three specific Acetyl-CoA solutions inspected in detail share a common rTCA-like core with a glyoxylate shunt and differ mainly along the oxaloacetate-to-malyl-CoA branch; their totals range from $\Delta_r G&rsquo;^{\circ}_{total} = -80$ kJ/mol (the one-ATP solution) to $-168$ kJ/mol.</p>
<p>All solutions rely on <a href="https://en.wikipedia.org/wiki/Ferredoxin">Ferredoxin</a>-dependent carboxylating enzymes (pyruvate:ferredoxin oxidoreductase and 2-ketoglutarate:ferredoxin oxidoreductase), which have higher reduction potentials than NAD(P) but are oxygen-sensitive and would restrict wet-lab implementation to anaerobic conditions or engineered anaerobic strains.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>The workflow produces pathway candidates whose efficiency approaches the best synthetic designs while running on a consumer laptop, and it generalizes to any chemical space that can be formalized by graph-transformation rules. Because the ILP returns many equally optimal solutions, a downstream filtering step can select candidates matching user criteria (oxygen sensitivity, specific cofactor preference, enzyme availability).</p>
<p>Acknowledged limitations include: the topology-only search ignores enzyme kinetics, so candidates that look thermodynamically favorable might be bottlenecked in practice; the carbon-count and CoA restrictions are necessary to bound combinatorial blow-up but also constrain the discoverable space; reliance on Ferredoxin complicates implementation; and enzyme availability varies across organisms, which matters for recombination-based designs. The authors point to kinetic modeling, cofactor-recycling system inclusion, and incorporation of metabolic reactions outside the canonical carbon fixation space as future directions.</p>
<p>The paper positions itself as a design-stage tool rather than an end-to-end in-vitro pipeline. The authors frame the contribution as idea generation that complements, not replaces, the subsequent experimental optimization (enzyme engineering, directed evolution) that has carried prior synthetic pathway work from theory to in-vitro success.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Seed molecules</td>
          <td>Curated Acetyl-CoA-Succinyl-CoA family + CETCH + Bar-Even theoretical</td>
          <td>49 metabolites + 20 cofactors</td>
          <td>Tables S1-S2</td>
      </tr>
      <tr>
          <td>Reaction rules</td>
          <td>KEGG enzyme entries, GML-encoded</td>
          <td>Rules listed in Figure S1</td>
          <td>Conservative context</td>
      </tr>
      <tr>
          <td>CRN (2-step expansion)</td>
          <td>Generated by MØD</td>
          <td>318 molecules, 942 reactions</td>
          <td>Primary analysis space</td>
      </tr>
      <tr>
          <td>Thermodynamic data</td>
          <td>eQuilibrator 3.0 component-contribution</td>
          <td>All molecules in space</td>
          <td>pH 7, ionic strength 0.1 M</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Graph-grammar rule expansion via MØD 1.0.0 with a 6-carbon backbone cap and at most one CoA moiety per molecule. ILP flow queries formulated with the edge-minimization objective in Equation (1) and the autocatalysis constraint in Equation (2). Natural pathway presence first verified via set operations on the CRN, then reconfirmed by constraining the ILP to pass through core intermediates. The pathway solution enumeration is structural: 1000 topologically distinct solutions per query at the optimal objective value.</p>
<h3 id="models">Models</h3>
<p>No machine-learning models. The pipeline is symbolic: graph transformations, hypergraph flow constraints, and component-contribution free energy estimates.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Acetyl-CoA</th>
          <th>Propionyl-CoA</th>
          <th>Malate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Avg steps</td>
          <td>11</td>
          <td>15</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Avg cofactors</td>
          <td>7.6</td>
          <td>13.3</td>
          <td>10.6</td>
      </tr>
      <tr>
          <td>Cofactors/step</td>
          <td>0.69</td>
          <td>0.89</td>
          <td>0.88</td>
      </tr>
      <tr>
          <td>Avg $\Delta_r G&rsquo;^{\circ}$ (kJ/mol)</td>
          <td>$-150.66$</td>
          <td>$-165.82$</td>
          <td>$-196.98$</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Gurobi 11.0.3 (academic license) on a consumer laptop: AMD Ryzen 7 5700U, 16 GB RAM, Windows 11. Full 1000-solution runs for the three benchmark queries completed in just under 18 hours total.</p>
<h3 id="artifacts-and-licensing">Artifacts and licensing</h3>
<ul>
<li>Code and output pathways: <a href="https://github.com/anne-susann/C_fixation_pathway_design">github.com/anne-susann/C_fixation_pathway_design</a> (MIT License)</li>
<li>MØD cheminformatics package (version 1.0.0)</li>
<li>eQuilibrator API version 0.6.0</li>
<li>Gurobi 11.0.3</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Abel, A.-S., Lauber, N., Andersen, J. L., Fagerberg, R., Merkle, D. E., &amp; Flamm, C. (2026). Computational approaches in chemical space exploration for carbon fixation pathways. <em>npj Systems Biology and Applications</em>, 12(1), 17. <a href="https://doi.org/10.1038/s41540-025-00641-8">https://doi.org/10.1038/s41540-025-00641-8</a></p>
<p><strong>Publication</strong>: npj Systems Biology and Applications, 2026</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{abel2026computational,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Computational approaches in chemical space exploration for carbon fixation pathways}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Abel, Anne-Susann and Lauber, Nino and Andersen, Jakob Lykke and Fagerberg, Rolf and Merkle, Daniel Elmar and Flamm, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{npj Systems Biology and Applications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17--17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Portfolio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41540-025-00641-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Surge: Fastest Open-Source Chemical Graph Generator</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/surge-chemical-graph-generator/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/surge-chemical-graph-generator/</guid><description>McKay et al. present Surge, an open-source constitutional isomer generator that outperforms MOLGEN by orders of magnitude in speed.</description><content:encoded><![CDATA[<h2 id="a-three-stage-canonical-generation-path">A Three-Stage Canonical Generation Path</h2>
<p>Surge is an open-source constitutional isomer generator that enumerates all possible molecular structures for a given molecular formula. It is built on the <a href="/notes/interdisciplinary/graph-theory/nauty-traces-graph-isomorphism/">nauty</a> package for <a href="https://en.wikipedia.org/wiki/Graph_automorphism">graph automorphism</a> computation and uses a three-stage canonical generation path method that decomposes the enumeration problem into progressively refined graph operations. Surge outperforms the previous state-of-the-art (MOLGEN 5.0) by orders of magnitude in speed while running in under 5 MB of RAM regardless of molecule size.</p>
<h2 id="motivation-the-need-for-fast-open-structure-generators">Motivation: The Need for Fast, Open Structure Generators</h2>
<p>Chemical structure generators are essential for <a href="https://en.wikipedia.org/wiki/Computer-assisted_structure_elucidation">computer-assisted structure elucidation</a> (CASE), virtual library creation, and chemical space enumeration (e.g., <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>&rsquo;s 166.4 billion molecules). MOLGEN had been the gold standard for decades but is closed-source. The previous best open-source alternative, MAYGEN, was roughly 3x slower than MOLGEN. Reymond&rsquo;s lab used an in-house nauty-based generator for GDB-17 but did not release it publicly. Surge fills this gap as a fast, open-source, and extensible alternative.</p>
<h2 id="the-three-stage-algorithm">The Three-Stage Algorithm</h2>
<p>Given a molecular formula (e.g., $\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), Surge proceeds through three stages:</p>
<p><strong>Stage 1 (geng): Simple graph generation.</strong> Computes all connected simple graphs with the appropriate number of non-hydrogen atoms and edges, subject to maximum degree constraints from the molecular formula. These graphs represent molecular topologies without atom types or bond orders. For Lysopine ($\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), this produces 534,493 graphs in 1.3 seconds.</p>
<p><strong>Stage 2 (vcolg): Vertex coloring (atom assignment).</strong> Assigns element types (C, N, O, S, etc.) to vertices in all distinct ways, using the automorphism group of each simple graph to avoid generating equivalent assignments. Given a fixed ordering of elements (e.g., $\text{C} &lt; \text{O} &lt; \text{S}$), element assignments are represented as lists $L$ and compared lexicographically. Exactly one representative from each equivalence class is selected by computing the canonical (lexicographically maximal) list:</p>
<p>$$
\text{canon}(L) = \max\{\gamma(L) \mid \gamma \in \text{Aut}(G)\}
$$</p>
<p>A list $L$ is accepted if and only if $\text{canon}(L) = L$, i.e., no automorphism produces a lexicographically larger list. For Lysopine, this expands to 3.0 billion vertex-labeled graphs in 90 seconds.</p>
<p><strong>Stage 3 (multig): Edge multiplicity (bond orders).</strong> Assigns bond multiplicities (single, double, triple) to edges, again using automorphism group factorization to avoid duplicates. For Lysopine, this produces 6.0 billion completed molecules in an additional 100 seconds.</p>
<h2 id="efficient-automorphism-handling-via-group-factorization">Efficient Automorphism Handling via Group Factorization</h2>
<p>The key algorithmic innovation is the factorization of the automorphism group:</p>
<p>$$
\text{Aut}(G) = NM = \{\gamma\delta \mid \gamma \in N,; \delta \in M\}
$$</p>
<p>where $M$ is the &ldquo;minor subgroup&rdquo; generated by transpositions of leaves sharing a common neighbor (&ldquo;flowers&rdquo;), and $N$ is a complete set of coset representatives computed by nauty. A flower is a maximal set of degree-1 vertices (leaves) with the same neighbor. The minor subgroup $M$ is normal in $\text{Aut}(G)$, making the factorization well-defined.</p>
<p><strong>Theorem.</strong> A list $L$ satisfies $L = \text{canon}(L)$ if and only if $L = \max\{\delta(L) \mid \delta \in M\}$ and $L = \max\{\gamma(L) \mid \gamma \in N\}$.</p>
<p>This factorization enables efficient canonicity testing. Maximality under $M$ reduces to enforcing decreasing element order within each flower (simple inequality constraints during recursive assignment). Maximality under $N$ requires explicit testing against the $N$ generators, but $N$ is trivial (identity only) 58% of the time in Stage 2 and 98% of the time in Stage 3.</p>
<h2 id="benchmark-results">Benchmark Results</h2>
<p>Benchmarked against MOLGEN 5.0 on 30 natural product molecular formulas from the COCONUT database on a compute-optimized c2-standard-4 Google Cloud VM, Surge achieves 7-22 million molecules per second with a memory footprint of at most 5 MB regardless of molecule size. Representative results:</p>
<table>
  <thead>
      <tr>
          <th>Formula</th>
          <th>Isomers</th>
          <th>Surge (s)</th>
          <th>MOLGEN (s)</th>
          <th>Speedup</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$\text{C}_{10}\text{H}_{16}\text{O}_5$</td>
          <td>1.1B</td>
          <td>69</td>
          <td>5,146</td>
          <td>75x</td>
      </tr>
      <tr>
          <td>$\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$</td>
          <td>6.0B</td>
          <td>289</td>
          <td>27,250</td>
          <td>94x</td>
      </tr>
      <tr>
          <td>$\text{C}_{11}\text{H}_{12}\text{O}_4$</td>
          <td>31.6B</td>
          <td>2,179</td>
          <td>181,725</td>
          <td>83x</td>
      </tr>
      <tr>
          <td>$\text{C}_{10}\text{H}_{13}\text{NO}_5$</td>
          <td>552B</td>
          <td>54,372</td>
          <td>6,325,646</td>
          <td>116x</td>
      </tr>
      <tr>
          <td>$\text{C}_{10}\text{H}_{10}\text{N}_2\text{O}_3$</td>
          <td>1.5T</td>
          <td>83,186</td>
          <td>8,292,585</td>
          <td>100x</td>
      </tr>
      <tr>
          <td>$\text{C}_9\text{H}_{12}\text{N}_2\text{O}_5$</td>
          <td>1.8T</td>
          <td>180,727</td>
          <td>13,983,652</td>
          <td>77x</td>
      </tr>
  </tbody>
</table>
<p>MOLGEN hit its built-in limit of $2^{31} - 1$ structures for most formulas; reported times were linearly extrapolated. Both generators were instructed to generate but not output structures. MOLGEN was run with <code>-noaromaticity</code> for fair comparison since Surge v1.0 lacks aromaticity detection.</p>
<p>Surge supports output in both <a href="https://en.wikipedia.org/wiki/Chemical_table_file">SDfile</a> and <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> formats. SMILES output is produced efficiently by constructing a template for each simple graph at Stage 1, so that only atom types and bond multiplicities must be filled in before output.</p>
<p>Surge also supports built-in filters applied during generation (more efficient than post-hoc filtering):</p>
<ul>
<li><code>-p0:1</code>: at most one cycle of length 5</li>
<li><code>-P</code>: the molecule must be planar</li>
<li><code>-B5</code>: no atom has two double bonds and otherwise only hydrogen neighbors</li>
<li><code>-B9</code>: no atom lies on more than one cycle of length 3 or 4</li>
</ul>
<p>These filter options are inspired by corresponding features in MOLGEN. Surge&rsquo;s open-source design also supports a plugin mechanism: users can write small code snippets to insert custom filters into any of the three stages, enabling efficient pruning of the generation tree.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Version 1.0 does not perform <a href="https://en.wikipedia.org/wiki/H%C3%BCckel%27s_rule">Hückel aromaticity</a> detection, so it generates duplicate <a href="https://en.wikipedia.org/wiki/Aromaticity">Kekulé structures</a> for aromatic rings that are graph-theoretically distinct</li>
<li>Benchmarking against MOLGEN required disabling MOLGEN&rsquo;s aromaticity detection (<code>-noaromaticity</code>) for fair comparison</li>
<li>Written in C (from the nauty suite), which limits accessibility compared to Python-based tools, though this is also the source of its speed</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/structuregenerator/surge">Surge on GitHub</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official C implementation from the nauty suite</td>
      </tr>
      <tr>
          <td><a href="https://structuregenerator.github.io">Surge project page</a></td>
          <td>Other</td>
          <td>Apache 2.0</td>
          <td>Project homepage with documentation and binaries</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Status</strong>: Highly Reproducible. Source code, build instructions, and benchmark formulas are all publicly available.</li>
<li><strong>Hardware</strong>: Benchmarks used a compute-optimized c2-standard-4 Google Cloud VM. Surge runs in at most 5 MB of RAM regardless of molecule size.</li>
<li><strong>Build</strong>: Standard Unix Configure/Make scheme producing a standalone command-line executable. Written in portable C from the nauty suite.</li>
<li><strong>Dependencies</strong>: Requires the nauty package (bundled).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Published</strong>: Journal of Cheminformatics, Volume 14, Article 24, April 23, 2022</li>
<li><strong>Preprint</strong>: ChemRxiv, December 7, 2021</li>
<li><strong>License</strong>: Apache 2.0 (software), Open Access (paper)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mckay2022surge,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Surge: a fast open-source chemical graph generator}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{McKay, Brendan D. and Yirik, Mehmet Aziz and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00604-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SpeechT5: Unified Speech-Text Pre-Training Framework</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</guid><description>SpeechT5 introduces a shared encoder-decoder framework with cross-modal vector quantization for joint speech and text pre-training across six tasks.</description><content:encoded><![CDATA[<h2 id="a-unified-encoder-decoder-for-spoken-language-processing">A Unified Encoder-Decoder for Spoken Language Processing</h2>
<p>SpeechT5 is a <strong>Method</strong> paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5&rsquo;s</a> text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as &ldquo;speech/text to speech/text&rdquo; problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).</p>
<h2 id="bridging-the-gap-between-speech-and-text-pre-training">Bridging the Gap Between Speech and Text Pre-Training</h2>
<p>Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.</p>
<p>SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.</p>
<h2 id="cross-modal-vector-quantization-for-alignment">Cross-Modal Vector Quantization for Alignment</h2>
<p>The core innovation in SpeechT5 is a cross-modal <a href="https://en.wikipedia.org/wiki/Vector_quantization">vector quantization</a> (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:</p>
<p><strong>Shared encoder-decoder backbone.</strong> A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.</p>
<p><strong>Modal-specific pre/post-nets.</strong> Six specialized networks handle the conversion between raw modalities and the shared representation space:</p>
<ul>
<li>Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms</li>
<li>Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features</li>
<li>Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support</li>
<li>Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)</li>
</ul>
<p><strong>Cross-modal vector quantization.</strong> A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:</p>
<p>$$
\mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2
$$</p>
<p>A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder&rsquo;s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:</p>
<p>$$
\mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k
$$</p>
<h3 id="pre-training-objectives">Pre-Training Objectives</h3>
<p>SpeechT5 combines three pre-training objectives:</p>
<p><strong>Speech pre-training</strong> uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT&rsquo;s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:</p>
<p>$$
\mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n)
$$</p>
<p>A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.</p>
<p><strong>Text pre-training</strong> uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:</p>
<p>$$
\mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{&lt; n}^t, \hat{\mathbf{X}}^t)
$$</p>
<p>The full pre-training loss combines all components:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d
$$</p>
<p>where $\gamma = 0.1$.</p>
<h2 id="evaluation-across-six-spoken-language-tasks">Evaluation Across Six Spoken Language Tasks</h2>
<p>SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:</p>
<h3 id="automatic-speech-recognition-asr">Automatic Speech Recognition (ASR)</h3>
<p>Fine-tuned on LibriSpeech 100h with joint <a href="https://en.wikipedia.org/wiki/Connectionist_temporal_classification">CTC</a>/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:</p>
<p>$$
\alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM}
$$</p>
<p>where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>LM</th>
          <th>test-clean</th>
          <th>test-other</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>-</td>
          <td>6.1</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td>HuBERT BASE</td>
          <td>-</td>
          <td>5.8</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>-</strong></td>
          <td><strong>4.4</strong></td>
          <td><strong>10.4</strong></td>
      </tr>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>Transf.</td>
          <td>2.6</td>
          <td>6.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>Transf.</strong></td>
          <td><strong>2.4</strong></td>
          <td><strong>5.8</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="text-to-speech-synthesis-tts">Text-to-Speech Synthesis (TTS)</h3>
<p>Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Naturalness</th>
          <th>MOS</th>
          <th>CMOS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ground Truth</td>
          <td>-</td>
          <td>3.87 ± 0.04</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Baseline</td>
          <td>2.76</td>
          <td>3.56 ± 0.05</td>
          <td>0</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>2.91</strong></td>
          <td><strong>3.65 ± 0.04</strong></td>
          <td><strong>+0.290</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-translation-st">Speech Translation (ST)</h3>
<p>Evaluated on MUST-C English-to-German and English-to-French:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>EN-DE</th>
          <th>EN-FR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fairseq ST</td>
          <td>22.70</td>
          <td>32.90</td>
      </tr>
      <tr>
          <td>Adapter Tuning</td>
          <td>24.63</td>
          <td>34.98</td>
      </tr>
      <tr>
          <td>Baseline (HuBERT init)</td>
          <td>23.43</td>
          <td>33.76</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>25.18</strong></td>
          <td><strong>35.30</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="voice-conversion-vc">Voice Conversion (VC)</h3>
<p>Evaluated on CMU Arctic:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WER (bdl→slt)</th>
          <th>MCD (bdl→slt)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VTN w/ TTS</td>
          <td>7.6%</td>
          <td>6.33</td>
      </tr>
      <tr>
          <td>Many-to-many VTN</td>
          <td>-</td>
          <td>6.13</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>7.8%</strong></td>
          <td><strong>5.93</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-enhancement-se">Speech Enhancement (SE)</h3>
<p>On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline&rsquo;s 10.9%.</p>
<h3 id="speaker-identification-sid">Speaker Identification (SID)</h3>
<p>On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.</p>
<h2 id="ablation-study-and-key-findings">Ablation Study and Key Findings</h2>
<p>The ablation study reveals the contribution of each pre-training component:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ASR (clean)</th>
          <th>ASR (other)</th>
          <th>VC (MCD)</th>
          <th>SID (ACC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SpeechT5</td>
          <td>4.4</td>
          <td>10.7</td>
          <td>5.93</td>
          <td>96.49%</td>
      </tr>
      <tr>
          <td>w/o Speech PT</td>
          <td>-</td>
          <td>-</td>
          <td>6.49</td>
          <td>38.61%</td>
      </tr>
      <tr>
          <td>w/o Text PT</td>
          <td>5.4</td>
          <td>12.8</td>
          <td>6.03</td>
          <td>95.60%</td>
      </tr>
      <tr>
          <td>w/o Joint PT</td>
          <td>4.6</td>
          <td>11.3</td>
          <td>6.18</td>
          <td>95.54%</td>
      </tr>
      <tr>
          <td>w/o $\mathcal{L}_{mlm}^{s}$</td>
          <td>7.6</td>
          <td>22.4</td>
          <td>6.29</td>
          <td>90.91%</td>
      </tr>
  </tbody>
</table>
<p>Key findings:</p>
<ol>
<li><strong>Speech pre-training is critical</strong>: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.</li>
<li><strong>Text pre-training complements speech</strong>: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.</li>
<li><strong>Joint pre-training enables cross-modal transfer</strong>: the vector quantization approach is essential for modality-bridging tasks like ASR.</li>
<li><strong>The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component</strong>, responsible for learning strong acoustic features.</li>
</ol>
<p>The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Speech pre-training</td>
          <td>LibriSpeech</td>
          <td>960 hours</td>
          <td>Full training set</td>
      </tr>
      <tr>
          <td>Text pre-training</td>
          <td>LibriSpeech LM text</td>
          <td>400M sentences</td>
          <td>Normalized language model text</td>
      </tr>
      <tr>
          <td>ASR fine-tuning</td>
          <td>LibriSpeech</td>
          <td>100h / 960h subsets</td>
          <td></td>
      </tr>
      <tr>
          <td>TTS fine-tuning</td>
          <td>LibriTTS</td>
          <td>460h clean sets</td>
          <td></td>
      </tr>
      <tr>
          <td>ST fine-tuning</td>
          <td>MUST-C</td>
          <td>EN-DE, EN-FR</td>
          <td></td>
      </tr>
      <tr>
          <td>VC fine-tuning</td>
          <td>CMU Arctic</td>
          <td>4 speakers</td>
          <td>bdl, clb, slt, rms</td>
      </tr>
      <tr>
          <td>SE fine-tuning</td>
          <td>WHAM!</td>
          <td>16 kHz max</td>
          <td>enhance-single task</td>
      </tr>
      <tr>
          <td>SID fine-tuning</td>
          <td>VoxCeleb1</td>
          <td>100k+ utterances</td>
          <td>1,251 speakers</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay</li>
<li>Speech masking: 8% of timesteps, 10-step spans</li>
<li>Text masking: 30% of spans, Poisson $\lambda = 3.5$</li>
<li>Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes</li>
<li>CTC/attention joint decoding for ASR (beam size 30)</li>
<li>HiFi-GAN vocoder for TTS and SE waveform generation</li>
<li>Parallel WaveGAN vocoder for VC</li>
</ul>
<h3 id="fine-tuning-hyperparameters">Fine-Tuning Hyperparameters</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GPUs</th>
          <th>Steps</th>
          <th>Peak LR</th>
          <th>Batch (per GPU)</th>
          <th>Schedule</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ASR (100h)</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>6e-5</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>ASR (960h)</td>
          <td>8×V100</td>
          <td>320k</td>
          <td>1.3e-4</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>TTS</td>
          <td>8×V100</td>
          <td>120k</td>
          <td>4e-4</td>
          <td>45k tokens</td>
          <td>Warmup 10k steps, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>ST</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>-</td>
          <td>-</td>
          <td>Warmup 10k steps</td>
      </tr>
      <tr>
          <td>VC</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>1e-4</td>
          <td>20k tokens</td>
          <td>6k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SE</td>
          <td>8×V100</td>
          <td>100k</td>
          <td>1e-4</td>
          <td>16k tokens</td>
          <td>10k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SID</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>5e-4</td>
          <td>64 segments (3s each)</td>
          <td>Triangular cyclical (1e-8 to 5e-4)</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<ul>
<li>Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)</li>
<li>Decoder: 6 Transformer blocks (same dimensions)</li>
<li>Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])</li>
<li>Code and pre-trained models available at <a href="https://github.com/microsoft/SpeechT5">github.com/microsoft/SpeechT5</a> (MIT license)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/microsoft/SpeechT5">microsoft/SpeechT5</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official Fairseq-based implementation</td>
      </tr>
      <tr>
          <td>Pre-trained models (via repo)</td>
          <td>Model</td>
          <td>MIT</td>
          <td>SpeechT5 BASE encoder-decoder checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/12">LibriSpeech</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>960h speech pre-training and ASR fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/60">LibriTTS</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>460h TTS fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://ict.fbk.eu/must-c/">MUST-C</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-ND-4.0</td>
          <td>Speech translation fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://www.festvox.org/cmu_arctic/">CMU Arctic</a></td>
          <td>Dataset</td>
          <td>Free</td>
          <td>Voice conversion fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://wham.whisper.ai/">WHAM!</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-4.0</td>
          <td>Speech enhancement fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html">VoxCeleb1</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA-4.0</td>
          <td>Speaker identification fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 32 NVIDIA V100 GPUs</li>
<li>Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2</li>
<li>Pre-training steps: 500k</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., &amp; Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. <em>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 5723-5738.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ao2022speecht,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5723--5738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2022.acl-long.393}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>nauty and Traces: Graph Isomorphism Algorithms</title><link>https://hunterheidenreich.com/notes/interdisciplinary/graph-theory/nauty-traces-graph-isomorphism/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/graph-theory/nauty-traces-graph-isomorphism/</guid><description>nauty and Traces use individualization-refinement with search tree pruning for graph isomorphism testing and canonical labeling.</description><content:encoded><![CDATA[<h2 id="a-method-paper-on-practical-graph-isomorphism">A Method Paper on Practical Graph Isomorphism</h2>
<p>This is a <strong>Method</strong> paper that brings the published description of nauty (version 2.5) up to date and introduces Traces (version 2.0), a new program for graph isomorphism testing and canonical labeling. The paper provides a unified theoretical framework for the individualization-refinement paradigm that underpins all leading graph isomorphism programs, then details the distinct implementation strategies of nauty and Traces. Extensive benchmarks compare both programs against saucy, Bliss, and conauto across graph families ranging from easy to extremely difficult.</p>
<h2 id="the-graph-isomorphism-problem-in-practice">The Graph Isomorphism Problem in Practice</h2>
<p>An isomorphism between two graphs is a bijection between their vertex sets that preserves adjacency. The graph isomorphism problem (GI) asks whether such a bijection exists. While GI is in NP, it is neither known to be in co-NP nor proven NP-complete. NP-completeness is considered unlikely, as it would imply collapse of the <a href="https://en.wikipedia.org/wiki/Polynomial_hierarchy">polynomial-time hierarchy</a>. The best proven worst-case running time has stood for three decades at $e^{O(\sqrt{n \log n})}$.</p>
<p>In practice, direct isomorphism testing is poorly suited for common tasks like removing duplicates from large graph collections or looking up graphs in databases. The standard approach is <strong>canonical labeling</strong>: relabeling a graph so that isomorphic graphs become identical after relabeling. This allows sorting algorithms and standard data structures to handle isomorph rejection and retrieval.</p>
<p>The dominant practical approach is the <strong>individualization-refinement paradigm</strong>, introduced by Parris and Read (1969) and developed by Corneil and Gotlieb (1970). McKay&rsquo;s nauty (1978, 1980) was the first program to handle both structurally regular graphs with hundreds of vertices and graphs with large <a href="https://en.wikipedia.org/wiki/Automorphism_group">automorphism groups</a>. Its key innovation was using discovered automorphisms to prune the search tree. nauty dominated the field for decades until competitors like saucy (2004), Bliss (2007), and conauto (2009) introduced sparse data structures, early refinement abort, and other improvements.</p>
<h2 id="the-individualization-refinement-framework">The Individualization-Refinement Framework</h2>
<p>The paper provides a general formal framework encompassing all leading graph isomorphism algorithms. The core idea has three components: vertex colorings, a search tree built by individualizing vertices, and pruning via node invariants and automorphisms.</p>
<h3 id="colorings-and-refinement">Colorings and Refinement</h3>
<p>A <strong>colouring</strong> of vertex set $V$ is a surjective function $\pi: V \to {1, 2, \ldots, k}$. A colouring is <strong>equitable</strong> if any two vertices of the same colour are adjacent to the same number of vertices of each colour. Given any colouring $\pi$, there exists a unique coarsest equitable colouring $\pi&rsquo;$ with $\pi&rsquo; \preceq \pi$ (meaning $\pi&rsquo;$ is finer than or equal to $\pi$). Computing this equitable refinement is the primary computational bottleneck.</p>
<p><strong>Individualization</strong> gives a single vertex a unique colour, then refines:</p>
<p>$$
I(\pi, v)(w) = \begin{cases} \pi(w), &amp; \text{if } \pi(w) &lt; \pi(v) \text{ or } w = v \\ \pi(w) + 1, &amp; \text{otherwise} \end{cases}
$$</p>
<p>The refinement function $R(G, \pi_0, \nu)$ applies equitable refinement after each individualization step for a sequence of vertices $\nu = (v_1, v_2, \ldots)$.</p>
<h3 id="search-tree-and-canonical-forms">Search Tree and Canonical Forms</h3>
<p>The search tree $\mathcal{T}(G, \pi_0)$ is a rooted tree whose nodes are vertex sequences. Starting from the empty sequence at the root, each node extends the sequence by choosing a vertex from a <strong>target cell</strong> (a non-singleton cell of the current colouring). Leaves correspond to discrete colourings (permutations of $V$).</p>
<p>A <strong>canonical form</strong> is a function $C: \mathcal{G} \times \Pi \to \mathcal{G} \times \Pi$ satisfying:</p>
<ul>
<li>$C(G, \pi) \cong (G, \pi)$ (the canonical form is isomorphic to the input)</li>
<li>$C(G^g, \pi^g) = C(G, \pi)$ for all $g \in S_n$ (label-invariance)</li>
</ul>
<p>The canonical form is computed by finding the leaf $\nu^*$ maximizing the node invariant $\phi(G, \pi_0, \nu)$, then applying the corresponding discrete colouring.</p>
<h3 id="tree-pruning">Tree Pruning</h3>
<p>Three pruning operations keep the search tractable:</p>
<ul>
<li><strong>$P_A(\nu, \nu&rsquo;)$</strong>: Remove subtree at $\nu&rsquo;$ if $\phi(G, \pi_0, \nu) &gt; \phi(G, \pi_0, \nu&rsquo;)$ (invariant comparison)</li>
<li><strong>$P_B(\nu, \nu&rsquo;)$</strong>: Remove subtree at $\nu&rsquo;$ if $\phi(G, \pi_0, \nu) \neq \phi(G, \pi_0, \nu&rsquo;)$ (inequivalence)</li>
<li><strong>$P_C(\nu, g)$</strong>: Remove subtree at $\nu^g$ if $g \in \text{Aut}(G, \pi_0)$ and $\nu &lt; \nu^g$ (automorphism pruning)</li>
</ul>
<p>Theorem 5 in the paper guarantees that after any sequence of these pruning operations, at least one canonical leaf survives and the discovered automorphisms generate the full automorphism group.</p>
<h2 id="implementation-nauty-vs-traces">Implementation: nauty vs. Traces</h2>
<p>While both programs operate within the same individualization-refinement framework, their implementation strategies differ substantially.</p>
<h3 id="refinement-strategies">Refinement Strategies</h3>
<p>Both nauty and Traces compute equitable colourings using Algorithm 1, which iteratively splits cells based on adjacency counts. For regular graphs (where all vertices have equal degree), the initial colouring is trivially equitable, making these graphs difficult. nauty addresses this with a library of stronger partitioning functions (e.g., triangle counting), which require user expertise to select. Traces instead uses a richer node invariant that often makes stronger refinements unnecessary.</p>
<h3 id="target-cell-selection">Target Cell Selection</h3>
<p>nauty has two strategies: using the first non-singleton cell regardless of size, or choosing the first cell with the most non-trivial joins to other cells (where a non-trivial join means more than 0 edges and less than the maximum possible between two cells). An earlier version of nauty preferred the smallest non-singleton cell, hypothesizing it would more likely correspond to a group orbit, but experiments showed the first non-singleton cell performs better in most cases. Traces prefers <strong>large</strong> target cells, which produce shallower search trees. Specifically, Traces selects the first largest non-singleton cell that is a subset of the parent node&rsquo;s target cell. If no non-singleton cells satisfy this, it falls back to the grandparent node&rsquo;s target cell, and so on.</p>
<h3 id="node-invariants-the-trace">Node Invariants: The Trace</h3>
<p>The most consequential difference is in node invariants. nauty computes a single integer $f(\nu)$ at each node, forming a vector $(f([\nu]_0), f([\nu]_1), \ldots, f(\nu))$ for lexicographic comparison. Traces defines $f(\nu)$ as a <strong>vector</strong> encoding the sizes and positions of cells in the order they are created during refinement. This vector-of-vectors structure (the &ldquo;trace,&rdquo; hence the program&rsquo;s name) enables comparison while refinement is still incomplete. For many difficult graph families, only a fraction of refinement operations need to finish before pruning can occur.</p>
<h3 id="tree-scanning-order">Tree Scanning Order</h3>
<p>This is the fundamental architectural difference. nauty uses <strong>depth-first</strong> search, keeping the lexicographically least leaf $\nu_1$ and the leaf $\nu^*$ with the greatest invariant discovered so far. Pruning applies when a node&rsquo;s invariant matches neither.</p>
<p>Traces uses <strong>breadth-first</strong> search, processing all nodes at each level $k$ and retaining only those with the greatest invariant value. By property $(\phi 1)$, the best nodes at level $k$ are children of the best nodes at level $k-1$, so no backtracking is needed. This maximizes pruning operation $P_A$.</p>
<p>To compensate for the fact that breadth-first search delays automorphism discovery (which requires leaves), Traces generates <strong>experimental paths</strong>: random paths from each node down to a leaf. Random experimental paths tend to find automorphisms generating larger subgroups, making more of the group available early for pruning. Both programs maintain discovered automorphisms using the <a href="https://en.wikipedia.org/wiki/Schreier%E2%80%93Sims_algorithm">random Schreier method</a> for efficient orbit computation.</p>
<h3 id="low-degree-vertex-handling">Low-Degree Vertex Handling</h3>
<p>Traces includes special handling for vertices of degree 0, 1, 2, or $n-1$. After the initial refinement, vertices with equal colours also have equal degrees. The target cell selector never selects cells containing vertices of these low degrees, and nodes whose non-trivial cells consist only of such vertices are not expanded further. Instead, special-purpose code produces generators for the automorphism group fixed by that node and, if needed, a unique discrete colouring. This technique is effective for graphs with many small components and tree-like structures (as in constraint satisfaction problems), though the authors note that such graphs could also benefit from preprocessing that factors out tree-like appendages and replaces vertices with identical neighborhoods.</p>
<h3 id="automorphism-detection">Automorphism Detection</h3>
<p>Beyond leaf comparison, saucy introduced early detection of automorphisms higher in the search tree by checking whether partial mappings between equivalent colourings extend trivially. Traces extends this idea with a heuristic that attempts non-trivial extensions. When computing only the automorphism group (not canonical labeling), Traces employs a strategy where it finds all discrete children of one node and then checks each remaining node for a single matching discrete child, further reducing search effort.</p>
<h2 id="performance-benchmarks">Performance Benchmarks</h2>
<p>The authors compare nauty 2.5, Traces 2.0, saucy 3.0, Bliss 7.2, and conauto 2.0.1 on a MacBook Pro with a 2.66 GHz Intel i7 processor. All graphs were randomly labeled before processing to avoid artifacts from input ordering. The benchmark covers both automorphism group computation and canonical labeling.</p>
<table>
  <thead>
      <tr>
          <th>Graph Family</th>
          <th>Best Program(s)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Random graphs ($p = 1/2$)</td>
          <td>nauty, Traces</td>
          <td>All programs fast; easy class</td>
      </tr>
      <tr>
          <td>Random graphs ($p = n^{-1/2}$)</td>
          <td>nauty</td>
          <td>Sparse random graphs</td>
      </tr>
      <tr>
          <td>Random cubic graphs</td>
          <td>nauty (with invariant)</td>
          <td>nauty benefits from distance invariant</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Hypercube_graph">Hypercubes</a></td>
          <td>Traces</td>
          <td>Vertex-transitive; Traces dramatically faster</td>
      </tr>
      <tr>
          <td>Misc. vertex-transitive</td>
          <td>Traces</td>
          <td>Large automorphism groups</td>
      </tr>
      <tr>
          <td>Unions of tripartite graphs</td>
          <td>conauto, Bliss</td>
          <td>Special handling for disjoint components</td>
      </tr>
      <tr>
          <td>Small strongly-regular graphs</td>
          <td>Traces, nauty</td>
          <td>Both competitive</td>
      </tr>
      <tr>
          <td>Large strongly-regular graphs</td>
          <td>Traces</td>
          <td>Orders of magnitude faster</td>
      </tr>
      <tr>
          <td>Hadamard matrix graphs</td>
          <td>Traces</td>
          <td>Among the hardest known classes</td>
      </tr>
      <tr>
          <td>Random trees</td>
          <td>nauty</td>
          <td>Low-degree preprocessing helps</td>
      </tr>
      <tr>
          <td>Cai-Furer-Immerman graphs</td>
          <td>Traces</td>
          <td>Designed to defeat refinement; Traces still efficient</td>
      </tr>
      <tr>
          <td>Miyazaki graphs</td>
          <td>Traces</td>
          <td>Another hard class; dramatic advantage</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Projective_plane">Projective planes</a> (order 16)</td>
          <td>Traces</td>
          <td>Large automorphism groups on bipartite graphs</td>
      </tr>
      <tr>
          <td>Combinatorial graphs</td>
          <td>Mixed</td>
          <td>Performance varies by instance; Traces generally competitive</td>
      </tr>
  </tbody>
</table>
<p>The results show that nauty is generally fastest for small graphs and some easier families, while Traces dominates on most difficult graph classes, sometimes by orders of magnitude. The breadth-first tree scanning strategy of Traces, combined with its richer node invariant, provides the largest gains on graphs with complex symmetry structure (<a href="https://en.wikipedia.org/wiki/Strongly_regular_graph">strongly-regular graphs</a>, <a href="https://en.wikipedia.org/wiki/Hadamard_matrix">Hadamard matrix</a> graphs, <a href="https://en.wikipedia.org/wiki/Vertex-transitive_graph">vertex-transitive graphs</a>). The exception is graph families with many disjoint or minimally-overlapping components, where conauto and Bliss have specialized handling that nauty and Traces lack.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>The paper establishes several findings:</p>
<ol>
<li>The breadth-first tree scanning approach in Traces, combined with experimental paths for early automorphism discovery, provides large efficiency gains on difficult graph classes.</li>
<li>Traces&rsquo; richer node invariant (the trace) enables early pruning during incomplete refinement, reducing dependence on user-selected invariant functions compared to nauty.</li>
<li>No single program dominates all graph classes. nauty remains preferred for mass processing of small graphs.</li>
<li>The random Schreier method for maintaining the automorphism group is effective in both programs, enabling more complete pruning via orbit computation.</li>
</ol>
<p>Limitations acknowledged by the authors include: nauty and Traces lack specialized code for graphs consisting of disjoint or minimally-overlapping components (where conauto and Bliss excel), and the choice of refinement function in nauty still requires user expertise for certain difficult graph classes.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Benchmarking</td>
          <td>Bliss test collection</td>
          <td>Multiple families</td>
          <td>Graphs ranging from easy to very difficult</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>nauty/Traces website collection</td>
          <td>Multiple families</td>
          <td>All test graphs available at the project website</td>
      </tr>
  </tbody>
</table>
<p>All test graphs are publicly available at the nauty and Traces website. Graphs were randomly labeled before processing to avoid non-typical behavior from input labeling.</p>
<h3 id="algorithms">Algorithms</h3>
<p>The core algorithms are described formally with proofs of correctness (Theorem 5 guarantees pruning validity). Key implementation choices:</p>
<ul>
<li><strong>Refinement</strong>: Equitable colouring via Algorithm 1 (iterated cell splitting by adjacency counts)</li>
<li><strong>Target cell selection</strong>: nauty uses first non-singleton or most non-trivially joined cell; Traces uses first largest cell within parent&rsquo;s target</li>
<li><strong>Tree scanning</strong>: nauty uses depth-first; Traces uses breadth-first with experimental paths</li>
<li><strong>Group maintenance</strong>: Random Schreier method for orbit computation in both programs</li>
</ul>
<h3 id="software">Software</h3>
<table>
  <thead>
      <tr>
          <th>Program</th>
          <th>Version</th>
          <th>Canonical Labeling</th>
          <th>Open Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>nauty</td>
          <td>2.5</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Traces</td>
          <td>2.0</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>saucy</td>
          <td>3.0</td>
          <td>No (v3.0)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Bliss</td>
          <td>7.2</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>conauto</td>
          <td>2.0.1</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://pallini.di.uniroma1.it/">nauty and Traces</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official distribution (v2.9.3 as of 2026); includes gtools graph utilities</td>
      </tr>
      <tr>
          <td><a href="http://pallini.di.uniroma1.it/">Test graphs</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>All benchmark graphs from the paper, available at the project website</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Benchmarks run on a MacBook Pro with 2.66 GHz Intel i7 processor, compiled with gcc 4.7, single-threaded execution.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McKay, B. D., &amp; Piperno, A. (2013). Practical graph isomorphism, II. <em>Journal of Symbolic Computation</em>, 60, 94-112.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mckay2013practical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Practical graph isomorphism, {II}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{McKay, Brendan D. and Piperno, Adolfo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Symbolic Computation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{94--112}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2013}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier BV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.jsc.2013.09.003}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Complexity from the GDB Chemical Space</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/gdb-molecular-complexity/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/gdb-molecular-complexity/</guid><description>Buehler &amp; Reymond propose MC1 and MC2, simple graph-based molecular complexity measures derived from GDB chemical space enumeration.</description><content:encoded><![CDATA[<h2 id="molecular-complexity-as-branching-in-the-molecular-graph">Molecular Complexity as Branching in the Molecular Graph</h2>
<p>This paper proposes two simple, interpretable measures of molecular complexity grounded in the observation that most GDB-enumerated molecules are synthetically challenging despite containing only standard functional groups and ring systems. The core insight is that branching points (non-divalent nodes) in the molecular graph correspond to synthesis difficulty: each additional branching point implies a new ring or substituent requiring extra synthetic steps, possible protecting groups, potential stereogenic centers, and increased steric hindrance.</p>
<h2 id="motivation-why-most-gdb-molecules-are-hard-to-make">Motivation: Why Most GDB Molecules Are Hard to Make</h2>
<p>The Generated DataBases (GDBs) enumerate billions of hypothetical small organic molecules by exhaustively substituting atoms and bonds in mathematical graphs. Despite applying filters for ring strain, functional group diversity, <a href="/notes/chemistry/datasets/fdb-17/">fragment-likeness</a>, drug-likeness, and ChEMBL-likeness, most enumerated molecules remain daunting to synthesize. Even in the most restrictive subset (GDB-13s, 99.4 million molecules from the 977 million in GDB-13), practical synthesis remains challenging for most entries. This motivated the search for a complexity measure that captures why these molecules are hard, without relying on reaction databases or machine learning.</p>
<h2 id="mc1-and-mc2-two-graph-based-complexity-measures">MC1 and MC2: Two Graph-Based Complexity Measures</h2>
<p>The two proposed measures are:</p>
<p><strong>MC1</strong> (size-independent): the fraction of non-divalent nodes in the molecular graph.</p>
<p>$$
\text{MC1} = 1 - \text{FDV}
$$</p>
<p>where FDV is the fraction of divalent nodes (e.g., $-\text{CH}_2-$, $=\text{CH}-$, $=\text{C}=$, $-\text{O}-$, $-\text{NH}-$, $=\text{N}-$, $-\text{S}-$) in the molecular graph. The graph is computed by treating the molecule as if all bonds were single and all heavy atoms were carbon. MC1 is independent of molecule size, making it useful for comparing molecules of different sizes.</p>
<p><strong>MC2</strong> (size-dependent): the count of non-divalent nodes, excluding carbonyl carbons in standard carboxyl derivatives.</p>
<p>$$
\text{MC2} = \text{NDV}
$$</p>
<p>where NDV is the number of non-divalent nodes, not counting $\text{C}{=}\text{O}$ in $(\text{X}-\text{C}{=}\text{O})$ for $\text{X} = \text{N}$ or $\text{O}$ (acids, esters, amides, carbonates, carbamates, ureas). MC2 grows with molecule size only when branching increases. Linear extensions (adding divalent atoms to chains or enlarging rings) do not increase MC2.</p>
<p>The rationale for excluding carboxyl groups from MC2 is that their chemistry (amide bond formation, esterification) is well-established and straightforward. Functional groups like amidines, guanidines, thioesters, thiones, sulfoxides, sulfinates, sulfones, and sulfonamides, as well as phosphorus-containing groups, are still counted because their synthesis is less routine.</p>
<h2 id="design-choices-and-limitations">Design Choices and Limitations</h2>
<p>MC1 and MC2 deliberately do not distinguish between $\text{sp}^2$ and $\text{sp}^3$ branching points or count chiral centers. This choice is motivated by the observation that unusual substitution patterns on aromatic rings in GDB molecules are also synthetically difficult, and that functionalization of aromatic/heteroaromatic rings and control of <a href="https://en.wikipedia.org/wiki/Atropisomer">atropisomerism</a> in biaryls are both challenging. A consequence is that carbohydrates and polyphenols receive high complexity scores despite being abundant in biomass.</p>
<p>MC1 gives uninformative values for very small molecules (trifluoroacetic acid and tert-butanol both score $\text{MC1} = 1$) and for polymers (where the repeating unit dominates). MC2 similarly cannot give useful values for polymers due to its size dependence.</p>
<h2 id="comparison-with-existing-complexity-measures">Comparison with Existing Complexity Measures</h2>
<p>The authors compare MC1 and MC2 against six molecular complexity scores and two synthetic accessibility scores across four databases: GDB-13s, <a href="/notes/chemistry/datasets/zinc-22/">ZINC</a>, ChEMBL, and COCONUT.</p>
<table>
  <thead>
      <tr>
          <th>Measure</th>
          <th>Category</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FCFP4</td>
          <td>Complexity</td>
          <td>Number of on-bits in a binary 2048-bit FCFP4 fingerprint</td>
      </tr>
      <tr>
          <td>DataWarrior</td>
          <td>Complexity</td>
          <td>Fractal complexity via Minkowski-Bouligand (box-counting) dimension of distinct substructures up to 7 bonds</td>
      </tr>
      <tr>
          <td>Böttcher</td>
          <td>Complexity</td>
          <td>Shannon entropy using additive atom contributions (valence electrons, atom environment, chirality, symmetry)</td>
      </tr>
      <tr>
          <td>Proudfoot</td>
          <td>Complexity</td>
          <td>Shannon entropy using additive atom contributions (atomic number, connections, paths up to length 2)</td>
      </tr>
      <tr>
          <td>SPS/nSPS</td>
          <td>Complexity</td>
          <td>Spacial score summing heavy atom contributions (hybridization, stereochemistry, nonaromaticity, neighbor count); nSPS normalizes by HAC</td>
      </tr>
      <tr>
          <td>SAscore</td>
          <td>Synthesizability</td>
          <td>Fragment frequency from PubChem combined with complexity penalty (ring types, stereochemistry, size)</td>
      </tr>
      <tr>
          <td>SCS</td>
          <td>Synthesizability</td>
          <td>Machine-learned score from 12 million Reaxys reactions predicting synthesis steps from ECFP4 fingerprint (max value 5)</td>
      </tr>
  </tbody>
</table>
<p>Key findings from the correlation analysis:</p>
<ul>
<li>For GDB-13s (where nearly all molecules have HAC = 13), complexity measures generally do not correlate with each other ($r^2 &lt; 0.6$), except MC1 with MC2 and SPS with nSPS (expected, since each pair differs only in size normalization).</li>
<li>For ZINC, ChEMBL, and COCONUT (spanning a broad range of molecular sizes), several complexity measures correlate with heavy atom count (HAC) and therefore with each other.</li>
<li>Size-independent measures (DataWarrior, nSPS, SCS, SAscore, MC1) are unaffected by molecule size across datasets, while Böttcher and Proudfoot scores are strongly size-dependent. FCFP4 and SPS show partial size dependence.</li>
<li>SPS and nSPS also correlate with SAscore.</li>
</ul>
<p>The analysis is supported by interactive TMAP visualizations (tree-maps organized by MAP4C molecular fingerprint similarity) for 30,000 random molecules from each database, color-coded by each complexity measure. The interactive TMAPs are available online for <a href="https://tm.gdb.tools/MAP4C/GDB-13s_complexity">GDB-13s</a>, <a href="https://tm.gdb.tools/MAP4C/ZINC_complexity">ZINC</a>, <a href="https://tm.gdb.tools/MAP4C/ChEMBL_complexity">ChEMBL</a>, and <a href="https://tm.gdb.tools/MAP4C/COCONUT_complexity">COCONUT</a>.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Ye-Buehler/Molecular_Complexity">Molecular_Complexity</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Python implementation of MC1, MC2, and eight comparison metrics with Jupyter notebooks</td>
      </tr>
  </tbody>
</table>
<p>The paper is open access (hybrid). The GitHub repository provides Python code for computing MC1 and MC2 along with Jupyter notebooks demonstrating all ten complexity and synthesizability measures from Table 1. The four databases used (GDB-13s, ZINC, ChEMBL, COCONUT) are all publicly available. No model training or specialized hardware is involved, as MC1 and MC2 are deterministic graph computations.</p>
<p><strong>Reproducibility status</strong>: Highly Reproducible.</p>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Journal</strong>: Journal of Chemical Information and Modeling, Vol. 65, No. 16, pp. 8405-8410</li>
<li><strong>Published</strong>: May 15, 2025</li>
<li><strong>Part of</strong>: Special issue &ldquo;Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning&rdquo;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{buehler2025view,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A View on Molecular Complexity from the GDB Chemical Space}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Buehler, Ye and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{65}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8405--8410}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.5c00334}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LSTNet: Long- and Short-Term Time Series Network</title><link>https://hunterheidenreich.com/notes/time-series/lstnet-multivariate-time-series/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/time-series/lstnet-multivariate-time-series/</guid><description>LSTNet combines CNNs, recurrent-skip connections, and autoregressive models to capture both short-term and long-term patterns in multivariate time series.</description><content:encoded><![CDATA[<h2 id="a-deep-learning-framework-for-multivariate-forecasting">A Deep Learning Framework for Multivariate Forecasting</h2>
<p>This is a <strong>Method</strong> paper that introduces the Long- and Short-term Time-series Network (LSTNet), a deep learning architecture specifically designed for multivariate time series forecasting. LSTNet combines convolutional neural networks (CNNs), recurrent neural networks (RNNs) with a novel skip-connection structure, and a traditional autoregressive (AR) component into a unified framework. The architecture targets the challenge of simultaneously capturing both short-term local dependencies and long-term periodic patterns in temporal data.</p>
<h2 id="why-short-term-and-long-term-patterns-need-separate-treatment">Why Short-Term and Long-Term Patterns Need Separate Treatment</h2>
<p>Real-world multivariate time series often exhibit a mixture of repeating patterns at different time scales. Highway traffic, for example, shows daily peaks (morning vs. evening commutes) alongside weekly patterns (weekday vs. weekend behavior). Solar energy output varies with cloud movements on short time scales and with seasonal daylight changes on longer ones. Electricity consumption follows similar daily and weekly cycles.</p>
<p>Traditional autoregressive methods (<a href="https://en.wikipedia.org/wiki/Vector_autoregression">VAR</a>, <a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average">ARIMA</a>) and <a href="https://en.wikipedia.org/wiki/Gaussian_process">Gaussian Process</a> models struggle to distinguish and jointly model these two kinds of recurring patterns. Standard RNNs, including LSTM and <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> variants, theoretically handle long-range dependencies but in practice suffer from <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">gradient vanishing</a> when the period length is large (e.g., 24 hours at hourly resolution, or 168 time steps for weekly patterns). The authors also identify a scale sensitivity problem: neural network models can fail when the magnitude of the input signal changes in non-periodic ways, such as sudden shifts in electricity consumption due to holidays or weather events.</p>
<h2 id="combining-cnns-recurrent-skip-connections-and-autoregression">Combining CNNs, Recurrent-Skip Connections, and Autoregression</h2>
<p>The LSTNet architecture consists of four main components that work together.</p>
<h3 id="convolutional-component">Convolutional Component</h3>
<p>The first layer applies 1D convolution without pooling across the multivariate input. Each filter has width $\omega$ (in the time dimension) and height $n$ (spanning all variables), producing feature maps that capture short-term local dependency patterns among variables:</p>
<p>$$h_k = \text{RELU}(W_k * X + b_k)$$</p>
<p>where $*$ denotes convolution and the input is zero-padded so each output vector has length $T$. The output is a $d_c \times T$ matrix where $d_c$ is the number of filters.</p>
<h3 id="recurrent-component">Recurrent Component</h3>
<p>The CNN output feeds into a GRU-based recurrent layer that uses RELU (rather than the standard tanh) as the hidden update activation:</p>
<p>$$\begin{aligned}
r_t &amp;= \sigma(x_t W_{xr} + h_{t-1} W_{hr} + b_r) \\
u_t &amp;= \sigma(x_t W_{xu} + h_{t-1} W_{hu} + b_u) \\
c_t &amp;= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-1} W_{hc}) + b_c) \\
h_t &amp;= (1 - u_t) \odot h_{t-1} + u_t \odot c_t
\end{aligned}$$</p>
<h3 id="recurrent-skip-component">Recurrent-Skip Component</h3>
<p>The key architectural innovation is a recurrent structure with temporal skip connections. Instead of connecting to the immediately preceding hidden state $h_{t-1}$, skip links connect to the hidden state from $p$ steps ago ($h_{t-p}$), where $p$ corresponds to the period length of the data (e.g., $p = 24$ for hourly data with daily periodicity):</p>
<p>$$\begin{aligned}
r_t &amp;= \sigma(x_t W_{xr} + h_{t-p} W_{hr} + b_r) \\
u_t &amp;= \sigma(x_t W_{xu} + h_{t-p} W_{hu} + b_u) \\
c_t &amp;= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-p} W_{hc}) + b_c) \\
h_t &amp;= (1 - u_t) \odot h_{t-p} + u_t \odot c_t
\end{aligned}$$</p>
<p>This design shortens the effective path length for learning periodic dependencies, making optimization easier. A dense layer combines outputs from both recurrent components:</p>
<p>$$h_t^D = W^R h_t^R + \sum_{i=0}^{p-1} W_i^S h_{t-i}^S + b$$</p>
<h3 id="temporal-attention-alternative">Temporal Attention Alternative</h3>
<p>For datasets without clear periodicity, LSTNet offers an attention-based variant (LSTNet-Attn) as an alternative to the recurrent-skip component. The attention mechanism learns to weight hidden representations across the input window adaptively. The attention weights $\alpha_t \in \mathbb{R}^q$ at time $t$ are computed as:</p>
<p>$$\alpha_t = \text{AttnScore}(H_t^R, h_{t-1}^R)$$</p>
<p>where $H_t^R = [h_{t-q}^R, \dots, h_{t-1}^R]$ stacks the RNN hidden representations column-wise and AttnScore is a similarity function (dot product, cosine, or a parameterized MLP). The weighted context vector and final output are:</p>
<p>$$\begin{aligned}
c_t &amp;= H_t \alpha_t \\
h_t^D &amp;= W[c_t;; h_{t-1}^R] + b
\end{aligned}$$</p>
<h3 id="autoregressive-component">Autoregressive Component</h3>
<p>To address the scale insensitivity of neural networks, LSTNet adds a classical autoregressive model in parallel:</p>
<p>$$h_{t,i}^L = \sum_{k=0}^{q^{ar}-1} W_k^{ar} y_{t-k,i} + b^{ar}$$</p>
<p>The final prediction integrates both the neural network and AR outputs:</p>
<p>$$\hat{Y}_t = h_t^D + h_t^L$$</p>
<p>This decomposition separates the prediction into a linear part (handling local scale changes) and a non-linear part (capturing recurring patterns).</p>
<h3 id="objective-function">Objective Function</h3>
<p>LSTNet supports two loss functions, selected via validation performance. The default is the squared (L2) loss:</p>
<p>$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \left| Y_t - \hat{Y}_{t-h} \right|_F^2$$</p>
<p>Motivated by the strong performance of Linear SVR baselines, LSTNet also supports the absolute (L1) loss, which is more robust to anomalies in real time series data:</p>
<p>$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \sum_{i=0}^{n-1} \left| Y_{t,i} - \hat{Y}_{t-h,i} \right|$$</p>
<p>where $\Theta$ is the full parameter set, $\Omega_{\text{Train}}$ is the set of training time stamps, $|\cdot|_F$ is the Frobenius norm, and $h$ is the forecast horizon.</p>
<h2 id="evaluation-on-four-benchmark-datasets">Evaluation on Four Benchmark Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Length</th>
          <th>Variables</th>
          <th>Sample Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Traffic</td>
          <td>17,544</td>
          <td>862</td>
          <td>1 hour</td>
      </tr>
      <tr>
          <td>Solar-Energy</td>
          <td>52,560</td>
          <td>137</td>
          <td>10 minutes</td>
      </tr>
      <tr>
          <td>Electricity</td>
          <td>26,304</td>
          <td>321</td>
          <td>1 hour</td>
      </tr>
      <tr>
          <td>Exchange-Rate</td>
          <td>7,588</td>
          <td>8</td>
          <td>1 day</td>
      </tr>
  </tbody>
</table>
<p>All datasets are split 60/20/20 (train/validation/test) in chronological order. Traffic, Solar-Energy, and Electricity exhibit clear periodic patterns (daily and weekly), while Exchange-Rate shows only short-term local continuity.</p>
<h3 id="baselines">Baselines</h3>
<p>The authors compare against seven methods: AR (univariate autoregression), LRidge (VAR with L2 regularization), LSVR (VAR with SVR objective), TRMF (temporal regularized matrix factorization), GP (Gaussian Process), VAR-MLP (hybrid MLP-autoregressive), and RNN-GRU (standard GRU).</p>
<h3 id="metrics">Metrics</h3>
<p>Two evaluation metrics are used:</p>
<ul>
<li><strong>Root Relative Squared Error (RSE)</strong> (lower is better): A scaled RMSE that normalizes by the standard deviation of the test data, making comparison across datasets readable regardless of data scale:</li>
</ul>
<p>$$\text{RSE} = \frac{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \hat{Y}_{it})^2}}{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \text{mean}(Y))^2}}$$</p>
<ul>
<li><strong>Empirical Correlation Coefficient (CORR)</strong> (higher is better): The average Pearson correlation between predicted and true time series across all $n$ variables:</li>
</ul>
<p>$$\text{CORR} = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_t (Y_{it} - \text{mean}(Y_i))(\hat{Y}_{it} - \text{mean}(\hat{Y}_i))}{\sqrt{\sum_t (Y_{it} - \text{mean}(Y_i))^2 \sum_t (\hat{Y}_{it} - \text{mean}(\hat{Y}_i))^2}}$$</p>
<h3 id="main-results">Main Results</h3>
<p>The models are evaluated at horizons $h \in {3, 6, 12, 24}$, corresponding to 3-24 hours for Traffic and Electricity, 30-240 minutes for Solar-Energy, and 3-24 days for Exchange-Rate.</p>
<p>LSTNet-Skip achieved the best result in 17 out of 32 (dataset, metric, horizon) combinations, and LSTNet-Attn won 7 more. No other method won more than 3. At horizon 24, the best LSTNet variant improved over RNN-GRU by 9.2% RSE on Solar-Energy (LSTNet-Attn), 11.7% on Traffic (LSTNet-Skip), and 22.2% on Electricity (LSTNet-Skip). On the Exchange-Rate dataset, which lacks periodic patterns, LSTNet performed comparably to AR and LRidge, as expected.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>Removing each component individually revealed:</p>
<ul>
<li><strong>Without AR</strong>: The largest performance drops across most datasets, confirming the AR component&rsquo;s role in handling scale changes. Visualization showed that LSTNet-Skip successfully tracks sudden magnitude shifts in electricity consumption around the 1000th hour, while the model without AR fails.</li>
<li><strong>Without Skip/CNN</strong>: Significant drops on datasets with periodic patterns, though less consistent than removing AR.</li>
<li><strong>Full LSTNet</strong>: The most robust configuration across all datasets and horizons.</li>
</ul>
<p>A simulation experiment with synthetic autoregressive data confirmed that standard RNN-GRU fails to track non-periodic scale changes, while LSTNet with its AR component adapts properly.</p>
<h2 id="robust-performance-through-architectural-complementarity">Robust Performance Through Architectural Complementarity</h2>
<p>LSTNet&rsquo;s main strength is the complementarity of its components. The CNN captures short-term local patterns, the recurrent-skip layer captures long-term periodic dependencies, and the AR component provides robustness to scale changes. On datasets with strong periodicity (Traffic, Solar-Energy, Electricity), the skip connections provide large gains. On datasets without periodicity (Exchange-Rate), the AR component prevents degradation below competitive baselines.</p>
<p>The primary limitation is that the skip length $p$ in the recurrent-skip component must be manually specified or tuned. For datasets with known periodicity (e.g., hourly data with daily cycles), $p$ is straightforward to set. For datasets without clear periodicity, $p$ must be tuned as a hyperparameter, and the attention-based variant (LSTNet-Attn) offers an alternative that avoids this requirement. Future work directions include automatic period detection and incorporating variable-level attribute information into the convolutional layer.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>Traffic</td>
          <td>17,544 x 862</td>
          <td>California DoT highway occupancy, hourly, 2015-2016</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Solar-Energy</td>
          <td>52,560 x 137</td>
          <td>Solar power from 137 PV plants in Alabama, 10-min intervals, 2006</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Electricity</td>
          <td>26,304 x 321</td>
          <td>kWh consumption for 321 clients, hourly, 2012-2014</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Exchange-Rate</td>
          <td>7,588 x 8</td>
          <td>Daily exchange rates for 8 countries, 1990-2016</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available via the <a href="https://github.com/laiguokun/LSTNet">GitHub repository</a>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam</li>
<li>Dropout: 0.1 or 0.2 after each layer except input and output</li>
<li>Window size $q$: grid search over ${2^0, 2^1, \ldots, 2^9}$</li>
<li>Skip length $p$: set to 24 for Traffic/Electricity; tuned from $2^1$ to $2^6$ for Solar-Energy/Exchange-Rate</li>
<li>Objective: L2 loss (Eq. 7) or L1 loss (Eq. 9), selected via validation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Hidden dimensions (Recurrent/CNN): ${50, 100, 200}$</li>
<li>Hidden dimensions (Recurrent-skip): ${20, 50, 100}$</li>
<li>AR regularization: ${0.1, 1, 10}$</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best LSTNet RSE</th>
          <th>Baseline (RNN-GRU)</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Solar-Energy (h=24)</td>
          <td>0.4403 (Attn)</td>
          <td>0.4852</td>
          <td>9.2%</td>
      </tr>
      <tr>
          <td>Traffic (h=24)</td>
          <td>0.4973 (Skip)</td>
          <td>0.5633</td>
          <td>11.7%</td>
      </tr>
      <tr>
          <td>Electricity (h=24)</td>
          <td>0.1007 (Skip)</td>
          <td>0.1295</td>
          <td>22.2%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/laiguokun/LSTNet">LSTNet (laiguokun/LSTNet)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation (Python 2.7, PyTorch 0.3.0)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/laiguokun/multivariate-time-series-data">Multivariate Time Series Data (laiguokun/multivariate-time-series-data)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Preprocessed benchmark datasets (Traffic, Solar-Energy, Electricity, Exchange-Rate)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code and all four benchmark datasets are publicly available. Hyperparameter search ranges are fully specified.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lai, G., Chang, W.-C., Yang, Y., &amp; Liu, H. (2018). Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. <em>The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval (SIGIR &lsquo;18)</em>, 95-104. <a href="https://doi.org/10.1145/3209978.3210006">https://doi.org/10.1145/3209978.3210006</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lai2018modeling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lai, Guokun and Chang, Wei-Cheng and Yang, Yiming and Liu, Hanxiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The 41st International ACM SIGIR Conference on Research \&amp; Development in Information Retrieval}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{95--104}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3209978.3210006}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DGCNN: Dynamic Graph CNN for Point Cloud Learning</title><link>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/dgcnn-dynamic-graph-point-clouds/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/dgcnn-dynamic-graph-point-clouds/</guid><description>EdgeConv module learns point cloud features on dynamically recomputed k-NN graphs in feature space, achieving strong classification and segmentation results.</description><content:encoded><![CDATA[<h2 id="a-general-purpose-edge-convolution-module-for-point-cloud-learning">A General-Purpose Edge Convolution Module for Point Cloud Learning</h2>
<p>This is a <strong>Method</strong> paper that introduces EdgeConv, a neural network module for learning on point clouds. The key idea is to construct a local graph structure and define convolution-like operations over edges connecting neighboring points. Unlike prior <a href="/notes/machine-learning/model-architectures/relational-inductive-biases-deep-learning-graph-networks/">graph neural network approaches</a> that operate on a fixed graph, DGCNN (Dynamic Graph CNN) recomputes the graph at each layer using k-nearest neighbors in feature space. This dynamic graph update allows the network to learn semantic groupings that differ from spatial proximity, enabling information propagation across long distances in the original point cloud. The model achieves strong results on classification (ModelNet40), part segmentation (ShapeNetPart), and semantic segmentation (S3DIS) benchmarks.</p>
<h2 id="why-point-clouds-need-topology-recovery">Why Point Clouds Need Topology Recovery</h2>
<p>Point clouds are the raw output of most 3D acquisition devices (<a href="https://en.wikipedia.org/wiki/Lidar">LiDAR</a>, stereo reconstruction) and serve as the simplest geometric representation for countless applications in graphics, robotics, and autonomous driving. However, point clouds inherently lack topological information: they are unordered sets of points with no connectivity structure.</p>
<p>Standard CNNs require grid-structured input, making them incompatible with irregular point cloud data. Volumetric approaches that discretize point clouds onto 3D grids introduce quantization artifacts and excessive memory usage. PointNet addressed this by operating on each point independently and aggregating with a symmetric function (max pooling), achieving permutation invariance. However, this independence means PointNet cannot capture local geometric structure.</p>
<p>PointNet++ partially addresses this by applying PointNet hierarchically in local neighborhoods, but it constructs neighborhoods based on Euclidean distances in the input space and does not update the graph structure during processing. The fundamental limitation is that treating points independently, even locally, prevents the model from learning the geometric relationships between points that carry important structural and semantic information.</p>
<h2 id="edgeconv-combining-local-geometry-with-global-structure">EdgeConv: Combining Local Geometry with Global Structure</h2>
<p>Given an $F$-dimensional point cloud $\mathbf{X} = \lbrace \mathbf{x}_1, \ldots, \mathbf{x}_n \rbrace \subseteq \mathbb{R}^F$, DGCNN constructs a directed graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ as the $k$-nearest neighbor graph in $\mathbb{R}^F$, including self-loops so each node also points to itself. Edge features are defined as:</p>
<p>$$
\mathbf{x}_i&rsquo; = \square_{j:(i,j) \in \mathcal{E}} h_\Theta(\mathbf{x}_i, \mathbf{x}_j)
$$</p>
<p>where $h_\Theta$ is a learnable nonlinear function and $\square$ denotes a channel-wise symmetric aggregation operation (e.g., max or sum).</p>
<p>The choice of edge function $h_\Theta$ determines the model&rsquo;s properties. The authors analyze several options:</p>
<table>
  <thead>
      <tr>
          <th>Choice</th>
          <th>Edge function</th>
          <th>Properties</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard convolution</td>
          <td>$\theta_m \cdot \mathbf{x}_j$</td>
          <td>Requires fixed grid structure</td>
      </tr>
      <tr>
          <td>PointNet</td>
          <td>$h_\Theta(\mathbf{x}_i)$</td>
          <td>Global only, ignores local structure</td>
      </tr>
      <tr>
          <td>PointNet++</td>
          <td>$h_\Theta(\mathbf{x}_j)$</td>
          <td>Local only, loses global context</td>
      </tr>
      <tr>
          <td>Local difference</td>
          <td>$h_\Theta(\mathbf{x}_j - \mathbf{x}_i)$</td>
          <td>Local patches without global positioning</td>
      </tr>
      <tr>
          <td><strong>EdgeConv (this work)</strong></td>
          <td>$\bar{h}_\Theta(\mathbf{x}_i, \mathbf{x}_j - \mathbf{x}_i)$</td>
          <td><strong>Both local geometry and global structure</strong></td>
      </tr>
  </tbody>
</table>
<p>The concrete EdgeConv operation uses an asymmetric edge function that combines the point&rsquo;s own features $\mathbf{x}_i$ (global shape structure) with the relative difference $\mathbf{x}_j - \mathbf{x}_i$ (local neighborhood information):</p>
<p>$$
e&rsquo;_{ijm} = \text{ReLU}(\boldsymbol{\theta}_m \cdot (\mathbf{x}_j - \mathbf{x}_i) + \boldsymbol{\phi}_m \cdot \mathbf{x}_i)
$$</p>
<p>$$
x&rsquo;_{im} = \max_{j:(i,j) \in \mathcal{E}} e&rsquo;_{ijm}
$$</p>
<p>where $\boldsymbol{\Theta} = (\theta_1, \ldots, \theta_M, \phi_1, \ldots, \phi_M)$ are learnable parameters. This formulation can be implemented as a shared MLP followed by max pooling over neighbors.</p>
<h3 id="dynamic-graph-recomputation">Dynamic Graph Recomputation</h3>
<p>The defining feature of DGCNN is that the graph $\mathcal{G}^{(l)}$ is recomputed at each layer $l$ using k-NN in the current feature space, rather than being fixed based on input coordinates. This means:</p>
<ul>
<li>The receptive field grows to be as large as the diameter of the point cloud while remaining sparse.</li>
<li>Points that are far apart in Euclidean space but semantically similar (e.g., the two wings of an airplane) become neighbors in deeper feature spaces.</li>
<li>The model learns to construct the graph itself, rather than taking it as a fixed input.</li>
</ul>
<h3 id="permutation-and-translation-invariance">Permutation and Translation Invariance</h3>
<p>EdgeConv is permutation invariant because the max aggregation is a symmetric function. It has a &ldquo;partial&rdquo; translation invariance property: the local difference term $\mathbf{x}_j - \mathbf{x}_i$ is fully translation invariant, while the global term $\boldsymbol{\phi}_m \cdot \mathbf{x}_i$ is translation-dependent. Setting $\boldsymbol{\phi}_m = 0$ yields full translation invariance but loses global positioning information.</p>
<h2 id="benchmarks-classification-part-segmentation-and-scene-segmentation">Benchmarks: Classification, Part Segmentation, and Scene Segmentation</h2>
<h3 id="classification-on-modelnet40">Classification on ModelNet40</h3>
<p>The classification architecture uses four EdgeConv layers with output dimensions (64, 64, 128, 256), $k = 20$ nearest neighbors, and shortcut connections that concatenate all EdgeConv outputs into a $64 + 64 + 128 + 256 = 512$-dimensional per-point feature. A shared fully-connected layer (1024) aggregates these multi-scale features. Global max and sum pooling produce a 1D descriptor, followed by two fully-connected layers (512, 256) with dropout (probability 0.5). All layers use LeakyReLU and batch normalization. Input point clouds are rescaled to fit into the unit sphere.</p>
<p>Training uses SGD with momentum 0.9, initial learning rate 0.1, cosine annealing to 0.001, and batch size 32. Batch normalization momentum is 0.9 with no BN decay. Data augmentation includes random scaling and perturbation of object and point locations. The value of $k$ is selected using an 80/20 train/validation split, then the model is retrained on the full training set.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Mean Class Acc. (%)</th>
          <th>Overall Acc. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PointNet</td>
          <td>86.0</td>
          <td>89.2</td>
      </tr>
      <tr>
          <td>PointNet++</td>
          <td>&ndash;</td>
          <td>90.7</td>
      </tr>
      <tr>
          <td>PointCNN</td>
          <td>88.1</td>
          <td>92.2</td>
      </tr>
      <tr>
          <td>PCNN</td>
          <td>&ndash;</td>
          <td>92.3</td>
      </tr>
      <tr>
          <td><strong>DGCNN (baseline, fixed graph)</strong></td>
          <td><strong>88.9</strong></td>
          <td><strong>91.7</strong></td>
      </tr>
      <tr>
          <td><strong>DGCNN</strong></td>
          <td><strong>90.2</strong></td>
          <td><strong>92.9</strong></td>
      </tr>
      <tr>
          <td><strong>DGCNN (2048 points)</strong></td>
          <td><strong>90.7</strong></td>
          <td><strong>93.5</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="model-complexity">Model Complexity</h3>
<p>DGCNN achieves a favorable tradeoff between model size, inference speed, and accuracy:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Model Size (MB)</th>
          <th>Time (ms)</th>
          <th>Accuracy (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PointNet (baseline)</td>
          <td>9.4</td>
          <td>6.8</td>
          <td>87.1</td>
      </tr>
      <tr>
          <td>PointNet</td>
          <td>40</td>
          <td>16.6</td>
          <td>89.2</td>
      </tr>
      <tr>
          <td>PointNet++</td>
          <td>12</td>
          <td>163.2</td>
          <td>90.7</td>
      </tr>
      <tr>
          <td>PCNN</td>
          <td>94</td>
          <td>117.0</td>
          <td>92.3</td>
      </tr>
      <tr>
          <td>DGCNN (baseline)</td>
          <td>11</td>
          <td>19.7</td>
          <td>91.7</td>
      </tr>
      <tr>
          <td>DGCNN</td>
          <td>21</td>
          <td>27.2</td>
          <td>92.9</td>
      </tr>
  </tbody>
</table>
<p>The DGCNN baseline outperforms PointNet++ by 1.0% while being 7x faster. The full DGCNN outperforms PCNN by 0.6% while being 4x faster with 4.5x fewer parameters.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: center">Centralization</th>
          <th style="text-align: center">Dynamic Graph</th>
          <th style="text-align: center">2048 Points</th>
          <th>Mean Class (%)</th>
          <th>Overall (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td>88.9</td>
          <td>91.7</td>
      </tr>
      <tr>
          <td style="text-align: center">x</td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td>89.3</td>
          <td>92.2</td>
      </tr>
      <tr>
          <td style="text-align: center">x</td>
          <td style="text-align: center">x</td>
          <td style="text-align: center"></td>
          <td>90.2</td>
          <td>92.9</td>
      </tr>
      <tr>
          <td style="text-align: center">x</td>
          <td style="text-align: center">x</td>
          <td style="text-align: center">x</td>
          <td>90.7</td>
          <td>93.5</td>
      </tr>
  </tbody>
</table>
<p>The choice of $k$ also matters:</p>
<table>
  <thead>
      <tr>
          <th>$k$</th>
          <th>Mean Class Acc. (%)</th>
          <th>Overall Acc. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>5</td>
          <td>88.0</td>
          <td>90.5</td>
      </tr>
      <tr>
          <td>10</td>
          <td>88.9</td>
          <td>91.4</td>
      </tr>
      <tr>
          <td>20</td>
          <td>90.2</td>
          <td>92.9</td>
      </tr>
      <tr>
          <td>40</td>
          <td>89.4</td>
          <td>92.4</td>
      </tr>
  </tbody>
</table>
<p>$k = 20$ performs best on 1024 points. Larger $k$ (e.g., 40) degrades performance because Euclidean distance poorly approximates geodesic distance at larger scales for a given point density.</p>
<h3 id="part-segmentation-on-shapenetpart">Part Segmentation on ShapeNetPart</h3>
<p>On the ShapeNetPart dataset (16,881 shapes, 16 categories, 50 part labels), DGCNN achieves 85.2% mean IoU, comparable to PointNet++ (85.1%) and PointCNN (86.1%). The model also demonstrates robustness to partial data, maintaining reasonable segmentation quality even when half of the points are removed.</p>
<h3 id="indoor-scene-segmentation-on-s3dis">Indoor Scene Segmentation on S3DIS</h3>
<p>On the Stanford Large-Scale 3D Indoor Spaces Dataset (6 indoor areas, 272 rooms, 13 semantic categories), DGCNN achieves 56.1% mean IoU and 84.1% overall accuracy using 6-fold cross-validation over the areas, outperforming PointNet (47.6% / 78.5%) and producing smoother segmentation boundaries. Each point is represented as a 9D vector (XYZ, RGB, and normalized spatial coordinates), with 4,096 points sampled per $1\text{m} \times 1\text{m}$ block during training.</p>
<h2 id="semantic-feature-spaces-and-future-directions">Semantic Feature Spaces and Future Directions</h2>
<p>A key qualitative finding is that the feature spaces learned by DGCNN in deeper layers capture semantic similarity rather than spatial proximity. Visualizations show that semantically similar structures (e.g., all legs of a table, or all wings of an airplane) are brought close together in feature space, even when they are far apart in the original 3D embedding. This property also transfers across shapes: features from one airplane&rsquo;s wing are close to the wing features of a different airplane in the learned feature space.</p>
<p>The authors identify several directions for future work:</p>
<ul>
<li><strong>Efficiency</strong>: Incorporating fast data structures (e.g., KD-trees) instead of computing pairwise distances for k-NN queries.</li>
<li><strong>Higher-order relationships</strong>: Considering tuples of points rather than only pairwise relationships.</li>
<li><strong>Non-shared transformations</strong>: Applying different transformations to different local patches rather than using shared weights.</li>
<li><strong>Abstract point clouds</strong>: Extending the approach to non-geometric applications like document retrieval and image processing, where the role of geometry in abstract feature spaces may provide new insights.</li>
</ul>
<p>The model has some limitations. On S3DIS, PointCNN achieves notably higher mean IoU (65.39% vs. 56.1%), suggesting room for improvement on large-scale scene segmentation. The dynamic k-NN computation adds overhead relative to fixed-graph approaches, though the overall model remains efficient.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification</td>
          <td>ModelNet40</td>
          <td>12,311 CAD models (40 categories)</td>
          <td>1,024 points uniformly sampled per model</td>
      </tr>
      <tr>
          <td>Part Segmentation</td>
          <td>ShapeNetPart</td>
          <td>16,881 shapes (16 categories, 50 parts)</td>
          <td>2,048 points per shape</td>
      </tr>
      <tr>
          <td>Scene Segmentation</td>
          <td>S3DIS</td>
          <td>272 rooms (13 categories)</td>
          <td>4,096 points per 1m x 1m block</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>k-NN graph construction</strong>: Pairwise distance matrix in feature space, $k = 20$ (classification) or $k = 40$ (2048 points).</li>
<li><strong>EdgeConv</strong>: Shared MLP on concatenated $[\mathbf{x}_i, \mathbf{x}_j - \mathbf{x}_i]$ features, followed by channel-wise max pooling over neighbors.</li>
<li><strong>Dynamic graph update</strong>: Graph recomputed from k-NN in feature space at each EdgeConv layer.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Classification</strong>: 4 EdgeConv layers (64, 64, 128, 256) + shortcut concatenation (512-dim) + shared FC (1024) + global max/sum pooling + FC (512, 256). 21 MB.</li>
<li><strong>Segmentation</strong>: Spatial transformer + 3 EdgeConv layers + shared FC (1024) aggregation + shortcut connections + FC (256, 256, 128).</li>
<li>All layers use LeakyReLU and batch normalization. Dropout 0.5 in final FC layers.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>DGCNN</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ModelNet40 Classification</td>
          <td>Overall Accuracy</td>
          <td>92.9%</td>
          <td>92.3% (PCNN)</td>
      </tr>
      <tr>
          <td>ShapeNetPart Segmentation</td>
          <td>Mean IoU</td>
          <td>85.2%</td>
          <td>86.1% (PointCNN)</td>
      </tr>
      <tr>
          <td>S3DIS Scene Segmentation</td>
          <td>Mean IoU</td>
          <td>56.1%</td>
          <td>65.39% (PointCNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/WangYueFt/dgcnn">WangYueFt/dgcnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official TensorFlow and PyTorch implementations</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training used NVIDIA TITAN X GPUs. Distributed training (2 GPUs) for part segmentation.</li>
<li>Forward pass time: 27.2 ms per sample (1,024 points) on a single GPU.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., &amp; Solomon, J. M. (2019). Dynamic Graph CNN for Learning on Point Clouds. <em>ACM Transactions on Graphics</em>, 38(5), Article 146. <a href="https://doi.org/10.1145/3326362">https://doi.org/10.1145/3326362</a></p>
<p><strong>Code</strong>: <a href="https://github.com/WangYueFt/dgcnn">github.com/WangYueFt/dgcnn</a> (MIT License)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wang2019dynamic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dynamic Graph CNN for Learning on Point Clouds}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Yue and Sun, Yongbin and Liu, Ziwei and Sarma, Sanjay E. and Bronstein, Michael M. and Solomon, Justin M.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACM Transactions on Graphics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">articleno</span>=<span style="color:#e6db74">{146}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3326362}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Conformation Autoencoder for 3D Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/conformation-autoencoder/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/conformation-autoencoder/</guid><description>An autoencoder that maps 3D molecular conformations to a continuous latent space using internal coordinates and graph attention networks.</description><content:encoded><![CDATA[<h2 id="a-method-for-learning-conformation-embeddings">A Method for Learning Conformation Embeddings</h2>
<p>This is a <strong>Method</strong> paper that introduces an autoencoder architecture for molecular conformations. The model converts the discrete 3D spatial arrangement of atoms (a conformation) in a given molecular graph into a continuous, fixed-size latent representation and back. The approach uses <a href="https://en.wikipedia.org/wiki/Z-matrix_(chemistry)">internal coordinates</a> (bond lengths, bond angles, dihedral angles) as input rather than Cartesian coordinates, making the representation inherently invariant to rigid translations and rotations.</p>
<h2 id="why-3d-structure-matters-for-molecular-modeling">Why 3D Structure Matters for Molecular Modeling</h2>
<p>Most deep learning methods for molecules operate on 2D representations: molecular graphs (atoms as nodes, bonds as edges) or <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. These representations capture connectivity and atom types but do not encode the 3D spatial arrangement of atoms. Many important molecular properties, such as the ability to fit inside a protein binding pocket or the shape-dependent pharmacological effect, depend on the molecule&rsquo;s possible energetically stable spatial arrangements (conformations).</p>
<p>Prior work has addressed either property prediction from fixed conformations (SchNet, Schütt et al., 2018) or conformation generation for a given molecular graph (Mansimov et al., 2019; Simm and Hernández-Lobato, 2019). This paper addresses a different gap: learning a continuous, fixed-size embedding of a conformation that is independent of molecule size and atom ordering, enabling both reconstruction and generation.</p>
<h2 id="internal-coordinates-and-set-based-encoding">Internal Coordinates and Set-Based Encoding</h2>
<p>The core innovation is a two-part architecture: a conformation-independent graph neural network and a conformation-dependent encoder/decoder that operates on internal coordinates.</p>
<h3 id="internal-coordinate-representation">Internal Coordinate Representation</h3>
<p>Instead of Cartesian coordinates, conformations are represented as a set of internal coordinates:</p>
<p>$$
\Xi = (\mathcal{D}, \Phi, \Psi)
$$</p>
<p>where $\mathcal{D} = \{d_1, \ldots, d_{N_\mathcal{D}}\}$ are bond lengths, $\Phi = \{\phi_1, \ldots, \phi_{N_\Phi}\}$ are bond angles, and $\Psi = \{\psi_1, \ldots, \psi_{N_\Psi}\}$ are dihedral angles. This representation is invariant to rotations and rigid translations and can always be converted to and from Cartesian coordinates.</p>
<h3 id="molecular-graph-encoder">Molecular Graph Encoder</h3>
<p>A Graph Neural Network extracts conformation-independent node embeddings from the molecular graph. The molecular graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ uses node features $v_i \in \mathbb{R}^{F_v}$ encoding atom properties (element type, charge) and edge features $\mathbf{e}_{i,j} \in \mathbb{R}^{F_e}$ encoding bond type (single, double, triple, or aromatic). The architecture combines an edge-conditioned convolution (EConv) layer to encode bond-type information with multiple Graph Attention Network (GAT) layers:</p>
<p>$$
\mathbf{h}_i^l = \mathbf{GAT}^{l-1} \circ \cdots \circ \mathbf{GAT}^1 \circ \text{EConv}(\mathbf{h}_i^0)
$$</p>
<p>where $\mathbf{h}_i^0 = v_i \in \mathbb{R}^{F_v}$ are the initial atom features. The GAT attention coefficients are:</p>
<p>$$
\alpha_{i,j} = \frac{\exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_j]\right)\right)}{\sum_{k \in \mathcal{N}(i) \cup \{i\}} \exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_k]\right)\right)}
$$</p>
<p>Each GAT layer updates node embeddings using the attention weights:</p>
<p>$$
\mathbf{h}&rsquo;_i = \alpha_{i,i}\boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \alpha_{i,j}\boldsymbol{\Theta}\mathbf{h}_j
$$</p>
<p>The EConv layer incorporates edge (bond-type) information via a learned filter:</p>
<p>$$
\mathbf{h}&rsquo;_i = \boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \mathbf{h}_j \cdot \mathrm{f}_{\boldsymbol{\Theta}}(\mathbf{e}_{i,j})
$$</p>
<p>where $\mathrm{f}_{\boldsymbol{\Theta}}$ is a multi-layer perceptron.</p>
<h3 id="permutation-invariant-conformation-encoder">Permutation-Invariant Conformation Encoder</h3>
<p>The conformation encoder uses a Deep Sets-style architecture (Zaheer et al., 2017) to achieve permutation invariance. Three separate neural networks encode each type of internal coordinate, conditioned on the corresponding node embeddings:</p>
<p>$$
z_\Xi = \frac{1}{N_\mathcal{D} + N_\Phi + N_\Psi} \left(\sum_{d \in \mathcal{D}} \rho_\Theta^{(\mathcal{D})}(\mathcal{H}, d) + \sum_{\phi \in \Phi} \rho_\Theta^{(\Phi)}(\mathcal{H}, \phi) + \sum_{\psi \in \Psi} \rho_\Theta^{(\Psi)}(\mathcal{H}, \psi)\right)
$$</p>
<p>Each encoding function $\rho_\Theta$ takes both the internal coordinate value and the node embeddings of the involved atoms as input. The resulting conformation embedding $z_\Xi \in \mathbb{R}^{F_z}$ has a fixed dimensionality regardless of molecule size.</p>
<h3 id="conformation-decoder-and-loss">Conformation Decoder and Loss</h3>
<p>Three decoder networks $\delta_\Theta^{(\mathcal{D})}$, $\delta_\Theta^{(\Phi)}$, and $\delta_\Theta^{(\Psi)}$ reconstruct internal coordinates from the conformation embedding, conditioned on the node embeddings. The reconstruction loss is:</p>
<p>$$
\mathcal{C}_\Xi = \frac{1}{N_\mathcal{D}} \sum_{d \in \mathcal{D}} |d - \hat{d}|_2^2 + \frac{1}{N_\Phi} \sum_{\phi \in \Phi} |\phi - \hat{\phi}|_2^2 + \frac{1}{N_\Psi} \sum_{\psi \in \Psi} \min\left(|\psi - \hat{\psi}|_2^2, 2\pi - |\psi - \hat{\psi}|_2^2\right)
$$</p>
<p>The dihedral angle loss uses a periodic distance to account for angular periodicity. The model can be extended to a <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder (VAE)</a> by applying the reparameterization trick from Kingma and Welling (2013).</p>
<h2 id="conformer-generation-and-spatial-optimization-experiments">Conformer Generation and Spatial Optimization Experiments</h2>
<h3 id="dataset-and-training">Dataset and Training</h3>
<p>The model was trained on the PubChem3D dataset (Bolton et al., 2011), which contains organic molecules with up to 50 heavy atoms with multiple conformations generated by the OMEGA forcefield software.</p>
<h3 id="reconstruction-quality">Reconstruction Quality</h3>
<p>Upon convergence, the model reconstructs conformations with low RMSD to the input. The median energetic difference between input and reconstructed conformations is approximately 80 kcal/mol (evaluated using the <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> forcefield via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>), corresponding to small deviations from local minima without atom clashes.</p>
<h3 id="latent-space-structure">Latent Space Structure</h3>
<p>The learned latent space exhibits meaningful clustering: similar conformations map to nearby points, while distinct conformations separate. Principal component analysis of 200 conformations of a small molecule reveals clear conformational clusters in the first two principal components.</p>
<h3 id="conformer-generation-via-vae">Conformer Generation via VAE</h3>
<p>The variational autoencoder variant can sample diverse conformers from the learned distribution. Comparing the average inter-conformer RMSD (icRMSD) for 200 sampled conformers per molecule against the ETKDG algorithm (Riniker and Landrum, 2015) implemented in RDKit, the model achieves comparable diversity with a slightly higher average icRMSD of 0.07 Angstrom.</p>
<h3 id="multi-objective-molecular-optimization">Multi-Objective Molecular Optimization</h3>
<p>By combining the conformation embedding with a continuous molecular structure embedding (<a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a>, Winter et al., 2019), the model enables joint optimization over both molecular graph and conformation. Using <a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">particle swarm optimization</a> (Kennedy and Eberhart, 1995) to maximize QED (drug-likeness, values between 0 and 1) and asphericity (deviation from spherical shape, values between 0 and 1), starting from aspirin (combined score 0.76), the method finds molecules with a combined score of 1.82 after 50 iterations.</p>
<h2 id="compact-conformation-encoding-with-practical-applications">Compact Conformation Encoding with Practical Applications</h2>
<p>The conformation autoencoder produces fixed-size latent representations of molecular 3D structures that are invariant to molecule size, atom ordering, and rigid transformations. The key findings are:</p>
<ol>
<li><strong>Meaningful latent space</strong>: Conformational similarity is preserved in the embedding space, enabling clustering and interpolation.</li>
<li><strong>Diverse conformer generation</strong>: The VAE variant generates conformer ensembles with diversity comparable to established force-field-based methods.</li>
<li><strong>Joint optimization</strong>: Combining conformation and structure embeddings enables multi-objective optimization over both molecular graph and spatial arrangement.</li>
</ol>
<p>Limitations include the relatively small energy evaluation (MMFF94 only), the lack of comparison with quantum mechanical energy evaluations, and the proof-of-concept nature of the spatial optimization experiments. The approach also relies on the quality of the internal coordinate representation, which may lose information about ring conformations and other constrained geometries.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PubChem3D</td>
          <td>Multiple conformations per molecule</td>
          <td>Organic molecules, up to 50 heavy atoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>PubChem3D holdout</td>
          <td>Subset</td>
          <td>Same distribution as training</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Graph Neural Network: EConv + multiple GAT layers</li>
<li>Conformation encoder: Deep Sets architecture with three coordinate-specific encoders</li>
<li>VAE: Reparameterization trick for probabilistic sampling</li>
<li>Optimization: Particle Swarm Optimization for multi-objective design</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Conformation-independent: EConv + GAT layers for node embeddings</li>
<li>Conformation-dependent: Three encoder/decoder feed-forward networks per coordinate type</li>
<li>Latent dimension $F_z$ is fixed (exact value not specified in the workshop paper)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Median energy difference</td>
          <td>~80 kcal/mol</td>
          <td>Input conformations</td>
          <td>MMFF94 forcefield</td>
      </tr>
      <tr>
          <td>icRMSD difference vs ETKDG</td>
          <td>+0.07 Angstrom</td>
          <td>ETKDG (RDKit)</td>
          <td>200 conformers per molecule</td>
      </tr>
      <tr>
          <td>Combined QED+asphericity</td>
          <td>1.82</td>
          <td>0.76 (aspirin)</td>
          <td>After 50 optimization iterations</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the workshop paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem3D</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>NIH public database; conformations generated by OMEGA (Hawkins et al., 2010)</td>
      </tr>
      <tr>
          <td><a href="https://arxiv.org/abs/2101.01618">arXiv preprint</a></td>
          <td>Paper</td>
          <td>arXiv license</td>
          <td>6-page workshop paper, open access</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status: Partially Reproducible.</strong> The training dataset (PubChem3D) is publicly available, and the architecture is described in sufficient detail for reimplementation. No source code, pre-trained weights, or exact hyperparameters (latent dimension $F_z$, learning rate, number of GAT layers) are released. The workshop paper format (6 pages) limits the level of experimental detail provided.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Winter, R., Noé, F., &amp; Clevert, D.-A. (2020). Auto-Encoding Molecular Conformations. <em>Machine Learning for Molecules Workshop, NeurIPS 2020</em>.</p>
<p><strong>Publication</strong>: Machine Learning for Molecules Workshop at NeurIPS 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{winter2021auto,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Auto-Encoding Molecular Conformations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Winter, Robin and No\&#39;{e}, Frank and Clevert, Djork-Arn\&#39;{e}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2101.01618}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AllChem: Generating and Searching 10^20 Structures</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/allchem-synthetically-accessible-structures/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/allchem-synthetically-accessible-structures/</guid><description>AllChem generates and searches 10^20 synthetically accessible structures by combining synthons from recursive reaction application.</description><content:encoded><![CDATA[<h2 id="combinatorial-synthon-assembly-at-scale">Combinatorial Synthon Assembly at Scale</h2>
<p>AllChem is a computer-aided molecular design system that generates and searches an unprecedentedly large space of synthetically accessible structures (on the order of $10^{20}$). Rather than enumerating molecules from mathematical graphs (as in the <a href="/notes/chemistry/datasets/gdb-17/">GDB databases</a>), AllChem builds its chemical space from real synthetic chemistry: it recursively applies known reactions to commercial building blocks, producing <a href="https://en.wikipedia.org/wiki/Synthon">synthons</a> (structures with open valences of defined reactivity) that combinatorially assemble into complete molecules. Every structure found by a search comes paired with a proposed synthetic route.</p>
<h2 id="motivation-costs-and-benefits-together">Motivation: Costs and Benefits Together</h2>
<p>Most computer-aided molecular design methods focus on predicting biological activity (the benefit) while leaving synthesis feasibility (the cost) to the laboratory chemist. AllChem addresses both simultaneously. Its predecessor, ChemSpace, accessed $\sim 10^{14}$ structures built from simple <a href="https://en.wikipedia.org/wiki/Combinatorial_chemistry">combinatorial libraries</a> (chemist-proposed scaffolds plus commercial side chains), but only about 5% of structures in the medicinal chemistry literature fit that template. AllChem aims to cover roughly 50% of published structures by allowing multi-step synthon generation that produces more complex, non-trivial scaffolds.</p>
<h2 id="the-gensyn-synthon-generator">The gensyn Synthon Generator</h2>
<p>The core component is <code>gensyn</code>, a program that recursively applies a curated set of approximately 100 reactions to approximately 7,000 commercially available building blocks. Each product becomes a new building block for subsequent reaction steps, with recursion bounded primarily by a cumulative synthesis &ldquo;cost&rdquo; limit (roughly five AllChem-type steps per sequence). Structures bearing open valences are collected as synthons. A typical run produces around $5 \times 10^6$ synthons, which combinatorially represent $(5 \times 10^6)^3 = 10^{20}$ complete structures with an A-B-C topology.</p>
<p>Key design decisions in gensyn:</p>
<ul>
<li><strong>Reaction curation</strong>: All reactions come from external human-readable text files, based on reactions already practiced by laboratory chemists. Scope constraints are calibrated so that at least 90% of randomly sampled reaction applications appear unchallengeable to synthetic chemists.</li>
<li><strong>Reactive intermediates</strong>: Explicitly represented. For example, amide formation requires three steps: acid chloride to electrophilic synthon, amine to nucleophilic synthon, then coupling.</li>
<li><strong>Protective groups</strong>: Addition and removal are treated as standard reactions.</li>
<li><strong>Concerted cyclizations</strong>: Represented by splitting the ring formation across two complementary synthons with specially labeled open valences.</li>
<li><strong>Bimolecular reactions</strong>: In addition to unimolecular transformations, gensyn performs reactions that combine selected synthons with other synthons, increasing overall structural diversity.</li>
<li><strong>Constraints</strong>: Maximum of one prochiral center (to avoid diastereomeric mixtures), heavy atom count limits for lead-likeness, and a cumulative cost bound on synthetic routes. Each reaction step has a default cost of $-5$, and the maximum allowed cumulative cost is $-25$ (roughly five steps per sequence).</li>
</ul>
<h2 id="reaction-description-language">Reaction Description Language</h2>
<p>Reactions are described using an extension of Sybyl Line Notation (SLN), a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-like notation. Each reaction description specifies the structural pattern required in the substrate, the transformation to apply, the reactivity class of resulting open valences, the relative cost, incompatible functional groups, and rules for handling multiple equivalent reactive sites. A separate reactivity table defines which valence classes can react with each other (e.g., nucleophilic with electrophilic).</p>
<h2 id="topomer-similarity-search">Topomer Similarity Search</h2>
<p>Searching among $10^{20}$ complete structures relies on topomer shape similarity as a branch-and-bound filter. A query structure is fragmented by breaking acyclic single bonds (individually and pairwise), each fragment is converted to a topomer (a canonical 3D shape), and the topomer is compared against all stored synthons. Topomer comparisons run at tens of thousands per second. Because the vast majority of synthons are individually shape-dissimilar enough to eliminate every complete structure containing them, the search space collapses rapidly. To be acceptable, a product must also have been formed by joining open valences with complementary reactivity.</p>
<p>Validation used repeated &ldquo;self-searches,&rdquo; in which a query structure is assembled from randomly chosen synthons and searched for in the database. On the 250,000-synthon leadhopping database, average self-search time was 7.1 minutes; complete searches of the full-scale database take several hours on standard hardware.</p>
<h2 id="applications-lead-hopping-and-scaffold-generation">Applications: Lead Hopping and Scaffold Generation</h2>
<p><strong>Lead hopping</strong>: Finding structurally novel molecules that are shape-similar (and therefore likely biologically similar) to a query lead. Using a 250,000-synthon leadhopping database, 18 of 19 self-search queries recovered the query structure perfectly (shape difference of 0 topomer units). The remaining query also recovered itself as the closest hit.</p>
<p><strong>Scaffold idea generation</strong>: Filtering the synthon collection for small ($\leq$ 14 heavy atoms), low-chirality scaffolds with at least two diversification sites (primarily through nucleophilic heteroatom reactions on activated carbon electrophiles or <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki-type couplings</a>), UV chromophores, minimal freely rotatable bonds (especially between diversification sites and rings), a ring, and short synthetic paths (all branches fewer than about six AllChem steps). Over 20% of gensyn-proposed synthons pass these scaffold filters, suggesting on the order of $10^6$ accessible and structurally distinct scaffolds, compared to the few thousand scaffolds typically represented in large screening collections.</p>
<h2 id="compute-and-infrastructure">Compute and Infrastructure</h2>
<p>Full-scale synthon database recreation takes approximately one week using two standard workstations (one Oracle database server, one compute engine). The codebase was rewritten from Java to Python for portability and performance. All data is managed through an Oracle relational database, including synthons, intermediates, and a reactions table recording every gensyn conversion.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Variable reactivity of open valences (e.g., weakly nucleophilic amines may not form the implied bond readily) is handled only approximately via reagent class annotations.</li>
<li>Stereospecificity and most aromatic electrophilic substitution reactions are omitted.</li>
<li>The system was described as under active development at the time of publication, giving the paper the character of an interim progress report.</li>
<li>Drug-likeness of 3-synthon products (average MW ~800, CLOGP ~8.0) requires careful filtering of the synthon distribution toward smaller, less lipophilic components.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>AllChem was developed as proprietary software at Tripos Inc. (Tripos Discovery Research, Bude, Cornwall, UK). No source code, synthon databases, or reaction files have been publicly released. The paper functions as a description of the system&rsquo;s architecture and early results rather than a reproducibility-oriented publication.</p>
<ul>
<li><strong>Code</strong>: Not publicly available. The system was proprietary to Tripos Inc.</li>
<li><strong>Data</strong>: Synthon databases and reaction description files are not shared.</li>
<li><strong>Hardware</strong>: Two standard workstations (one Oracle server, one compute engine); no specialized hardware required.</li>
<li><strong>Funding</strong>: NIH/GMS SBIR grant 2 R44 GM068359-02.</li>
</ul>
<p><strong>Reproducibility status</strong>: Closed.</p>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Journal</strong>: Journal of Computer-Aided Molecular Design, Vol. 21, No. 6, pp. 341-350</li>
<li><strong>Published</strong>: January 25, 2007</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cramer2007allchem,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AllChem: generating and searching 10^{20} synthetically accessible structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cramer, Richard D. and Soltanshahi, Farhad and Jilek, Robert J. and Campbell, Brian}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Computer-Aided Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{21}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{341--350}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science+Business Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1007/s10822-006-9093-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ACSESS: Diverse Optimal Molecules in the SMU</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/acsess-diverse-optimal-molecules/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/acsess-diverse-optimal-molecules/</guid><description>Rupakheti et al. extend ACSESS to find diverse molecules with favorable properties without exhaustive enumeration of chemical space.</description><content:encoded><![CDATA[<h2 id="diversity-biased-search-of-the-small-molecule-universe">Diversity-Biased Search of the Small Molecule Universe</h2>
<p>The small molecule universe (SMU), estimated at over $10^{60}$ synthetically feasible organic molecules under ~500 Da, is far too large for exhaustive enumeration and evaluation. This paper extends the ACSESS (Algorithm for Chemical Space Exploration with Stochastic Search) framework to simultaneously optimize molecular diversity and a targeted physical property. The key insight is that enforcing diversity at each iteration prevents the search from collapsing into local optima, a failure mode common in standard <a href="/notes/chemistry/molecular-design/generation/search-based/genetic-algorithms-molecule-generation-baselines/">genetic algorithms</a>.</p>
<h2 id="motivation-diversity-vs-fitness">Motivation: Diversity vs. Fitness</h2>
<p>Standard genetic algorithms optimize fitness effectively but sacrifice diversity: they converge to a few high-fitness regions while ignoring equally good solutions elsewhere. Exhaustive enumeration guarantees completeness but is computationally infeasible beyond ~20 heavy atoms. ACSESS bridges this gap by maintaining a maximally diverse library throughout the optimization process, ensuring coverage of multiple fitness peaks without needing to evaluate every candidate.</p>
<h2 id="the-property-optimizing-acsess-algorithm">The Property-Optimizing ACSESS Algorithm</h2>
<p>The method has four iterative steps:</p>
<ol>
<li><strong>Initialize</strong> a library (from a single molecule or a seed collection)</li>
<li><strong>Breed</strong> new compounds via mutations and crossovers</li>
<li><strong>Filter</strong> by property threshold, removing compounds below a cutoff</li>
<li><strong>Select</strong> a maximally diverse subset of qualifying structures</li>
</ol>
<p>The property threshold increases linearly with each iteration, starting low (to prevent population collapse) and gradually rising until the desired fitness level is reached. Diversity is enforced via either a maximin algorithm (maximizing nearest-neighbor distance) or cell-based partitioning (linear scaling for large libraries).</p>
<p>Molecules are represented in a 40-dimensional chemical space using Moreau-Broto autocorrelation descriptors. The descriptor encodes correlations of atomic properties as a function of topological distance (bond distance) $d$:</p>
<p>$$
AC(d, p) = \sum_{i \leq j} p_{i} , p_{j} , \delta(d_{ij} - d)
$$</p>
<p>where $p_{i}$ is an atomic property of atom $i$ and $d_{ij}$ is the shortest bond path between atoms $i$ and $j$. Five atomic properties are used: atomic number, Gasteiger-Marsili partial charge, atomic polarizability, topological steric index, and unity ($p_{i} = 1$ for all $i$, effectively counting atom pairs at each distance). Topological distance $d$ ranges from 0 to 7, yielding $5 \times 8 = 40$ descriptor components. Descriptors are mean-centered and normalized to unit variance before computing distances.</p>
<p>Chemical space distance is the Euclidean distance between descriptor vectors:</p>
<p>$$
D_{ij} = \sqrt{\sum_{k=1}^{N} (d_{ik} - d_{jk})^2}
$$</p>
<p>Library diversity is measured as the average nearest-neighbor distance:</p>
<p>$$
D_{\min} = \frac{1}{M} \sqrt{\sum_{i=1}^{M} \min_{i \neq j} (D_{ij}^2)}
$$</p>
<h2 id="validation-on-nkp-fitness-landscapes">Validation on NKp Fitness Landscapes</h2>
<p>The <a href="https://en.wikipedia.org/wiki/NK_model">NKp model</a> maps binary strings of length $N$ to fitness values in $[0, 1]$. The fitness of a string $g$ is:</p>
<p>$$
\Phi(g) = \frac{1}{N} \sum_{i=1}^{N} \varphi_{i}(g)
$$</p>
<p>where each $\varphi_{i} \in [0, 1]$ is a randomly drawn fitness contribution. Ruggedness is controlled by $K$ (the number of inter-bit associations per position) and $p$ (fitness contribution weights). Using $N = 19$, $K = 9$, $p = 0.9$ (524,288 total strings, comparable to GDB-9 size), the global maximum was ~0.3. Both ACSESS and SGA were initialized with the same diverse subset and ran for 30 iterations across 10 independent runs:</p>
<ul>
<li>ACSESS found the global optimum in 100% of runs (vs. 60% for SGA)</li>
<li>ACSESS discovered ~15 of 19 globally optimal strings on average (vs. ~3 for SGA)</li>
<li>ACSESS solutions had higher average fitness than SGA solutions</li>
</ul>
<h2 id="validation-on-gdb-9-dipole-moments">Validation on GDB-9 Dipole Moments</h2>
<p>The method was tested on all ~300,000 molecules in GDB-9 (up to 9 heavy atoms; allowed atom types: C, N, O, S, Cl). For each molecule, the Boltzmann-averaged dipole moment was computed at the <a href="https://en.wikipedia.org/wiki/Austin_Model_1">AM1 level</a> (Gaussian 09):</p>
<p>$$
D = \frac{\sum_{i \in C} \mu_{i} , e^{-\beta E_{i}}}{\sum_{i \in C} e^{-\beta E_{i}}}
$$</p>
<p>where $\mu_{i}$ and $E_{i}$ are the dipole moment and internal energy of conformation $i$, and $\beta = 1 / (k_{\text{B}} T)$ at $T = 298$ K. Conformations (including stereoisomers) were generated using OpenEye OMEGA. The target was molecules with dipole moments $\geq 5.5$ D (the 90th percentile). ACSESS first generated a maximally diverse seed set, then ran 60 iterations of fitness-biased optimization. All methods were initialized from the same diverse seed and compared over multiple runs.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Dipole Moment (D)</th>
          <th>Diversity (eq. 4)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GA-Roulette</td>
          <td>5.8 $\pm$ 0.03</td>
          <td>6.5 $\pm$ 0.7</td>
      </tr>
      <tr>
          <td>GA-Tournament</td>
          <td>6.4 $\pm$ 0.08</td>
          <td>3.5 $\pm$ 0.7</td>
      </tr>
      <tr>
          <td>GA-Elitism</td>
          <td>6.74 $\pm$ 0.08</td>
          <td>5.4 $\pm$ 0.4</td>
      </tr>
      <tr>
          <td><strong>ACSESS</strong></td>
          <td><strong>6.05 $\pm$ 0.05</strong></td>
          <td><strong>9.7 $\pm$ 0.6</strong></td>
      </tr>
  </tbody>
</table>
<p>ACSESS achieved nearly double the diversity of the best SGA variant while maintaining competitive fitness. Its diversity (~9.7) approached the diversity of the full enumerated high-fitness subset of GDB-9 (~12). <a href="https://en.wikipedia.org/wiki/Self-organizing_map">Self-organizing map</a> (SOM) visualizations confirmed that ACSESS covered high-activity regions that SGAs missed entirely.</p>
<p>Only ~30,000 fitness evaluations were needed to locate diverse optimal regions in the 300,000-molecule space, a 10x efficiency gain over exhaustive enumeration.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Tested only on relatively small chemical spaces (GDB-9 with ~300k molecules and 19-bit NKp with ~500k strings); scaling to the full SMU ($10^{60}$) remains a research direction</li>
<li>Property evaluation (AM1 dipole moments with conformer generation) is the computational bottleneck, not the ACSESS algorithm itself</li>
<li>The 40-dimensional autocorrelation descriptor space may not capture all relevant structural features for every optimization target</li>
<li>Comparison is limited to simple genetic algorithms; more sophisticated evolutionary strategies were not benchmarked</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The ACSESS algorithm relies on proprietary software, limiting full reproducibility.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.1038/sdata.2014.22">GDB-9</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Publicly available enumerated chemical universe (~300k molecules)</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Code</strong>: No public source code was released. The implementation depends on OpenEye OEChem TK (molecule generation), OpenEye MolProp TK (filtering), and OpenEye OMEGA TK (conformer generation), all of which require commercial licenses.</li>
<li><strong>Property calculations</strong>: Dipole moments were computed at the AM1 level using Gaussian 09, also commercial software.</li>
<li><strong>NKp landscape</strong>: Fully specified by parameters ($N = 19$, $K = 9$, $p = 0.9$) and standard NKp model equations, making this portion independently reproducible.</li>
<li><strong>Hardware</strong>: No specific compute requirements reported.</li>
<li><strong>Reproducibility status</strong>: Partially Reproducible. The algorithm is well-described and the NKp experiments could be reimplemented, but the molecular experiments require OpenEye and Gaussian 09 licenses, and no reference implementation was released.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Journal</strong>: Journal of Chemical Information and Modeling, Vol. 55, No. 3, pp. 529-537</li>
<li><strong>Published</strong>: January 16, 2015</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rupakheti2015strategy,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rupakheti, Chetan and Virshup, Aaron M. and Yang, Weitao and Beratan, David N.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{55}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{529--537}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/ci500749q}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DoReMi: Optimizing Data Mixtures for LM Pretraining</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</guid><description>DoReMi uses a small proxy model with distributionally robust optimization to learn domain weights that speed up large-scale language model pretraining by 2.6x.</description><content:encoded><![CDATA[<h2 id="a-method-for-automatic-domain-reweighting">A method for automatic domain reweighting</h2>
<p>This is a <strong>method paper</strong> that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with <a href="https://en.wikipedia.org/wiki/Robust_optimization">group distributionally robust optimization (Group DRO)</a> to produce domain weights that transfer to much larger models.</p>
<h2 id="why-data-mixture-proportions-matter">Why data mixture proportions matter</h2>
<p>Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (<a href="https://en.wikipedia.org/wiki/The_Pile_(dataset)">The Pile</a> uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.</p>
<h2 id="minimax-optimization-over-domain-excess-loss">Minimax optimization over domain excess loss</h2>
<p>DoReMi&rsquo;s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.</p>
<p><strong>Step 1</strong>: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).</p>
<p><strong>Step 2</strong>: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:</p>
<p>$$
\min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right]
$$</p>
<p>where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.</p>
<p>At each training step, the domain weights update as:</p>
<p>$$
\alpha_{t}&rsquo; \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t})
$$</p>
<p>where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}&rsquo;}{\sum_{i} \alpha_{t}&rsquo;[i]} + cu$, with $c = 10^{-3}$.</p>
<p>The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.</p>
<p><strong>Step 3</strong>: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.</p>
<p><strong>Iterated DoReMi</strong> extends this by running multiple rounds, using the previous round&rsquo;s optimized weights as the next round&rsquo;s reference weights. This converges within 3 rounds on the GLaM dataset.</p>
<h2 id="experiments-across-the-pile-and-glam-datasets">Experiments across The Pile and GLaM datasets</h2>
<p><strong>Datasets.</strong> The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.</p>
<p><strong>Setup.</strong> Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.</p>
<p><strong>Evaluation.</strong> Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.</p>
<h3 id="key-domain-weight-shifts">Key domain weight shifts</h3>
<p>On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.</p>
<h3 id="scaling-behavior">Scaling behavior</h3>
<p>DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Speedup to baseline accuracy</th>
          <th>Downstream improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DoReMi (280M to 280M)</td>
          <td>4x</td>
          <td>+2% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (280M to 8B)</td>
          <td>2.6x</td>
          <td>+6.5% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (150M to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
      <tr>
          <td>DoReMi (1B to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
  </tbody>
</table>
<p>Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.</p>
<h2 id="perplexity-improves-everywhere-even-on-downweighted-domains">Perplexity improves everywhere, even on downweighted domains</h2>
<p>The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they&rsquo;re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.</p>
<p>On The Pile, DoReMi reaches the baseline&rsquo;s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.</p>
<p>On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.</p>
<h3 id="ablations">Ablations</h3>
<p>Using only the proxy model&rsquo;s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.</p>
<p>The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.</p>
<h3 id="limitations">Limitations</h3>
<p>The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.</p>
<p>The granularity of &ldquo;domains&rdquo; matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>800 GB, 22 domains</td>
          <td>Default heuristic weights as baseline</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>GLaM dataset</td>
          <td>8 domains</td>
          <td>Uniform weights as baseline; downstream-tuned oracle available</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA</td>
          <td>Standard splits</td>
          <td>One-shot generative evaluation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} &lt; 10^{-3}$.</p>
<h3 id="models">Models</h3>
<p>Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>DoReMi (280M to 8B)</th>
          <th>Baseline (8B)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Avg one-shot accuracy</td>
          <td>+6.5% over baseline</td>
          <td>Reference</td>
          <td>5 generative tasks</td>
      </tr>
      <tr>
          <td>Worst-case log-perplexity</td>
          <td>1.46</td>
          <td>1.71</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Avg log-perplexity</td>
          <td>1.40</td>
          <td>1.64</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Domains beating baseline</td>
          <td>22/22</td>
          <td>0/22</td>
          <td>Per-domain perplexity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{xie2023doremi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RWKV: Linear-Cost RNN with Transformer Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</guid><description>RWKV combines parallelizable transformer training with constant-cost RNN inference using linear attention and channel-wise decay.</description><content:encoded><![CDATA[<h2 id="a-new-architecture-bridging-rnns-and-transformers">A New Architecture Bridging RNNs and Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.</p>
<h2 id="the-quadratic-cost-of-self-attention">The Quadratic Cost of Self-Attention</h2>
<p>Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.</p>
<p>RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.</p>
<p>Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.</p>
<h2 id="linear-attention-via-channel-wise-decay">Linear Attention via Channel-Wise Decay</h2>
<p>RWKV is built on four core vectors that interact multiplicatively at each timestep:</p>
<ul>
<li><strong>R</strong> (Receptance): receives past information, acting as a gating signal</li>
<li><strong>W</strong> (Weight): a trainable positional weight decay vector</li>
<li><strong>K</strong> (Key): analogous to keys in standard attention</li>
<li><strong>V</strong> (Value): analogous to values in standard attention</li>
</ul>
<p>The architecture consists of stacked residual blocks, each containing a <strong>time-mixing</strong> sub-block and a <strong>channel-mixing</strong> sub-block.</p>
<h3 id="token-shift">Token Shift</h3>
<p>All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:</p>
<p>$$
r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1})
$$</p>
<p>$$
k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1})
$$</p>
<p>$$
v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1})
$$</p>
<p>where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.</p>
<h3 id="the-wkv-operator">The WKV Operator</h3>
<p>The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:</p>
<p>$$
wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}}
$$</p>
<p>Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.</p>
<h3 id="output-gating">Output Gating</h3>
<p>The receptance vector gates the WKV output through a sigmoid:</p>
<p>$$
o_t = W_o \cdot (\sigma(r_t) \odot wkv_t)
$$</p>
<p>The channel-mixing block uses a similar gating mechanism with squared ReLU activation:</p>
<p>$$
o&rsquo;_t = \sigma(r&rsquo;_t) \odot (W&rsquo;_v \cdot \max(k&rsquo;_t, 0)^2)
$$</p>
<h3 id="dual-mode-operation">Dual-Mode Operation</h3>
<p>During <strong>training</strong>, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.</p>
<p>During <strong>inference</strong>, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.</p>
<h3 id="optimizations">Optimizations</h3>
<p>Three additional design choices improve training:</p>
<ol>
<li><strong>Custom CUDA kernels</strong> for the sequential WKV computation, fusing it into a single kernel on training accelerators</li>
<li><strong>Small init embedding</strong>: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence</li>
<li><strong>Custom initialization</strong>: most weights initialized to zero with no biases, following identity-mapping principles from residual network design</li>
</ol>
<h2 id="scaling-to-14b-parameters-and-benchmark-evaluation">Scaling to 14B Parameters and Benchmark Evaluation</h2>
<h3 id="model-scaling">Model Scaling</h3>
<p>The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Dimension</th>
          <th>Parameters</th>
          <th>FLOP/Token</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>12</td>
          <td>768</td>
          <td>$1.69 \times 10^8$</td>
          <td>$2.61 \times 10^8$</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>24</td>
          <td>1024</td>
          <td>$4.30 \times 10^8$</td>
          <td>$7.57 \times 10^8$</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>24</td>
          <td>2048</td>
          <td>$1.52 \times 10^9$</td>
          <td>$2.82 \times 10^9$</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>32</td>
          <td>2560</td>
          <td>$2.99 \times 10^9$</td>
          <td>$5.71 \times 10^9$</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>32</td>
          <td>4096</td>
          <td>$7.39 \times 10^9$</td>
          <td>$1.44 \times 10^{10}$</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>40</td>
          <td>5120</td>
          <td>$1.42 \times 10^{10}$</td>
          <td>$2.78 \times 10^{10}$</td>
      </tr>
  </tbody>
</table>
<p>The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.</p>
<h3 id="scaling-laws">Scaling Laws</h3>
<p>Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.</p>
<h3 id="nlp-benchmarks">NLP Benchmarks</h3>
<p>RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.</p>
<p>RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.</p>
<h3 id="long-context-and-extended-finetuning">Long Context and Extended Finetuning</h3>
<p>RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.</p>
<p>On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.</p>
<h3 id="inference-efficiency">Inference Efficiency</h3>
<p>Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.</p>
<h2 id="competitive-performance-with-key-caveats">Competitive Performance with Key Caveats</h2>
<p>RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:</p>
<ol>
<li><strong>Scaling laws hold</strong>: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior</li>
<li><strong>Competitive NLP performance</strong>: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data</li>
<li><strong>Linear inference cost</strong>: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length</li>
<li><strong>Context extension</strong>: Progressive finetuning effectively extends the context window post-training</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors identify two primary limitations:</p>
<p><strong>Information compression</strong>: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.</p>
<p><strong>Prompt sensitivity</strong>: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch training and inference implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/BlinkDL/rwkv-4-pile-14b">Pre-trained weights (169M to 14B)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>All six Pile-trained sizes on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>)</td>
      </tr>
      <tr>
          <td><a href="https://pile.eleuther.ai/">The Pile</a></td>
          <td>Dataset</td>
          <td>Mixed</td>
          <td>825 GiB pretraining corpus; component licenses vary by source</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>330B tokens</td>
          <td>One full epoch for all model sizes</td>
      </tr>
      <tr>
          <td>Context extension</td>
          <td>The Pile</td>
          <td>210B additional tokens</td>
          <td>Progressive doubling: 1024 to 8192</td>
      </tr>
      <tr>
          <td>NLP evaluation</td>
          <td>ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande</td>
          <td>Various</td>
          <td>Zero-shot evaluation</td>
      </tr>
      <tr>
          <td>Long-range evaluation</td>
          <td>Long Range Arena (LRA)</td>
          <td>1K-16K tokens</td>
          <td>Five sub-tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay</li>
<li>Precision: bfloat16</li>
<li>Training context length: 1024 tokens</li>
<li>Learning rate: constant warmup, then exponential decay</li>
<li>Auxiliary loss from PaLM (softmax normalizer regularization)</li>
<li>Batch size: 128 or 256 sequences (dynamically switched)</li>
<li>Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Init LR</th>
          <th>Warmup Mini-Epochs</th>
          <th>End LR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>6e-4</td>
          <td>361</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>4e-4</td>
          <td>411</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>3e-4</td>
          <td>443</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>1.5e-4</td>
          <td>451</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>1.5e-4</td>
          <td>465</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>1e-4</td>
          <td>544</td>
          <td>7e-6</td>
      </tr>
  </tbody>
</table>
<p>All pretrained models (169M to 14B) are publicly released on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>) under Apache-2.0. Training code is at <a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a> (Apache-2.0).</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>All NLP benchmarks evaluated in zero-shot setting</li>
<li>FLOP-matched comparison against Pythia, OPT, BLOOM</li>
<li>Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference experiments: NVIDIA A100 80GB GPU</li>
<li>Training hardware details not fully specified; FLOP budgets reported per model</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., &hellip; &amp; Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In <em>Findings of the Association for Computational Linguistics: EMNLP 2023</em>, pp. 14048-14077.</p>
<p><strong>Publication</strong>: Findings of EMNLP 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/BlinkDL/RWKV-LM">GitHub Repository (Apache-2.0)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{peng2023rwkv,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{RWKV: Reinventing RNNs for the Transformer Era}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\&#39;n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\&#39;z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Findings of the Association for Computational Linguistics: EMNLP 2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{14048--14077}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.findings-emnlp.936}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Liquid-S4: Input-Dependent State-Space Models</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/liquid-s4-state-space-models/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/liquid-s4-state-space-models/</guid><description>Liquid-S4 combines liquid time-constant networks with structured state-space models, adding input-dependent kernels for long-range sequence modeling.</description><content:encoded><![CDATA[<h2 id="a-method-for-input-adaptive-sequence-modeling">A Method for Input-Adaptive Sequence Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces Liquid-S4, a new state-space model combining the structured state-space framework (S4) with liquid time-constant (LTC) networks. The primary contribution is an input-dependent state transition mechanism that allows the model to adapt its dynamics based on incoming inputs, while retaining the efficient convolutional kernel computation of S4.</p>
<h2 id="scaling-liquid-networks-to-long-sequences">Scaling Liquid Networks to Long Sequences</h2>
<p>Liquid time-constant (LTC) networks are continuous-time neural networks with input-dependent state transitions, giving them strong generalization and causal modeling properties. However, LTCs rely on ODE solvers that limit their scalability to long sequences. Structured state-space models (S4) solve this scalability problem through HiPPO initialization, diagonal plus low-rank (DPLR) parameterization, and efficient Cauchy kernel computation in the frequency domain, but they use fixed (input-independent) state transitions.</p>
<p>The key question this paper addresses: can the expressivity of LTC networks be combined with the efficiency and scalability of S4 to improve long-range sequence modeling?</p>
<h2 id="the-liquid-kernel-input-dependent-convolutions">The Liquid Kernel: Input-Dependent Convolutions</h2>
<p>The core innovation is a linearized LTC state-space model that replaces the standard SSM dynamics:</p>
<p>$$\dot{x}(t) = \mathbf{A}x(t) + \mathbf{B}u(t)$$</p>
<p>with an input-dependent formulation:</p>
<p>$$\dot{x}(t) = \left[\mathbf{A} + \mathbf{B}u(t)\right]x(t) + \mathbf{B}u(t)$$</p>
<p>where $u(t)$ now modulates the state transition matrix itself. After discretization via the <a href="https://en.wikipedia.org/wiki/Bilinear_transform">bilinear transform</a>, the recurrence becomes:</p>
<p>$$x_{k} = \left(\overline{\mathbf{A}} + \overline{\mathbf{B}}u_{k}\right)x_{k-1} + \overline{\mathbf{B}}u_{k}$$</p>
<p>Unrolling this recurrence reveals that the output $y_{k}$ decomposes into two parts:</p>
<p>$$y = \overline{\mathbf{K}} * u + \overline{\mathbf{K}}_{\text{liquid}} * u_{\text{correlations}}$$</p>
<p>The first term is the standard S4 convolutional kernel $\overline{\mathbf{K}}$, mapping individual input time steps independently. The second term is a new &ldquo;liquid kernel&rdquo; $\overline{\mathbf{K}}_{\text{liquid}}$ that operates on <a href="https://en.wikipedia.org/wiki/Autocorrelation">auto-correlation</a> terms of the input signal (products $u_{i}u_{j}$, $u_{i}u_{j}u_{k}$, etc., up to a chosen order $\mathcal{P}$).</p>
<p><strong>Proposition 1</strong> shows that each liquid kernel of order $p$ can be computed from the precomputed S4 kernel via a <a href="https://en.wikipedia.org/wiki/Hadamard_product_(matrices)">Hadamard product</a> with $\overline{\mathbf{B}}^{p-1}$ followed by an anti-diagonal transformation (flip):</p>
<p>$$\overline{\mathbf{K}}_{\text{liquid}=p} = \left[\overline{\mathbf{K}}_{(L-\tilde{L},L)} \odot \overline{\mathbf{B}}_{(L-\tilde{L},L)}^{p-1}\right] * \mathbf{J}_{\tilde{L}}$$</p>
<p>This is the KB (Kernel $\times$ B) mode. The authors also propose a simplified PB (Powers of B) mode that sets the transition matrix $\overline{\mathbf{A}}$ to identity for the correlation terms:</p>
<p>$$\overline{\mathbf{K}}_{\text{liquid}=p} = \overline{\mathbf{C}} \odot \overline{\mathbf{B}}^{p-1}$$</p>
<p>The PB kernel is cheaper to compute and performs equally well or better in practice.</p>
<p>The computational complexity is $\tilde{\mathcal{O}}(N + L + p_{\text{max}}\tilde{L})$, where $N$ is the state size, $L$ the sequence length, $p_{\text{max}}$ the maximum liquid order, and $\tilde{L}$ the liquid kernel length (typically two orders of magnitude smaller than $L$).</p>
<h2 id="benchmarks-across-long-range-sequence-tasks">Benchmarks Across Long-Range Sequence Tasks</h2>
<p>Liquid-S4 is evaluated on four benchmark suites with the PB kernel using the S4-LegS (scaled <a href="https://en.wikipedia.org/wiki/Legendre_polynomials">Legendre</a>) parameterization.</p>
<h3 id="long-range-arena-lra">Long Range Arena (LRA)</h3>
<p>The LRA benchmark contains six tasks with sequence lengths from 1K to 16K. Liquid-S4 achieves state-of-the-art on all six tasks with an average accuracy of 87.32%:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input Length</th>
          <th>Liquid-S4</th>
          <th>S4-LegS</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ListOps</td>
          <td>2048</td>
          <td>62.75%</td>
          <td>59.60%</td>
          <td>+3.15%</td>
      </tr>
      <tr>
          <td>Text (IMDB)</td>
          <td>2048</td>
          <td>89.02%</td>
          <td>86.82%</td>
          <td>+2.20%</td>
      </tr>
      <tr>
          <td>Retrieval (AAN)</td>
          <td>4000</td>
          <td>91.20%</td>
          <td>90.90%</td>
          <td>+0.30%</td>
      </tr>
      <tr>
          <td>Image (CIFAR)</td>
          <td>1024</td>
          <td>89.50%</td>
          <td>88.65%</td>
          <td>+0.85%</td>
      </tr>
      <tr>
          <td>Pathfinder</td>
          <td>1024</td>
          <td>94.80%</td>
          <td>94.20%</td>
          <td>+0.60%</td>
      </tr>
      <tr>
          <td>Path-X</td>
          <td>16384</td>
          <td>96.66%</td>
          <td>96.35%</td>
          <td>+0.31%</td>
      </tr>
      <tr>
          <td><strong>Average</strong></td>
          <td></td>
          <td><strong>87.32%</strong></td>
          <td><strong>86.09%</strong></td>
          <td><strong>+1.23%</strong></td>
      </tr>
  </tbody>
</table>
<p>Liquid orders $p$ range from 2 to 6 across tasks.</p>
<h3 id="bidmc-vital-signs">BIDMC Vital Signs</h3>
<p>On medical time-series regression (heart rate, respiratory rate, <a href="https://en.wikipedia.org/wiki/Oxygen_saturation_(medicine)">SpO2</a> prediction from length-4000 biomarker signals):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Liquid-S4 (RMSE)</th>
          <th>S4-LegS (RMSE)</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heart Rate</td>
          <td>0.303</td>
          <td>0.332</td>
          <td>8.7%</td>
      </tr>
      <tr>
          <td>Respiratory Rate</td>
          <td>0.158</td>
          <td>0.247</td>
          <td>36.0%</td>
      </tr>
      <tr>
          <td>SpO2</td>
          <td>0.066</td>
          <td>0.090</td>
          <td>26.7%</td>
      </tr>
  </tbody>
</table>
<h3 id="sequential-cifar-scifar">Sequential CIFAR (sCIFAR)</h3>
<p>Liquid-S4 with $p=3$ achieves 92.02% accuracy on 1-D pixel-level image classification, improving over S4-LegS (91.80%).</p>
<h3 id="speech-commands-full-35-labels">Speech Commands (Full 35 Labels)</h3>
<p>On the raw 16kHz speech recognition task, Liquid-S4 achieves 96.78% accuracy with only 224K parameters, a 30% reduction compared to S4&rsquo;s 307K. On the zero-shot 8kHz experiment, performance drops to 90.00% (vs. 91.32% for S4-LegS), which the authors attribute to the liquid kernel&rsquo;s sensitivity to input covariance structure at different sampling rates.</p>
<h2 id="consistent-improvements-with-smaller-models">Consistent Improvements with Smaller Models</h2>
<p>Liquid-S4 achieves state-of-the-art performance on every benchmark evaluated: all six LRA tasks (87.32% average), all three BIDMC vital signs tasks, sCIFAR, and full Speech Commands recognition. The gains are particularly large on tasks where input correlation structure matters (ListOps +3.15%, IMDB +2.20%, respiratory rate RMSE improvement of 36%).</p>
<p>A practical advantage is that Liquid-S4 works well with smaller state sizes (as low as 7 units for some tasks), reducing parameter counts. The PB kernel is recommended over KB for its simplicity and competitive performance. Higher liquid orders ($p$) consistently improve performance, though $p=3$ is recommended as a default.</p>
<p>Limitations include degraded performance in zero-shot frequency transfer (8kHz Speech Commands), suggesting the liquid kernel&rsquo;s input covariance terms may not generalize well across sampling rate changes. The paper also does not compare against non-SSM approaches beyond the LRA benchmark. The causal (unidirectional) configuration works better than bidirectional for Liquid-S4, which may limit applicability to tasks that benefit from bidirectional context.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Classification: Partially Reproducible.</strong> Code and all benchmark datasets are publicly available, with complete hyperparameters documented. No pre-trained weights are released and hardware requirements are not specified.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/raminmh/liquid-s4">raminmh/liquid-s4</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch implementation; fork of the S4 repo with KB and PB kernels added</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Long Range Arena (LRA)</td>
          <td>6 tasks, 1K-16K seq length</td>
          <td>ListOps, IMDB, AAN, CIFAR, Pathfinder, Path-X</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BIDMC Vital Signs</td>
          <td>4000-length biomarker signals</td>
          <td>Heart rate, respiratory rate, SpO2</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>sCIFAR</td>
          <td>1024-length flattened images</td>
          <td>10-class classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Speech Commands</td>
          <td>16kHz raw audio, 35 labels</td>
          <td>Full dataset with zero-shot 8kHz test</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The Liquid-S4 kernel computation builds on the S4 kernel pipeline:</p>
<ol>
<li>Initialize $\mathbf{A}$ with HiPPO (scaled Legendre) matrix in DPLR form</li>
<li>Compute S4 kernel $\overline{\mathbf{K}}$ via Cauchy kernel and iFFT</li>
<li>For each liquid order $p \in {2, \ldots, \mathcal{P}}$, compute $\overline{\mathbf{K}}_{\text{liquid}=p}$ using either KB or PB mode</li>
<li>Convolve $\overline{\mathbf{K}}_{\text{liquid}}$ with input correlation vector $u_{\text{correlations}}$</li>
</ol>
<p>The PB kernel mode is used in all reported experiments. The PyKeops package is used for large tensor computations.</p>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Depth</th>
          <th>Features</th>
          <th>State Size</th>
          <th>Norm</th>
          <th>LR</th>
          <th>Epochs</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ListOps</td>
          <td>9</td>
          <td>128</td>
          <td>7</td>
          <td>BN</td>
          <td>0.002</td>
          <td>30</td>
      </tr>
      <tr>
          <td>IMDB</td>
          <td>4</td>
          <td>128</td>
          <td>7</td>
          <td>BN</td>
          <td>0.003</td>
          <td>50</td>
      </tr>
      <tr>
          <td>AAN</td>
          <td>6</td>
          <td>256</td>
          <td>64</td>
          <td>BN</td>
          <td>0.005</td>
          <td>20</td>
      </tr>
      <tr>
          <td>CIFAR (LRA)</td>
          <td>6</td>
          <td>512</td>
          <td>512</td>
          <td>LN</td>
          <td>0.01</td>
          <td>200</td>
      </tr>
      <tr>
          <td>Pathfinder</td>
          <td>6</td>
          <td>256</td>
          <td>64</td>
          <td>BN</td>
          <td>0.0004</td>
          <td>200</td>
      </tr>
      <tr>
          <td>Path-X</td>
          <td>6</td>
          <td>320</td>
          <td>64</td>
          <td>BN</td>
          <td>0.001</td>
          <td>60</td>
      </tr>
      <tr>
          <td>Speech Commands</td>
          <td>6</td>
          <td>128</td>
          <td>7</td>
          <td>BN</td>
          <td>0.008</td>
          <td>50</td>
      </tr>
      <tr>
          <td>BIDMC (HR)</td>
          <td>6</td>
          <td>128</td>
          <td>256</td>
          <td>LN</td>
          <td>0.005</td>
          <td>500</td>
      </tr>
      <tr>
          <td>BIDMC (RR)</td>
          <td>6</td>
          <td>128</td>
          <td>256</td>
          <td>LN</td>
          <td>0.01</td>
          <td>500</td>
      </tr>
      <tr>
          <td>BIDMC (SpO2)</td>
          <td>6</td>
          <td>128</td>
          <td>256</td>
          <td>LN</td>
          <td>0.01</td>
          <td>500</td>
      </tr>
      <tr>
          <td>sCIFAR</td>
          <td>6</td>
          <td>512</td>
          <td>512</td>
          <td>LN</td>
          <td>0.01</td>
          <td>200</td>
      </tr>
  </tbody>
</table>
<p>Liquid-S4 generally requires smaller learning rates than S4/S4D. $\Delta t_{\text{max}} = 0.2$ for all experiments; $\Delta t_{\text{min}} \propto 1/\text{seq_length}$.</p>
<h3 id="evaluation">Evaluation</h3>
<p>All results report validation accuracy (except BIDMC, which reports test RMSE). Experiments use 2-3 random seeds with standard deviations reported.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hasani, R., Lechner, M., Wang, T.-H., Chahine, M., Amini, A., &amp; Rus, D. (2022). Liquid Structural State-Space Models. <em>arXiv preprint arXiv:2209.12951</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{hasani2022liquid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Liquid Structural State-Space Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hasani, Ramin and Lechner, Mathias and Wang, Tsun-Hsuan and Chahine, Makram and Amini, Alexander and Rus, Daniela}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2209.12951}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lagrangian Neural Networks for Physics</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/lagrangian-neural-networks/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/lagrangian-neural-networks/</guid><description>LNNs parameterize arbitrary Lagrangians with neural networks, learning energy-conserving dynamics without requiring canonical coordinates.</description><content:encoded><![CDATA[<h2 id="a-method-for-learning-arbitrary-lagrangians">A Method for Learning Arbitrary Lagrangians</h2>
<p>This is a <strong>Method</strong> paper that introduces Lagrangian Neural Networks (LNNs), a neural network architecture that parameterizes arbitrary Lagrangians to learn energy-conserving dynamics from data. The key contribution is showing that neural networks can learn Lagrangian functions directly, and that the Euler-Lagrange equation can be solved numerically using automatic differentiation to produce physically consistent dynamics. The approach is strictly more general than prior methods: it does not require canonical coordinates (unlike Hamiltonian Neural Networks) and does not restrict the functional form of kinetic energy (unlike Deep Lagrangian Networks).</p>
<h2 id="why-standard-neural-networks-fail-at-conservation-laws">Why Standard Neural Networks Fail at Conservation Laws</h2>
<p>Neural networks struggle to learn fundamental symmetries and conservation laws from data. A standard neural network trained on trajectories of a <a href="https://en.wikipedia.org/wiki/Double_pendulum">double pendulum</a> will gradually dissipate energy over long rollouts, producing physically implausible behavior. This happens because unconstrained function approximators have no inductive bias toward conservation.</p>
<p>Hamiltonian Neural Networks (HNNs) addressed this by learning a Hamiltonian function, which automatically enforces energy conservation. However, the <a href="https://en.wikipedia.org/wiki/Hamiltonian_mechanics">Hamiltonian formalism</a> requires inputs in <a href="https://en.wikipedia.org/wiki/Canonical_coordinates">canonical coordinates</a> $(q, p)$ satisfying strict <a href="https://en.wikipedia.org/wiki/Poisson_bracket">Poisson bracket</a> relations:</p>
<p>$$
p_i \equiv \frac{\partial \mathcal{L}}{\partial \dot{q}_i} \quad \Longleftrightarrow \quad {q_i, q_j} = 0, \quad {p_i, p_j} = 0, \quad {q_i, p_j} = \delta_{ij}
$$</p>
<p>In many real-world settings, the canonical momenta are unknown or difficult to compute. For example, in special relativity the canonical momentum $\dot{q}(1 - \dot{q}^2)^{-3/2}$ is a complex nonlinear function of velocity. Deep Lagrangian Networks (DeLaNs) partially addressed this by learning Lagrangians, but they assumed kinetic energy takes the rigid-body form $T = \dot{q}^T M \dot{q}$, which excludes relativistic and other non-standard systems.</p>
<h2 id="solving-euler-lagrange-for-a-black-box-lagrangian">Solving Euler-Lagrange for a Black-Box Lagrangian</h2>
<p>The core innovation of LNNs is a method for computing accelerations from a neural network that represents an arbitrary Lagrangian $\mathcal{L}(q, \dot{q})$. Starting from the <a href="https://en.wikipedia.org/wiki/Euler%E2%80%93Lagrange_equation">Euler-Lagrange equation</a>:</p>
<p>$$
\frac{d}{dt} \nabla_{\dot{q}} \mathcal{L} = \nabla_{q} \mathcal{L}
$$</p>
<p>The authors expand the time derivative using the chain rule, yielding:</p>
<p>$$
\left(\nabla_{\dot{q}} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \ddot{q} + \left(\nabla_{q} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \dot{q} = \nabla_{q} \mathcal{L}
$$</p>
<p>Solving for the accelerations gives:</p>
<p>$$
\ddot{q} = \left(\nabla_{\dot{q}} \nabla_{\dot{q}}^{\top} \mathcal{L}\right)^{-1} \left[ \nabla_{q} \mathcal{L} - \left(\nabla_{q} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \dot{q} \right]
$$</p>
<p>This requires computing the Hessian of the neural network with respect to $\dot{q}$ and then inverting it (using a pseudoinverse for numerical stability). JAX&rsquo;s automatic differentiation makes this feasible in just a few lines of code, despite the seemingly complex chain of second-order derivatives. The matrix inverse scales as $\mathcal{O}(d^3)$ with the number of coordinates $d$.</p>
<p>A critical implementation detail is the choice of activation function. Since the method takes second-order derivatives of the network, ReLU is unsuitable (its second derivative is zero everywhere). After a hyperparameter search over ReLU$^2$, ReLU$^3$, tanh, sigmoid, and softplus, the authors found <a href="https://en.wikipedia.org/wiki/Softplus">softplus</a> performed best.</p>
<p>The authors also developed a custom initialization scheme, using symbolic regression to find initialization variances that maintain well-conditioned gradients through the Hessian computation:</p>
<p>$$
\sigma = \frac{1}{\sqrt{n}} \begin{cases} 2.2 &amp; \text{First layer} \\ 0.58i &amp; \text{Hidden layer } i \\ n &amp; \text{Output layer} \end{cases}
$$</p>
<h2 id="extension-to-graphs-and-continuous-systems">Extension to Graphs and Continuous Systems</h2>
<p>LNNs extend naturally to graph-structured and continuous systems via Lagrangian <a href="/notes/machine-learning/model-architectures/relational-inductive-biases-deep-learning-graph-networks/">Graph Networks</a>. For a system with $n$ gridpoints, the total Lagrangian is decomposed into local densities:</p>
<p>$$
\mathcal{L} = \sum_{i=1}^{n} \mathcal{L}_i, \quad \text{where} \quad \mathcal{L}_i = \mathcal{L}_{\text{density}}\left({\phi_j, \dot{\phi}_j}_{j \in \mathcal{I}_i}\right)
$$</p>
<p>Here $\mathcal{I}_i$ defines the neighborhood of node $i$ (e.g., ${i-1, i, i+1}$ for a 1D grid). The Lagrangian density is modeled as an MLP. The resulting Hessian matrix is sparse, with non-zero entries only at &ldquo;neighbor of neighbor&rdquo; positions, enabling efficient computation: in 1D, only 5 forward-over-backward autodiff passes are needed, and the tridiagonal inverse runs in linear time.</p>
<h2 id="experiments-double-pendulum-relativity-and-waves">Experiments: Double Pendulum, Relativity, and Waves</h2>
<p>All models used 4-layer MLPs with 500 hidden units, softplus activations, a decaying learning rate starting at $10^{-3}$, and batch size 32.</p>
<h3 id="double-pendulum">Double Pendulum</h3>
<p>The LNN and baseline achieved similar instantaneous acceleration losses ($7.3$ vs. $7.4 \times 10^{-2}$). The key difference appeared in long-term energy conservation: averaged over 40 random initial conditions with 100 time steps, the mean energy discrepancy was 8% of max potential energy for the baseline but only 0.4% for the LNN.</p>
<h3 id="relativistic-particle">Relativistic Particle</h3>
<p>For a particle with Lagrangian $\mathcal{L} = ((1 - \dot{q}^2)^{-1/2} - 1) + gq$, the canonical momenta $\dot{q}(1 - \dot{q}^2)^{-3/2}$ are non-trivial. An HNN trained on non-canonical coordinates $(q, \dot{q})$ failed to learn the dynamics. The LNN succeeded using the same non-canonical coordinates, matching the performance of an HNN given the correct canonical coordinates.</p>
<h3 id="1d-wave-equation">1D Wave Equation</h3>
<p>The Lagrangian Graph Network learned the wave equation dynamics ($\ddot{\phi} = \frac{\partial^2 \phi}{\partial x^2}$ with $c = 1$) on a 100-gridpoint domain with periodic boundary conditions. The network learned the Lagrangian density corresponding to the continuum form $\mathcal{L} = \int (\dot{\phi}^2 - (\partial \phi / \partial x)^2) dx$, accurately modeling wave propagation and conserving energy across the material.</p>
<table>
  <thead>
      <tr>
          <th>Experiment</th>
          <th>Model</th>
          <th>Energy Error (% of max PE)</th>
          <th>Canonical Coords Required</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Double Pendulum</td>
          <td>Baseline</td>
          <td>8%</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Double Pendulum</td>
          <td>LNN</td>
          <td>0.4%</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Relativistic Particle</td>
          <td>HNN (non-canonical)</td>
          <td>Failed</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Relativistic Particle</td>
          <td>HNN (canonical)</td>
          <td>Succeeded</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Relativistic Particle</td>
          <td>LNN</td>
          <td>Succeeded</td>
          <td>No</td>
      </tr>
      <tr>
          <td>1D Wave Equation</td>
          <td>LGN</td>
          <td>Energy conserved</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<h2 id="findings-and-comparison-to-prior-approaches">Findings and Comparison to Prior Approaches</h2>
<p>LNNs combine several desirable properties that no single prior method offers:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Neural Net</th>
          <th>Neural ODE</th>
          <th>HNN</th>
          <th>DeLaN</th>
          <th>LNN</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Models dynamical systems</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Learns differential equations</td>
          <td></td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Learns exact conservation laws</td>
          <td></td>
          <td></td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Learns from arbitrary coordinates</td>
          <td>Yes</td>
          <td>Yes</td>
          <td></td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Learns arbitrary Lagrangians</td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p>The main limitation is computational cost: the Hessian computation and inversion scale as $\mathcal{O}(d^3)$ in the number of coordinates. The Lagrangian Graph Network partially mitigates this for spatially extended systems through the sparsity of the resulting Hessian. The method also assumes access to state derivatives ($\dot{q}$) during training, which may not always be directly available from observations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Double pendulum</td>
          <td>600,000 random initial conditions</td>
          <td>Simulated with masses and lengths set to 1</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Relativistic particle</td>
          <td>Random initial conditions and $g$ values</td>
          <td>$c = 1$, mass = 1, uniform potential</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>1D wave equation</td>
          <td>100 gridpoints</td>
          <td>Periodic boundary conditions, $c = 1$</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Forward model: Euler-Lagrange equation solved via Equation 6 using JAX autodiff</li>
<li>Pseudoinverse used for Hessian inversion to handle potential singular matrices</li>
<li>Custom initialization scheme (Equation 16) derived via symbolic regression with eureqa</li>
<li>Softplus activation selected via hyperparameter search</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4-layer MLP with 500 hidden units for all experiments</li>
<li>Softplus activation function</li>
<li>Code: <a href="https://github.com/MilesCranmer/lagrangian_nns">github.com/MilesCranmer/lagrangian_nns</a> (Apache-2.0)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>LNN</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Acceleration loss (double pendulum)</td>
          <td>$7.3 \times 10^{-2}$</td>
          <td>$7.4 \times 10^{-2}$</td>
          <td>Similar short-term accuracy</td>
      </tr>
      <tr>
          <td>Energy error (double pendulum)</td>
          <td>0.4%</td>
          <td>8%</td>
          <td>Percentage of max potential energy</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. JAX-based implementation supports CPU and GPU execution.</p>
<hr>
<p><strong>Reproducibility Status</strong>: Highly Reproducible</p>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MilesCranmer/lagrangian_nns">lagrangian_nns</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official JAX implementation with notebooks for all experiments</td>
      </tr>
      <tr>
          <td>Training data</td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Generated procedurally; simulation code included in repository</td>
      </tr>
      <tr>
          <td>Trained models</td>
          <td>Model</td>
          <td>N/A</td>
          <td>Not provided</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., &amp; Ho, S. (2020). Lagrangian Neural Networks. <em>ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations</em>. arXiv: <a href="https://arxiv.org/abs/2003.04630">2003.04630</a></p>
<p><strong>Publication</strong>: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{cranmer2020lagrangian,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Lagrangian Neural Networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cranmer, Miles and Greydanus, Sam and Hoyer, Stephan and Battaglia, Peter and Spergel, David and Ho, Shirley}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2003.04630}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Ewald Message Passing for Molecular Graphs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/ewald-message-passing-molecular-graphs/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/ewald-message-passing-molecular-graphs/</guid><description>Ewald message passing augments GNNs with Fourier-space long-range interactions, improving energy predictions by 10-16% on OC20 and OE62 benchmarks.</description><content:encoded><![CDATA[<h2 id="a-fourier-space-long-range-correction-for-molecular-gnns">A Fourier-Space Long-Range Correction for Molecular GNNs</h2>
<p>This is a <strong>Method</strong> paper that introduces Ewald message passing (Ewald MP), a general framework for incorporating long-range interactions into message passing neural networks (MPNNs) for molecular <a href="/notes/chemistry/molecular-simulation/ml-potentials/learning-smooth-interatomic-potentials/">potential energy surface</a> prediction. The key contribution is a nonlocal Fourier-space message passing scheme, grounded in the classical <a href="https://en.wikipedia.org/wiki/Ewald_summation">Ewald summation</a> technique from computational physics, that complements the short-range message passing of existing GNN architectures.</p>
<h2 id="the-long-range-interaction-problem-in-molecular-gnns">The Long-Range Interaction Problem in Molecular GNNs</h2>
<p>Standard MPNNs for molecular property prediction rely on a spatial distance cutoff to define atomic neighborhoods. While this locality assumption enables favorable scaling with system size and provides a useful inductive bias, it fundamentally limits the model&rsquo;s ability to capture long-range interactions such as electrostatic forces and van der Waals (<a href="https://en.wikipedia.org/wiki/London_dispersion_force">London dispersion</a>) interactions. These interactions decay slowly with distance (e.g., electrostatic energy follows a $1/r$ power law), and truncating them with a distance cutoff can introduce severe artifacts in thermochemical predictions.</p>
<p>This problem is well-known in molecular dynamics, where empirical force fields explicitly separate bonded (short-range) and non-bonded (long-range) energy terms. The Ewald summation technique addresses this by decomposing interactions into a short-range part that converges quickly with a distance cutoff and a long-range part whose Fourier transform converges quickly with a frequency cutoff. The authors propose bringing this same strategy into the GNN paradigm.</p>
<h2 id="from-ewald-summation-to-learnable-fourier-space-messages">From Ewald Summation to Learnable Fourier-Space Messages</h2>
<p>The core insight is a formal analogy between the continuous-filter convolution used in MPNNs and the electrostatic potential computation in Ewald summation. In a standard continuous-filter convolution, the message sum for atom $i$ is:</p>
<p>$$
M_i^{(l+1)} = \sum_{j \in \mathcal{N}(i)} h_j^{(l)} \cdot \Phi^{(l)}(| \mathbf{x}_i - \mathbf{x}_j |)
$$</p>
<p>where $h_j^{(l)}$ are atom embeddings and $\Phi^{(l)}$ is a learned radial filter. Comparing this to the electrostatic potential $V_i^{\text{es}}(\mathbf{x}_i) = \sum_{j \neq i} q_j \cdot \Phi^{\text{es}}(| \mathbf{x}_i - \mathbf{x}_j |)$ reveals a direct correspondence: atom embeddings play the role of partial charges, and learned filters replace the $1/r$ kernel.</p>
<p>Ewald MP decomposes the learned filter into short-range and long-range components. The short-range part is handled by any existing GNN architecture with a distance cutoff. The long-range part is computed as a sum over Fourier frequencies:</p>
<p>$$
M^{\text{lr}}(\mathbf{x}_i) = \sum_{\mathbf{k}} \exp(i \mathbf{k}^T \mathbf{x}_i) \cdot s_{\mathbf{k}} \cdot \hat{\Phi}^{\text{lr}}(| \mathbf{k} |)
$$</p>
<p>where $s_{\mathbf{k}}$ are <strong><a href="https://en.wikipedia.org/wiki/Structure_factor">structure factor</a> embeddings</strong>, computed as:</p>
<p>$$
s_{\mathbf{k}} = \sum_{j \in \mathcal{S}} h_j \exp(-i \mathbf{k}^T \mathbf{x}_j)
$$</p>
<p>These structure factor embeddings are a Fourier-space representation of the atom embedding distribution, and truncating to low frequencies effectively coarse-grains the hidden model state while preserving long-range information. The frequency filters $\hat{\Phi}^{\text{lr}}$ are learned, making the entire scheme data-driven rather than tied to a fixed physical functional form.</p>
<p>The method handles both <strong>periodic</strong> systems (where the <a href="https://en.wikipedia.org/wiki/Reciprocal_lattice">reciprocal lattice</a> provides a natural frequency discretization) and <strong>aperiodic</strong> systems (where the Fourier domain is discretized using a cubic voxel grid with SVD-based rotation alignment to preserve rotation invariance). The combined embedding update becomes:</p>
<p>$$
h_i^{(l+1)} = \frac{1}{\sqrt{3}} \left[ h_i^{(l)} + f_{\text{upd}}^{\text{sr}}(M_i^{\text{sr}}) + f_{\text{upd}}^{\text{lr}}(M_i^{\text{lr}}) \right]
$$</p>
<p>The computational complexity is $\mathcal{O}(N_{\text{at}} N_{\text{k}})$, and by fixing the number of frequency vectors $N_{\text{k}}$, linear scaling $\mathcal{O}(N_{\text{at}})$ is achievable.</p>
<h2 id="experiments-across-four-gnn-architectures-and-two-datasets">Experiments Across Four GNN Architectures and Two Datasets</h2>
<p>The authors test Ewald MP as an augmentation on four baseline architectures: <a href="/notes/chemistry/datasets/marcel/">SchNet, PaiNN, DimeNet++, and GemNet-T</a>. Two datasets are used:</p>
<ul>
<li><strong>OC20</strong> (Chanussot et al., 2021): ~265M periodic structures of adsorbate-catalyst systems with DFT-computed energies and forces. The OC20-2M subsplit is used for training.</li>
<li><strong>OE62</strong> (Stuke et al., 2020): ~62,000 large aperiodic organic molecules with DFT-computed energies that include a DFT-D3 dispersion correction for London dispersion interactions.</li>
</ul>
<p>All baselines use a 6 Å distance cutoff and 50 maximum neighbors. The Ewald modification is minimal: the long-range message sum is added as an additional skip connection term in each interaction block. Comparison studies include: (1) increasing the distance cutoff to match the computational cost of Ewald MP, (2) replacing the Ewald block with a SchNet interaction block at increased cutoff, and (3) increasing atom embedding dimensions to match Ewald MP&rsquo;s parameter count.</p>
<h3 id="key-energy-mae-results-on-oe62">Key Energy MAE Results on OE62</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Baseline (meV)</th>
          <th>Ewald MP (meV)</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SchNet</td>
          <td>133.5</td>
          <td>79.2</td>
          <td>40.7%</td>
      </tr>
      <tr>
          <td>PaiNN</td>
          <td>61.4</td>
          <td>57.9</td>
          <td>5.7%</td>
      </tr>
      <tr>
          <td>DimeNet++</td>
          <td>51.2</td>
          <td>46.5</td>
          <td>9.2%</td>
      </tr>
      <tr>
          <td>GemNet-T</td>
          <td>51.5</td>
          <td>47.4</td>
          <td>8.0%</td>
      </tr>
  </tbody>
</table>
<h3 id="key-energy-mae-results-on-oc20-averaged-across-test-splits">Key Energy MAE Results on OC20 (Averaged Across Test Splits)</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Baseline (meV)</th>
          <th>Ewald MP (meV)</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SchNet</td>
          <td>895</td>
          <td>830</td>
          <td>7.3%</td>
      </tr>
      <tr>
          <td>PaiNN</td>
          <td>448</td>
          <td>393</td>
          <td>12.3%</td>
      </tr>
      <tr>
          <td>DimeNet++</td>
          <td>496</td>
          <td>445</td>
          <td>10.4%</td>
      </tr>
      <tr>
          <td>GemNet-T</td>
          <td>346</td>
          <td>307</td>
          <td>11.3%</td>
      </tr>
  </tbody>
</table>
<h2 id="robust-long-range-improvements-and-dispersion-recovery">Robust Long-Range Improvements and Dispersion Recovery</h2>
<p>Ewald MP achieves consistent improvements across all models and both datasets, averaging 16.1% on OE62 and 10.3% on OC20. Several findings stand out:</p>
<ol>
<li>
<p><strong>Robustness</strong>: Unlike the increased-cutoff and SchNet-LR alternatives, Ewald MP never produces detrimental effects in any tested configuration. The increased cutoff setting hurts SchNet and PaiNN on OE62, and the SchNet-LR block fails to improve DimeNet++ and GemNet-T.</p>
</li>
<li>
<p><strong>Long-range specificity</strong>: A binning analysis on OE62 groups molecules by the magnitude of their DFT-D3 dispersion correction. Ewald MP shows an outsize improvement for structures with large long-range energy contributions. It recovers or surpasses a &ldquo;cheating&rdquo; baseline that receives the exact DFT-D3 ground truth as an additional input.</p>
</li>
<li>
<p><strong>Efficiency on periodic systems</strong>: Ewald MP achieves similar relative improvements on OC20 at roughly half the relative computational cost compared to OE62, suggesting periodic structures as a particularly attractive application domain.</p>
</li>
<li>
<p><strong>Force predictions</strong>: Improvements in <a href="/notes/chemistry/molecular-simulation/ml-potentials/dark-side-of-forces/">force MAEs</a> are consistent but small, which is expected since the frequency truncation removes high-frequency contributions to the potential energy surface.</p>
</li>
<li>
<p><strong>Ablation studies</strong>: Results are robust across different frequency cutoffs, voxel resolutions, and filtering strategies, with the non-radial periodic filtering scheme outperforming radial alternatives on out-of-distribution generalization.</p>
</li>
</ol>
<p>Limitations include the current focus on scalar (invariant) embeddings only (PaiNN&rsquo;s equivariant vector embeddings are not augmented), and the potential for a &ldquo;gap&rdquo; of medium-range interactions when $N_{\text{k}}$ is fixed for linear scaling. The authors suggest adapting more efficient Ewald summation variants (e.g., particle mesh Ewald with $\mathcal{O}(N \log N)$ scaling) as future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (periodic)</td>
          <td>OC20-2M</td>
          <td>~2M structures</td>
          <td>Subsplit of OC20; PBC; DFT energies and forces</td>
      </tr>
      <tr>
          <td>Training (aperiodic)</td>
          <td>OE62</td>
          <td>~62,000 molecules</td>
          <td>Large organic molecules; DFT energies with D3 correction</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>OC20-test (4 splits: ID, OOD-ads, OOD-cat, OOD-both)</td>
          <td>Varies</td>
          <td>Evaluated via submission to OC20 evaluation server</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>OE62-val, OE62-test</td>
          <td>~6,000 each</td>
          <td>Direct evaluation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Ewald message passing is integrated as an additional skip connection term in each interaction block</li>
<li>For periodic systems: non-radial filtering with fixed reciprocal lattice positions ($N_x, N_y, N_z$ hyperparameters)</li>
<li>For aperiodic systems: radial Gaussian basis function filtering with frequency cutoff $c_k$ and voxel resolution $\Delta = 0.2$ Å$^{-1}$</li>
<li>SVD-based coordinate alignment for rotation invariance in the aperiodic case</li>
<li>Bottleneck dimension $N_\downarrow = 16$ (GemNet-T) or $N_\downarrow = 8$ (others)</li>
<li>Update function: dense layer + $N_{\text{hidden}}$ residual layers ($N_{\text{hidden}} = 3$, except PaiNN with $N_{\text{hidden}} = 0$)</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Embedding Size (OE62)</th>
          <th>Interaction Blocks</th>
          <th>Ewald Params (OE62)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SchNet</td>
          <td>512</td>
          <td>4</td>
          <td>12.2M total</td>
      </tr>
      <tr>
          <td>PaiNN</td>
          <td>512</td>
          <td>4</td>
          <td>15.7M total</td>
      </tr>
      <tr>
          <td>DimeNet++</td>
          <td>256</td>
          <td>3</td>
          <td>4.8M total</td>
      </tr>
      <tr>
          <td>GemNet-T</td>
          <td>256</td>
          <td>3</td>
          <td>16.1M total</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Primary metric: Energy mean absolute error (EMAE) in meV</li>
<li>Secondary metric: Force MAE in meV/Å (OC20 only)</li>
<li>Loss: Linear combination of energy and force MAEs (Eq. 15) with model-specific force multipliers</li>
<li>Optimizer: Adam with weight decay ($\lambda = 0.01$)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All runtime measurements on NVIDIA A100 GPUs</li>
<li>Runtimes measured after 50 warmup batches, averaged over 500 batches, minimum of 3 repetitions</li>
<li>Code: <a href="https://github.com/arthurkosmala/EwaldMP">EwaldMP</a> (Hippocratic License 3.0)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/arthurkosmala/EwaldMP">EwaldMP</a></td>
          <td>Code</td>
          <td>Hippocratic License 3.0 (new files) / MIT (OC20 base)</td>
          <td>Official implementation built on the Open Catalyst Project codebase</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md">OC20</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>~265M periodic adsorbate-catalyst structures with DFT energies and forces</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.1038/s41597-020-0385-y">OE62</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>~62,000 large organic molecules with DFT energies including D3 correction</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Source code, both datasets, and detailed hyperparameters (including per-model learning rates, batch sizes, and Ewald-specific settings) are all publicly available. Pre-trained model weights are not provided.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kosmala, A., Gasteiger, J., Gao, N., &amp; Günnemann, S. (2023). Ewald-based Long-Range Message Passing for Molecular Graphs. In <em>Proceedings of the 40th International Conference on Machine Learning (ICML 2023)</em>.</p>
<p><strong>Publication</strong>: ICML 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kosmala2023ewald,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Ewald-based Long-Range Message Passing for Molecular Graphs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kosmala, Arthur and Gasteiger, Johannes and Gao, Nicholas and G{\&#34;u}nnemann, Stephan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 40th International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{PMLR}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{202}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Block-Recurrent Transformers for Long Sequences</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</guid><description>Block-Recurrent Transformers combine attention and recurrence for linear-complexity language modeling on long documents like books and code.</description><content:encoded><![CDATA[<h2 id="a-method-for-combining-attention-with-block-level-recurrence">A Method for Combining Attention with Block-Level Recurrence</h2>
<p>This is a <strong>Method</strong> paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, <a href="/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/">RWKV</a>, later explored similar ideas using linear attention with channel-wise decay.</p>
<h2 id="why-transformers-struggle-with-long-documents">Why Transformers Struggle with Long Documents</h2>
<p>Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.</p>
<p>Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.</p>
<p>Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.</p>
<h2 id="block-level-recurrence-with-lstm-style-gates">Block-Level Recurrence with LSTM-Style Gates</h2>
<p>The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.</p>
<h3 id="the-recurrent-cell">The Recurrent Cell</h3>
<p>The cell has two processing directions:</p>
<ul>
<li><strong>Vertical direction</strong>: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.</li>
<li><strong>Horizontal direction</strong>: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.</li>
</ul>
<p>Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).</p>
<h3 id="gating-mechanisms">Gating Mechanisms</h3>
<p>Two gate types are explored. The <strong>fixed gate</strong> uses a learned convex combination:</p>
<p>$$
g = \sigma(b_g)
$$</p>
<p>$$
c_{t+1} = c_t \odot g + z_t \odot (1 - g)
$$</p>
<p>where $g$ is constant after training, implementing an <a href="https://en.wikipedia.org/wiki/Moving_average">exponential moving average</a>.</p>
<p>The <strong>LSTM gate</strong> uses input and forget gates:</p>
<p>$$
i_t = \sigma(W_i h_t + b_i - 1)
$$</p>
<p>$$
f_t = \sigma(W_f h_t + b_f + 1)
$$</p>
<p>$$
c_{t+1} = c_t \odot f_t + z_t \odot i_t
$$</p>
<p>The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to &ldquo;remember&rdquo; by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the <a href="/notes/machine-learning/model-architectures/can-recurrent-neural-networks-warp-time/">gate initialization challenges studied by Tallec and Ollivier</a>, who derived chrono initialization for LSTMs from time-warping invariance.</p>
<h3 id="gate-configurations">Gate Configurations</h3>
<p>Three configurations are tested: <strong>dual</strong> (gates on both attention and MLP outputs), <strong>single</strong> (gate only on MLP output), and <strong>skip</strong> (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.</p>
<h3 id="learned-state-ids">Learned State IDs</h3>
<p>Since the same weights are applied to all state vectors, learned &ldquo;state IDs&rdquo; (analogous to position embeddings) are added so each state vector can issue distinct queries. <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.</p>
<h2 id="language-modeling-on-pg19-arxiv-and-github">Language Modeling on PG19, arXiv, and GitHub</h2>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:</p>
<ul>
<li><strong>PG19</strong>: Full-length books from <a href="https://en.wikipedia.org/wiki/Project_Gutenberg">Project Gutenberg</a> (pre-1919)</li>
<li><strong>arXiv</strong>: Mathematics papers in LaTeX</li>
<li><strong>GitHub</strong>: Concatenated source code from open-source repositories</li>
</ul>
<p>All models report bits-per-token ($\log_2$ perplexity, lower is better).</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.</p>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Step Time</th>
          <th>PG19 (bytes)</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XL:512</td>
          <td>0.88</td>
          <td>1.01</td>
          <td>3.62</td>
          <td>1.45</td>
          <td>1.21</td>
      </tr>
      <tr>
          <td>XL:2048</td>
          <td>2.11</td>
          <td>0.990</td>
          <td>3.58</td>
          <td>1.31</td>
          <td>1.01</td>
      </tr>
      <tr>
          <td>Slide:13L</td>
          <td>1.00</td>
          <td>0.989</td>
          <td>3.58</td>
          <td>1.42</td>
          <td>1.17</td>
      </tr>
      <tr>
          <td>Rec:fixed:skip</td>
          <td>0.99</td>
          <td>0.952</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Rec:fixed:dual</td>
          <td>1.01</td>
          <td>0.957</td>
          <td>3.52</td>
          <td>1.27</td>
          <td>0.991</td>
      </tr>
      <tr>
          <td>Feedback:fixed:skip</td>
          <td>1.35</td>
          <td>0.935</td>
          <td>3.49</td>
          <td>1.24</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Memorizing Trans. 64k</td>
          <td>1.94</td>
          <td>0.950</td>
          <td>3.53</td>
          <td>1.22</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.</p>
<h3 id="scaling-behavior">Scaling Behavior</h3>
<p>Models from 40M to 1.3B parameters show that the benefit of recurrence is <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">consistent across scales</a> and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>PG19 Perplexity</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compressive Transformer</td>
          <td>36</td>
          <td>33.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Routing Transformer</td>
          <td>22</td>
          <td>33.2</td>
          <td>490M</td>
      </tr>
      <tr>
          <td>Perceiver AR</td>
          <td>60</td>
          <td>28.9</td>
          <td>974.6M</td>
      </tr>
      <tr>
          <td>Block-Recurrent Transformer</td>
          <td>24</td>
          <td>26.50</td>
          <td>1.3B</td>
      </tr>
  </tbody>
</table>
<h3 id="ablations">Ablations</h3>
<ul>
<li><strong>Multiple recurrent layers</strong>: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.</li>
<li><strong>Number of states</strong>: Improvement up to 1024 states, degradation at 2048.</li>
<li><strong>Window size reduction</strong>: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.</li>
<li><strong>Gate type</strong>: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.</li>
</ul>
<h3 id="qualitative-analysis">Qualitative Analysis</h3>
<p>Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model&rsquo;s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.</p>
<p>Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.</p>
<p>The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>PG19</td>
          <td>~29k books</td>
          <td>Public domain, freely available</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>arXiv</td>
          <td>Mathematics papers</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>GitHub</td>
          <td>Open-source repos</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adafactor</li>
<li>Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)</li>
<li>Warmup: 1000 steps</li>
<li>Dropout: 0.05</li>
<li>Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)</li>
<li>Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial &ldquo;remember&rdquo; behavior</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Layers</th>
          <th>Parameters</th>
          <th>Recurrent Layers</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>12 (+1 recurrent)</td>
          <td>~151-164M</td>
          <td>Layer 10</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>24 (+2 recurrent)</td>
          <td>650M</td>
          <td>Layers 10, 20</td>
      </tr>
      <tr>
          <td>XL</td>
          <td>24 (+2 recurrent)</td>
          <td>1.3B</td>
          <td>Layers 10, 20</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bits-per-token</td>
          <td>Rec:fixed:skip</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Word-level PPL</td>
          <td>1.3B model</td>
          <td>26.50</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 32 Google V4 TPU replicas</li>
<li>Training time: ~48 hours for 500k steps on PG19</li>
<li>Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Available</th>
          <th>License</th>
          <th>URL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Code (Meliad)</td>
          <td>Yes</td>
          <td>Apache 2.0</td>
          <td><a href="https://github.com/google-research/meliad">github.com/google-research/meliad</a></td>
      </tr>
      <tr>
          <td>PG19 Dataset</td>
          <td>Yes</td>
          <td>Public Domain</td>
          <td>Public</td>
      </tr>
      <tr>
          <td>arXiv Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>GitHub Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>Pretrained Models</td>
          <td>No</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Assessment</strong>: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., &amp; Neyshabur, B. (2022). Block-Recurrent Transformers. <em>Advances in Neural Information Processing Systems 35 (NeurIPS 2022)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{hutchins2022block,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Block-Recurrent Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2203.07852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NaViT: Native Resolution Vision Transformer</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</guid><description>NaViT uses sequence packing to train Vision Transformers on images at native resolution and aspect ratio, improving efficiency and flexibility.</description><content:encoded><![CDATA[<h2 id="a-method-for-flexible-resolution-vision-transformers">A Method for Flexible-Resolution Vision Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called &ldquo;Patch n&rsquo; Pack,&rdquo; borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.</p>
<h2 id="why-fixed-resolution-pipelines-are-suboptimal">Why Fixed-Resolution Pipelines Are Suboptimal</h2>
<p>Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.</p>
<p>This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that most images deviate more than 20% from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.</p>
<h2 id="patch-n-pack-sequence-packing-for-vision">Patch n&rsquo; Pack: Sequence Packing for Vision</h2>
<p>The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.</p>
<h3 id="architectural-modifications">Architectural Modifications</h3>
<p>Three changes enable Patch n&rsquo; Pack:</p>
<ol>
<li>
<p><strong>Masked self-attention and masked pooling</strong>: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.</p>
</li>
<li>
<p><strong>Factorized positional embeddings</strong>: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:</p>
<ul>
<li>Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index</li>
<li>Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image</li>
</ul>
</li>
<li>
<p><strong>Chunked contrastive loss</strong>: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.</p>
</li>
</ol>
<h3 id="training-innovations">Training Innovations</h3>
<p>Packing enables two techniques that were previously impractical:</p>
<ul>
<li>
<p><strong>Continuous token dropping</strong>: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.</p>
</li>
<li>
<p><strong>Resolution sampling</strong>: Each image&rsquo;s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.</p>
</li>
</ul>
<h3 id="computational-overhead">Computational Overhead</h3>
<p>A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.</p>
<h2 id="pretraining-and-downstream-evaluation">Pretraining and Downstream Evaluation</h2>
<p>NaViT is evaluated in two pretraining setups:</p>
<ul>
<li><strong>Classification pretraining</strong> on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)</li>
<li><strong>Contrastive pretraining</strong> on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval</li>
</ul>
<h3 id="training-efficiency">Training Efficiency</h3>
<p>At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.</p>
<h3 id="variable-resolution-results">Variable Resolution Results</h3>
<p>Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution&rsquo;s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.</p>
<p>For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.</p>
<h3 id="positional-embedding-comparison">Positional Embedding Comparison</h3>
<p>Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct&rsquo;s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.</p>
<h3 id="token-dropping-strategies">Token Dropping Strategies</h3>
<p>Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.</p>
<h3 id="downstream-tasks">Downstream Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Setup</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Semantic segmentation</td>
          <td>ADE20k, L/16, linear decoder</td>
          <td>NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster</td>
      </tr>
      <tr>
          <td>Object detection</td>
          <td>OWL-ViT-L/14 backbone</td>
          <td>NaViT: 28.3% LVIS AP vs. ViT: 23.3%</td>
      </tr>
      <tr>
          <td>Video classification</td>
          <td>Kinetics-400, tubelet extraction</td>
          <td>NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs</td>
      </tr>
      <tr>
          <td>Fairness annotation</td>
          <td>FairFace, CelebA linear probes</td>
          <td>Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)</td>
      </tr>
  </tbody>
</table>
<h3 id="out-of-distribution-robustness">Out-of-Distribution Robustness</h3>
<p>NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.</p>
<p>Key takeaways:</p>
<ul>
<li>4x compute reduction to match top ViT performance</li>
<li>A single model works across a continuous range of resolutions at inference time</li>
<li>Variable-resolution training and token dropping provide complementary efficiency gains</li>
<li>Factorized positional embeddings generalize to unseen resolutions</li>
<li>Benefits transfer to detection, segmentation, video, and fairness tasks</li>
</ul>
<p>Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification pretraining</td>
          <td>JFT-4B</td>
          <td>~4B labeled images</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Contrastive pretraining</td>
          <td>WebLI</td>
          <td>Large-scale web data</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Classification fine-tuning</td>
          <td>ImageNet-1k</td>
          <td>1.28M images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Segmentation</td>
          <td>ADE20k</td>
          <td>20K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Detection</td>
          <td>LVIS</td>
          <td>164K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Video</td>
          <td>Kinetics-400</td>
          <td>~240K videos</td>
          <td>Publicly available (partial)</td>
      </tr>
      <tr>
          <td>Fairness</td>
          <td>FairFace, CelebA</td>
          <td>108K / 200K images</td>
          <td>Publicly available</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Greedy bin-packing for sequence construction (less than 2% padding tokens)</li>
<li>Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$</li>
<li>Token dropping: Beta-distributed per-image rates, optionally resolution-dependent</li>
<li>Factorized positional embeddings with additive combination</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>NaViT variants: B/16, L/16, L/14</li>
<li>Based on vanilla ViT with query-key normalization, no biases, attention pooling</li>
<li>Implemented in JAX/FLAX within the Scenic framework</li>
<li>No public model checkpoints available</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>NaViT</th>
          <th>ViT Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JFT linear probe (L/16)</td>
          <td>Matches top ViT</td>
          <td>4x more compute</td>
          <td>Compute-matched comparison</td>
      </tr>
      <tr>
          <td>ImageNet zero-shot (L/14)</td>
          <td>72.9%</td>
          <td>68.3%</td>
          <td>Contrastive pretraining</td>
      </tr>
      <tr>
          <td>LVIS AP (L/14)</td>
          <td>28.3%</td>
          <td>23.3%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>LVIS AP rare (L/14)</td>
          <td>24.3%</td>
          <td>17.2%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>ADE20k mIoU (L/16, 384)</td>
          <td>Beats ViT@512</td>
          <td>At 2x cost</td>
          <td>Segmenter linear decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training on Cloud TPUs (specific configuration not detailed)</li>
<li>Inference latency measured on Cloud TPUv3</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., &amp; Houlsby, N. (2023). Patch n&rsquo; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. <em>Advances in Neural Information Processing Systems 36 (NeurIPS 2023)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{dehghani2023patch,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Patch n&#39; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2307.06304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher-2: End-to-End Markush Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</guid><description>MarkushGrapher-2 fuses vision, text, and layout encoders with a dedicated OCR module for end-to-end Markush structure recognition from patent images.</description><content:encoded><![CDATA[<h2 id="a-multimodal-method-for-markush-structure-recognition">A Multimodal Method for Markush Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.</p>
<h2 id="why-markush-structure-recognition-remains-challenging">Why Markush Structure Recognition Remains Challenging</h2>
<p><a href="https://en.wikipedia.org/wiki/Markush_structure">Markush structures</a> are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.</p>
<p>Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.</p>
<p>Prior work, including the original <a href="/notes/chemistry/optical-structure-recognition/markush/markushgrapher/">MarkushGrapher</a>, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.</p>
<h2 id="dual-encoder-architecture-with-dedicated-chemicalocr">Dual-Encoder Architecture with Dedicated ChemicalOCR</h2>
<p>MarkushGrapher-2 uses two complementary encoding pipelines:</p>
<ol>
<li>
<p><strong>Vision encoder pipeline</strong>: The input image passes through a Swin-B Vision Transformer (taken from <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a>) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.</p>
</li>
<li>
<p><strong>Vision-Text-Layout (VTL) pipeline</strong>: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.</p>
</li>
</ol>
<p>The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) string describing the backbone structure and a substituent table listing variable group definitions.</p>
<h3 id="two-stage-training-strategy">Two-Stage Training Strategy</h3>
<p>Training proceeds in two phases:</p>
<ul>
<li>
<p><strong>Phase 1 (Adaptation)</strong>: The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe&rsquo;s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.</p>
</li>
<li>
<p><strong>Phase 2 (Fusion)</strong>: The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.</p>
</li>
</ul>
<p>The total model has 831M parameters, of which 744M are trainable.</p>
<h2 id="datasets-and-evaluation-benchmarks">Datasets and Evaluation Benchmarks</h2>
<h3 id="training-data">Training Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical structures</td>
          <td>235K</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> SMILES augmented to CXSMILES, rendered with annotations</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>Manual OCR annotations</td>
          <td>7K</td>
          <td>IP5 patent document crops</td>
      </tr>
      <tr>
          <td>Phase 1 (OCSR)</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>Synthetic CXSMILES</td>
          <td>235K</td>
          <td>Same as OCR pretraining set</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>MolParser dataset</td>
          <td>91K</td>
          <td>Real-world Markush, converted to CXSMILES</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>USPTO-MOL-M</td>
          <td>54K</td>
          <td>Real-world, auto-extracted from USPTO MOL files (2010-2025)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-benchmarks">Evaluation Benchmarks</h3>
<p><strong>Markush benchmarks</strong>: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).</p>
<p><strong>OCSR benchmarks</strong>: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).</p>
<p>The primary metric is <strong>CXSMILES Accuracy (A)</strong>: a prediction is correct when (1) the predicted SMILES matches the ground truth by <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChIKey</a> equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.</p>
<h3 id="results-markush-structure-recognition">Results: Markush Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S</th>
          <th>USPTO-M</th>
          <th>WildMol-M</th>
          <th>IP5-M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>39</td>
          <td>30</td>
          <td>38.1</td>
          <td>47.7</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>21</td>
          <td>7</td>
          <td>28.1</td>
          <td>22.3</td>
      </tr>
      <tr>
          <td>GPT-5</td>
          <td>3</td>
          <td>0</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>0</td>
          <td>0</td>
          <td>1.9</td>
          <td>0.0</td>
      </tr>
      <tr>
          <td>MarkushGrapher-1</td>
          <td>38</td>
          <td>10</td>
          <td>32</td>
          <td>-</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td><strong>56</strong></td>
          <td><strong>13</strong></td>
          <td><strong>55</strong></td>
          <td><strong>48.0</strong></td>
      </tr>
  </tbody>
</table>
<p>On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.</p>
<h3 id="results-standard-molecular-structure-recognition">Results: Standard Molecular Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WildMol</th>
          <th>JPO</th>
          <th>UOB</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>76.9</td>
          <td>78.9</td>
          <td>91.8</td>
          <td>93.0</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>66.4</td>
          <td>76.2</td>
          <td>87.4</td>
          <td>93.1</td>
      </tr>
      <tr>
          <td>DECIMER 2.7</td>
          <td>56.0</td>
          <td>64.0</td>
          <td>88.3</td>
          <td>59.9</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/">MolGrapher</a></td>
          <td>45.5</td>
          <td>67.5</td>
          <td>94.9</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>25.8</td>
          <td>31.6</td>
          <td>78.7</td>
          <td>36.9</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td>68.4</td>
          <td>71.0</td>
          <td><strong>96.6</strong></td>
          <td>89.8</td>
      </tr>
  </tbody>
</table>
<p>MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.</p>
<h3 id="chemicalocr-vs-general-ocr">ChemicalOCR vs. General OCR</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S F1</th>
          <th>USPTO-M F1</th>
          <th>IP5-M F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PaddleOCR v5</td>
          <td>7.7</td>
          <td>1.2</td>
          <td>1.9</td>
      </tr>
      <tr>
          <td>EasyOCR</td>
          <td>10.2</td>
          <td>18.0</td>
          <td>18.4</td>
      </tr>
      <tr>
          <td><strong>ChemicalOCR</strong></td>
          <td><strong>87.2</strong></td>
          <td><strong>93.0</strong></td>
          <td><strong>86.5</strong></td>
      </tr>
  </tbody>
</table>
<p>General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.</p>
<h2 id="ablation-results-and-key-findings">Ablation Results and Key Findings</h2>
<p><strong>OCR input is critical for Markush features.</strong> Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.</p>
<p><strong>Two-phase training improves both tasks.</strong> Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.</p>
<p><strong>Frequency variation indicators remain the hardest feature.</strong> On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.</p>
<p><strong>Limitations</strong>: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical images</td>
          <td>235K</td>
          <td>Generated from PubChem SMILES, augmented to CXSMILES</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>IP5 patent crops</td>
          <td>7K</td>
          <td>Manually annotated</td>
      </tr>
      <tr>
          <td>Phase 1 training</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Public, real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 training</td>
          <td>Synthetic + MolParser + USPTO-MOL-M</td>
          <td>380K</td>
          <td>Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>M2S, USPTO-M, WildMol-M, IP5-M</td>
          <td>103 to 10K</td>
          <td>Markush benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>WildMol, JPO, UOB, USPTO</td>
          <td>450 to 10K</td>
          <td>OCSR benchmarks</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vision encoder</td>
          <td>Swin-B ViT (from MolScribe)</td>
          <td>~87M</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td>VTL encoder + decoder</td>
          <td>T5-base</td>
          <td>~744M trainable</td>
          <td>Trained</td>
      </tr>
      <tr>
          <td>ChemicalOCR</td>
          <td>SmolDocling-based VLM</td>
          <td>256M</td>
          <td>Fine-tuned, frozen in Phase 2</td>
      </tr>
      <tr>
          <td>MLP projector</td>
          <td>Linear projection</td>
          <td>-</td>
          <td>Trained in Phase 1, frozen in Phase 2</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>831M</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CXSMILES Accuracy (A)</td>
          <td>Percentage of samples where InChIKey matches AND all Markush features correct</td>
      </tr>
      <tr>
          <td>$A_{\text{InChIKey}}$</td>
          <td>Backbone structure accuracy only (ignoring Markush features)</td>
      </tr>
      <tr>
          <td>Table Accuracy</td>
          <td>Percentage of correctly predicted substituent tables</td>
      </tr>
      <tr>
          <td>Markush Accuracy</td>
          <td>Joint CXSMILES + Table accuracy</td>
      </tr>
      <tr>
          <td>OCR F1</td>
          <td>Bounding-box-level precision/recall at IoU &gt; 0.5</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: NVIDIA A100 GPU</li>
<li>Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3</li>
<li>Phase 2: 2 epochs, batch size 8</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation of MarkushGrapher-2 with models and datasets</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Code, models, and datasets are all publicly released under an MIT license with documented training hyperparameters and a single A100 GPU requirement.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., &amp; Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>.</p>
<p><strong>Publication</strong>: CVPR 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository (MIT License)</a></li>
<li><a href="https://arxiv.org/abs/2603.28550">arXiv Preprint</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{strohmeyer2026markushgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\&#39;{e}ry and Nassar, Ahmed and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2603.28550}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>REINVENT: Reinforcement Learning for Mol. Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/</guid><description>REINVENT uses augmented episodic likelihood to fine-tune a SMILES-based RNN via reinforcement learning for goal-directed molecular generation.</description><content:encoded><![CDATA[<h2 id="augmented-episodic-likelihood-for-goal-directed-generation">Augmented Episodic Likelihood for Goal-Directed Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the <a href="/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented episodic likelihood</a>, that fine-tunes a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and <a href="/notes/chemistry/molecular-design/generation/evaluation/failure-modes-molecule-generation/">mode collapse</a> to trivially simple structures).</p>
<h2 id="de-novo-design-needs-flexible-data-driven-approaches">De Novo Design Needs Flexible, Data-Driven Approaches</h2>
<p>Traditional de novo design methods fall into three categories, each with limitations:</p>
<ol>
<li><strong>Structure-based approaches</strong> grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.</li>
<li><strong>Ligand-based virtual library</strong> approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.</li>
<li><strong><a href="/notes/chemistry/molecular-design/property-prediction/">Inverse QSAR</a></strong> methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.</li>
</ol>
<p>RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.</p>
<p>Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only &ldquo;C&rdquo; to satisfy a scoring function).</p>
<h2 id="the-augmented-episodic-likelihood-framework">The Augmented Episodic Likelihood Framework</h2>
<p>The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.</p>
<p>The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:</p>
<p>$$
J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1})
$$</p>
<p>The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.</p>
<p>The augmented likelihood combines prior likelihood with the score:</p>
<p>$$
\log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A)
$$</p>
<p>where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.</p>
<p>The return is defined as the negative squared difference between the augmented likelihood and the agent&rsquo;s likelihood:</p>
<p>$$
G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2}
$$</p>
<p>The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.</p>
<p>This design has three key advantages over standard REINFORCE:</p>
<ul>
<li>The target policy is explicitly stochastic, preserving diversity in generated molecules</li>
<li>The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage</li>
<li>No hand-written rules are needed to penalize degenerate solutions</li>
</ul>
<p>The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.</p>
<h2 id="three-experiments-sulphur-avoidance-celecoxib-analogues-and-drd2-activity">Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity</h2>
<h3 id="prior-network-architecture">Prior Network Architecture</h3>
<p>The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.</p>
<h3 id="experiment-1-learning-to-avoid-sulphur">Experiment 1: Learning to Avoid Sulphur</h3>
<p>A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.</p>
<p>The Agent method was compared against three alternatives:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Fraction No S</th>
          <th>Avg MW</th>
          <th>Avg cLogP</th>
          <th>Avg RotBonds</th>
          <th>Avg AromRings</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior</td>
          <td>0.94</td>
          <td>0.66</td>
          <td>371</td>
          <td>3.36</td>
          <td>5.39</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Agent</td>
          <td>0.95</td>
          <td>0.98</td>
          <td>367</td>
          <td>3.37</td>
          <td>5.41</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Action basis</td>
          <td>0.95</td>
          <td>0.92</td>
          <td>372</td>
          <td>3.39</td>
          <td>6.08</td>
          <td>2.09</td>
      </tr>
      <tr>
          <td>REINFORCE</td>
          <td>0.98</td>
          <td>0.98</td>
          <td>585</td>
          <td>11.3</td>
          <td>30.0</td>
          <td>0.57</td>
      </tr>
      <tr>
          <td>REINFORCE + Prior</td>
          <td>0.98</td>
          <td>0.92</td>
          <td>232</td>
          <td>3.05</td>
          <td>2.8</td>
          <td>2.11</td>
      </tr>
  </tbody>
</table>
<p>Standard REINFORCE exploited the reward by generating sequences of predominantly &ldquo;C&rdquo; (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.</p>
<h3 id="experiment-2-similarity-guided-generation-celecoxib-analogues">Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)</h3>
<p>The scoring function uses <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> on FCFP4 fingerprints:</p>
<p>$$
S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k}
$$</p>
<p>where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers <a href="https://en.wikipedia.org/wiki/Celecoxib">Celecoxib</a> itself within 200 training steps. Even when all structures with $J &gt; 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).</p>
<p>With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.</p>
<h3 id="experiment-3-target-activity-drd2">Experiment 3: Target Activity (DRD2)</h3>
<p>The most drug-discovery-relevant task: generating molecules predicted active against the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2 (DRD2)</a>. An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 &gt; 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Prior</th>
          <th>Agent</th>
          <th>Prior (reduced)</th>
          <th>Agent (reduced)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid SMILES</td>
          <td>0.94</td>
          <td>0.99</td>
          <td>0.94</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Fraction predicted actives</td>
          <td>0.03</td>
          <td>0.97</td>
          <td>0.02</td>
          <td>0.96</td>
      </tr>
      <tr>
          <td>Fraction similar to train active</td>
          <td>0.02</td>
          <td>0.79</td>
          <td>0.02</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>Fraction similar to test active</td>
          <td>0.01</td>
          <td>0.46</td>
          <td>0.01</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>Test actives recovered (x10^-3)</td>
          <td>13.5</td>
          <td>126</td>
          <td>2.85</td>
          <td>72.6</td>
      </tr>
  </tbody>
</table>
<p>The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.</p>
<h2 id="anchored-policy-learning-prevents-reward-exploitation">Anchored Policy Learning Prevents Reward Exploitation</h2>
<p>The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.</p>
<p>Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.</p>
<p>Limitations acknowledged by the authors:</p>
<ul>
<li>All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work</li>
<li>The quality of generated structures depends heavily on the Prior&rsquo;s coverage of chemical space</li>
<li>The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored</li>
<li>No exhaustive study of how Prior training set size, model size, and regularization affect generation quality</li>
</ul>
<p>Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>1.5M structures</td>
          <td>10-50 heavy atoms, filtered elements</td>
      </tr>
      <tr>
          <td>DRD2 activity model</td>
          <td>ExCAPE-DB</td>
          <td>7,218 actives + 100K inactives</td>
          <td>Butina clustering split (ECFP6, cutoff 0.4)</td>
      </tr>
      <tr>
          <td>Similarity target</td>
          <td>Celecoxib</td>
          <td>1 query structure</td>
          <td>FCFP4 fingerprints for Jaccard similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prior</strong>: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps</li>
<li><strong>Agent</strong>: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128</li>
<li><strong>DRD2 model</strong>: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MarcusOlivecrona/REINVENT">REINVENT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original implementation in TensorFlow/Python 2.7</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.572576">Archived version</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Zenodo archive (DOI: 10.5281/zenodo.572576)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>SMILES validity rate (RDKit parsing)</li>
<li>Fraction of structures satisfying scoring function</li>
<li>Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)</li>
<li>Jaccard similarity on ECFP6/FCFP4 fingerprints</li>
<li>Recovery rate of known actives from test set</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Olivecrona, M., Blaschke, T., Engkvist, O., &amp; Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. <em>Journal of Cheminformatics</em>, 9(1), 48.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{olivecrona2017molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular de-novo design through deep reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{48}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-017-0235-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ReactionT5: Pre-trained T5 for Reaction Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</guid><description>ReactionT5 uses two-stage pretraining on ZINC and the Open Reaction Database to enable competitive reaction and yield prediction with minimal fine-tuning data.</description><content:encoded><![CDATA[<h2 id="a-two-stage-pre-trained-transformer-for-chemical-reactions">A Two-Stage Pre-trained Transformer for Chemical Reactions</h2>
<p>ReactionT5 is a <strong>Method</strong> paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.</p>
<h2 id="bridging-the-gap-between-single-molecule-and-multi-molecule-pretraining">Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining</h2>
<p>While transformer-based models pre-trained on compound libraries (e.g., <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.</p>
<p>The authors identify two key gaps:</p>
<ol>
<li>Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.</li>
<li>In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.</li>
</ol>
<h2 id="two-stage-pretraining-with-compound-restoration">Two-Stage Pretraining with Compound Restoration</h2>
<p>The core innovation is a two-stage pretraining procedure built on the <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5 (text-to-text transfer transformer)</a> architecture:</p>
<p><strong>Stage 1: Compound Pretraining (CompoundT5)</strong>. An initialized T5 model is trained on 23M <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.</p>
<p><strong>Stage 2: Reaction Pretraining (ReactionT5)</strong>. CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:</p>
<ul>
<li><code>REACTANT:</code>, <code>REAGENT:</code>, and <code>PRODUCT:</code> tokens delimit the role of each molecule in the reaction string.</li>
<li>For product prediction, the model takes reactants and reagents as input and generates product SMILES.</li>
<li>For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.</li>
</ul>
<p><strong>Compound Restoration</strong>. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (&ldquo;ORD(restored)&rdquo;) is then used for reaction pretraining.</p>
<p>For yield prediction, the loss function is mean squared error:</p>
<p>$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$</p>
<p>where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.</p>
<h2 id="experimental-setup-product-and-yield-prediction-benchmarks">Experimental Setup: Product and Yield Prediction Benchmarks</h2>
<h3 id="product-prediction">Product Prediction</h3>
<p>The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.</p>
<p>Baselines include Seq-to-seq, WLDN (graph neural network), <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, and T5Chem.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Train</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Seq-to-seq</td>
          <td>USPTO</td>
          <td>80.3</td>
          <td>84.7</td>
          <td>86.2</td>
          <td>87.5</td>
          <td>-</td>
      </tr>
      <tr>
          <td>WLDN</td>
          <td>USPTO</td>
          <td>85.6</td>
          <td>90.5</td>
          <td>92.8</td>
          <td>93.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Molecular Transformer</td>
          <td>USPTO</td>
          <td>88.8</td>
          <td>92.6</td>
          <td>-</td>
          <td>94.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>USPTO</td>
          <td>90.4</td>
          <td>94.2</td>
          <td>-</td>
          <td>96.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>USPTO</td>
          <td>88.0</td>
          <td>92.4</td>
          <td>93.9</td>
          <td>95.0</td>
          <td>7.5</td>
      </tr>
      <tr>
          <td>ReactionT5 (restored ORD)</td>
          <td>USPTO200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<p>A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.</p>
<p>The few-shot fine-tuning analysis shows rapid performance scaling:</p>
<table>
  <thead>
      <tr>
          <th>Samples</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>10</td>
          <td>9.0</td>
          <td>12.5</td>
          <td>15.3</td>
          <td>19.1</td>
          <td>12.4</td>
      </tr>
      <tr>
          <td>30</td>
          <td>80.5</td>
          <td>87.3</td>
          <td>89.8</td>
          <td>92.0</td>
          <td>17.2</td>
      </tr>
      <tr>
          <td>50</td>
          <td>83.7</td>
          <td>89.9</td>
          <td>92.2</td>
          <td>94.0</td>
          <td>14.8</td>
      </tr>
      <tr>
          <td>100</td>
          <td>85.1</td>
          <td>91.0</td>
          <td>92.8</td>
          <td>94.4</td>
          <td>14.0</td>
      </tr>
      <tr>
          <td>200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<h3 id="yield-prediction">Yield Prediction</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Random 7:3</th>
          <th>Test 1</th>
          <th>Test 2</th>
          <th>Test 3</th>
          <th>Test 4</th>
          <th>Avg. Tests 1-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DFT</td>
          <td>0.92</td>
          <td>0.80</td>
          <td>0.77</td>
          <td>0.64</td>
          <td>0.54</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MFF</td>
          <td>0.927</td>
          <td>0.851</td>
          <td>0.713</td>
          <td>0.635</td>
          <td>0.184</td>
          <td>0.596</td>
      </tr>
      <tr>
          <td>Yield-BERT</td>
          <td>0.951</td>
          <td>0.838</td>
          <td>0.836</td>
          <td>0.738</td>
          <td>0.538</td>
          <td>0.738</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>0.970</td>
          <td>0.811</td>
          <td>0.907</td>
          <td>0.789</td>
          <td>0.627</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>0.971</td>
          <td>0.855</td>
          <td>0.852</td>
          <td>0.712</td>
          <td>0.547</td>
          <td>0.741</td>
      </tr>
      <tr>
          <td>ReactionT5</td>
          <td>0.966</td>
          <td>0.914</td>
          <td>0.940</td>
          <td>0.819</td>
          <td>0.896</td>
          <td>0.892</td>
      </tr>
      <tr>
          <td>ReactionT5 (zero-shot)</td>
          <td>0.904</td>
          <td>0.919</td>
          <td>0.927</td>
          <td>0.847</td>
          <td>0.909</td>
          <td>0.900</td>
      </tr>
  </tbody>
</table>
<p>ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem&rsquo;s 0.627 and Yield-BERT&rsquo;s 0.538.</p>
<p>In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Two-stage pretraining is effective</strong>: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.</li>
<li><strong>Few-shot transfer works</strong>: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.</li>
<li><strong>Compound restoration matters</strong>: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.</li>
<li><strong>Zero-shot yield prediction is surprisingly effective</strong>: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.</li>
<li>The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).</li>
<li>The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.</li>
<li>The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.</li>
<li>Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compound pretraining</td>
          <td>ZINC</td>
          <td>22,992,522 compounds</td>
          <td>SMILES canonicalized with <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a></td>
      </tr>
      <tr>
          <td>Reaction pretraining</td>
          <td>ORD (restored)</td>
          <td>1,505,916 reactions</td>
          <td>Atom mapping removed, compounds canonicalized</td>
      </tr>
      <tr>
          <td>Product prediction eval</td>
          <td>USPTO</td>
          <td>479,035 reactions</td>
          <td>409K/30K/40K train/val/test split</td>
      </tr>
      <tr>
          <td>Yield prediction eval</td>
          <td>Buchwald-Hartwig C-N</td>
          <td>3,955 reactions</td>
          <td>Random 7:3 split (10 repeats) + 4 OOS tests</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Base architecture: T5 (text-to-text transfer transformer)</li>
<li>Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens</li>
<li>Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)</li>
<li>Beam search: size 10 for product prediction</li>
<li>Output length constraints: min/max from training data distribution</li>
<li>Yield normalization: clipped to [0, 100], then scaled to [0, 1]</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>CompoundT5: T5 pretrained on ZINC</li>
<li>RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)</li>
<li>ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction</li>
<li>Pre-trained weights available on Hugging Face</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>Product prediction</td>
          <td>85.5%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>Top-5 accuracy</td>
          <td>Product prediction</td>
          <td>94.9%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (random)</td>
          <td>0.966</td>
          <td>ReactionT5 fine-tuned</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (OOS avg.)</td>
          <td>0.900</td>
          <td>ReactionT5 zero-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training times and GPU requirements are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sagawatatsuya/ReactionT5v2">ReactionT5v2 (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/sagawa">ReactionT5 models (Hugging Face)</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sagawa, T. &amp; Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. <em>arXiv preprint arXiv:2311.06708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sagawa2023reactiont5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ReactionT5: a large-scale pre-trained model towards application of limited reaction data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sagawa, Tatsuya and Kojima, Ryosuke}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2311.06708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2311.06708}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PharMolixFM: Multi-Modal All-Atom Molecular Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/pharmolixfm-all-atom-foundation-models/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/pharmolixfm-all-atom-foundation-models/</guid><description>PharMolixFM unifies diffusion, flow matching, and Bayesian flow networks for all-atom molecular modeling and generation with task-specific denoising priors.</description><content:encoded><![CDATA[<h2 id="a-unified-framework-for-all-atom-molecular-foundation-models">A Unified Framework for All-Atom Molecular Foundation Models</h2>
<p>PharMolixFM is a <strong>Method</strong> paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.</p>
<h2 id="challenges-in-multi-modal-atomic-modeling">Challenges in Multi-Modal Atomic Modeling</h2>
<p>Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.</p>
<p>First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.</p>
<p>Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.</p>
<p>PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.</p>
<h2 id="multi-modal-denoising-with-task-specific-priors">Multi-Modal Denoising with Task-Specific Priors</h2>
<p>The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.</p>
<p>The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:</p>
<p>$$
p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)
$$</p>
<p>The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:</p>
<p>$$
q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})})
$$</p>
<p>A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This &ldquo;Fix&rdquo; mechanism enables multiple training tasks:</p>
<ul>
<li><strong>Docking</strong> ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.</li>
<li><strong>Structure-based drug design</strong> ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.</li>
<li><strong>Robustness augmentation</strong> ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.</li>
</ul>
<h3 id="three-generative-model-variants">Three Generative Model Variants</h3>
<p><strong>Multi-modal diffusion (PharMolixFM-Diff)</strong> uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:</p>
<p>$$
q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})})
$$</p>
<p>$$
q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T
$$</p>
<p>The training loss combines coordinate MSE with cross-entropy for discrete variables:</p>
<p>$$
\mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right]
$$</p>
<p><strong>Multi-modal flow matching (PharMolixFM-Flow)</strong> constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.</p>
<p><strong>Bayesian flow networks (PharMolixFM-BFN)</strong> perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:</p>
<p>$$
p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})}
$$</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.</p>
<h2 id="docking-and-drug-design-experiments">Docking and Drug Design Experiments</h2>
<h3 id="protein-small-molecule-docking">Protein-Small-Molecule Docking</h3>
<p>PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD &lt; 2 Angstrom.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Self-Ranking (%)</th>
          <th>Oracle-Ranking (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffDock</td>
          <td>38.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>RFAA</td>
          <td>42.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Vina</td>
          <td>52.3</td>
          <td>-</td>
      </tr>
      <tr>
          <td>UniMol-Docking V2</td>
          <td>77.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>SurfDock</td>
          <td>78.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>AlphaFold3</td>
          <td>90.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>PocketXMol (50 repeats)</td>
          <td>82.2</td>
          <td>95.3</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (50 repeats)</td>
          <td>83.4</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow (50 repeats)</td>
          <td>73.4</td>
          <td>93.7</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN (50 repeats)</td>
          <td>78.5</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (500 repeats)</td>
          <td>83.9</td>
          <td>98.1</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.</p>
<h3 id="structure-based-drug-design">Structure-Based Drug Design</h3>
<p>Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score (Avg/Med)</th>
          <th>QED</th>
          <th>SA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-5.14 / -4.70</td>
          <td>0.57</td>
          <td>0.76</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-5.47 / -6.30</td>
          <td>0.48</td>
          <td>0.58</td>
      </tr>
      <tr>
          <td>DecompDiff</td>
          <td>-5.67 / -6.04</td>
          <td>0.45</td>
          <td>0.61</td>
      </tr>
      <tr>
          <td>MolCRAFT</td>
          <td>-6.61 / -8.14</td>
          <td>0.46</td>
          <td>0.62</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff</td>
          <td>-6.18 / -6.44</td>
          <td>0.50</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow</td>
          <td>-6.34 / -6.47</td>
          <td>0.49</td>
          <td>0.74</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN</td>
          <td>-6.38 / -6.45</td>
          <td>0.48</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.</p>
<h3 id="inference-scaling-law">Inference Scaling Law</h3>
<p>The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:</p>
<p>$$
\text{Acc} = a \log(bR + c) + d
$$</p>
<p>where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.</p>
<h2 id="competitive-docking-with-faster-inference-but-limited-task-scope">Competitive Docking with Faster Inference, but Limited Task Scope</h2>
<p>PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:</p>
<ol>
<li><strong>Diffusion outperforms flow matching and BFN</strong> for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.</li>
<li><strong>Oracle-ranking reveals untapped potential</strong>: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.</li>
<li><strong>The three variants show similar performance for drug design</strong>, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.</li>
<li><strong>Inference scaling laws hold</strong> for molecular generative models, paralleling findings in NLP.</li>
</ol>
<p>Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3&rsquo;s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PDBBind, Binding MOAD, CrossDocked2020, PepBDB</td>
          <td>Not specified</td>
          <td>Filtered by PocketXMol criteria</td>
      </tr>
      <tr>
          <td>Docking eval</td>
          <td>PoseBusters benchmark</td>
          <td>428 complexes</td>
          <td>Holo docking with known protein</td>
      </tr>
      <tr>
          <td>SBDD eval</td>
          <td>CrossDocked test set</td>
          <td>100 pockets</td>
          <td>100 molecules per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks</li>
<li>Task-specific noise via Fix mechanism (0, 0.7, or 1.0)</li>
<li>Training tasks selected with equal probability per sample</li>
<li>AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$</li>
<li>Linear warmup to learning rate 0.001 over 1000 steps</li>
<li>180K training steps with batch size 40</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)</li>
<li>kNN graph construction for protein and protein-molecule interactions</li>
<li>Independent prediction heads for coordinates, atom types, bond types</li>
<li>Confidence heads for self-ranking during inference</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PharMolixFM-Diff</th>
          <th>AlphaFold3</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSD &lt; 2A self-ranking</td>
          <td>83.4% (50 rep)</td>
          <td>90.4%</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>RMSD &lt; 2A oracle-ranking</td>
          <td>98.1% (500 rep)</td>
          <td>-</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>Inference time (per complex)</td>
          <td>~4.6s</td>
          <td>~249.0s</td>
          <td>Single A800 GPU</td>
      </tr>
      <tr>
          <td>Vina score (avg)</td>
          <td>-6.18</td>
          <td>-</td>
          <td>CrossDocked SBDD</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 4x 80GB A800 GPUs</li>
<li>Inference benchmarked on single A800 GPU</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/PharMolix/OpenBioMed">OpenBioMed (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, Y., Wang, J., Fan, S., &amp; Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. <em>arXiv preprint arXiv:2503.21788</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{luo2025pharmolixfm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2503.21788}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PharmaGPT: Domain-Specific LLMs for Pharma and Chem</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/pharmagpt-domain-specific-llms-biopharmaceutical/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/pharmagpt-domain-specific-llms-biopharmaceutical/</guid><description>PharmaGPT introduces 13B and 70B parameter LLMs trained on biopharmaceutical and chemical corpora, outperforming GPT-3.5 and rivaling GPT-4 on pharmacy exams.</description><content:encoded><![CDATA[<h2 id="a-domain-specific-llm-suite-for-biopharmaceuticals-and-chemistry">A Domain-Specific LLM Suite for Biopharmaceuticals and Chemistry</h2>
<p>This is a <strong>Method</strong> paper that introduces PharmaGPT, a suite of domain-specific large language models with 13 billion and 70 billion parameters. The models are built on the LLaMA architecture and undergo continued pretraining on a curated corpus of biopharmaceutical and chemical literature, followed by instruction fine-tuning and reinforcement learning from human feedback (RLHF). The primary contribution is demonstrating that domain-specific continued pretraining on a general-purpose LLM backbone can produce models that outperform much larger general-purpose models on pharmaceutical knowledge tasks, using only a fraction of the parameters.</p>
<h2 id="bridging-the-gap-between-general-purpose-llms-and-specialized-pharmaceutical-knowledge">Bridging the Gap Between General-Purpose LLMs and Specialized Pharmaceutical Knowledge</h2>
<p>General-purpose LLMs like GPT-3.5 and GPT-4 show impressive broad capabilities but often fall short in specialized domains requiring precise terminology, deep domain knowledge, and high accuracy. The biopharmaceutical and chemical sectors present particular challenges: intricate terminologies, specialized regulatory knowledge, and a demand for precision that general models cannot consistently deliver. Most state-of-the-art LLMs are proprietary, English-centric, and lack depth in vertical domains. The authors identify a gap in the availability of domain-specific LLMs for biomedicine and chemistry, particularly multilingual models that can handle both English and Chinese pharmaceutical content.</p>
<h2 id="continued-pretraining-with-domain-specific-data-and-weighted-instruction-tuning">Continued Pretraining with Domain-Specific Data and Weighted Instruction Tuning</h2>
<p>PharmaGPT&rsquo;s core innovation lies in its training pipeline, which adapts the LLaMA backbone through three stages:</p>
<p><strong>Extended Tokenizer</strong>: The authors develop a new tokenizer using <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">byte-pair encoding (BPE)</a> from SentencePiece, trained on their pretraining data and merged with the LLaMA2 tokenizer. This extends the vocabulary from 32,000 to 55,296 tokens, improving compression efficiency for Chinese text and specialized domain terminology. The embedding and output layers are resized from $V \times H$ to $V&rsquo; \times H$ where $V = 32{,}000$ and $V&rsquo; = 55{,}296$.</p>
<p><strong>Two-Stage Continued Pretraining</strong>: The models consume 153 billion tokens in Stage 1 (primarily web, news, patents, and papers) and 43 billion tokens in Stage 2 (research reports, exams, books, chats, code, and supervised data). The data distribution shifts between stages to move from general domain knowledge toward specialized biopharmaceutical tasks.</p>
<p><strong>Weighted Instruction Fine-tuning</strong>: Inspired by OpenChat, the authors use a weighted autoregressive objective that zeros out loss on user instruction tokens. The loss function is:</p>
<p>$$\mathcal{L}_{SFT}(\Theta) = \mathbb{E}_{x \sim \mathcal{D}_{SFT}} \left[ -\alpha \sum_{i \in \text{output}} \log p(x_i \mid x_0, x_1, \dots, x_{i-1}; \Theta) \right]$$</p>
<p>where the weight $\alpha$ is set to 1 for expert-curated domain-specific instructions ($\mathcal{D}_{\exp}$) and 0.1 for generic instructions ($\mathcal{D}_{\text{gen}}$). This differential weighting ensures domain-relevant instructions receive higher priority during training.</p>
<p><strong>RLHF with PPO</strong>: A reward model is initialized from the pretrained PharmaGPT-70B and enhanced with two MLPs to output a scalar preference score. The reward model is trained with a binary ranking loss:</p>
<p>$$\mathcal{L}_{\text{ranking}} = -\log\left(\sigma\left(r_\theta(x, y_c) - r_\theta(x, y_r)\right)\right)$$</p>
<p>where $r_\theta(x, y_c)$ is the score for the preferred response and $r_\theta(x, y_r)$ is the score for the rejected response. The RLHF dataset consists of 50,000 human preference expert-annotated instructions with responses from PharmaGPT variants and commercial LLMs (GPT-4, ChatGPT-3.5). <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization (PPO)</a> is used for the RL training, selecting the highest-scoring response from four generated candidates at each step.</p>
<h2 id="evaluation-on-pharmacy-licensing-exams-translation-and-mmlu">Evaluation on Pharmacy Licensing Exams, Translation, and MMLU</h2>
<p>The evaluation covers four main benchmarks:</p>
<p><strong>NAPLEX (North American Pharmacist Licensure Examination)</strong>: PharmaGPT is tested across three NAPLEX sections. Results show consistent improvement across model iterations:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>NAPLEX I</th>
          <th>NAPLEX II</th>
          <th>NAPLEX III</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT 0.1</td>
          <td>5.0</td>
          <td>2.5</td>
          <td>3.5</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.3</td>
          <td>42.0</td>
          <td>48.0</td>
          <td>46.5</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.5</td>
          <td>57.0</td>
          <td>59.0</td>
          <td>58.0</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.7</td>
          <td>66.0</td>
          <td>68.0</td>
          <td>76.0</td>
      </tr>
  </tbody>
</table>
<p>PharmaGPT 0.7 scores in the 66-76% range across all three NAPLEX sections, outperforming GPT-3.5-turbo by considerable margins.</p>
<p><strong>Chinese Pharmacist Examination</strong>: PharmaGPT achieves scores in the 70% range across all four exam categories, outperforming both GPT-3.5-turbo and GPT-4 in all categories. This result is notable given GPT-4&rsquo;s much larger scale.</p>
<p><strong>Biomedical Translation</strong>: PharmaGPT 0.7 outperforms GPT-3.5, Claude 3, and Google Translate on biomedical paper translation (English-Chinese), achieving <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> scores of 30 (paragraph-level), 18 (sentence-level), and 10 (word-level).</p>
<p><strong>MMLU</strong>: On the general Multitask Multilingual Language Understanding benchmark, PharmaGPT achieves scores in the 80% range across most biomedical and life science tasks, surpassing GPT-3.5-turbo and performing comparably to GPT-4 in areas such as physiology, health sciences, and biology.</p>
<h2 id="strong-domain-performance-with-smaller-scale-but-limited-reproducibility">Strong Domain Performance with Smaller Scale, but Limited Reproducibility</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Domain-specific continued pretraining enables a 70B parameter model to match or exceed GPT-4 on pharmaceutical knowledge tasks, despite having a fraction of GPT-4&rsquo;s parameters</li>
<li>Iterative post-training (versions 0.1 through 0.7) shows consistent improvement, with the largest gains occurring between versions 0.3 and 0.5</li>
<li>The two-stage pretraining strategy, shifting from general domain data to more specialized exam and report data, appears effective for building domain expertise</li>
<li>Scaling laws hold within the PharmaGPT family: larger parameter counts consistently produce better performance on both NAPLEX and Chinese pharmaceutical exams</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>Potential biases in the training data</li>
<li>Model dependency on the quality and diversity of input prompts</li>
<li>Challenges in accurately assessing performance on highly specialized tasks without domain expert evaluation</li>
<li>Interpretability concerns for use in sensitive healthcare and pharmaceutical applications</li>
<li>The 3B model is trained from scratch while the 13B and 70B models use LLaMA as a backbone, making direct comparison across model sizes less straightforward</li>
</ul>
<p><strong>Missing details</strong>: The paper does not release model weights, training code, or the proprietary training dataset. No ablation studies isolate the contribution of each training stage (continued pretraining vs. instruction tuning vs. RLHF). The evaluation is limited to multiple-choice exams and translation, without testing on molecular property prediction, reaction prediction, or other computational chemistry tasks common in this domain.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining Stage 1</td>
          <td>Web, News, Patents, Papers</td>
          <td>153B tokens</td>
          <td>Proprietary corpus; not publicly available</td>
      </tr>
      <tr>
          <td>Pretraining Stage 2</td>
          <td>Research Reports, Exams, Books, Chats, Code</td>
          <td>43B tokens</td>
          <td>Proprietary corpus; not publicly available</td>
      </tr>
      <tr>
          <td>Instruction Tuning</td>
          <td>Manually labeled + synthesized data</td>
          <td>Several hundred thousand instructions</td>
          <td>Includes expert Q&amp;A, patent data, ShareGPT</td>
      </tr>
      <tr>
          <td>RLHF</td>
          <td>Human preference annotations</td>
          <td>50,000 annotated instructions</td>
          <td>Expert annotators ranked responses</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>NAPLEX, Chinese Pharmacist Exam, MMLU, MT</td>
          <td>Not specified</td>
          <td>Exam datasets sourced from public exams</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Base architecture</strong>: LLaMA (13B and 70B variants); 3B model trained from scratch</li>
<li><strong>Tokenizer</strong>: Extended BPE tokenizer (55,296 vocab size) merged with LLaMA2 tokenizer</li>
<li><strong>Training objective</strong>: Standard autoregressive LM (pretraining), weighted autoregressive with $\alpha \in {0.1, 1.0}$ (SFT), PPO (RLHF)</li>
<li><strong>Reward model</strong>: Initialized from PharmaGPT-70B with two additional MLPs</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Base</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT-3B</td>
          <td>3B</td>
          <td>Trained from scratch</td>
          <td>Not evaluated in main results</td>
      </tr>
      <tr>
          <td>PharmaGPT-13B</td>
          <td>13B</td>
          <td>LLaMA-13B</td>
          <td>Post-trained</td>
      </tr>
      <tr>
          <td>PharmaGPT-70B</td>
          <td>70B</td>
          <td>LLaMA-70B</td>
          <td>Primary model; versions 0.1-0.7 reported</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PharmaGPT 0.7</th>
          <th>GPT-3.5</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NAPLEX I</td>
          <td>66%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>NAPLEX II</td>
          <td>68%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>NAPLEX III</td>
          <td>76%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>Chinese Pharmacist Exam</td>
          <td>~70% range</td>
          <td>Lower</td>
          <td>Outperforms GPT-4</td>
      </tr>
      <tr>
          <td>Biomedical Translation (paragraph BLEU)</td>
          <td>30</td>
          <td>27</td>
          <td>English-Chinese</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify the hardware used for training. Training hyperparameters for the 70B model include tensor parallelism (TP=8) and pipeline parallelism (PP=16) during pretraining, suggesting multi-node GPU training, likely on at least 128 GPUs.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT models</td>
          <td>Model</td>
          <td>Not released</td>
          <td>No public weights or API access</td>
      </tr>
      <tr>
          <td>Training data</td>
          <td>Dataset</td>
          <td>Proprietary</td>
          <td>PatSnap internal data</td>
      </tr>
      <tr>
          <td>Training code</td>
          <td>Code</td>
          <td>Not released</td>
          <td>No public repository</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: <strong>Closed</strong>. Neither the model weights, training data, nor training code are publicly available. The proprietary nature of both the data pipeline and the models makes independent reproduction infeasible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, L., Wang, W., Bai, Z., Xu, P., Fang, Y., Fang, J., &hellip; &amp; Tu, C. (2024). PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry. <em>arXiv preprint arXiv:2406.18045</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chen2024pharmagpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Linqing and Wang, Weilei and Bai, Zilong and Xu, Peng and Fang, Yan and Fang, Jie and Wu, Wentao and Zhou, Lizhi and Zhang, Ruiji and Xia, Yubin and Xu, Chaobo and Hu, Ran and Xu, Licong and Cai, Qijun and Hua, Haoran and Sun, Jing and Liu, Jin and Qiu, Tian and Liu, Haowen and Hu, Meng and Li, Xiuwen and Gao, Fei and Wang, Yufu and Tie, Lin and Wang, Chaochao and Lu, Jianping and Sun, Cheng and Wang, Yixin and Yang, Shengjie and Li, Yuancheng and Jin, Lu and Zhang, Lisha and Bian, Fu and Ye, Zhongkai and Pei, Lidong and Tu, Changyang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2406.18045}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2406.18045}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ORGAN: Objective-Reinforced GANs for Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/</guid><description>ORGAN combines GANs with reinforcement learning to steer SMILES-based molecular generation toward drug-likeness, solubility, and synthesizability objectives.</description><content:encoded><![CDATA[<h2 id="combining-gans-and-reinforcement-learning-for-goal-directed-sequence-generation">Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).</p>
<h2 id="exposure-bias-and-mode-collapse-in-discrete-sequence-generation">Exposure Bias and Mode Collapse in Discrete Sequence Generation</h2>
<p>Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while <a href="/posts/what-is-a-gan/">GANs</a> can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.</p>
<p>In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating &ldquo;CCCCCCC&rdquo; to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.</p>
<h2 id="mixed-reward-interpolating-between-adversarial-and-objective-signals">Mixed Reward: Interpolating Between Adversarial and Objective Signals</h2>
<p>ORGAN&rsquo;s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:</p>
<p>$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$</p>
<p>When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.</p>
<p>The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:</p>
<p>$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$</p>
<p>For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:</p>
<p>$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), &amp; \text{if } t &lt; T \\ R(Y_{1:T}), &amp; \text{if } t = T \end{cases}$$</p>
<p>where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.</p>
<p>The policy gradient is:</p>
<p>$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$</p>
<p>Two additional mechanisms improve training:</p>
<ol>
<li><strong>Diversity penalty</strong>: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.</li>
<li><strong>Wasserstein distance</strong>: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.</li>
</ol>
<h2 id="molecular-and-musical-melody-generation-experiments">Molecular and Musical Melody Generation Experiments</h2>
<h3 id="architecture">Architecture</h3>
<p>The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.</p>
<h3 id="molecular-generation-setup">Molecular Generation Setup</h3>
<p>Training data consists of 5,000 random molecules from the <a href="/notes/chemistry/datasets/qm9/">QM9</a> dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.</p>
<p>Three molecular objectives are evaluated:</p>
<ul>
<li><strong>Solubility (LogP)</strong>: water-octanol partition coefficient via RDKit&rsquo;s Crippen function</li>
<li><strong>Synthesizability</strong>: SA score estimating ease of synthesis (0 = hard, 1 = easy)</li>
<li><strong>Druglikeness</strong>: QED score capturing medicinal chemistry aesthetics</li>
</ul>
<p>Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.</p>
<h3 id="molecular-generation-results">Molecular Generation Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Validity (%)</th>
          <th>Diversity</th>
          <th>Druglikeness</th>
          <th>Synthesizability</th>
          <th>Solubility</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>75.9</td>
          <td>0.64</td>
          <td>0.48 (0%)</td>
          <td>0.23 (0%)</td>
          <td>0.30 (0%)</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>80.3</td>
          <td>0.61</td>
          <td>0.49 (+2%)</td>
          <td>0.25 (+6%)</td>
          <td>0.31 (+3%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>ORGAN</td>
          <td>88.2</td>
          <td>0.55</td>
          <td>0.52 (+8%)</td>
          <td>0.32 (+38%)</td>
          <td>0.35 (+18%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>OR(W)GAN</td>
          <td>85.0</td>
          <td>0.95</td>
          <td>0.60 (+25%)</td>
          <td>0.54 (+130%)</td>
          <td>0.47 (+57%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Naive RL</td>
          <td>97.1</td>
          <td>0.80</td>
          <td>0.57 (+19%)</td>
          <td>0.53 (+126%)</td>
          <td>0.50 (+67%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>ORGAN</td>
          <td>96.5</td>
          <td>0.92</td>
          <td>0.51 (+6%)</td>
          <td>0.83 (+255%)</td>
          <td>0.45 (+52%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>OR(W)GAN</td>
          <td>97.6</td>
          <td>1.00</td>
          <td>0.20 (-59%)</td>
          <td>0.75 (+223%)</td>
          <td>0.84 (+184%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>ORGAN</td>
          <td>94.7</td>
          <td>0.76</td>
          <td>0.50 (+4%)</td>
          <td>0.63 (+171%)</td>
          <td>0.55 (+85%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>OR(W)GAN</td>
          <td>94.1</td>
          <td>0.90</td>
          <td>0.42 (-12%)</td>
          <td>0.66 (+185%)</td>
          <td>0.54 (+81%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>Naive RL</td>
          <td>92.7</td>
          <td>0.75</td>
          <td>0.49 (+3%)</td>
          <td>0.70 (+200%)</td>
          <td>0.78 (+162%)</td>
      </tr>
      <tr>
          <td>All (alternated)</td>
          <td>ORGAN</td>
          <td>96.1</td>
          <td>92.3</td>
          <td>0.52 (+9%)</td>
          <td>0.71 (+206%)</td>
          <td>0.53 (+79%)</td>
      </tr>
  </tbody>
</table>
<p>Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.</p>
<h3 id="music-generation-setup">Music Generation Setup</h3>
<p>Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.</p>
<h3 id="music-results">Music Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Diversity</th>
          <th>Tonality</th>
          <th>Ratio of Steps</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>0.221</td>
          <td>0.007</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>0.187</td>
          <td>0.005</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Naive RL</td>
          <td>0.100</td>
          <td>0.478</td>
          <td>2.9E-05</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>ORGAN</td>
          <td>0.268</td>
          <td>0.372</td>
          <td>1.78E-04</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>OR(W)GAN</td>
          <td>0.268</td>
          <td>0.177</td>
          <td>2.4E-04</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Naive RL</td>
          <td>0.321</td>
          <td>0.001</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>ORGAN</td>
          <td>0.433</td>
          <td>0.001</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>OR(W)GAN</td>
          <td>0.134</td>
          <td>5.95E-05</td>
          <td>0.622</td>
      </tr>
  </tbody>
</table>
<p>ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.</p>
<h2 id="capacity-ceilings-trade-offs-and-future-directions">Capacity Ceilings, Trade-offs, and Future Directions</h2>
<p>The authors identify several limitations and findings:</p>
<p><strong>Capacity ceiling</strong>: GAN-based models tend to generate sequences matching the training set&rsquo;s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data&rsquo;s maximum, suggesting dataset-dependent limits.</p>
<p><strong>Lambda trade-off</strong>: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.</p>
<p><strong>Tonality vs. steps inverse relationship</strong>: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.</p>
<p><strong>Limitations</strong>: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.</p>
<p><strong>Future directions</strong>: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecular training</td>
          <td>QM9 subset</td>
          <td>5,000 molecules</td>
          <td>Random subset from 134k stable small molecules with up to 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Music training</td>
          <td>EsAC folk dataset</td>
          <td>1,000 melodies</td>
          <td>36-token sequences, processed following Chen et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs</li>
<li>Adversarial/RL training for up to 100 epochs</li>
<li>Default $\lambda = 0.5$ for reward mixing</li>
<li>Monte Carlo rollouts for intermediate reward estimation</li>
<li>Duplicate penalty: reward divided by copy count</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Generator</strong>: RNN with LSTM cells</li>
<li><strong>Discriminator</strong>: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization</li>
<li><strong>Optimizer</strong>: Adam for all gradient descent steps</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Domain</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (%)</td>
          <td>Fraction of generated SMILES that decode to valid molecules</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Average Jaccard distance of fingerprints to training subset</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Druglikeness (QED)</td>
          <td>Quantitative Estimate of Drug-likeness</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Synthesizability (SA)</td>
          <td>Synthetic accessibility score</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Solubility (LogP)</td>
          <td>Water-octanol partition coefficient</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Proportion of perfect fifths</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Proportion of conjunct melodic intervals</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Diversity (edit)</td>
          <td>Average pairwise edit distance</td>
          <td>Music</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gablg1/ORGAN">ORGAN</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Official implementation including metrics for molecules and music</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., &amp; Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. <em>arXiv preprint arXiv:1705.10843</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guimaraes2017organ,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1705.10843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Neural Machine Translation for Reaction Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/</guid><description>Nam and Kim apply a GRU-based seq2seq model with attention to predict organic reaction products from SMILES, pioneering the NMT approach to chemistry.</description><content:encoded><![CDATA[<h2 id="pioneering-seq2seq-translation-for-reaction-prediction">Pioneering Seq2Seq Translation for Reaction Prediction</h2>
<p>This is a <strong>Method</strong> paper. It introduces the idea of applying neural machine translation (NMT) to organic chemistry reaction prediction by framing product prediction as a sequence-to-sequence translation problem from reactant/reagent <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> to product SMILES. This was one of the earliest works to demonstrate that a data-driven encoder-decoder model could predict reaction products without any hand-coded reaction rules or SMARTS transformations.</p>
<h2 id="limitations-of-existing-reaction-prediction-methods">Limitations of Existing Reaction Prediction Methods</h2>
<p>Prior computational approaches to reaction prediction fell into three categories, each with significant drawbacks:</p>
<ol>
<li>
<p><strong>Rule-based methods</strong> (e.g., CAMEO, EROS) relied on manually encoded reaction rules. They performed well on reactions covered by the rules but required continuous manual encoding as new reaction types were discovered. Many older systems became outdated for this reason.</p>
</li>
<li>
<p><strong>Physical calculation methods</strong> computed energies of transition states from plausible reaction pathways using quantum mechanics. While principled, these approaches carried high computational cost. Simplified approaches (ToyChem, ROBIA) traded accuracy for speed.</p>
</li>
<li>
<p><strong>Machine learning methods</strong> at the time either predicted individual mechanistic steps (requiring tree search for multi-step reactions) or classified reaction types and applied SMARTS transformations to generate products. The classification-based approach of Wei et al. still required manual encoding of SMARTS transformations for new reaction types and struggled with ambiguous reaction classes.</p>
</li>
</ol>
<p>The key gap was the absence of a method that could predict reaction products directly from input molecules, learn from data alone, and generalize to new reaction types without manual rule encoding.</p>
<h2 id="core-innovation-reactions-as-machine-translation">Core Innovation: Reactions as Machine Translation</h2>
<p>The central insight is that SMILES strings can be treated as a language with grammatical specifications. Predicting reaction products then becomes a problem of translating &ldquo;reactant and reagent&rdquo; sentences into &ldquo;product&rdquo; sentences.</p>
<p>The model uses a <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a>-based encoder-decoder architecture with attention:</p>
<ul>
<li><strong>Encoder</strong>: 3 layers of GRU cells that process the reversed, tokenized SMILES string of reactants and reagents</li>
<li><strong>Decoder</strong>: 3 layers of GRU cells that generate product SMILES tokens autoregressively</li>
<li><strong>Attention mechanism</strong>: allows the decoder to attend to relevant encoder states at each generation step</li>
<li><strong>Embedding dimension</strong>: 600</li>
<li><strong>Vocabulary</strong>: 311 input tokens (reactants/reagents), 180 output tokens (products)</li>
<li><strong>Bucketed sequences</strong>: four bucket sizes handle variable-length inputs and outputs: (54, 54), (70, 60), (90, 65), (150, 80)</li>
</ul>
<p>The SMILES tokenization uses a <a href="https://en.wikipedia.org/wiki/Parsing_expression_grammar">PEG</a>-based parser that splits SMILES strings into atoms, bonds, branching symbols, and ring closure numbers. Input sequences are reversed before feeding to the encoder, following standard practice in NMT at the time.</p>
<p>The translation objective finds the product sequence $\mathbf{y}$ that maximizes the conditional probability:</p>
<p>$$p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_1, \ldots, y_{t-1}, \mathbf{x})$$</p>
<p>where $\mathbf{x}$ is the tokenized reactant/reagent sequence and $T$ is the product sequence length.</p>
<h2 id="training-data-and-experimental-evaluation">Training Data and Experimental Evaluation</h2>
<h3 id="training-sets">Training Sets</h3>
<p>Two training sets were constructed:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Source</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Patent reactions (&ldquo;real&rdquo;)</td>
          <td style="text-align: left">1,094,235</td>
          <td style="text-align: left">USPTO patent applications (2001-2013), filtered by length</td>
      </tr>
      <tr>
          <td style="text-align: left">Generated reactions (&ldquo;gen&rdquo;)</td>
          <td style="text-align: left">865,118</td>
          <td style="text-align: left">75 reaction types from Wade&rsquo;s organic chemistry textbook, applied to <a href="/notes/chemistry/datasets/gdb-11/">GDB-11</a> molecules (1-10 atoms)</td>
      </tr>
  </tbody>
</table>
<p>The &ldquo;real&rdquo; set was filtered to exclude reactions with reactant/reagent strings longer than 150 characters, product strings longer than 80 characters, or more than four products. The &ldquo;gen&rdquo; set was composed by iterating reaction templates (as SMARTS) over small molecules from GDB-11, covering five substrate types: acid derivatives, alcohols, aldehydes/ketones, alkenes, and alkynes.</p>
<p>Two models were compared: a &ldquo;gen&rdquo; model (trained only on generated reactions) and a &ldquo;real+gen&rdquo; model (trained on both sets).</p>
<h3 id="textbook-problem-evaluation">Textbook Problem Evaluation</h3>
<p>The models were tested on 10 problem sets from Wade&rsquo;s textbook, following the evaluation approach of Wei et al. Each problem set contained 6-15 reactions. Evaluation metrics included the ratio of fully correct predictions and the average <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between Morgan fingerprints of predicted and actual products.</p>
<p>The &ldquo;real+gen&rdquo; model outperformed the &ldquo;gen&rdquo; model on most problem sets. On problem set 17-44 (aromatic compound reactions, only present in the &ldquo;real&rdquo; training set), the &ldquo;real+gen&rdquo; model correctly answered 4 out of 11 problems while the &ldquo;gen&rdquo; model answered 2. The &ldquo;gen&rdquo; model&rsquo;s ability to correctly predict some aromatic reactions despite never being trained on them suggests the model can extrapolate to unseen reaction patterns.</p>
<p>For <a href="https://en.wikipedia.org/wiki/Diels%E2%80%93Alder_reaction">Diels-Alder reactions</a> (problem set 15-30), neither model achieved fully correct predictions for all problems, though the &ldquo;real+gen&rdquo; model showed better Tanimoto scores, indicating partially correct structural predictions even when the exact product was missed.</p>
<h3 id="scalability-testing">Scalability Testing</h3>
<p>A scalability test used generated reactions with substrate molecules containing 11-16 atoms (larger than the training set molecules with fewer than 11 atoms). Results showed:</p>
<ul>
<li>The &ldquo;real+gen&rdquo; model maintained Tanimoto scores around 0.7 and error rates around 0.4 as substrate atom count increased</li>
<li>The ratio of fully correct predictions decreased as atom count increased, revealing that the recurrent network struggled with longer input sequences</li>
<li>The &ldquo;real+gen&rdquo; model produced fewer invalid SMILES strings than the &ldquo;gen&rdquo; model, likely because training on more reactions improved the decoder&rsquo;s ability to generate syntactically valid SMILES</li>
</ul>
<h3 id="attention-analysis">Attention Analysis</h3>
<p>Visualization of attention weights revealed a limitation: the decoder cells predominantly attended to the first few encoder cells rather than distributing attention across the full input sequence. This means the attention mechanism was not learning meaningful &ldquo;alignment&rdquo; between reactant atoms and product atoms. The authors note that if decoder cells generating tokens for unreactive sites could attend to the corresponding encoder cells (analogous to atom mapping), prediction quality on longer sequences could improve.</p>
<h3 id="token-embedding-analysis">Token Embedding Analysis</h3>
<p>t-SNE visualization of the learned token embeddings showed that encoder and decoder tokens clustered primarily by syntactic similarity rather than chemical properties. The model did not learn chemically meaningful embeddings, which the authors identify as an area for future improvement.</p>
<h2 id="key-findings-limitations-and-impact">Key Findings, Limitations, and Impact</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>Treating reaction prediction as NMT is viable: the seq2seq model can predict products without any hand-coded rules</li>
<li>Training on real patent data significantly improves prediction over generated data alone</li>
<li>The model can extrapolate to reaction types not seen during training (e.g., the &ldquo;gen&rdquo; model predicting aromatic reactions)</li>
<li>Compared to the fingerprint-based approach of Wei et al., this method performed better on textbook problems and eliminated the need for manual SMARTS encoding</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Invalid SMILES generation</strong>: the token-by-token generation process can produce syntactically invalid SMILES (e.g., mismatched parentheses), which the authors scored as zero</li>
<li><strong>Sequence length degradation</strong>: prediction accuracy dropped for longer SMILES strings, a known limitation of RNN-based seq2seq models at the time</li>
<li><strong>Poor attention alignment</strong>: attention weights collapsed to the first encoder positions rather than learning meaningful reactant-product correspondences</li>
<li><strong>Chemically naive embeddings</strong>: token embeddings did not capture chemical properties</li>
<li><strong>Multiple reaction pathways</strong>: reactions with competing pathways (e.g., substitution vs. elimination) were difficult for the model to handle</li>
</ul>
<h3 id="historical-significance">Historical Significance</h3>
<p>This paper is historically significant as one of the first (alongside concurrent work) to propose the NMT framing for reaction prediction. This framing was later adopted and refined by the <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a> (Schwaller et al., 2019), which replaced GRUs with the Transformer architecture and achieved over 90% top-1 accuracy on standard benchmarks. The conceptual contribution of treating SMILES-to-SMILES translation as machine translation became the foundation of an entire subfield.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Training (real)</td>
          <td style="text-align: left">USPTO patent reactions</td>
          <td style="text-align: left">1,094,235</td>
          <td style="text-align: left">2001-2013 applications, filtered by length</td>
      </tr>
      <tr>
          <td style="text-align: left">Training (gen)</td>
          <td style="text-align: left">Generated from Wade textbook templates</td>
          <td style="text-align: left">865,118</td>
          <td style="text-align: left">75 reaction types, GDB-11 substrates</td>
      </tr>
      <tr>
          <td style="text-align: left">Testing (textbook)</td>
          <td style="text-align: left">Wade textbook problems</td>
          <td style="text-align: left">~100</td>
          <td style="text-align: left">10 problem sets, 6-15 reactions each</td>
      </tr>
      <tr>
          <td style="text-align: left">Testing (scalability)</td>
          <td style="text-align: left">Generated from <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td style="text-align: left">2,400</td>
          <td style="text-align: left">400 per atom count (11-16)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GRU-based encoder-decoder with attention mechanism</li>
<li>PEG-based SMILES tokenizer</li>
<li>Input sequence reversal</li>
<li>Bucketed training with four bucket sizes</li>
<li>TensorFlow seq2seq tutorial implementation with default learning rate</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">GRU layers</td>
          <td style="text-align: left">3</td>
      </tr>
      <tr>
          <td style="text-align: left">Embedding size</td>
          <td style="text-align: left">600</td>
      </tr>
      <tr>
          <td style="text-align: left">Input vocabulary</td>
          <td style="text-align: left">311 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left">Output vocabulary</td>
          <td style="text-align: left">180 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left">Buckets</td>
          <td style="text-align: left">(54,54), (70,60), (90,65), (150,80)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">gen Model</th>
          <th style="text-align: left">real+gen Model</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Textbook correct ratio</td>
          <td style="text-align: left">Variable by set</td>
          <td style="text-align: left">Higher on most sets</td>
          <td style="text-align: left">10 problem sets</td>
      </tr>
      <tr>
          <td style="text-align: left">Average Tanimoto similarity</td>
          <td style="text-align: left">Variable</td>
          <td style="text-align: left">~0.7 on scalability test</td>
          <td style="text-align: left">Morgan fingerprint based</td>
      </tr>
      <tr>
          <td style="text-align: left">Invalid SMILES ratio</td>
          <td style="text-align: left">Higher</td>
          <td style="text-align: left">~0.4 on scalability test</td>
          <td style="text-align: left">Decreases with more training data</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nam, J. &amp; Kim, J. (2016). Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. <em>arXiv preprint</em>, arXiv:1612.09529. <a href="https://arxiv.org/abs/1612.09529">https://arxiv.org/abs/1612.09529</a></p>
<p><strong>Publication</strong>: arXiv preprint 2016</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{nam2016linking,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Nam, Juno and Kim, Jurae}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1612.09529}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.1612.09529}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoMu: Bridging Molecular Graphs and Natural Language</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/</guid><description>MoMu bridges molecular graphs and natural language via contrastive pre-training, enabling cross-modal retrieval, captioning, and property prediction.</description><content:encoded><![CDATA[<h2 id="bridging-molecular-graphs-and-natural-language-through-contrastive-learning">Bridging Molecular Graphs and Natural Language Through Contrastive Learning</h2>
<p>MoMu (Molecular Multimodal foundation model) is a <strong>Method</strong> paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.</p>
<h2 id="why-single-modality-models-are-insufficient-for-molecular-understanding">Why Single-Modality Models Are Insufficient for Molecular Understanding</h2>
<p>Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.</p>
<p>Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.</p>
<p>The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.</p>
<h2 id="contrastive-pre-training-with-inter-modal-and-intra-modal-objectives">Contrastive Pre-Training with Inter-Modal and Intra-Modal Objectives</h2>
<p>MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).</p>
<h3 id="data-collection">Data Collection</h3>
<p>The authors collect approximately 15,613 molecular graph-document pairs by:</p>
<ol>
<li>Gathering names, synonyms, and SMILES for the top 50K compounds in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></li>
<li>Converting SMILES to molecular graphs using the OGB <code>smiles2graph</code> function</li>
<li>Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields</li>
<li>Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts</li>
</ol>
<h3 id="contrastive-training-objective">Contrastive Training Objective</h3>
<p>For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.</p>
<p>The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:</p>
<p>$$
\ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)}
$$</p>
<p>where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.</p>
<p>An intra-modal graph contrastive loss further strengthens the graph encoder:</p>
<p>$$
\ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)}
$$</p>
<h3 id="zero-shot-text-to-graph-generation">Zero-Shot Text-to-Graph Generation</h3>
<p>MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:</p>
<ol>
<li>Samples a latent variable $q$ from MoFlow&rsquo;s Gaussian prior $P(q)$</li>
<li>Generates a molecular graph through MoFlow&rsquo;s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$</li>
<li>Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu&rsquo;s graph encoder</li>
<li>Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:</li>
</ol>
<p>$$
\ell_q = -\text{sim}(z^G, z^T) / \tau
$$</p>
<p>All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.</p>
<h2 id="evaluation-across-four-downstream-tasks">Evaluation Across Four Downstream Tasks</h2>
<h3 id="cross-modal-retrieval">Cross-Modal Retrieval</h3>
<p>MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.</p>
<p><strong>Graph-to-Text Retrieval (PCdes, fine-tuned)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Sentence Acc</th>
          <th>Sentence R@20</th>
          <th>Paragraph Acc</th>
          <th>Paragraph R@20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sci-BERT</td>
          <td>50.38</td>
          <td>62.11</td>
          <td>62.57</td>
          <td>60.67</td>
      </tr>
      <tr>
          <td>KV-PLM</td>
          <td>53.79</td>
          <td>66.63</td>
          <td>64.81</td>
          <td>63.87</td>
      </tr>
      <tr>
          <td>KV-PLM*</td>
          <td>55.92</td>
          <td>68.59</td>
          <td>77.92</td>
          <td>75.93</td>
      </tr>
      <tr>
          <td>MoMu-S</td>
          <td>58.64</td>
          <td>80.59</td>
          <td>80.62</td>
          <td>79.11</td>
      </tr>
      <tr>
          <td>MoMu-K</td>
          <td>58.74</td>
          <td>81.29</td>
          <td>81.09</td>
          <td>80.15</td>
      </tr>
  </tbody>
</table>
<p><strong>Text-to-Graph Retrieval (PCdes, fine-tuned)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Sentence Acc</th>
          <th>Sentence R@20</th>
          <th>Paragraph Acc</th>
          <th>Paragraph R@20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sci-BERT</td>
          <td>50.12</td>
          <td>68.02</td>
          <td>61.75</td>
          <td>60.77</td>
      </tr>
      <tr>
          <td>KV-PLM</td>
          <td>54.22</td>
          <td>71.80</td>
          <td>64.95</td>
          <td>64.27</td>
      </tr>
      <tr>
          <td>KV-PLM*</td>
          <td>55.61</td>
          <td>74.77</td>
          <td>77.03</td>
          <td>75.47</td>
      </tr>
      <tr>
          <td>MoMu-S</td>
          <td>55.44</td>
          <td>76.92</td>
          <td>80.22</td>
          <td>79.02</td>
      </tr>
      <tr>
          <td>MoMu-K</td>
          <td>54.94</td>
          <td>78.29</td>
          <td>81.45</td>
          <td>80.62</td>
      </tr>
  </tbody>
</table>
<p>In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.</p>
<h3 id="molecule-captioning">Molecule Captioning</h3>
<p>MoMu&rsquo;s graph features are appended to MolT5&rsquo;s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The pre-trained graph encoder from MoMu is fine-tuned on eight <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> datasets using scaffold splitting and ROC-AUC evaluation (10 runs).</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>No Pre-Train</th>
          <th>GraphCL</th>
          <th>MoMu-S</th>
          <th>MoMu-K</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>65.8</td>
          <td>69.7</td>
          <td><strong>70.5</strong></td>
          <td>70.1</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>74.0</td>
          <td>73.9</td>
          <td>75.6</td>
          <td>75.6</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td>63.4</td>
          <td>62.4</td>
          <td>63.4</td>
          <td>63.0</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>57.3</td>
          <td>60.5</td>
          <td>60.5</td>
          <td>60.4</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>58.0</td>
          <td>76.0</td>
          <td><strong>79.9</strong></td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>MUV</td>
          <td>71.8</td>
          <td>69.8</td>
          <td>70.5</td>
          <td>71.1</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>75.3</td>
          <td><strong>78.5</strong></td>
          <td>75.9</td>
          <td>76.2</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>70.1</td>
          <td>75.4</td>
          <td>76.7</td>
          <td>77.1</td>
      </tr>
      <tr>
          <td><strong>Average</strong></td>
          <td>66.96</td>
          <td>70.78</td>
          <td><strong>71.63</strong></td>
          <td>71.36</td>
      </tr>
  </tbody>
</table>
<p>MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu&rsquo;s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM&rsquo;s SMILES-based knowledge does not transfer well to graph-based representations.</p>
<h3 id="zero-shot-text-to-graph-generation-1">Zero-Shot Text-to-Graph Generation</h3>
<p>The method generates molecules from three types of text descriptions:</p>
<ol>
<li><strong>High-level vague descriptions</strong> (e.g., &ldquo;The molecule is beautiful&rdquo;): MoMu generates diverse, interpretable molecules where &ldquo;beautiful&rdquo; tends to produce locally symmetric and stretched graphs, &ldquo;versatile&rdquo; produces molecules with varied elements and functional groups, and &ldquo;strange&rdquo; produces cluttered, irregular structures.</li>
<li><strong>Functional descriptions</strong> (e.g., &ldquo;fluorescent molecules&rdquo;, &ldquo;high water solubility and barrier permeability with low toxicity&rdquo;): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.</li>
<li><strong>Structural descriptions</strong> (e.g., &ldquo;molecules containing <a href="https://en.wikipedia.org/wiki/Nucleophile">nucleophilic</a> groups&rdquo;): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).</li>
</ol>
<h2 id="promising-multimodal-transfer-with-clear-data-limitations">Promising Multimodal Transfer with Clear Data Limitations</h2>
<p>MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:</p>
<ol>
<li><strong>Cross-modal alignment works with limited data</strong>: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.</li>
<li><strong>Multimodal supervision improves graph representations</strong>: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.</li>
<li><strong>SMILES knowledge does not transfer to graphs</strong>: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several important limitations:</p>
<ul>
<li><strong>Data scarcity</strong>: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.</li>
<li><strong>Noisy supervision</strong>: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.</li>
<li><strong>Generator constraints</strong>: The zero-shot generation method is limited by MoFlow&rsquo;s capacity (maximum 38 atoms, 9 element types from ZINC250K training).</li>
<li><strong>Property coverage</strong>: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Collected graph-text pairs (PubChem + S2ORC)</td>
          <td>15,613 pairs</td>
          <td>~37M paragraphs total; top 50K PubChem compounds</td>
      </tr>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>PCdes</td>
          <td>15K pairs (10.5K/1.5K/3K split)</td>
          <td>SMILES-description pairs from PubChem</td>
      </tr>
      <tr>
          <td>Molecule captioning</td>
          <td>ChEBI-20</td>
          <td>~33K pairs</td>
          <td>Used with MolT5</td>
      </tr>
      <tr>
          <td>Text-to-graph generation</td>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC250K</a> (MoFlow)</td>
          <td>250K molecules</td>
          <td>Pre-trained generator, max 38 atoms</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (8 datasets)</td>
          <td>Varies</td>
          <td>BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Graph augmentations</strong>: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)</li>
<li><strong>Contrastive learning</strong>: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives</li>
<li><strong>Zero-shot generation</strong>: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Graph encoder</strong>: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint</li>
<li><strong>Text encoder</strong>: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM</li>
<li><strong>Projection heads</strong>: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space</li>
<li><strong>Optimizer</strong>: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Best Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>G-T Retrieval (PCdes)</td>
          <td>Accuracy / R@20</td>
          <td>81.09 / 80.15 (paragraph)</td>
          <td>MoMu-K, fine-tuned</td>
      </tr>
      <tr>
          <td>T-G Retrieval (PCdes)</td>
          <td>Accuracy / R@20</td>
          <td>81.45 / 80.62 (paragraph)</td>
          <td>MoMu-K, fine-tuned</td>
      </tr>
      <tr>
          <td>Zero-shot G-T Retrieval</td>
          <td>Accuracy</td>
          <td>~46%</td>
          <td>vs. ~1.4% for baselines</td>
      </tr>
      <tr>
          <td>Property Prediction</td>
          <td>ROC-AUC (avg)</td>
          <td>71.63%</td>
          <td>MoMu-S, 8 MoleculeNet datasets</td>
      </tr>
      <tr>
          <td>Molecule Captioning</td>
          <td>Text2Mol</td>
          <td>Improved over MolT5</td>
          <td>MoMu + MolT5-large</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs</li>
<li>Framework: PyTorch</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BingSu12/MoMu">MoMu code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Pre-training and downstream task code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/yangzhao1230/GraphTextRetrieval">GraphTextRetrieval</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Data collection and cross-modal retrieval code</td>
      </tr>
      <tr>
          <td><a href="https://pan.baidu.com/s/1aHJoYTTZWDHPCcRuu9I7Fg">Pre-training dataset</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Hosted on Baidu Pan (Chinese cloud storage)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., &amp; Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{su2022momu,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2209.05481}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolFM: Trimodal Molecular Foundation Pre-training</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/molfm-multimodal-molecular-foundation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/molfm-multimodal-molecular-foundation/</guid><description>MolFM fuses molecular graphs, biomedical text, and knowledge graphs via cross-modal attention for joint molecular representation learning.</description><content:encoded><![CDATA[<h2 id="trimodal-pre-training-for-molecular-understanding">Trimodal Pre-training for Molecular Understanding</h2>
<p>MolFM is a <strong>Method</strong> paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.</p>
<h2 id="why-existing-molecular-models-fall-short">Why Existing Molecular Models Fall Short</h2>
<p>Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like <a href="/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/">MoMu</a> and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.</p>
<p>Second, and more fundamentally, no prior model incorporates <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a> as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.</p>
<h2 id="cross-modal-attention-and-metric-learning-guarantees">Cross-Modal Attention and Metric Learning Guarantees</h2>
<h3 id="architecture">Architecture</h3>
<p>MolFM uses three pre-trained single-modal encoders:</p>
<ul>
<li><strong>Molecular graph encoder</strong>: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$</li>
<li><strong>Text encoder</strong>: A 6-layer transformer (61.8M parameters) initialized from KV-PLM&rsquo;s first 6 layers, producing token features $h_T$</li>
<li><strong>Knowledge graph encoder</strong>: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$</li>
</ul>
<p>A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule&rsquo;s entity and $N=4$ randomly sampled one-hop neighbors.</p>
<h3 id="pre-training-objectives">Pre-training Objectives</h3>
<p>MolFM combines four losses:</p>
<p><strong>Structure-text contrastive (STC)</strong> aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:</p>
<p>$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S&rsquo; \in B} \exp(s(z_{S&rsquo;}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T&rsquo; \in B} \exp(s(z_S, z_{T&rsquo;}) / \tau)} \right]$$</p>
<p>where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.</p>
<p><strong>Cross-modal matching (CMM)</strong> predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder&rsquo;s CLS token:</p>
<p>$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$</p>
<p><strong>Masked language modeling (MLM)</strong> predicts masked text tokens conditioned on all three modalities:</p>
<p>$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$</p>
<p><strong>Knowledge graph embedding (KGE)</strong> regularizes entity embeddings with a max-margin TransE loss:</p>
<p>$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$</p>
<p>where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.</p>
<p>The total pre-training loss is:</p>
<p>$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$</p>
<h3 id="theoretical-justifications">Theoretical Justifications</h3>
<p>The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.</p>
<p>For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:</p>
<p><strong>Lemma 1</strong> (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:</p>
<p>$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$</p>
<p>This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.</p>
<p><strong>Lemma 2</strong> (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:</p>
<p>$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$</p>
<p>where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.</p>
<h2 id="experiments-across-four-downstream-tasks">Experiments Across Four Downstream Tasks</h2>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>MolFM pre-trains on 15K molecules from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from <a href="https://en.wikipedia.org/wiki/DrugBank">DrugBank</a>, <a href="https://en.wikipedia.org/wiki/BindingDB">BindingDB</a>, and additional public databases with heuristic augmentation.</p>
<h3 id="cross-modal-retrieval">Cross-Modal Retrieval</h3>
<p>Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.</p>
<table>
  <thead>
      <tr>
          <th>Mode</th>
          <th>Model</th>
          <th>S-T MRR</th>
          <th>S-T R@1</th>
          <th>S-T R@10</th>
          <th>T-S MRR</th>
          <th>T-S R@1</th>
          <th>T-S R@10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Zero-shot</td>
          <td>MoMu</td>
          <td>9.89</td>
          <td>5.08</td>
          <td>18.93</td>
          <td>10.33</td>
          <td>4.90</td>
          <td>20.69</td>
      </tr>
      <tr>
          <td>Zero-shot</td>
          <td>MolFM</td>
          <td>21.42</td>
          <td>13.90</td>
          <td>36.21</td>
          <td>23.63</td>
          <td>16.14</td>
          <td>39.54</td>
      </tr>
      <tr>
          <td>Fine-tune</td>
          <td>MoMu</td>
          <td>34.29</td>
          <td>24.47</td>
          <td>53.84</td>
          <td>34.53</td>
          <td>24.87</td>
          <td>54.25</td>
      </tr>
      <tr>
          <td>Fine-tune</td>
          <td>MolFM</td>
          <td>39.56</td>
          <td>29.76</td>
          <td>58.63</td>
          <td>39.34</td>
          <td>29.39</td>
          <td>58.49</td>
      </tr>
  </tbody>
</table>
<p>MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.</p>
<h3 id="molecule-captioning">Molecule Captioning</h3>
<p>Evaluated on ChEBI-20 using MolT5 decoders. MolFM&rsquo;s structure encoder features are concatenated with the MolT5 encoder outputs.</p>
<table>
  <thead>
      <tr>
          <th>Decoder</th>
          <th>Encoder</th>
          <th>BLEU-4</th>
          <th>ROUGE-L</th>
          <th>METEOR</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-base</td>
          <td>MolT5-base</td>
          <td>0.457</td>
          <td>0.578</td>
          <td>0.569</td>
          <td>0.547</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MoMu</td>
          <td>0.462</td>
          <td>0.575</td>
          <td>0.576</td>
          <td>0.558</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>GraphMVP</td>
          <td>0.491</td>
          <td>0.592</td>
          <td>0.599</td>
          <td>0.570</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MolFM</td>
          <td>0.498</td>
          <td>0.594</td>
          <td>0.607</td>
          <td>0.576</td>
      </tr>
  </tbody>
</table>
<h3 id="text-based-molecule-generation">Text-Based Molecule Generation</h3>
<p>Also on ChEBI-20 with MolT5 decoders. MolFM&rsquo;s text features are projected and fed to the decoder.</p>
<table>
  <thead>
      <tr>
          <th>Decoder</th>
          <th>Encoder</th>
          <th>Exact</th>
          <th>Valid</th>
          <th>Morgan FTS</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-base</td>
          <td>MolT5-base</td>
          <td>0.082</td>
          <td>0.786</td>
          <td>0.601</td>
          <td>0.543</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MoMu</td>
          <td>0.183</td>
          <td>0.863</td>
          <td>0.678</td>
          <td>0.580</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MolFM</td>
          <td>0.210</td>
          <td>0.892</td>
          <td>0.697</td>
          <td>0.583</td>
      </tr>
  </tbody>
</table>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder&rsquo;s CLS feature to predict properties.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP</th>
          <th>Tox21</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>Avg</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GraphMVP</td>
          <td>72.4</td>
          <td>74.4</td>
          <td>77.5</td>
          <td>77.0</td>
          <td>81.2</td>
          <td>73.07</td>
      </tr>
      <tr>
          <td>DeepEIK</td>
          <td>72.1</td>
          <td>72.4</td>
          <td>89.7</td>
          <td>75.0</td>
          <td>80.5</td>
          <td>73.27</td>
      </tr>
      <tr>
          <td>MolFM (w/o T+K)</td>
          <td>72.2</td>
          <td>76.6</td>
          <td>78.6</td>
          <td>78.2</td>
          <td>82.6</td>
          <td>73.95</td>
      </tr>
      <tr>
          <td>MolFM (w/ T+K)</td>
          <td>72.9</td>
          <td>77.2</td>
          <td>79.7</td>
          <td>78.8</td>
          <td>83.9</td>
          <td>74.62</td>
      </tr>
  </tbody>
</table>
<p>With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.</p>
<p>The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.</p>
<p>The authors acknowledge several limitations:</p>
<ol>
<li><strong>Data quality</strong>: The pre-training dataset (15K molecules) is small and may introduce biases</li>
<li><strong>Cold-start problem</strong>: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information</li>
<li><strong>Entity scope</strong>: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training (molecules)</td>
          <td>PubChem</td>
          <td>15K molecules</td>
          <td>Follows MoMu&rsquo;s pre-training data</td>
      </tr>
      <tr>
          <td>Pre-training (text)</td>
          <td>S2ORC</td>
          <td>37M paragraphs</td>
          <td>Biomedical literature paragraphs</td>
      </tr>
      <tr>
          <td>Knowledge graph</td>
          <td>DrugBank, BindingDB, public DBs</td>
          <td>49K entities, 3.2M relations</td>
          <td>Constructed with heuristics from MoCL</td>
      </tr>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>PCdes</td>
          <td>Paragraph-level</td>
          <td>Test split</td>
      </tr>
      <tr>
          <td>Captioning/Generation</td>
          <td>ChEBI-20</td>
          <td>-</td>
          <td>Following MolT5 splits</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet</td>
          <td>8 datasets</td>
          <td>Classification tasks, ROC-AUC metric</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: AdamW with weight decay $1 \times 10^{-4}$</li>
<li>Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$</li>
<li>Batch size: 128</li>
<li>Pre-training epochs: 300</li>
<li>Knowledge graph neighbors per molecule: $N = 4$</li>
<li>Temperature: $\tau = 0.1$</li>
<li>Margin: $\Delta = 0.2$</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
          <th>Initialization</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph encoder</td>
          <td>5-layer GIN</td>
          <td>1.8M</td>
          <td>GraphMVP</td>
      </tr>
      <tr>
          <td>Text encoder</td>
          <td>6-layer Transformer</td>
          <td>61.8M</td>
          <td>KV-PLM (first 6 layers)</td>
      </tr>
      <tr>
          <td>Knowledge encoder</td>
          <td>TransE</td>
          <td>12.6M</td>
          <td>Trained 500 epochs on KG</td>
      </tr>
      <tr>
          <td>Multimodal encoder</td>
          <td>6-layer Transformer + cross-attention</td>
          <td>61.8M</td>
          <td>KV-PLM (last 6 layers)</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>~138M</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>MRR, Recall@1/5/10</td>
      </tr>
      <tr>
          <td>Molecule captioning</td>
          <td>BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol</td>
      </tr>
      <tr>
          <td>Text-to-molecule generation</td>
          <td>BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>ROC-AUC per dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 NVIDIA A100 GPUs for pre-training</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BioFM/OpenBioMed">OpenBioMed</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation including MolFM</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, Y., Yang, K., Hong, M., Liu, X. Y., &amp; Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. <em>arXiv preprint arXiv:2307.09484</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{luo2023molfm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolFM: A Multimodal Molecular Foundation Model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2307.09484}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolecularRNN: Graph-Based Molecular Generation and RL</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</guid><description>MolecularRNN extends GraphRNN with atom and bond type predictions, valency-based rejection sampling, and policy gradient optimization for molecular generation.</description><content:encoded><![CDATA[<h2 id="a-graph-recurrent-model-for-molecular-generation-with-property-optimization">A Graph Recurrent Model for Molecular Generation with Property Optimization</h2>
<p>This is a <strong>Method</strong> paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.</p>
<h2 id="why-generate-molecules-as-graphs-rather-than-strings">Why Generate Molecules as Graphs Rather Than Strings</h2>
<p>Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.</p>
<p>Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.</p>
<h2 id="core-innovation-extending-graphrnn-with-chemical-constraints-and-rl">Core Innovation: Extending GraphRNN with Chemical Constraints and RL</h2>
<p>MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.</p>
<h3 id="autoregressive-graph-generation">Autoregressive Graph Generation</h3>
<p>The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:</p>
<p>$$
p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right) p\left(S_{i}^{\pi} \mid C_{i}^{\pi}, S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right)
$$</p>
<p>NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:</p>
<p>$$
h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right)
$$</p>
<p>$$
\psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right)
$$</p>
<p>EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:</p>
<p>$$
h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}}
$$</p>
<p>$$
\phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right)
$$</p>
<p>Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.</p>
<h3 id="valency-based-rejection-sampling">Valency-Based Rejection Sampling</h3>
<p>During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:</p>
<p>$$
\sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}}
$$</p>
<p>Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.</p>
<h3 id="property-optimization-via-policy-gradient">Property Optimization via Policy Gradient</h3>
<p>For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:</p>
<p>$$
L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta)
$$</p>
<p>where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.</p>
<h2 id="experimental-setup-pretraining-and-property-optimization">Experimental Setup: Pretraining and Property Optimization</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), <a href="/notes/chemistry/datasets/zinc-22/">ZINC 250k</a> (250K randomly selected commercially available compounds), and <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.</p>
<h3 id="generation-quality-at-scale">Generation Quality at Scale</h3>
<p>The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:</p>
<table>
  <thead>
      <tr>
          <th>Training Set</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>IntDiv (p=1)</th>
          <th>IntDiv (p=2)</th>
          <th>SA Score</th>
          <th>QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>100%</td>
          <td>99.2%</td>
          <td>99.3%</td>
          <td>0.895</td>
          <td>0.890</td>
          <td>3.67 +/- 1.20</td>
          <td>0.56 +/- 0.20</td>
      </tr>
      <tr>
          <td>ZINC 250k</td>
          <td>100%</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>0.892</td>
          <td>0.887</td>
          <td>3.60 +/- 1.01</td>
          <td>0.68 +/- 0.16</td>
      </tr>
      <tr>
          <td>MOSES</td>
          <td>100%</td>
          <td>99.4%</td>
          <td>100%</td>
          <td>0.881</td>
          <td>0.876</td>
          <td>3.24 +/- 0.97</td>
          <td>0.74 +/- 0.14</td>
      </tr>
  </tbody>
</table>
<p>Comparison with baselines on ZINC 250k (30K samples):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>SA Score</th>
          <th>QED</th>
          <th>IntDiv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>100%</td>
          <td>3.37</td>
          <td>0.76</td>
          <td>0.85</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>100%</td>
          <td>99.97%</td>
          <td>100%</td>
          <td>4.62</td>
          <td>0.61</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>100%</td>
          <td>99.89%</td>
          <td>100%</td>
          <td>3.59</td>
          <td>0.68</td>
          <td>0.89</td>
      </tr>
  </tbody>
</table>
<p>GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.</p>
<h3 id="property-optimization-results">Property Optimization Results</h3>
<p>Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>logP 1st</th>
          <th>logP 2nd</th>
          <th>logP 3rd</th>
          <th>QED 1st</th>
          <th>QED 2nd</th>
          <th>QED 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></td>
          <td>3.63</td>
          <td>3.49</td>
          <td>3.44</td>
          <td>0.896</td>
          <td>0.824</td>
          <td>0.820</td>
      </tr>
      <tr>
          <td>JT-VAE</td>
          <td>5.30</td>
          <td>4.93</td>
          <td>4.49</td>
          <td>0.925</td>
          <td>0.911</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>7.98</td>
          <td>7.85</td>
          <td>7.80</td>
          <td>0.948</td>
          <td>0.947</td>
          <td>0.946</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>10.34</td>
          <td>10.19</td>
          <td>10.14</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.947</td>
      </tr>
  </tbody>
</table>
<p>MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN&rsquo;s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.</p>
<h2 id="distribution-level-evaluation-and-learned-chemical-patterns">Distribution-Level Evaluation and Learned Chemical Patterns</h2>
<p>The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.</p>
<p>Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.</p>
<p>Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>~1.5M molecules</td>
          <td>Bioactive molecules with experimental measurements</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC 250k</td>
          <td>250K molecules</td>
          <td>Random subset of ZINC database</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>MOSES</td>
          <td>~1.9M molecules</td>
          <td>Drug-like subset of ZINC</td>
      </tr>
      <tr>
          <td>Melting point critic</td>
          <td>Custom split</td>
          <td>37,940 train / 9,458 test</td>
          <td>Melting temperatures from -196 to 517 degrees C</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs</li>
<li><strong>Structural penalty</strong>: Policy gradient with -10 penalty per valency-violating atom</li>
<li><strong>Property optimization</strong>: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$</li>
<li><strong>Melting point critic</strong>: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NodeRNN</strong>: 4 GRU layers, hidden size 256, node embedding 128</li>
<li><strong>EdgeRNN</strong>: 4 GRU layers, hidden size 128, edge embedding 16</li>
<li><strong>NodeMLP/EdgeMLP</strong>: 2-layer MLP with 128 hidden units, ReLU activation, softmax output</li>
<li><strong>BFS window</strong>: $M = 12$ preceding atoms</li>
<li><strong>Atom types</strong>: 9 (C, N, O, F, P, S, Cl, Br, I)</li>
<li><strong>Bond types</strong>: 3 (single, double, triple) + no bond</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>% chemically valid molecules (RDKit)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>% unique in generated pool (up to 1M)</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>% not in training set</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Average pairwise Tanimoto distance</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility (2-4 optimal range)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Drug-likeness score (0-1)</td>
      </tr>
      <tr>
          <td>Penalized logP</td>
          <td>Lipophilicity with ring and SA penalties</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 GPUs (NVIDIA, specific model not stated)</li>
<li>Per-GPU batch size of 512 for pretraining</li>
<li>Training time not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Popova, M., Shvets, M., Oliva, J., &amp; Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. <em>arXiv preprint arXiv:1905.13372</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{popova2019molecularrnn,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolecularRNN: Generating realistic molecular graphs with optimized properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1905.13372}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Memory-Assisted RL for Diverse De Novo Mol. Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</guid><description>A memory unit for REINVENT-based RL that tracks generated scaffolds and penalizes repeated solutions, increasing molecular diversity up to fourfold.</description><content:encoded><![CDATA[<h2 id="a-memory-module-for-diverse-molecular-generation-via-rl">A Memory Module for Diverse Molecular Generation via RL</h2>
<p>This is a <strong>Method</strong> paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework&rsquo;s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.</p>
<h2 id="policy-collapse-limits-rl-based-de-novo-design">Policy Collapse Limits RL-Based De Novo Design</h2>
<p>Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> algorithm and related approaches (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is <strong>policy collapse</strong> (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.</p>
<p>Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.</p>
<h2 id="core-innovation-hash-table-memory-unit-for-reward-modification">Core Innovation: Hash-Table Memory Unit for Reward Modification</h2>
<p>The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).</p>
<h3 id="integration-with-reinvent">Integration with REINVENT</h3>
<p>The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:</p>
<p>$$
\log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c)
$$</p>
<p>where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:</p>
<p>$$
R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2
$$</p>
<p>and the loss is $\text{loss} = -R(c)$.</p>
<h3 id="memory-unit-operation">Memory Unit Operation</h3>
<p>When a high-scoring molecule is generated:</p>
<ol>
<li>Its fingerprint or scaffold is compared against all index structures in the memory</li>
<li>If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket</li>
<li>If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules</li>
<li>If no similar index exists, a new index-bucket pair is created</li>
</ol>
<h3 id="four-similarity-criteria">Four Similarity Criteria</h3>
<p>The authors evaluate four criteria for grouping molecules in the memory:</p>
<ol>
<li><strong>Compound similarity</strong>: ECFP4 Tanimoto similarity at the whole-molecule level</li>
<li><strong>Identical Bemis-Murcko (BM) scaffold</strong>: exact match of Bemis-Murcko frameworks</li>
<li><strong>Identical carbon skeleton</strong>: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)</li>
<li><strong>Scaffold similarity</strong>: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)</li>
</ol>
<h3 id="alternative-output-modes">Alternative Output Modes</h3>
<p>Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:</p>
<p>$$
M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}}
$$</p>
<p>And the sigmoid mode:</p>
<p>$$
M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}}
$$</p>
<p>Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.</p>
<h2 id="experimental-setup-logp-optimization-and-target-activity-prediction">Experimental Setup: LogP Optimization and Target Activity Prediction</h2>
<h3 id="case-study-1-logp-optimization">Case Study 1: LogP Optimization</h3>
<p>As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP &gt;= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:</p>
<p>$$
S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right)
$$</p>
<p>targeting LogP values between 2.0 and 3.0.</p>
<h3 id="case-study-2-htr1a-and-drd2-activity-prediction">Case Study 2: HTR1A and DRD2 Activity Prediction</h3>
<p>For a more complex scenario, the authors trained SVM classifiers (with <a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt scaling</a> for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/5-HT1A_receptor">HTR1A</a></strong>: 3,599 actives (pIC50 &gt;= 7) and 66,684 inactives</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a></strong>: 2,981 actives (pIC50 &gt;= 7) and 346,206 inactives (100,000 sampled)</li>
</ul>
<p>Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Set</th>
          <th>Balanced Accuracy</th>
          <th>ROC AUC</th>
          <th>F1</th>
          <th>MCC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>Test</td>
          <td>0.96</td>
          <td>0.99</td>
          <td>0.75</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Test</td>
          <td>0.95</td>
          <td>0.99</td>
          <td>0.71</td>
          <td>0.72</td>
      </tr>
  </tbody>
</table>
<p>RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity &gt;= 0.7 were considered active.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.</p>
<h3 id="comparisons">Comparisons</h3>
<p>The authors compared memory-assisted RL against:</p>
<ul>
<li>Standard REINVENT RL (no memory)</li>
<li>Experience replay (re-presenting 8 high-scoring compounds per iteration)</li>
<li>Temperature scaling (values from 1.0 to 10.0)</li>
<li>Memory + experience replay combined</li>
</ul>
<h2 id="results-up-to-fourfold-increase-in-diverse-active-compounds">Results: Up to Fourfold Increase in Diverse Active Compounds</h2>
<h3 id="logp-optimization-results">LogP Optimization Results</h3>
<p>Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:</p>
<table>
  <thead>
      <tr>
          <th>Memory Type</th>
          <th>Optimized Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No memory</td>
          <td>938</td>
          <td>727</td>
          <td>396</td>
      </tr>
      <tr>
          <td>Compound similarity</td>
          <td>3,451</td>
          <td>2,963</td>
          <td>1,472</td>
      </tr>
      <tr>
          <td>Identical BM Scaffold</td>
          <td>3,428</td>
          <td>2,865</td>
          <td>1,398</td>
      </tr>
      <tr>
          <td>Identical Carbon Skeleton</td>
          <td>3,315</td>
          <td>3,002</td>
          <td>1,799</td>
      </tr>
      <tr>
          <td>Scaffold Similarity</td>
          <td>3,591</td>
          <td>3,056</td>
          <td>1,538</td>
      </tr>
  </tbody>
</table>
<p>The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto &gt;= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.</p>
<h3 id="htr1a-and-drd2-activity-optimization-results">HTR1A and DRD2 Activity Optimization Results</h3>
<p>The improvements were even more pronounced for target activity optimization:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Memory Type</th>
          <th>Active Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>No memory</td>
          <td>9,323</td>
          <td>7,312</td>
          <td>5,446</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Compound similarity</td>
          <td>16,779</td>
          <td>13,304</td>
          <td>9,887</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Identical Carbon Skeleton</td>
          <td>17,597</td>
          <td>15,531</td>
          <td>12,408</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>No memory</td>
          <td>5,143</td>
          <td>2,635</td>
          <td>1,949</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Compound similarity</td>
          <td>21,486</td>
          <td>17,844</td>
          <td>12,749</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Scaffold Similarity</td>
          <td>22,784</td>
          <td>20,712</td>
          <td>16,434</td>
      </tr>
  </tbody>
</table>
<p>For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).</p>
<h3 id="parameter-sensitivity">Parameter Sensitivity</h3>
<p>Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.</p>
<h3 id="comparison-with-experience-replay-and-temperature-scaling">Comparison with Experience Replay and Temperature Scaling</h3>
<ul>
<li><strong>Experience replay alone</strong> increased diversity compared to vanilla RL but was less effective than the memory unit alone</li>
<li><strong>Memory + experience replay</strong> achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape</li>
<li><strong>Temperature scaling</strong> was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>All evaluations are retrospective; no synthesized compounds were experimentally tested</li>
<li>The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds</li>
<li>The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt</li>
<li>The method was only tested with two biological targets and one physicochemical property</li>
<li>Computational overhead of the memory unit is not discussed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior model training</td>
          <td>ChEMBL 25</td>
          <td>~1.5M compounds</td>
          <td>Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs</td>
      </tr>
      <tr>
          <td>HTR1A activity data</td>
          <td>ExCAPE-DB</td>
          <td>3,599 actives + 66,684 inactives</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
      <tr>
          <td>DRD2 activity data</td>
          <td>ExCAPE-DB</td>
          <td>2,981 actives + 100,000 inactives (sampled)</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Generative model</strong>: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)</li>
<li><strong>RL</strong>: Augmented likelihood formulation with sigma scaling coefficient</li>
<li><strong>SVM classifiers</strong>: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)</li>
<li><strong>Butina clustering</strong>: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unique compounds</td>
          <td>Number of distinct valid SMILES generated</td>
      </tr>
      <tr>
          <td>Unique BM scaffolds</td>
          <td>Bemis-Murcko framework diversity</td>
      </tr>
      <tr>
          <td>Unique carbon skeletons</td>
          <td>Carbon skeleton diversity (stripped BM scaffolds)</td>
      </tr>
      <tr>
          <td>ECFP6 analogs</td>
          <td>Compounds with Tanimoto &gt;= 0.4 to known actives</td>
      </tr>
      <tr>
          <td>MMP analogs</td>
          <td>Matched molecular pair relationships with known actives</td>
      </tr>
      <tr>
          <td>Shared MMP cores</td>
          <td>Scaffold cores shared between generated and known compounds</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tblaschke/reinvent-memory">reinvent-memory</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with prepared datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blaschke, T., Engkvist, O., Bajorath, J., &amp; Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. <em>Journal of Cheminformatics</em>, 12, 68. <a href="https://doi.org/10.1186/s13321-020-00473-0">https://doi.org/10.1186/s13321-020-00473-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blaschke2020memory,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Memory-assisted reinforcement learning for diverse molecular de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\&#34;u}rgen and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00473-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LSTM Neural Network for Drug-Like Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/</guid><description>An LSTM neural network trained on 509K ChEMBL SMILES generates one million novel drug-like molecules with realistic substructures and bioactivity profiles.</description><content:encoded><![CDATA[<h2 id="an-early-method-for-lstm-based-molecular-generation">An Early Method for LSTM-Based Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.</p>
<h2 id="the-challenge-of-exploring-drug-like-chemical-space">The Challenge of Exploring Drug-Like Chemical Space</h2>
<p>The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.</p>
<p>The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> (VAE-based latent space design), <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">Olivecrona et al.</a> (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules&rsquo; chemical quality.</p>
<h2 id="character-level-lstm-with-temperature-based-sampling">Character-Level LSTM with Temperature-Based Sampling</h2>
<p>The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.</p>
<p>The network architecture consists of:</p>
<ul>
<li>Two stacked LSTM layers (which learn the SMILES grammar)</li>
<li>A dropout layer for regularization</li>
<li>A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation</li>
</ul>
<p>The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.</p>
<p>A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (<code>Cl</code> → <code>L</code>, <code>Br</code> → <code>R</code>, <code>[nH]</code> → <code>A</code>). Only the organic atom subset (<code>H</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>S</code>, <code>P</code>, <code>F</code>, <code>Cl</code>, <code>Br</code>, <code>I</code>) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.</p>
<h2 id="training-on-chembl-and-generating-one-million-molecules">Training on ChEMBL and Generating One Million Molecules</h2>
<h3 id="training-data">Training Data</h3>
<p>The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.</p>
<h3 id="generation-and-filtering">Generation and Filtering</h3>
<p>The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:</p>
<ol>
<li><strong>Bracket and ring closure check</strong> (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures</li>
<li><strong>Full chemical parsing with RDKit</strong>: An additional 14% fail due to unrealistic aromatic systems or incorrect valences</li>
<li><strong>Final yield</strong>: 32% of generated SMILES correspond to valid molecules</li>
</ol>
<p>One million valid molecules were generated in under 2 hours on 300 CPUs.</p>
<h3 id="novelty-and-diversity">Novelty and Diversity</h3>
<p>Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.</p>
<h3 id="physicochemical-properties">Physicochemical Properties</h3>
<p>Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.</p>
<h3 id="substructure-feature-comparison">Substructure Feature Comparison</h3>
<p>The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>ChEMBL (%)</th>
          <th>LSTM Generated (%)</th>
          <th>Naive Baseline (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No rings</td>
          <td>0.4</td>
          <td>0.4</td>
          <td>0.1</td>
      </tr>
      <tr>
          <td>1 ring</td>
          <td>2.8</td>
          <td>4.3</td>
          <td>13.2</td>
      </tr>
      <tr>
          <td>2 rings</td>
          <td>14.8</td>
          <td>23.1</td>
          <td>17.7</td>
      </tr>
      <tr>
          <td>3 rings</td>
          <td>32.2</td>
          <td>43.5</td>
          <td>27.3</td>
      </tr>
      <tr>
          <td>4 rings</td>
          <td>32.7</td>
          <td>23.9</td>
          <td>25.2</td>
      </tr>
      <tr>
          <td>&gt;4 rings</td>
          <td>17.2</td>
          <td>4.8</td>
          <td>16.5</td>
      </tr>
      <tr>
          <td>Fused aromatic rings</td>
          <td>38.8</td>
          <td>30.9</td>
          <td>0.2</td>
      </tr>
      <tr>
          <td>Large rings (&gt;8)</td>
          <td>0.4</td>
          <td>1.8</td>
          <td>75.9</td>
      </tr>
      <tr>
          <td>Spiro rings</td>
          <td>1.9</td>
          <td>0.6</td>
          <td>0.6</td>
      </tr>
      <tr>
          <td>Contains N</td>
          <td>96.5</td>
          <td>96.1</td>
          <td>92.3</td>
      </tr>
      <tr>
          <td>Contains O</td>
          <td>93.0</td>
          <td>92.0</td>
          <td>85.5</td>
      </tr>
      <tr>
          <td>Contains S</td>
          <td>35.6</td>
          <td>27.9</td>
          <td>39.6</td>
      </tr>
      <tr>
          <td>Contains halogen</td>
          <td>40.7</td>
          <td>38.8</td>
          <td>49.4</td>
      </tr>
  </tbody>
</table>
<p>The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.</p>
<h3 id="virtual-screening-validation">Virtual Screening Validation</h3>
<p>The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared &gt; 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.</p>
<p>Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:</p>
<table>
  <thead>
      <tr>
          <th>Assay</th>
          <th>KS D</th>
          <th>Distributions Differ?</th>
          <th>Mean (Real)</th>
          <th>Mean (Gen)</th>
          <th>Stdev (Real)</th>
          <th>Stdev (Gen)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>688395</td>
          <td>6.01%</td>
          <td>No</td>
          <td>4.66</td>
          <td>4.69</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>668624</td>
          <td>3.60%</td>
          <td>No</td>
          <td>4.86</td>
          <td>4.86</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>9.90%</td>
          <td>Yes</td>
          <td>5.33</td>
          <td>5.26</td>
          <td>0.34</td>
          <td>0.30</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>4.30%</td>
          <td>No</td>
          <td>5.18</td>
          <td>5.13</td>
          <td>0.47</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>688781</td>
          <td>2.20%</td>
          <td>No</td>
          <td>4.83</td>
          <td>4.82</td>
          <td>0.26</td>
          <td>0.25</td>
      </tr>
      <tr>
          <td>809170</td>
          <td>8.70%</td>
          <td>Yes</td>
          <td>5.12</td>
          <td>5.07</td>
          <td>0.51</td>
          <td>0.46</td>
      </tr>
  </tbody>
</table>
<p>For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.</p>
<h2 id="generated-molecules-are-novel-drug-like-and-potentially-bioactive">Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive</h2>
<p>The key findings of this study are:</p>
<ol>
<li><strong>High novelty</strong>: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL</li>
<li><strong>Drug-like quality</strong>: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints</li>
<li><strong>Predicted bioactivity</strong>: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds</li>
<li><strong>Scalability</strong>: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration</li>
<li><strong>LSTM superiority over naive baselines</strong>: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns</li>
</ol>
<p>The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as &ldquo;available on request&rdquo; from the corresponding author rather than publicly released.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL bioactive molecules</td>
          <td>509,000 molecules</td>
          <td>Activity &lt; 10 uM on any target; organic atoms only; no charges or stereo</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Double-stacked LSTM layers with dropout</li>
<li>Softmax output over 23-character reduced SMILES alphabet</li>
<li>RMSProp optimizer with learning rate annealed from 0.01 to 0.0002</li>
<li>Temperature-based sampling at generation time</li>
<li>40-character input windows during training</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES rate</td>
          <td>32%</td>
          <td>After bracket check and RDKit parsing</td>
      </tr>
      <tr>
          <td>Novelty (vs. training)</td>
          <td>99.72%</td>
          <td>Only 2,774 of 1M match ChEMBL</td>
      </tr>
      <tr>
          <td>Unique scaffolds</td>
          <td>627,000</td>
          <td>vs. 172,000 in ChEMBL</td>
      </tr>
      <tr>
          <td>KS test (4/6 assays)</td>
          <td>Not significantly different</td>
          <td>At 95% confidence</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Generation: 300 CPUs for under 2 hours (1 million valid molecules)</li>
<li>Training hardware not specified</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ertl, P., Lewis, R., Martin, E., &amp; Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. <em>arXiv preprint</em>, arXiv:1712.07449.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ertl2017silico,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{In silico generation of novel, drug-like chemical matter using the LSTM neural network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.07449}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LatentGAN: Latent-Space GAN for Molecular Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/</guid><description>LatentGAN combines a SMILES heteroencoder with a Wasserstein GAN to generate novel drug-like molecules in latent space, avoiding SMILES syntax issues.</description><content:encoded><![CDATA[<h2 id="a-gan-operating-in-learned-latent-space-for-molecular-design">A GAN Operating in Learned Latent Space for Molecular Design</h2>
<p>LatentGAN is a <strong>Method</strong> paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.</p>
<h2 id="limitations-of-direct-smiles-generation-with-gans">Limitations of Direct SMILES Generation with GANs</h2>
<p>Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski&rsquo;s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.</p>
<p>Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.</p>
<p>RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.</p>
<h2 id="heteroencoder-plus-wasserstein-gan-architecture">Heteroencoder Plus Wasserstein GAN Architecture</h2>
<p>The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.</p>
<h3 id="heteroencoder">Heteroencoder</h3>
<p>The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.</p>
<p>The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.</p>
<p>Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.</p>
<p>An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.</p>
<h3 id="wasserstein-gan-with-gradient-penalty">Wasserstein GAN with Gradient Penalty</h3>
<p>The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.</p>
<p>The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.</p>
<p>The WGAN-GP loss for the critic is:</p>
<p>$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$</p>
<p>where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.</p>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.</p>
<h2 id="experiments-on-drug-like-and-target-biased-generation">Experiments on Drug-Like and Target-Biased Generation</h2>
<h3 id="datasets">Datasets</h3>
<p>The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.</p>
<p>For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.</p>
<p>For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Training Set</th>
          <th>Test Set</th>
          <th>SVM ROC-AUC</th>
          <th>SVM Kappa</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>2,949</td>
          <td>2,326</td>
          <td>0.850</td>
          <td>0.56</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>48,283</td>
          <td>23,048</td>
          <td>0.993</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>49,381</td>
          <td>23,745</td>
          <td>0.995</td>
          <td>0.91</td>
      </tr>
  </tbody>
</table>
<p>SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.</p>
<h3 id="baselines">Baselines</h3>
<p>RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.</p>
<h3 id="heteroencoder-performance">Heteroencoder Performance</h3>
<p>The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.</p>
<h3 id="target-biased-generation-results">Target-Biased Generation Results</h3>
<p>From 50,000 sampled SMILES per target model:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Arch.</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>Active (%)</th>
          <th>Recovered Actives (%)</th>
          <th>Recovered Neighbors</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>GAN</td>
          <td>86</td>
          <td>56</td>
          <td>97</td>
          <td>71</td>
          <td>5.26</td>
          <td>196</td>
      </tr>
      <tr>
          <td>EGFR</td>
          <td>RNN</td>
          <td>96</td>
          <td>46</td>
          <td>95</td>
          <td>65</td>
          <td>7.74</td>
          <td>238</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>GAN</td>
          <td>86</td>
          <td>66</td>
          <td>95</td>
          <td>71</td>
          <td>5.05</td>
          <td>284</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>RNN</td>
          <td>96</td>
          <td>50</td>
          <td>90</td>
          <td>81</td>
          <td>7.28</td>
          <td>384</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>GAN</td>
          <td>89</td>
          <td>31</td>
          <td>98</td>
          <td>44</td>
          <td>0.93</td>
          <td>24</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>RNN</td>
          <td>97</td>
          <td>35</td>
          <td>97</td>
          <td>65</td>
          <td>3.72</td>
          <td>43</td>
      </tr>
  </tbody>
</table>
<h3 id="moses-benchmark">MOSES Benchmark</h3>
<p>On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.</p>
<h2 id="complementary-generation-and-drug-likeness-preservation">Complementary Generation and Drug-Likeness Preservation</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Validity and novelty</strong>: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN&rsquo;s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).</p>
<p><strong>Complementary chemical space</strong>: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.</p>
<p><strong>Drug-likeness</strong>: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.</p>
<p><strong>Chemical space coverage</strong>: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.</p>
<p><strong>Novel scaffolds</strong>: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.</p>
<h3 id="limitations">Limitations</h3>
<p>The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heteroencoder training</td>
          <td>ChEMBL 25 (subset)</td>
          <td>1,347,173 SMILES</td>
          <td>Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms</td>
      </tr>
      <tr>
          <td>General GAN training</td>
          <td>ChEMBL 25 (random subset)</td>
          <td>100,000</td>
          <td>Subset of heteroencoder training set</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (EGFR)</td>
          <td>2,949 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (HTR1A)</td>
          <td>48,283 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (S1PR1)</td>
          <td>49,381 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>ZINC (MOSES subset)</td>
          <td>1,584,663</td>
          <td>Canonical SMILES</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Heteroencoder</strong>: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs</li>
<li><strong>GAN</strong>: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs</li>
<li><strong>Evaluation</strong>: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder</li>
<li>Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU</li>
<li>Critic: 3 feed-forward layers of 256 dims with leaky ReLU</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>LatentGAN (EGFR)</th>
          <th>RNN Baseline (EGFR)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>86%</td>
          <td>96%</td>
          <td>Percent valid SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>56%</td>
          <td>46%</td>
          <td>Percent unique among valid</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>97%</td>
          <td>95%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Predicted active</td>
          <td>71%</td>
          <td>65%</td>
          <td>By SVM model</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Dierme/latent-gan">LatentGAN source code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Includes trained heteroencoder model and training sets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., &amp; Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. <em>Journal of Cheminformatics</em>, 11(1), 74. <a href="https://doi.org/10.1186/s13321-019-0397-9">https://doi.org/10.1186/s13321-019-0397-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{prykhodko2019latentgan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A de novo molecular generation method using latent vector based generative adversarial network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\&#39;u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{74}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-019-0397-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Grammar VAE: Generating Valid Molecules via CFGs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/</guid><description>The Grammar VAE encodes and decodes molecular parse trees from context-free grammars, guaranteeing syntactically valid SMILES outputs during generation.</description><content:encoded><![CDATA[<h2 id="a-grammar-constrained-vae-for-discrete-data-generation">A Grammar-Constrained VAE for Discrete Data Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder</a> that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> molecular strings and arithmetic expressions.</p>
<h2 id="why-character-level-vaes-fail-on-structured-discrete-data">Why Character-Level VAEs Fail on Structured Discrete Data</h2>
<p>Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.</p>
<p>The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.</p>
<h2 id="core-innovation-stack-based-grammar-masking-in-the-decoder">Core Innovation: Stack-Based Grammar Masking in the Decoder</h2>
<p>The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.</p>
<p><strong>Encoding.</strong> Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.</p>
<p><strong>Decoding with grammar masks.</strong> The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:</p>
<ol>
<li>Pops the top non-terminal $\alpha$ from the stack</li>
<li>Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$</li>
<li>Samples a production rule from the masked softmax distribution:</li>
</ol>
<p>$$
p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})}
$$</p>
<ol start="4">
<li>Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)</li>
</ol>
<p>This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.</p>
<p><strong>Training.</strong> The model is trained by maximizing the ELBO:</p>
<p>$$
\mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right]
$$</p>
<p>where the likelihood factorizes as:</p>
<p>$$
p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z})
$$</p>
<p>During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.</p>
<p><strong>Syntactic vs. semantic validity.</strong> The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.</p>
<h2 id="experiments-on-symbolic-regression-and-molecular-optimization">Experiments on Symbolic Regression and Molecular Optimization</h2>
<p>The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.</p>
<p><strong>Setup.</strong> After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.</p>
<h3 id="arithmetic-expressions">Arithmetic Expressions</h3>
<ul>
<li><strong>Data</strong>: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules</li>
<li><strong>Target</strong>: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$</li>
<li><strong>BO iterations</strong>: 5, averaged over 10 repetitions</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.99 +/- 0.01</td>
          <td>3.47 +/- 0.24</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.86 +/- 0.06</td>
          <td>4.75 +/- 0.25</td>
      </tr>
  </tbody>
</table>
<p>The GVAE&rsquo;s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE&rsquo;s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.</p>
<h3 id="molecular-optimization">Molecular Optimization</h3>
<ul>
<li><strong>Data</strong>: 250,000 SMILES strings from the ZINC database</li>
<li><strong>Target</strong>: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)</li>
<li><strong>BO iterations</strong>: 10, averaged over 5 trials</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.31 +/- 0.07</td>
          <td>-9.57 +/- 1.77</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.17 +/- 0.05</td>
          <td>-54.66 +/- 2.66</td>
      </tr>
  </tbody>
</table>
<p>The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).</p>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.</p>
<h3 id="predictive-performance">Predictive Performance</h3>
<p>Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE (Expressions)</th>
          <th>CVAE (Expressions)</th>
          <th>GVAE (Molecules)</th>
          <th>CVAE (Molecules)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Test LL</td>
          <td>-1.320 +/- 0.001</td>
          <td>-1.397 +/- 0.003</td>
          <td>-1.739 +/- 0.004</td>
          <td>-1.812 +/- 0.004</td>
      </tr>
      <tr>
          <td>Test RMSE</td>
          <td>0.884 +/- 0.002</td>
          <td>0.975 +/- 0.004</td>
          <td>1.404 +/- 0.006</td>
          <td>1.504 +/- 0.006</td>
      </tr>
  </tbody>
</table>
<h3 id="reconstruction-and-prior-sampling">Reconstruction and Prior Sampling</h3>
<p>On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.</p>
<h2 id="key-findings-limitations-and-impact">Key Findings, Limitations, and Impact</h2>
<p><strong>Key findings.</strong> Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.</p>
<p><strong>Limitations acknowledged by the authors.</strong></p>
<ul>
<li>The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.</li>
<li>The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.</li>
<li>The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.</li>
<li>Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.</li>
</ul>
<p><strong>Impact.</strong> The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (expressions)</td>
          <td>Generated arithmetic expressions</td>
          <td>100,000</td>
          <td>Up to 15 production rules each</td>
      </tr>
      <tr>
          <td>Training (molecules)</td>
          <td>ZINC database subset</td>
          <td>250,000 SMILES</td>
          <td>Same subset as <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 1D convolutional neural network over one-hot rule sequences</li>
<li>Decoder: RNN with stack-based grammar masking</li>
<li>Latent space: 56 dimensions (molecules), isotropic Gaussian prior</li>
<li>Property predictor: Sparse Gaussian process with 500 inducing points</li>
<li>Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection</li>
</ul>
<h3 id="models">Models</h3>
<p>Architecture details follow <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE</th>
          <th>CVAE</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid (expressions)</td>
          <td>0.99</td>
          <td>0.86</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Fraction valid (molecules)</td>
          <td>0.31</td>
          <td>0.17</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Best penalized logP</td>
          <td>2.94</td>
          <td>1.98</td>
          <td>Best molecule found</td>
      </tr>
      <tr>
          <td>Reconstruction accuracy</td>
          <td>53.7%</td>
          <td>44.6%</td>
          <td>On held-out molecules</td>
      </tr>
      <tr>
          <td>Prior validity</td>
          <td>7.2%</td>
          <td>0.7%</td>
          <td>Sampling from N(0,I)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mkusner/grammarVAE">grammarVAE</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kusner, M. J., Paige, B., &amp; Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. <em>Proceedings of the 34th International Conference on Machine Learning (ICML)</em>, 1945-1954.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kusner2017grammar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Grammar Variational Autoencoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kusner, Matt J. and Paige, Brooks and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 34th International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1945--1954}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v2: Pareto Multi-Objective RL for Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</guid><description>DrugEx v2 extends RNN-based de novo drug design with Pareto ranking and evolutionary exploration for multi-objective molecule generation.</description><content:encoded><![CDATA[<h2 id="multi-objective-de-novo-drug-design-with-pareto-optimization">Multi-Objective De Novo Drug Design with Pareto Optimization</h2>
<p>This is a <strong>Method</strong> paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.</p>
<h2 id="polypharmacology-and-the-limits-of-single-objective-generation">Polypharmacology and the Limits of Single-Objective Generation</h2>
<p>Traditional drug discovery follows the &ldquo;one drug, one target, one disease&rdquo; paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.</p>
<p>Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the <a href="https://en.wikipedia.org/wiki/Adenosine_receptor">adenosine receptor</a> system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and <a href="https://en.wikipedia.org/wiki/HERG">hERG</a> channel binding must be avoided to prevent cardiac toxicity.</p>
<h2 id="evolutionary-exploration-and-pareto-ranking-in-rl">Evolutionary Exploration and Pareto Ranking in RL</h2>
<p>The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.</p>
<h3 id="evolutionary-exploration-strategy">Evolutionary Exploration Strategy</h3>
<p>The generation process uses three RNN networks with identical LSTM architectures:</p>
<ul>
<li><strong>Agent net</strong> ($G_A$): the primary generator, updated at each training epoch via policy gradient</li>
<li><strong>Crossover net</strong> ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period</li>
<li><strong>Mutation net</strong> ($G_M$): initialized from the pre-trained model, parameters fixed throughout training</li>
</ul>
<p>At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.</p>
<h3 id="pareto-front-reward-scheme">Pareto Front Reward Scheme</h3>
<p>For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:</p>
<p>$$
R_{i} = \begin{cases} \text{minmax}(pX_{i}), &amp; \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), &amp; \text{if low affinity required} \\ 0, &amp; \text{if SMILES invalid} \end{cases}
$$</p>
<p>where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].</p>
<p>For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.</p>
<p>Molecules are ranked using a <a href="https://en.wikipedia.org/wiki/Multi-objective_optimization">non-dominated sorting</a> algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:</p>
<p>$$
R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>where $k$ is the molecule&rsquo;s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.</p>
<p>The agent is trained via policy gradient:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T})
$$</p>
<h3 id="weighted-sum-alternative">Weighted Sum Alternative</h3>
<p>The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:</p>
<p>$$
w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i
$$</p>
<p>This auto-adjusts importance toward under-performing objectives during training.</p>
<h3 id="molecular-diversity-metric">Molecular Diversity Metric</h3>
<p>Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<p>where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.</p>
<h2 id="multi-target-and-target-specific-experiments">Multi-Target and Target-Specific Experiments</h2>
<h3 id="qsar-environment">QSAR Environment</h3>
<p>Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.</p>
<h3 id="baselines">Baselines</h3>
<p>DrugEx v2 was compared against DrugEx v1, <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, and <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.</p>
<h3 id="multi-target-results">Multi-Target Results</h3>
<p>In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.57%</td>
          <td>80.81%</td>
          <td>87.29%</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.80%</td>
          <td><strong>97.45%</strong></td>
          <td>89.08%</td>
          <td>0.49</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>PF</td>
          <td>99.54%</td>
          <td>57.43%</td>
          <td><strong>98.84%</strong></td>
          <td><strong>0.77</strong></td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.84%</td>
          <td>66.01%</td>
          <td>82.67%</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>DrugEx v1</td>
          <td>PF</td>
          <td>98.28%</td>
          <td>43.27%</td>
          <td>88.96%</td>
          <td>0.71</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).</p>
<h3 id="target-specific-results">Target-Specific Results</h3>
<p>In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.53%</td>
          <td><strong>89.49%</strong></td>
          <td>90.55%</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.62%</td>
          <td><strong>97.86%</strong></td>
          <td>90.54%</td>
          <td>0.31</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>WS</td>
          <td>99.55%</td>
          <td>81.27%</td>
          <td>98.87%</td>
          <td>0.34</td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.29%</td>
          <td>86.98%</td>
          <td>80.30%</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme&rsquo;s diversity collapse (0.31) and competing methods.</p>
<h3 id="chemical-space-coverage">Chemical Space Coverage</h3>
<p>t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.</p>
<h3 id="substructure-distribution">Substructure Distribution</h3>
<p>Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.</p>
<h3 id="guacamol-benchmark">GuacaMol Benchmark</h3>
<p>DrugEx v2 was tested on 20 goal-directed tasks from the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.</p>
<h2 id="diversity-desirability-trade-off-and-limitations">Diversity-Desirability Trade-off and Limitations</h2>
<p>The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.</p>
<p>The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.</p>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The method is less effective for tasks with contradictory objectives in narrow chemical spaces</li>
<li>Emphasis is on generating diverse feasible molecules rather than individual optimal solutions</li>
<li>REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks</li>
<li>Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds</li>
</ul>
<p>Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v26 (ChEMBL set)</td>
          <td>1.7M molecules</td>
          <td>SMILES syntax learning, drug-like molecules</td>
      </tr>
      <tr>
          <td>Fine-tuning / Environment</td>
          <td>LIGAND set</td>
          <td>25,731 ligands</td>
          <td>Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>GuacaMol</td>
          <td>20 tasks</td>
          <td>Goal-directed generation tasks</td>
      </tr>
  </tbody>
</table>
<p>Active/inactive thresholds: $pX \geq 6.5$ (active), $pX &lt; 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>QSAR predictor</strong>: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.</li>
<li><strong>Generator</strong>: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.</li>
<li><strong>RL training</strong>: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.</li>
<li><strong>Pareto ranking</strong>: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generator</td>
          <td>LSTM (3 layers, 512 hidden)</td>
          <td>Embedding 128D, vocab 84</td>
      </tr>
      <tr>
          <td>Predictor</td>
          <td>Random Forest</td>
          <td>1000 trees, 2067D input</td>
      </tr>
      <tr>
          <td>MT-DNN (alternative)</td>
          <td>3 hidden layers (4000, 2000, 1000)</td>
          <td>ReLU, 20% dropout</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated SMILES that parse to valid molecules</td>
      </tr>
      <tr>
          <td>Desirability</td>
          <td>Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX &lt; 6.5$ off-targets)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of non-duplicate molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Solow-Polasky metric on ECFP6 Tanimoto distances</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Synthetic accessibility (1-10, lower is easier)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative estimate of drug-likeness (0-1, higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XuhanLiu/DrugEx">DrugEx GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Python, PyTorch)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v26</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source of training molecules and bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., &amp; van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. <em>Journal of Cheminformatics</em>, 13(1), 85. <a href="https://doi.org/10.1186/s13321-021-00561-9">https://doi.org/10.1186/s13321-021-00561-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2021drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{85}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-021-00561-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugChat: Conversational QA on Drug Molecule Graphs</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/drugchat-chatgpt-drug-molecule-graphs/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/drugchat-chatgpt-drug-molecule-graphs/</guid><description>DrugChat connects a GNN molecular encoder with Vicuna-13B via a linear adaptor, enabling multi-turn conversational QA about drug compound graphs.</description><content:encoded><![CDATA[<h2 id="a-prototype-for-conversational-drug-compound-analysis">A Prototype for Conversational Drug Compound Analysis</h2>
<p><strong>Method ($\Psi_{\text{Method}}$)</strong></p>
<p>DrugChat is a prototype system that enables ChatGPT-like conversational interaction with drug molecule graphs. Users upload a compound&rsquo;s molecular graph and ask free-form, multi-turn questions about its properties, mechanism of action, or therapeutic applications. The system generates natural language answers by combining a graph neural network (GNN) encoder, a large language model (LLM), and a lightweight linear adaptor that bridges the two modalities. The primary contribution is the architecture and the accompanying instruction tuning datasets (10,834 drug compounds, 143,517 QA pairs) that make this graph-to-language interaction possible.</p>
<h2 id="why-conversational-interfaces-for-drug-molecules">Why Conversational Interfaces for Drug Molecules?</h2>
<p>Drug discovery is time-intensive and expensive, often requiring years and billions of dollars to bring a single compound to market. Traditional computational chemistry tools provide specialized outputs but lack the ability to support open-ended, interactive exploration of molecular properties. Researchers working with drug compound data frequently need quick answers to diverse questions: What is the mechanism of action? Are there known drug interactions? What structural modifications could improve efficacy?</p>
<p>At the time of this work, large language models had demonstrated strong conversational capabilities for text, and multimodal extensions (MiniGPT-4, LLaVA) had connected vision encoders to LLMs. However, no system had bridged graph-structured molecular data with LLMs for interactive dialogue. DrugChat addresses this gap by proposing the first system (to the authors&rsquo; knowledge) that connects molecular graph representations directly to an LLM for multi-turn question answering.</p>
<h2 id="architecture-gnn-adaptor-llm-pipeline">Architecture: GNN-Adaptor-LLM Pipeline</h2>
<p>The core innovation is the three-component architecture and its training strategy:</p>
<p><strong>Graph Neural Network (GNN)</strong>: A pre-trained GNN from Hu et al. (2020) processes the compound&rsquo;s molecular graph. At each layer $k$, node representations are updated by aggregating features from neighboring nodes:</p>
<p>$$
h_{v}^{k} = \sigma\left(h_{v}^{k-1}, \text{AGG}\left(\left\{h_{u}^{k-1}, u \in \mathcal{N}(v)\right\}\right)\right)
$$</p>
<p>A permutation-invariant pooling function produces the graph-level representation:</p>
<p>$$
h_{G} = f\left(\left\{h_{v}^{K}, v \in G\right\}\right)
$$</p>
<p><strong>Linear Adaptor</strong>: A single linear transformation matrix converts the GNN graph representation into a soft prompt vector compatible with the LLM&rsquo;s input space. This is the only component whose weights are updated during training.</p>
<p><strong>Large Language Model (Vicuna-13B)</strong>: The pre-trained Vicuna-13B model takes the transformed graph prompt vector along with user questions and generates answers. Both the GNN and LLM weights remain frozen during training.</p>
<p>The prompt template follows the Vicuna conversational format:</p>
<p>$$
\mathbf{Q}: \langle\text{Graph}\rangle\langle\text{GraphFeature}\rangle\langle/\text{Graph}\rangle\langle\text{Instruction}\rangle \quad \mathbf{A}: \langle\text{Desc}\rangle
$$</p>
<p>During training, the system minimizes a negative log-likelihood loss between generated and ground-truth answers. The entire training procedure updates only the adaptor&rsquo;s parameters, making the approach computationally lightweight compared to full fine-tuning.</p>
<h2 id="instruction-tuning-datasets-from-chembl-and-pubchem">Instruction Tuning Datasets from ChEMBL and PubChem</h2>
<p>The authors constructed two instruction tuning datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Drug Compounds</th>
          <th>QA Pairs</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>3,892</td>
          <td>129,699</td>
          <td>ChEMBL database (Feb 2023)</td>
      </tr>
      <tr>
          <td>PubChem</td>
          <td>6,942</td>
          <td>13,818</td>
          <td>PubChem (May 2023)</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>10,834</strong></td>
          <td><strong>143,517</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p><strong>ChEMBL Dataset</strong>: Starting from 2,354,965 compounds in <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, the authors identified 14,816 with drug information and filtered to 3,892 with sufficient descriptive content. For each drug, they gathered <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, molecular features (formula, acid/base classification), and drug-specific properties (mechanism of action, therapeutic applications). They manually crafted QA pairs covering topics like rotatable bond count, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski rule</a> violations, <a href="https://en.wikipedia.org/wiki/Chirality_(chemistry)">chirality</a>, <a href="https://en.wikipedia.org/wiki/Polar_surface_area">polar surface area</a>, development stage, approval year, and <a href="https://en.wikipedia.org/wiki/United_States_Adopted_Name">USAN</a> classification.</p>
<p><strong>PubChem Dataset</strong>: From 66,469,244 compounds in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, 19,319 had drug information, and 6,942 were retained after filtering for detailed descriptions. Descriptions were sourced from <a href="https://en.wikipedia.org/wiki/ChEBI">ChEBI</a>, LOTUS, and YMDB databases, yielding 13,818 QA pairs primarily asking for drug descriptions.</p>
<p>The QA pairs are formulaic: the ChEMBL set covers up to 34 question types per drug (an example drug in the paper shows all 34), while PubChem questions ask for descriptive summaries from different source databases.</p>
<h2 id="qualitative-demonstrations-only">Qualitative Demonstrations Only</h2>
<p>The paper presents only qualitative results. Two demonstration examples show DrugChat answering multi-turn questions about test compounds not seen during training. Questions like &ldquo;what makes this compound unique?&rdquo; and &ldquo;what diseases can this compound potentially treat?&rdquo; are answered in natural language.</p>
<p>No systematic quantitative evaluation is reported. The authors state they &ldquo;will perform a systematic quantitative evaluation by collaborating with pharmaceutical scientists,&rdquo; but this evaluation is not included in the technical report.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>The authors identify <strong>language hallucination</strong> as the primary limitation. Since DrugChat incorporates an LLM, it may produce convincing but incorrect text descriptions about drugs, which could mislead decision-makers in real drug discovery pipelines.</p>
<p>Proposed mitigations include:</p>
<ul>
<li>Higher-quality training data and filtering strategies</li>
<li>More advanced GNN encoders and LLMs</li>
<li>Reinforcement learning from human feedback (RLHF) as the user base grows</li>
</ul>
<p>Several additional limitations are worth noting:</p>
<ul>
<li>The QA pairs are largely factoid-style questions with short, formulaic answers, which may not capture the nuanced reasoning needed for real drug discovery tasks</li>
<li>The evaluation is entirely qualitative, with no comparison to baselines or quantitative metrics</li>
<li>The linear adaptor is a minimal alignment mechanism; it remains unclear how much molecular structural information is preserved through this single linear transformation</li>
<li>The training data covers only a small fraction of known chemical space (10,834 compounds out of millions)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL Drug Instruction Tuning</td>
          <td>3,892 drugs, 129,699 QA pairs</td>
          <td>From ChEMBL (Feb 2023 dump)</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem Drug Instruction Tuning</td>
          <td>6,942 drugs, 13,818 QA pairs</td>
          <td>From PubChem (May 2023)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>GNN</strong>: Pre-trained model from Hu et al. (2020), &ldquo;Strategies for Pre-training Graph Neural Networks&rdquo;</li>
<li><strong>Adaptor</strong>: Single linear transformation matrix (only trainable component)</li>
<li><strong>Loss</strong>: Negative log-likelihood between generated and ground-truth answers</li>
<li><strong>Training</strong>: Only adaptor weights updated; GNN and LLM weights frozen</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Model</th>
          <th>Parameters</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GNN Encoder</td>
          <td>Pre-trained GNN (Hu et al., 2020)</td>
          <td>Not specified</td>
          <td>Frozen during training</td>
      </tr>
      <tr>
          <td>LLM</td>
          <td>Vicuna-13B</td>
          <td>~13B</td>
          <td>Frozen during training</td>
      </tr>
      <tr>
          <td>Adaptor</td>
          <td>Linear projection</td>
          <td>Not specified</td>
          <td>Trained</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>No quantitative evaluation metrics are reported. The paper provides only qualitative demonstrations on unseen compounds.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware specifications are reported for training or inference.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/UCSD-AI4H/drugchat">DrugChat Code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation (repository returned 404 as of March 2026)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liang, Y., Zhang, R., Zhang, L., &amp; Xie, P. (2023). DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs. <em>arXiv preprint arXiv:2309.03907</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liang2023drugchat,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liang, Youwei and Zhang, Ruiyi and Zhang, Li and Xie, Pengtao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2309.03907}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugAssist: Interactive LLM Molecule Optimization</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/drugassist-llm-molecule-optimization/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/drugassist-llm-molecule-optimization/</guid><description>DrugAssist fine-tunes Llama2-7B-Chat for interactive molecule optimization via natural language dialogue, releasing the MolOpt-Instructions dataset.</description><content:encoded><![CDATA[<h2 id="an-interactive-llm-for-molecule-optimization">An Interactive LLM for Molecule Optimization</h2>
<p>DrugAssist is a <strong>Method</strong> paper that proposes an interactive molecule optimization model built by fine-tuning Llama2-7B-Chat with LoRA on a newly constructed instruction dataset. The primary contribution is twofold: (1) the MolOpt-Instructions dataset containing over one million molecule pairs with six molecular properties and three optimization task categories, and (2) a dialogue-based molecule optimization system that allows domain experts to iteratively refine molecular modifications through multi-turn natural language conversations.</p>
<h2 id="why-interactive-molecule-optimization-matters">Why Interactive Molecule Optimization Matters</h2>
<p>Molecule optimization is a core step in the drug discovery pipeline, where lead compounds must be modified to improve specific pharmacological properties while maintaining structural similarity. Existing approaches fall into sequence-based methods (treating <a href="/notes/chemistry/molecular-representations/">SMILES</a> optimization as machine translation) and graph-based methods (graph-to-graph translation), but they share a critical limitation: they are non-interactive. These models learn patterns from chemical structure data without incorporating expert feedback.</p>
<p>The drug discovery process is inherently iterative and requires integrating domain expertise. Medicinal chemists typically refine candidates through repeated cycles of suggestion, evaluation, and adjustment. Prior LLM-based approaches like <a href="/notes/chemistry/llm-applications/chatdrug-conversational-drug-editing/">ChatDrug</a> relied on prompt engineering with general-purpose models (GPT-3.5-turbo) rather than fine-tuning, limiting their optimization accuracy. Additionally, most existing molecule optimization benchmarks focus on single-property optimization with vague objectives (e.g., &ldquo;maximize QED&rdquo;), while real-world drug design requires optimizing property values within specific ranges across multiple properties simultaneously.</p>
<h2 id="instruction-based-fine-tuning-with-molopt-instructions">Instruction-Based Fine-Tuning with MolOpt-Instructions</h2>
<p>The core innovation has two components: the MolOpt-Instructions dataset construction pipeline and the multi-task instruction tuning strategy.</p>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>MolOpt-Instructions is built from one million molecules randomly sampled from the <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a>. The construction workflow uses mmpdb (an open-source Matched Molecular Pair platform) to generate structurally similar molecule pairs through <a href="https://en.wikipedia.org/wiki/Matched_molecular_pair_analysis">Matched Molecular Pair Analysis (MMPA)</a>. Pairs are filtered to satisfy two criteria: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> greater than 0.65 and <a href="https://en.wikipedia.org/wiki/Partition_coefficient">logP</a> difference greater than 2.5. Property values for six properties (Solubility, BBBP, <a href="https://en.wikipedia.org/wiki/KCNH2">hERG</a> inhibition, QED, hydrogen bond donor count, and hydrogen bond acceptor count) are computed using Tencent&rsquo;s iDrug platform. The final dataset contains 1,029,949 unique pairs covering 1,595,839 unique molecules, with mean similarity of 0.69 and mean logP difference of 2.82.</p>
<p>Three categories of optimization tasks are defined:</p>
<ul>
<li><strong>Loose</strong>: Increase or decrease a given property value (no threshold)</li>
<li><strong>Strict</strong>: Increase or decrease by at least a specified threshold</li>
<li><strong>Range</strong>: Optimize the property value to fall within a given interval</li>
</ul>
<p>Instruction templates are generated with ChatGPT assistance and manually refined. To ensure balance, source and target molecules are swapped for some pairs to maintain a roughly 1:1 ratio of property increases to decreases.</p>
<p>Murcko scaffold analysis confirms chemical diversity: the average molecules per scaffold is 2.95, and over 93.7% of scaffolds contain no more than five molecules.</p>
<h3 id="multi-task-instruction-tuning">Multi-Task Instruction Tuning</h3>
<p>The model is fine-tuned on Llama2-7B-Chat using LoRA (rank 64, alpha 128). To prevent catastrophic forgetting of general language capabilities, the training data combines MolOpt-Instructions with the Stanford Alpaca dataset (52k instruction-following examples, replicated 5x to balance the mixture). The training objective minimizes the negative log-likelihood over the response tokens:</p>
<p>$$L(R; \boldsymbol{\theta}) = -\sum_{u_i \in R} \log \Phi(u_i \mid u_{&lt;i}, I)$$</p>
<p>where $I$ is the instruction, $R$ is the response, and $\Phi$ is the model&rsquo;s conditional probability.</p>
<p>Training runs for 10 epochs with batch size 512, using AdamW ($\beta = (0.9, 0.999)$), learning rate 1e-4, 3% warm-up steps with cosine decay, and no weight decay. The data is split 90/5/5 for train/validation/test.</p>
<h2 id="experimental-setup-and-multi-property-optimization-results">Experimental Setup and Multi-Property Optimization Results</h2>
<h3 id="comparison-with-traditional-approaches">Comparison with Traditional Approaches</h3>
<p>DrugAssist is compared against Mol-Seq2Seq and Mol-Transformer (He et al., 2021) on simultaneous Solubility and BBBP optimization with range constraints. The evaluation prompt asks the model to generate an optimized molecule with solubility within a given range and BBBP category changed from one level to another.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Solubility</th>
          <th>BBBP</th>
          <th>Both</th>
          <th>Valid Rate</th>
          <th>Similarity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mol-Seq2Seq</td>
          <td>0.46</td>
          <td>0.55</td>
          <td>0.35</td>
          <td>0.76</td>
          <td>0.61</td>
      </tr>
      <tr>
          <td>Mol-Transformer</td>
          <td>0.70</td>
          <td>0.78</td>
          <td>0.59</td>
          <td>0.96</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugAssist</td>
          <td>0.74</td>
          <td>0.80</td>
          <td>0.62</td>
          <td>0.98</td>
          <td>0.69</td>
      </tr>
  </tbody>
</table>
<p>DrugAssist achieves the highest success rates in both single-property and multi-property optimization while maintaining high validity (0.98) and comparable structural similarity (0.69).</p>
<h3 id="comparison-with-llms">Comparison with LLMs</h3>
<p>DrugAssist is compared against Llama2-7B-Chat, GPT-3.5-turbo (via ChatDrug), and BioMedGPT-LM-7B on 16 tasks covering all three optimization categories. These comparisons use multi-turn dialogues following the ChatDrug protocol: if the model&rsquo;s output fails to meet requirements, a database-retrieved molecule meeting the criteria and similar to the model&rsquo;s output is provided as a hint for iterative refinement.</p>
<p>Selected results on single-property tasks (valid ratio / correct ratio, loose/strict):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Llama2-7B-Chat</th>
          <th>GPT-3.5-turbo</th>
          <th>BioMedGPT-LM</th>
          <th>DrugAssist</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED+</td>
          <td>0.17 / 0.16</td>
          <td>0.15 / 0.15</td>
          <td>0.15 / 0.09</td>
          <td>0.76 / 0.63</td>
      </tr>
      <tr>
          <td>Acceptor+</td>
          <td>0.08 / 0.08</td>
          <td>0.04 / 0.06</td>
          <td>0.18 / 0.13</td>
          <td>0.71 / 0.67</td>
      </tr>
      <tr>
          <td>Donor+</td>
          <td>0.15 / 0.08</td>
          <td>0.10 / 0.04</td>
          <td>0.17 / 0.09</td>
          <td>0.72 / 0.76</td>
      </tr>
      <tr>
          <td>Solubility+</td>
          <td>0.36 / 0.20</td>
          <td>0.16 / 0.05</td>
          <td>0.18 / 0.09</td>
          <td>0.80 / 0.41</td>
      </tr>
      <tr>
          <td>BBBP+</td>
          <td>0.19 / 0.14</td>
          <td>0.10 / 0.10</td>
          <td>0.16 / 0.07</td>
          <td>0.82 / 0.61</td>
      </tr>
      <tr>
          <td>hERG-</td>
          <td>0.39 / 0.31</td>
          <td>0.13 / 0.15</td>
          <td>0.13 / 0.12</td>
          <td>0.71 / 0.67</td>
      </tr>
  </tbody>
</table>
<p>Multi-property tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Llama2-7B-Chat</th>
          <th>GPT-3.5-turbo</th>
          <th>BioMedGPT-LM</th>
          <th>DrugAssist</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sol+ &amp; Acc+</td>
          <td>0.15 / 0.04</td>
          <td>0.09 / 0.02</td>
          <td>0.10 / 0.07</td>
          <td>0.50 / 0.27</td>
      </tr>
      <tr>
          <td>QED+ &amp; BBBP+</td>
          <td>0.14 / 0.09</td>
          <td>0.09 / 0.06</td>
          <td>0.16 / 0.11</td>
          <td>0.65 / 0.41</td>
      </tr>
  </tbody>
</table>
<p>DrugAssist outperforms all baselines across every task. BioMedGPT-LM frequently misunderstands the task, generating guidance text rather than molecules. GPT-3.5-turbo achieves high validity but often outputs the input molecule unchanged.</p>
<h2 id="transferability-iterative-refinement-and-limitations">Transferability, Iterative Refinement, and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Zero-shot transferability</strong>: Although DrugAssist trains on single-property optimization data, it successfully handles multi-property optimization requests at inference time. In a case study, the model simultaneously increased both BBBP and QED by at least 0.1 while maintaining structural similarity, without any multi-property training examples.</p>
<p><strong>Few-shot generalization</strong>: DrugAssist optimizes properties not seen during training (e.g., logP) when provided with a few in-context examples of successful optimizations, a capability that traditional sequence-based or graph-based models cannot achieve without retraining.</p>
<p><strong>Iterative optimization</strong>: When an initial optimization fails to meet requirements, DrugAssist can incorporate feedback (a database-retrieved hint molecule) and modify different functional groups in a second attempt to produce a compliant molecule.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge that DrugAssist has a relatively lower success rate on the most challenging task category, strict range-constrained solubility optimization (0.41 success rate under strict criteria vs. 0.80 under loose criteria). The model also relies on iDrug for property prediction of Solubility, BBBP, and hERG inhibition, meaning its optimization quality is bounded by the accuracy of these property predictors. The evaluation uses only 500 test molecules for LLM comparisons, which is a relatively small evaluation set. The paper does not report statistical significance tests or confidence intervals for any results.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors plan to improve multimodal data handling to reduce hallucination problems and to further enhance DrugAssist&rsquo;s interactive capabilities for better understanding of user needs and feedback.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>MolOpt-Instructions</td>
          <td>1,029,949 molecule pairs</td>
          <td>Sourced from ZINC via mmpdb; 6 properties</td>
      </tr>
      <tr>
          <td>Training (auxiliary)</td>
          <td>Stanford Alpaca</td>
          <td>52k instructions (5x replicated)</td>
          <td>Mitigates catastrophic forgetting</td>
      </tr>
      <tr>
          <td>Evaluation (traditional)</td>
          <td>From He et al. (2021)</td>
          <td>Not specified</td>
          <td>Multi-property optimization test</td>
      </tr>
      <tr>
          <td>Evaluation (LLM)</td>
          <td>ZINC subset</td>
          <td>500 molecules</td>
          <td>Randomly selected</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Base model</strong>: Llama2-7B-Chat</li>
<li><strong>Fine-tuning</strong>: LoRA with rank 64, alpha 128</li>
<li><strong>Optimizer</strong>: AdamW, $\beta = (0.9, 0.999)$, lr = 1e-4, no weight decay</li>
<li><strong>Schedule</strong>: 3% warm-up, cosine decay</li>
<li><strong>Epochs</strong>: 10</li>
<li><strong>Batch size</strong>: 512</li>
<li><strong>Property calculation</strong>: iDrug (Solubility, BBBP, hERG); RDKit (H-bond donors/acceptors, QED)</li>
<li><strong>Molecular pairs</strong>: mmpdb for Matched Molecular Pair Analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Fine-tuned Llama2-7B-Chat with LoRA adapters</li>
<li>No pre-trained weights released (code and data available)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success rate</td>
          <td>Fraction of molecules meeting optimization criteria</td>
      </tr>
      <tr>
          <td>Valid rate</td>
          <td>Fraction of generated SMILES that parse as valid molecules</td>
      </tr>
      <tr>
          <td>Similarity</td>
          <td>Tanimoto similarity between input and optimized molecules</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>8 NVIDIA Tesla A100-SXM4-40GB GPUs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/blazerye/DrugAssist">DrugAssist Code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/blazerye/DrugAssist">MolOpt-Instructions</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>1M+ molecule pairs, 6 properties</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ye, G., Cai, X., Lai, H., Wang, X., Huang, J., Wang, L., Liu, W., &amp; Zeng, X. (2024). DrugAssist: A Large Language Model for Molecule Optimization. <em>Briefings in Bioinformatics</em>, 26(1), bbae693.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ye2024drugassist,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugAssist: A Large Language Model for Molecule Optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ye, Geyan and Cai, Xibao and Lai, Houtim and Wang, Xing and Huang, Junhong and Wang, Longyue and Liu, Wei and Zeng, Xiangxiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{26}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbae693}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbae693}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Coscientist: Autonomous Chemistry with LLM Agents</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/autonomous-chemical-research-coscientist/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/autonomous-chemical-research-coscientist/</guid><description>Coscientist uses GPT-4 to autonomously design, plan, and execute chemical experiments including Pd-catalysed cross-coupling optimization.</description><content:encoded><![CDATA[<h2 id="an-llm-powered-agent-for-autonomous-chemical-experimentation">An LLM-Powered Agent for Autonomous Chemical Experimentation</h2>
<p>This is a <strong>Method</strong> paper that introduces Coscientist, an AI system driven by GPT-4 that autonomously designs, plans, and performs complex chemical experiments. The primary contribution is a modular multi-LLM agent architecture that integrates internet search, documentation retrieval, code execution, and robotic experimentation APIs into a unified system capable of end-to-end experimental chemistry with minimal human intervention.</p>
<h2 id="bridging-llm-capabilities-and-laboratory-automation">Bridging LLM Capabilities and Laboratory Automation</h2>
<p>Transformer-based large language models had demonstrated strong capabilities in natural language processing, biology, chemistry, and code generation by early 2023. Simultaneously, laboratory automation had progressed with autonomous reaction discovery, automated flow systems, and mobile robotic platforms. However, these two threads remained largely separate: LLMs could reason about chemistry in text, but could not act on that reasoning by controlling physical experiments.</p>
<p>The gap this work addresses is the integration of LLM reasoning with laboratory automation in a closed-loop system. Prior automated chemistry systems relied on traditional optimization algorithms or narrow AI components. The question was whether GPT-4&rsquo;s general reasoning capabilities could be combined with tool access to produce a system that autonomously designs experiments, writes instrument code, executes reactions, and interprets results, all from natural language prompts.</p>
<p>This work was developed independently and in parallel with other autonomous agent efforts (AutoGPT, BabyAGI, LangChain), with <a href="/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/">ChemCrow</a> serving as another chemistry-specific example.</p>
<h2 id="a-modular-multi-llm-architecture-with-tool-access">A Modular Multi-LLM Architecture with Tool Access</h2>
<p>The core innovation is Coscientist&rsquo;s modular architecture, centered on a &ldquo;Planner&rdquo; module (a GPT-4 chat completion instance) that orchestrates four command types:</p>
<ol>
<li><strong>GOOGLE</strong>: A Web Searcher module (itself an LLM) that transforms prompts into search queries, browses results, and funnels answers back to the Planner.</li>
<li><strong>PYTHON</strong>: A Code Execution module running in an isolated Docker container for calculations and data analysis, with no LLM dependency.</li>
<li><strong>DOCUMENTATION</strong>: A Docs Searcher module that retrieves and summarizes technical documentation (e.g., Opentrons Python API, Emerald Cloud Lab Symbolic Lab Language) using ada embeddings and distance-based vector search.</li>
<li><strong>EXPERIMENT</strong>: An Automation module that executes generated code on laboratory hardware or provides synthetic procedures.</li>
</ol>
<p>The system prompts are engineered in a modular fashion, with the Planner receiving initial user input and command outputs as messages. The Planner can iteratively call commands, fix software errors, and refine its approach. This design allows natural language instructions (e.g., &ldquo;perform multiple Suzuki reactions&rdquo;) to be translated into complete experimental protocols.</p>
<p>For documentation retrieval, all sections of the OT-2 API documentation were embedded using OpenAI&rsquo;s ada model, and relevant sections are retrieved via cosine similarity search. For the Emerald Cloud Lab, the system learned to program in a symbolic lab language (SLL) that was completely unknown to GPT-4 at training time, demonstrating effective in-context learning from supplied documentation.</p>
<h2 id="six-tasks-demonstrating-autonomous-chemistry-capabilities">Six Tasks Demonstrating Autonomous Chemistry Capabilities</h2>
<p>The paper evaluates Coscientist across six tasks of increasing complexity.</p>
<h3 id="task-1-chemical-synthesis-planning">Task 1: Chemical Synthesis Planning</h3>
<p>A benchmark of seven compounds was used to compare synthesis planning across models (GPT-4, GPT-3.5, Claude 1.3, Falcon-40B-Instruct) with and without web search. Outputs were scored on a 1-5 scale:</p>
<table>
  <thead>
      <tr>
          <th>Score</th>
          <th>Meaning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>5</td>
          <td>Very detailed and chemically accurate procedure</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Detailed and accurate but without reagent quantities</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Correct chemistry but no step-by-step procedure</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Extremely vague or unfeasible</td>
      </tr>
      <tr>
          <td>1</td>
          <td>Incorrect or failure to follow instructions</td>
      </tr>
  </tbody>
</table>
<p>The GPT-4-powered Web Searcher achieved maximum scores for acetaminophen, aspirin, nitroaniline, and phenolphthalein. It was the only approach to achieve acceptable scores (3+) for ibuprofen, which all non-browsing models synthesized incorrectly. These results highlight the importance of grounding LLMs to avoid hallucinations.</p>
<h3 id="task-2-documentation-search">Task 2: Documentation Search</h3>
<p>The system correctly identified relevant ECL functions from documentation and generated valid SLL code that was successfully executed at ECL, including an <a href="https://en.wikipedia.org/wiki/High-performance_liquid_chromatography">HPLC</a> experiment on a caffeine standard sample.</p>
<h3 id="task-3-cloud-laboratory-execution">Task 3: Cloud Laboratory Execution</h3>
<p>Using prompt-to-function and prompt-to-SLL pipelines, Coscientist generated executable code for the Emerald Cloud Lab. It also searched a catalogue of 1,110 model samples to identify relevant stock solutions from simple search terms.</p>
<h3 id="task-4-liquid-handler-control">Task 4: Liquid Handler Control</h3>
<p>Using the Opentrons OT-2, Coscientist translated natural language prompts (e.g., &ldquo;colour every other line with one colour of your choice,&rdquo; &ldquo;draw a red cross&rdquo;) into accurate liquid handling protocols.</p>
<h3 id="task-5-integrated-multi-module-experiment">Task 5: Integrated Multi-Module Experiment</h3>
<p>The most complex demonstration combined web search, code execution, documentation retrieval, and hardware control to design and execute <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki-Miyaura</a> and <a href="https://en.wikipedia.org/wiki/Sonogashira_coupling">Sonogashira</a> <a href="https://en.wikipedia.org/wiki/Cross-coupling_reaction">cross-coupling</a> reactions. Coscientist:</p>
<ul>
<li>Searched the internet for reaction conditions and stoichiometries</li>
<li>Selected correct coupling partners (never misassigning <a href="https://en.wikipedia.org/wiki/Phenylboronic_acid">phenylboronic acid</a> to Sonogashira)</li>
<li>Calculated reagent volumes and wrote OT-2 protocols</li>
<li>Self-corrected when using an incorrect heater-shaker method by consulting documentation</li>
<li>Successfully produced target products confirmed by <a href="https://en.wikipedia.org/wiki/Gas_chromatography%E2%80%93mass_spectrometry">GC-MS</a> analysis (biphenyl at 9.53 min for Suzuki, diphenylacetylene at 12.92 min for Sonogashira)</li>
</ul>
<h3 id="task-6-reaction-optimization">Task 6: Reaction Optimization</h3>
<p>Coscientist was tested on two fully mapped reaction datasets:</p>
<ol>
<li><strong>Suzuki reaction flow dataset</strong> (Perera et al.): varying ligands, reagents/bases, and solvents</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N coupling dataset</strong> (Doyle et al.): varying ligands, additives, and bases</li>
</ol>
<p>Performance was evaluated using a normalized advantage metric:</p>
<p>$$\text{Normalized Advantage} = \frac{\text{yield}_i - \overline{\text{yield}}}{\text{yield}_{\max} - \overline{\text{yield}}}$$</p>
<p>A value of 1 indicates maximum yield reached, 0 indicates random performance, and negative values indicate worse than random. The normalized maximum advantage (NMA) tracks the best result achieved up to each iteration.</p>
<p>Key findings from the optimization experiments:</p>
<ul>
<li>GPT-4 with prior information (10 random data points) produced better initial guesses than GPT-4 without prior information</li>
<li>Both GPT-4 approaches converged to similar NMA values at the limit</li>
<li>Both GPT-4 approaches outperformed standard <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian optimization</a> in NMA and normalized advantage</li>
<li>GPT-3.5 largely failed due to inability to output correct JSON schemas</li>
<li>On the Buchwald-Hartwig dataset, GPT-4 performed comparably whether given compound names or <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, and could reason about electronic properties from SMILES representations</li>
</ul>
<p>All experiments used a maximum of 20 iterations (5.2% and 6.9% of the total reaction space for the two datasets).</p>
<h2 id="demonstrated-versatility-with-safety-considerations">Demonstrated Versatility with Safety Considerations</h2>
<p>Coscientist demonstrated that GPT-4, when equipped with appropriate tool access, can autonomously handle the full experimental chemistry workflow from literature search to reaction execution and data interpretation. The system showed chemical reasoning capabilities, including selecting appropriate reagents, providing justifications for choices based on reactivity and selectivity, and using experimental data to guide subsequent iterations.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li>The experimental setup was not yet fully automated (plates were moved manually between instruments), though no human decision-making was involved</li>
<li>GPT-3.5 consistently underperformed due to inability to follow formatting instructions</li>
<li>The synthesis planning evaluation scale is inherently subjective</li>
<li>It is unclear whether GPT-4&rsquo;s training data contained information from the optimization datasets</li>
<li>The comparison with Bayesian optimization may reflect different exploration/exploitation balances rather than pure capability differences</li>
</ul>
<p>The authors raise safety concerns about dual-use potential and note that full code and prompts were withheld pending development of US AI regulations. A simplified implementation was released for reproducibility purposes.</p>
<p>Future directions include extending the system with reaction databases (Reaxys, SciFinder), implementing advanced prompting strategies (ReAct, Chain of Thought, Tree of Thoughts), and developing automated quality control for cloud laboratory experiments.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Synthesis benchmark</td>
          <td>7 compound set</td>
          <td>7 compounds</td>
          <td>Acetaminophen, aspirin, ibuprofen, nitroaniline, etc.</td>
      </tr>
      <tr>
          <td>Optimization</td>
          <td>Perera et al. Suzuki flow dataset</td>
          <td>Fully mapped condition space</td>
          <td>Varying ligands, bases, solvents</td>
      </tr>
      <tr>
          <td>Optimization</td>
          <td>Doyle Buchwald-Hartwig dataset</td>
          <td>Fully mapped condition space</td>
          <td>Varying ligands, additives, bases</td>
      </tr>
      <tr>
          <td>Reagent selection</td>
          <td><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> compound database</td>
          <td>Not specified</td>
          <td>Used for computational experiments</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Planner</strong>: GPT-4 chat completion with modular system prompts</li>
<li><strong>Web Searcher</strong>: GPT-4 or GPT-3.5-turbo for query generation and result parsing</li>
<li><strong>Documentation embedding</strong>: OpenAI ada model with distance-based vector search</li>
<li><strong>Code execution</strong>: Isolated Docker container (no LLM dependency)</li>
<li><strong>Baseline</strong>: Bayesian optimization with varying initial sample sizes (1-10)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-4 (primary)</li>
<li>GPT-3.5-turbo (baseline)</li>
<li>Claude 1.3 (baseline for synthesis planning)</li>
<li>Falcon-40B-Instruct (baseline for synthesis planning)</li>
<li>OpenAI ada (for documentation embedding)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Context</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Synthesis score (1-5)</td>
          <td>7-compound benchmark</td>
          <td>Subjective expert grading</td>
      </tr>
      <tr>
          <td>Normalized advantage</td>
          <td>Optimization tasks</td>
          <td>Measures improvement over random</td>
      </tr>
      <tr>
          <td>NMA</td>
          <td>Optimization tasks</td>
          <td>Maximum advantage achieved through iteration N</td>
      </tr>
      <tr>
          <td>GC-MS confirmation</td>
          <td>Cross-coupling reactions</td>
          <td>Product formation verified experimentally</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Opentrons OT-2 liquid handler with heater-shaker module</li>
<li>UV-Vis plate reader</li>
<li>Emerald Cloud Lab (cloud-based automation)</li>
<li>Computational requirements not specified (relies on OpenAI API calls)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gomesgroup/coscientist">gomesgroup/coscientist</a></td>
          <td>Code</td>
          <td>Apache-2.0 with Commons Clause</td>
          <td>Simplified implementation; full code withheld for safety</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Boiko, D. A., MacKnight, R., Kline, B. &amp; Gomes, G. (2023). Autonomous chemical research with large language models. <em>Nature</em>, 624(7992), 570-578. <a href="https://doi.org/10.1038/s41586-023-06792-0">https://doi.org/10.1038/s41586-023-06792-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{boiko2023autonomous,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Autonomous chemical research with large language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabriel dos Passos}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{624}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{7992}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{570--578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41586-023-06792-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGE: Molecule Generation via Grammatical Evolution</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/chemge-grammatical-evolution-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/chemge-grammatical-evolution-molecule-generation/</guid><description>ChemGE applies grammatical evolution to SMILES strings for population-based de novo molecule generation with inherent parallelism and diversity.</description><content:encoded><![CDATA[<h2 id="grammatical-evolution-for-de-novo-molecular-design">Grammatical Evolution for De Novo Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.</p>
<h2 id="limitations-of-sequential-deep-learning-generators">Limitations of Sequential Deep Learning Generators</h2>
<p>At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GVAE</a>), reinforcement learning with recurrent neural networks (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:</p>
<ol>
<li>
<p><strong>Simulation concurrency</strong>: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., <a href="https://en.wikipedia.org/wiki/Molecular_docking">docking</a>) in parallel. This wastes computational resources in high-throughput virtual screening settings.</p>
</li>
<li>
<p><strong>Molecular diversity</strong>: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> screening.</p>
</li>
</ol>
<p>ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.</p>
<h2 id="core-innovation-chromosome-to-smiles-mapping-via-grammar-rules">Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules</h2>
<p>ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:</p>
<ol>
<li>Start with the grammar&rsquo;s start symbol</li>
<li>At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome</li>
<li>Identify the leftmost non-terminal symbol and count its $r$ applicable production rules</li>
<li>Apply the $((c \bmod r) + 1)$-th rule</li>
<li>Repeat until no non-terminal symbols remain or the chromosome is exhausted</li>
</ol>
<p>The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.</p>
<p>Evolution follows the $(\mu + \lambda)$ evolution strategy:</p>
<ol>
<li>Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position</li>
<li>Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$</li>
<li>Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates</li>
</ol>
<p>The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.</p>
<h2 id="experimental-setup-and-benchmark-comparisons">Experimental Setup and Benchmark Comparisons</h2>
<h3 id="druglikeness-score-benchmark">Druglikeness Score Benchmark</h3>
<p>The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:</p>
<p>$$
J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m)
$$</p>
<p>where $\log P(m)$ is the <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">octanol-water partition coefficient</a>, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.</p>
<p>ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>2h</th>
          <th>4h</th>
          <th>6h</th>
          <th>8h</th>
          <th>Mol/Min</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemGE (10, 20)</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>14.5</td>
      </tr>
      <tr>
          <td>ChemGE (100, 200)</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>135</td>
      </tr>
      <tr>
          <td>ChemGE (1000, 2000)</td>
          <td>4.45 +/- 0.24</td>
          <td>5.32 +/- 0.43</td>
          <td>5.73 +/- 0.33</td>
          <td>5.88 +/- 0.34</td>
          <td>527</td>
      </tr>
      <tr>
          <td>ChemGE (10000, 20000)</td>
          <td>4.20 +/- 0.33</td>
          <td>4.28 +/- 0.28</td>
          <td>4.40 +/- 0.27</td>
          <td>4.53 +/- 0.26</td>
          <td>555</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>-30.18 +/- 26.91</td>
          <td>-1.39 +/- 2.24</td>
          <td>-0.61 +/- 1.08</td>
          <td>-0.006 +/- 0.92</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>GVAE</td>
          <td>-4.34 +/- 3.14</td>
          <td>-1.29 +/- 1.67</td>
          <td>-0.17 +/- 0.96</td>
          <td>0.25 +/- 1.31</td>
          <td>1.38</td>
      </tr>
      <tr>
          <td>ChemTS</td>
          <td>4.91 +/- 0.38</td>
          <td>5.41 +/- 0.51</td>
          <td>5.49 +/- 0.44</td>
          <td>5.58 +/- 0.50</td>
          <td>40.89</td>
      </tr>
  </tbody>
</table>
<p>At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.</p>
<h3 id="docking-experiment-with-thymidine-kinase">Docking Experiment with Thymidine Kinase</h3>
<p>The second experiment applied ChemGE to generate molecules with high predicted binding affinity for <a href="https://en.wikipedia.org/wiki/Thymidine_kinase">thymidine kinase</a> (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.</p>
<p>With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.</p>
<h3 id="diversity-analysis">Diversity Analysis</h3>
<p>Molecular diversity was measured using internal diversity based on Morgan fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y)
$$</p>
<p>where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the <a href="https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance">Tanimoto distance</a>.</p>
<p>The 349 &ldquo;ChemGE-active&rdquo; molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.</p>
<p>ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.</p>
<h2 id="high-throughput-and-diversity-without-deep-learning">High Throughput and Diversity Without Deep Learning</h2>
<p>ChemGE demonstrates several notable findings:</p>
<ol>
<li>
<p><strong>Deep learning is not required</strong> for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.</p>
</li>
<li>
<p><strong>Population size matters significantly</strong>. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.</p>
</li>
<li>
<p><strong>Inherent diversity</strong> is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.</p>
</li>
<li>
<p><strong>Parallel evaluation</strong> is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.</p>
</li>
</ol>
<p>The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Initial population</td>
          <td>ZINC</td>
          <td>~35M compounds</td>
          <td>Randomly sampled starting molecules</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2B8T</td>
          <td>1 structure</td>
          <td>Thymidine kinase-ligand complex</td>
      </tr>
      <tr>
          <td>Baseline actives</td>
          <td>DUD-E (KITH)</td>
          <td>57 inhibitors</td>
          <td>Known thymidine kinase inhibitors</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Grammatical evolution with $(\mu + \lambda)$ evolution strategy</li>
<li>Mutation only (no crossover)</li>
<li>Context-free grammar subset of OpenSMILES specification</li>
<li>Chromosome length: $N$ integers per molecule</li>
<li>Fitness set to $-\infty$ for invalid SMILES, MW &gt; 500, or duplicate molecules</li>
</ul>
<h3 id="models">Models</h3>
<p>No neural network models are used. ChemGE is purely evolutionary.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Max $J^{\log P}$ (8h)</td>
          <td>5.88 +/- 0.34</td>
          <td>ChemTS: 5.58 +/- 0.50</td>
          <td>ChemGE (1000, 2000)</td>
      </tr>
      <tr>
          <td>Molecules/min</td>
          <td>527</td>
          <td>ChemTS: 40.89</td>
          <td>~13x throughput improvement</td>
      </tr>
      <tr>
          <td>Docking hits</td>
          <td>349</td>
          <td>Best DUD-E inhibitor</td>
          <td>Molecules with better $S_{\text{inter}}$</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>0.55</td>
          <td>Known inhibitors: 0.46</td>
          <td>Morgan fingerprint Tanimoto distance</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)</li>
<li>Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tsudalab/ChemGE">ChemGE</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., &amp; Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. <em>Chemistry Letters</em>, 47(11), 1431-1434. <a href="https://doi.org/10.1246/cl.180665">https://doi.org/10.1246/cl.180665</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yoshikawa2018chemge,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Population-based De Novo Molecule Generation, Using Grammatical Evolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1431--1434}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1246/cl.180665}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemCrow: Augmenting LLMs with 18 Chemistry Tools</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/</guid><description>ChemCrow integrates 18 expert-designed chemistry tools with GPT-4 to enable autonomous synthesis planning, drug discovery, and materials design tasks.</description><content:encoded><![CDATA[<h2 id="an-llm-powered-chemistry-agent">An LLM-Powered Chemistry Agent</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemCrow, an LLM chemistry agent that augments GPT-4 with 18 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. Rather than relying on the LLM&rsquo;s internal knowledge (which is often inaccurate for chemistry), ChemCrow uses the LLM as a reasoning engine that iteratively calls specialized tools to gather information, plan actions, and execute experiments. The system successfully planned and executed real-world chemical syntheses on a robotic platform, demonstrating one of the first chemistry-related LLM agent interactions with the physical world.</p>
<h2 id="bridging-llm-reasoning-and-chemical-expertise">Bridging LLM Reasoning and Chemical Expertise</h2>
<p>Large language models have transformed many domains, but they struggle with chemistry-specific problems. GPT-4 cannot reliably perform basic operations like multiplying large numbers, converting <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_chemistry">IUPAC names</a> to molecular structures, or predicting reaction outcomes. These limitations stem from the models&rsquo; token-prediction design, which does not encode chemical reasoning or factual chemical knowledge reliably.</p>
<p>Meanwhile, the chemistry community has developed numerous specialized computational tools for reaction prediction, <a href="/notes/chemistry/molecular-design/reaction-prediction/">retrosynthesis</a> planning, molecular property prediction, and de novo molecular generation. These tools exist in isolated environments with steep learning curves, making them difficult for experimental chemists to integrate and use together. The gap between LLM reasoning capabilities and specialized chemistry tools presents an opportunity: augmenting LLMs with these tools could compensate for the models&rsquo; chemical knowledge deficiencies while providing a natural language interface to specialized computational chemistry capabilities.</p>
<h2 id="tool-augmented-reasoning-via-react">Tool-Augmented Reasoning via ReAct</h2>
<p>ChemCrow builds on the ReAct (Reasoning and Acting) framework, where the LLM follows an iterative Thought-Action-Action Input-Observation loop. At each step, the model reasons about the current state of the task, selects an appropriate tool, provides input, pauses while the tool executes, and then incorporates the observation before deciding on the next step. This continues until the final answer is reached.</p>
<p>The system integrates 18 tools organized into four categories:</p>
<p><strong>General tools</strong> include web search (via SerpAPI), literature search (using paper-qa with OpenAI embeddings and FAISS), a Python REPL for arbitrary code execution, and a human interaction interface.</p>
<p><strong>Molecule tools</strong> cover Name2SMILES (converting molecule names to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> via Chem-Space, PubChem, and OPSIN), SMILES2Price (checking purchasability via molbloom and ZINC20), Name2CAS (CAS number lookup via PubChem), molecular Similarity (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> with ECFP2 fingerprints), ModifyMol (local chemical space exploration via SynSpace), PatentCheck (bloom filter patent lookup via molbloom), FuncGroups (functional group identification via SMARTS patterns), and SMILES2Weight (molecular weight calculation via RDKit).</p>
<p><strong>Safety tools</strong> include ControlledChemicalCheck (screening against chemical weapons lists from <a href="https://en.wikipedia.org/wiki/Organisation_for_the_Prohibition_of_Chemical_Weapons">OPCW</a> and the Australia Group), ExplosiveCheck (GHS explosive classification via PubChem), and SafetySummary (comprehensive safety overview from PubChem data).</p>
<p><strong>Chemical reaction tools</strong> include NameRXN (reaction classification via NextMove Software), ReactionPredict (product prediction via IBM&rsquo;s RXN4Chemistry API using the <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>), ReactionPlanner (multi-step synthesis planning via RXN4Chemistry), and ReactionExecute (direct synthesis execution on IBM&rsquo;s RoboRXN robotic platform).</p>
<p>A key design feature is that safety checks are automatically invoked before synthesis execution. If a molecule is flagged as a controlled chemical or precursor, execution stops immediately.</p>
<h2 id="experimental-validation-and-evaluation">Experimental Validation and Evaluation</h2>
<h3 id="autonomous-synthesis">Autonomous Synthesis</h3>
<p>ChemCrow autonomously planned and executed four real-world syntheses on the IBM RoboRXN cloud-connected robotic platform:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/DEET">DEET</a></strong> (insect repellent), from the prompt &ldquo;Plan and execute the synthesis of an insect repellent&rdquo;</li>
<li><strong>Three <a href="https://en.wikipedia.org/wiki/Thiourea">thiourea</a> <a href="https://en.wikipedia.org/wiki/Organocatalysis">organocatalysts</a></strong> (Schreiner&rsquo;s, Ricci&rsquo;s, and Takemoto&rsquo;s catalysts), from a prompt asking to find and synthesize a thiourea organocatalyst that accelerates the <a href="https://en.wikipedia.org/wiki/Diels%E2%80%93Alder_reaction">Diels-Alder reaction</a></li>
</ul>
<p>All four syntheses yielded the anticipated compounds. ChemCrow demonstrated the ability to autonomously adapt synthesis procedures when the RoboRXN platform flagged issues (such as insufficient solvent or invalid purification actions), iteratively modifying the procedure until it was valid.</p>
<h3 id="novel-chromophore-discovery">Novel Chromophore Discovery</h3>
<p>In a human-AI collaboration scenario, ChemCrow was instructed to train a machine learning model to screen candidate <a href="https://en.wikipedia.org/wiki/Chromophore">chromophores</a>. The system loaded and cleaned data from a chromophore database, trained and evaluated a random forest model, and suggested a molecule with a target absorption maximum of 369 nm. The proposed molecule was subsequently synthesized and characterized, revealing a measured absorption maximum of 336 nm, confirming the discovery of a new chromophore.</p>
<h3 id="expert-vs-llm-evaluation">Expert vs. LLM Evaluation</h3>
<p>The evaluation used 14 use cases spanning synthesis planning, molecular design, and chemical logic. Both ChemCrow and standalone GPT-4 (without tools) were evaluated by:</p>
<ol>
<li><strong>Expert human evaluators</strong> (n=4): Assessed correctness of chemistry, quality of reasoning, and degree of task completion</li>
<li><strong>EvaluatorGPT</strong>: An LLM evaluator prompted to assess responses</li>
</ol>
<p>Key findings from the evaluation:</p>
<table>
  <thead>
      <tr>
          <th>Evaluator</th>
          <th>Preferred System</th>
          <th>Reasoning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Human experts</td>
          <td>ChemCrow</td>
          <td>Better chemical accuracy and task completeness, especially on complex tasks</td>
      </tr>
      <tr>
          <td>EvaluatorGPT</td>
          <td>GPT-4</td>
          <td>Favored fluent, complete-sounding responses despite factual errors</td>
      </tr>
  </tbody>
</table>
<p>Human experts preferred ChemCrow across most tasks, with the exception of very simple tasks where GPT-4 could answer from memorized training data (e.g., synthesis of well-known molecules like paracetamol). GPT-4 without tools consistently produced hallucinations that appeared convincing but were factually incorrect upon expert inspection.</p>
<p>An important finding is that LLM-based evaluation (EvaluatorGPT) cannot replace expert human assessment for scientific tasks. The LLM evaluator lacks the domain knowledge needed to distinguish fluent but incorrect answers from accurate ones, rendering it unsuitable for benchmarking factuality in chemistry.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>ChemCrow demonstrates that augmenting LLMs with expert-designed tools transforms them from &ldquo;hyperconfident, typically wrong information sources&rdquo; into reasoning engines that can gather and act on accurate chemical information. The system lowers the barrier for non-experts to access computational chemistry tools through natural language while serving as an assistant to expert chemists.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li><strong>Tool dependency</strong>: ChemCrow&rsquo;s performance is bounded by the quality and coverage of its tools. Improved synthesis engines would directly improve synthesis planning capabilities.</li>
<li><strong>Reasoning failures</strong>: Tools become useless if the LLM&rsquo;s reasoning about when and how to use them is flawed, or if garbage inputs are provided.</li>
<li><strong>Reproducibility</strong>: The API-based approach to closed-source LLMs (GPT-4) limits reproducibility of individual results. The authors note that open-source models could address this, potentially at the cost of reasoning quality.</li>
<li><strong>Evaluation scope</strong>: The 14 evaluation tasks, while diverse, represent a limited test set. Standardized benchmarks for LLM-based chemistry tools did not exist at the time of publication.</li>
<li><strong>Safety considerations</strong>: While safety tools prevent execution of controlled chemical syntheses, risks remain from inaccurate reasoning or tool outputs leading to suboptimal conclusions.</li>
</ul>
<p>The authors emphasize that ChemCrow&rsquo;s modular design allows easy extension with new tools, and that future integration of image-processing tools, additional language-based tools, and other capabilities could substantially enhance the system.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chromophore screening</td>
          <td>DB for chromophore (Joung et al.)</td>
          <td>Not specified</td>
          <td>Used for training random forest model</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>14 expert-designed tasks</td>
          <td>14 tasks</td>
          <td>Spanning synthesis, molecular design, and chemical logic</td>
      </tr>
      <tr>
          <td>Chemical safety</td>
          <td>OPCW Schedules 1-3, Australia Group lists</td>
          <td>Not specified</td>
          <td>Used for controlled chemical screening</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>LLM</strong>: GPT-4 with temperature 0.1</li>
<li><strong>Framework</strong>: LangChain for tool integration</li>
<li><strong>Reasoning</strong>: ReAct (Reasoning + Acting) framework with chain-of-thought prompting</li>
<li><strong>Synthesis planning</strong>: IBM RXN4Chemistry API (Molecular Transformer-based)</li>
<li><strong>Molecule similarity</strong>: Tanimoto similarity with ECFP2 fingerprints via RDKit</li>
<li><strong>Chemical space exploration</strong>: SynSpace with 50 robust medicinal chemistry reactions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-4 (OpenAI, closed-source) for reasoning</li>
<li>Random forest for chromophore screening (trained on the fly)</li>
<li>Molecular Transformer via RXN4Chemistry API for reaction prediction and retrosynthesis</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Human evaluation</strong>: 4 expert chemists rated responses on chemistry correctness, reasoning quality, and task completion</li>
<li><strong>LLM evaluation</strong>: EvaluatorGPT assessed responses (found unreliable for factuality)</li>
<li><strong>Experimental validation</strong>: 4 syntheses on RoboRXN platform, 1 novel chromophore characterization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The system relies primarily on API calls to GPT-4 and RXN4Chemistry, so local compute requirements are minimal.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-public">chemcrow-public</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source implementation with 12 of 18 tools</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-runs">chemcrow-runs</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>All experiment outputs and evaluation data</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884639">Zenodo release (code)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release v0.3.24</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884645">Zenodo release (runs)</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>Archived experiment runs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., &amp; Schwaller, P. (2024). Augmenting large language models with chemistry tools. <em>Nature Machine Intelligence</em>, 6(5), 525-535.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{bran2024augmenting,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmenting large language models with chemistry tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{525--535}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-024-00832-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChatDrug: Conversational Drug Editing with ChatGPT</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chatdrug-conversational-drug-editing/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chatdrug-conversational-drug-editing/</guid><description>ChatDrug uses ChatGPT with retrieval and domain feedback for drug editing across small molecules, peptides, and proteins on 39 tasks.</description><content:encoded><![CDATA[<h2 id="a-framework-for-conversational-drug-editing-with-llms">A Framework for Conversational Drug Editing with LLMs</h2>
<p>ChatDrug is a <strong>Method</strong> paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.</p>
<h2 id="bridging-conversational-ai-and-drug-discovery">Bridging Conversational AI and Drug Discovery</h2>
<p>Drug editing (also called <a href="https://en.wikipedia.org/wiki/Hit_to_lead">lead optimization</a> or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.</p>
<p>The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.</p>
<h2 id="three-module-pipeline-pdds-redf-and-conversation">Three-Module Pipeline: PDDS, ReDF, and Conversation</h2>
<p>ChatDrug consists of three modules that operate sequentially without any parameter learning.</p>
<h3 id="pdds-module-prompt-design-for-domain-specific">PDDS Module (Prompt Design for Domain-Specific)</h3>
<p>The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:</p>
<p>$$
\pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t)
$$</p>
<p>The prompts are designed around high-level property descriptions (e.g., &ldquo;more soluble in water&rdquo;) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for &ldquo;fuzzy searching&rdquo; (property-based editing with non-deterministic answers) rather than &ldquo;exact searching&rdquo; (precise substructure replacement that experts can do directly).</p>
<h3 id="redf-module-retrieval-and-domain-feedback">ReDF Module (Retrieval and Domain Feedback)</h3>
<p>The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:</p>
<p>$$
\pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}&rsquo;_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}&rsquo;_R; \pmb{x}_t)
$$</p>
<p>where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle$ is a similarity function (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> for small molecules, <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> for peptides and proteins).</p>
<p>The retrieved example $\pmb{x}_R$ is injected into the prompt as: &ldquo;Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?&rdquo;</p>
<h3 id="conversation-module">Conversation Module</h3>
<p>The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.</p>
<h2 id="experiments-across-39-drug-editing-tasks">Experiments Across 39 Drug Editing Tasks</h2>
<h3 id="task-design">Task Design</h3>
<p>The benchmark includes 39 tasks across three drug types:</p>
<ul>
<li><strong>Small molecules</strong> (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (<a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>), drug-likeness (QED), permeability (<a href="https://en.wikipedia.org/wiki/Polar_surface_area">tPSA</a>), <a href="https://en.wikipedia.org/wiki/Hydrogen_bond">hydrogen bond</a> acceptors/donors.</li>
<li><strong>Peptides</strong> (9 tasks): 6 single-objective and 3 multi-objective tasks for editing <a href="https://en.wikipedia.org/wiki/Major_histocompatibility_complex">peptide-MHC binding</a> affinity across different <a href="https://en.wikipedia.org/wiki/Human_leukocyte_antigen">HLA allele</a> types.</li>
<li><strong>Proteins</strong> (2 tasks): Editing protein sequences to increase <a href="https://en.wikipedia.org/wiki/Alpha_helix">alpha-helix</a> or <a href="https://en.wikipedia.org/wiki/Beta_sheet">beta-strand</a> secondary structures.</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.</p>
<h3 id="main-results">Main Results</h3>
<p>ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Property</th>
          <th>ChatDrug (loose)</th>
          <th>Best Baseline (loose)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>101</td>
          <td>More soluble</td>
          <td>94.13</td>
          <td>67.86 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>102</td>
          <td>Less soluble</td>
          <td>96.86</td>
          <td>64.79 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>106</td>
          <td>Lower permeability</td>
          <td>77.35</td>
          <td>34.13 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>107</td>
          <td>More HBA</td>
          <td>95.35</td>
          <td>54.01 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>108</td>
          <td>More HBD</td>
          <td>96.54</td>
          <td>60.97 (MoleculeSTM-Graph)</td>
      </tr>
  </tbody>
</table>
<p>ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.</p>
<p>For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Conversation rounds</strong>: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.</p>
<p><strong>ReDF threshold</strong>: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.</p>
<p><strong>Similarity analysis</strong>: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.</p>
<p><strong>Knowledge extraction</strong>: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.</p>
<p>The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.</p>
<p>The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Small molecule inputs</td>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a></td>
          <td>200 molecules</td>
          <td>Sampled SMILES strings</td>
      </tr>
      <tr>
          <td>Small molecule retrieval DB</td>
          <td>ZINC</td>
          <td>10K molecules</td>
          <td>For ReDF similarity search</td>
      </tr>
      <tr>
          <td>Peptide inputs</td>
          <td>Peptide-MHC binding dataset</td>
          <td>500 peptides per task</td>
          <td>From 30 common MHC alleles</td>
      </tr>
      <tr>
          <td>Peptide retrieval DB</td>
          <td>Experimental binding data</td>
          <td>Varies by allele</td>
          <td>Target allele experimental data</td>
      </tr>
      <tr>
          <td>Protein inputs</td>
          <td>TAPE test set</td>
          <td>Varies</td>
          <td>Secondary structure prediction test data</td>
      </tr>
      <tr>
          <td>Protein retrieval DB</td>
          <td>TAPE training set</td>
          <td>Varies</td>
          <td>Secondary structure prediction training data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2</li>
<li>System prompt: &ldquo;You are an expert in the field of molecular chemistry.&rdquo;</li>
<li>$C = 2$ conversation rounds for main results</li>
<li>5 random seeds (0-4) for small molecule main results, seed 0 for ablations</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning</li>
<li>MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation</li>
<li>ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction</li>
<li>ESMFold: protein folding for visualization</li>
<li>RDKit: molecular property calculations for small molecules</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit Ratio</td>
          <td>Fraction of valid edits satisfying property requirements</td>
          <td>Invalid sequences excluded from denominator</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/chao1224/ChatDrug">ChatDrug GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., &amp; Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. <em>ICLR 2024</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liu2024chatdrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Conversational Drug Editing Using Retrieval and Domain Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BioT5: Cross-Modal Integration of Biology and Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/</guid><description>BioT5 is a T5-based pretraining framework that jointly models molecules, proteins, and natural language using SELFIES for robust molecular generation.</description><content:encoded><![CDATA[<h2 id="a-unified-pretraining-framework-for-molecules-proteins-and-text">A Unified Pretraining Framework for Molecules, Proteins, and Text</h2>
<p>BioT5 is a <strong>Method</strong> paper that introduces a comprehensive <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (instead of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.</p>
<h2 id="bridging-the-gap-between-molecular-sequences-and-scientific-knowledge">Bridging the Gap Between Molecular Sequences and Scientific Knowledge</h2>
<p>Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in <a href="https://en.wikipedia.org/wiki/PubMed">PubMed</a> abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a> share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom &ldquo;Br&rdquo; in SMILES gets split into &ldquo;B&rdquo; (boron) and &ldquo;r&rdquo;, producing erroneous downstream predictions.</p>
<p>BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.</p>
<h2 id="selfies-separate-tokenization-and-multi-task-pretraining">SELFIES, Separate Tokenization, and Multi-Task Pretraining</h2>
<p>The core innovations of BioT5 center on three design decisions:</p>
<h3 id="selfies-for-robust-molecular-representation">SELFIES for Robust Molecular Representation</h3>
<p>BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.</p>
<h3 id="modality-specific-tokenization">Modality-Specific Tokenization</h3>
<p>Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:</p>
<ul>
<li><strong>Molecules</strong>: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., <code>[C]</code>, <code>[=C]</code>, <code>[Br]</code>).</li>
<li><strong>Proteins</strong>: Amino acids are prefixed with a special <code>&lt;p&gt;</code> token to distinguish them from text characters (e.g., <code>&lt;p&gt;M</code>, <code>&lt;p&gt;K</code>, <code>&lt;p&gt;R</code>).</li>
<li><strong>Text</strong>: The standard T5 vocabulary is retained.</li>
</ul>
<p>This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.</p>
<h3 id="multi-task-pretraining-objectives">Multi-Task Pretraining Objectives</h3>
<p>BioT5 uses six pretraining tasks organized into three categories:</p>
<ol>
<li><strong>Single-modal T5 objective</strong>: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein <a href="https://en.wikipedia.org/wiki/FASTA_format">FASTA</a> (task 2), and general text from C4 (task 3).</li>
<li><strong>Wrapped text T5 objective</strong> (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.</li>
<li><strong>Bidirectional translation</strong> (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>), and protein FASTA to text description and vice versa (using 569K pairs from <a href="https://en.wikipedia.org/wiki/UniProt">Swiss-Prot</a>).</li>
</ol>
<p>The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.</p>
<h2 id="evaluation-across-15-downstream-tasks">Evaluation Across 15 Downstream Tasks</h2>
<p>BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.</p>
<h3 id="molecule-property-prediction-moleculenet">Molecule Property Prediction (MoleculeNet)</h3>
<p>BioT5 is evaluated on six binary classification tasks from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GEM</th>
          <th>MolXPT</th>
          <th>BioT5</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>72.4</td>
          <td>80.0</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>78.1</td>
          <td>77.1</td>
          <td>77.9</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>90.1</td>
          <td>95.3</td>
          <td>95.4</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>78.1</td>
          <td><strong>81.0</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>85.6</td>
          <td>88.4</td>
          <td><strong>89.4</strong></td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>67.2</td>
          <td>71.7</td>
          <td><strong>73.2</strong></td>
      </tr>
      <tr>
          <td><strong>Avg</strong></td>
          <td>79.0</td>
          <td>81.9</td>
          <td><strong>82.4</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).</p>
<h3 id="protein-property-prediction-peer-benchmark">Protein Property Prediction (PEER Benchmark)</h3>
<p>On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>Solubility (Acc)</th>
          <th>Localization (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESM-1b</td>
          <td>652.4M</td>
          <td>70.23</td>
          <td><strong>92.40</strong></td>
      </tr>
      <tr>
          <td>ProtBert</td>
          <td>419.9M</td>
          <td>68.15</td>
          <td>91.32</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252.1M</td>
          <td><strong>74.65</strong></td>
          <td>91.69</td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.</p>
<h3 id="drug-target-interaction-prediction">Drug-Target Interaction Prediction</h3>
<p>BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BioSNAP AUROC</th>
          <th>Human AUROC</th>
          <th>BindingDB AUROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugBAN</td>
          <td>0.903</td>
          <td>0.982</td>
          <td>0.960</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.989</strong></td>
          <td><strong>0.963</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.</p>
<h3 id="molecule-captioning-and-text-based-molecule-generation">Molecule Captioning and Text-Based Molecule Generation</h3>
<p>On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>BLEU-4</th>
          <th>METEOR</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-large</td>
          <td>783M</td>
          <td>0.508</td>
          <td>0.614</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>MolXPT</td>
          <td>350M</td>
          <td>0.505</td>
          <td>0.626</td>
          <td>0.594</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td><strong>0.556</strong></td>
          <td><strong>0.656</strong></td>
          <td><strong>0.603</strong></td>
      </tr>
  </tbody>
</table>
<p>For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.</p>
<h3 id="protein-protein-interaction-prediction">Protein-Protein Interaction Prediction</h3>
<p>On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5&rsquo;s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.</p>
<p>The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.</p>
<p>The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining (molecules)</td>
          <td>ZINC20</td>
          <td>~300M molecules</td>
          <td>Converted from SMILES to SELFIES</td>
      </tr>
      <tr>
          <td>Pretraining (proteins)</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniRef50</a></td>
          <td>27M proteins</td>
          <td>Filtered by length</td>
      </tr>
      <tr>
          <td>Pretraining (text)</td>
          <td>C4</td>
          <td>Large</td>
          <td>Standard T5 corpus</td>
      </tr>
      <tr>
          <td>Pretraining (wrapped text)</td>
          <td>PubMed</td>
          <td>33M articles</td>
          <td>Entity linking via BERN2</td>
      </tr>
      <tr>
          <td>Pretraining (molecule-text pairs)</td>
          <td>PubChem</td>
          <td>339K pairs</td>
          <td>Excludes ChEBI-20 molecules</td>
      </tr>
      <tr>
          <td>Pretraining (protein-text pairs)</td>
          <td>Swiss-Prot</td>
          <td>569K pairs</td>
          <td>High-quality annotations</td>
      </tr>
      <tr>
          <td>Evaluation (molecular properties)</td>
          <td>MoleculeNet</td>
          <td>6 datasets</td>
          <td>Scaffold splitting</td>
      </tr>
      <tr>
          <td>Evaluation (protein properties)</td>
          <td>PEER</td>
          <td>2 tasks</td>
          <td>Solubility and localization</td>
      </tr>
      <tr>
          <td>Evaluation (DTI)</td>
          <td>BioSNAP, Human, BindingDB</td>
          <td>3 datasets</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Evaluation (PPI)</td>
          <td>Yeast, Human</td>
          <td>2 datasets</td>
          <td>From PEER benchmark</td>
      </tr>
      <tr>
          <td>Evaluation (generation)</td>
          <td>ChEBI-20</td>
          <td>33K pairs</td>
          <td>Molecule captioning and text-to-molecule</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5-v1.1-base (encoder-decoder transformer)</li>
<li>Optimizer: AdamW with RMS scaling</li>
<li>Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$</li>
<li>Warmup steps: 10,000</li>
<li>Dropout: 0.0</li>
<li>Maximum input length: 512 tokens</li>
<li>Pretraining steps: 350K</li>
<li>Batch size: 96 per GPU (6 data types per batch)</li>
<li>Prompt-based fine-tuning for all downstream tasks</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Vocabulary Size</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td>35,073</td>
          <td>T5-v1.1-base</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)</li>
<li>Protein property prediction: accuracy on PEER benchmark (3 runs)</li>
<li>Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)</li>
<li>Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)</li>
<li>Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20</li>
<li>Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>8x NVIDIA A100 80GB GPUs for pretraining</li>
<li>Codebase: nanoT5</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/QizhiPei/BioT5">BioT5 Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., &amp; Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. <em>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</em>, 1102-1123. <a href="https://doi.org/10.18653/v1/2023.emnlp-main.70">https://doi.org/10.18653/v1/2023.emnlp-main.70</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{pei2023biot5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1102--1123}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.emnlp-main.70}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MTL-BERT: Multitask BERT for Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</guid><description>MTL-BERT combines BERT pretraining, multitask learning, and SMILES enumeration for molecular property prediction across 60 drug discovery datasets.</description><content:encoded><![CDATA[<h2 id="a-multitask-bert-framework-for-molecular-property-prediction">A Multitask BERT Framework for Molecular Property Prediction</h2>
<p>MTL-BERT is a <strong>Method</strong> paper that introduces a multitask learning framework built on BERT for predicting molecular properties from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a>. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a> approaches.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p>Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.</p>
<p>The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., <a href="https://en.wikipedia.org/wiki/Lipophilicity">lipophilicity</a> relates to many ADMET endpoints), (2) using only canonical SMILES limits the model&rsquo;s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.</p>
<h2 id="three-strategies-combined-pretraining-multitask-learning-and-smiles-enumeration">Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration</h2>
<p>The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.</p>
<h3 id="masked-smiles-pretraining">Masked SMILES Pretraining</h3>
<p>Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).</p>
<p>SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.</p>
<h3 id="transformer-architecture">Transformer Architecture</h3>
<p>The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:</p>
<p>$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$</p>
<p>where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.</p>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
          <th>Fine-tuning Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MTL-BERT_SMALL</td>
          <td>4</td>
          <td>4</td>
          <td>128</td>
          <td>512</td>
          <td>0.931</td>
          <td>0.826</td>
      </tr>
      <tr>
          <td>MTL-BERT_MEDIUM</td>
          <td>8</td>
          <td>8</td>
          <td>256</td>
          <td>1,024</td>
          <td>0.962</td>
          <td>0.852</td>
      </tr>
      <tr>
          <td>MTL-BERT_LARGE</td>
          <td>12</td>
          <td>12</td>
          <td>576</td>
          <td>2,304</td>
          <td>0.974</td>
          <td>0.848</td>
      </tr>
  </tbody>
</table>
<p>The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.</p>
<h3 id="multitask-fine-tuning-with-task-tokens">Multitask Fine-tuning with Task Tokens</h3>
<p>During fine-tuning, task tokens ([T0], [T1], &hellip;) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.</p>
<p>Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.</p>
<h3 id="smiles-enumeration-as-data-augmentation">SMILES Enumeration as Data Augmentation</h3>
<p>A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:</p>
<ol>
<li><strong>Pretraining</strong>: Enumerated SMILES increase diversity of the self-supervised training data.</li>
<li><strong>Fine-tuning</strong>: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.</li>
<li><strong>Inference</strong>: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.</li>
</ol>
<p>The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.</p>
<h2 id="experimental-evaluation-across-60-datasets">Experimental Evaluation Across 60 Datasets</h2>
<h3 id="setup">Setup</h3>
<p>MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.</p>
<p>Classification tasks were evaluated with <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> and accuracy; regression tasks with $R^2$ and RMSE.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ul>
<li><strong>ECFP4-XGBoost</strong>: Extended-connectivity fingerprints (diameter 4) with gradient boosting</li>
<li><strong>Graph Attention Network (GAT)</strong></li>
<li><strong>Graph Convolutional Network (GCN)</strong></li>
<li><strong>AttentiveFP</strong>: A GNN with attention for molecular property prediction</li>
<li><strong>CDDD</strong>: Continuous and data-driven descriptors from a pretrained RNN auto-encoder</li>
</ul>
<h3 id="ablation-study">Ablation Study</h3>
<p>Three model variants were compared to isolate contributions:</p>
<ul>
<li><strong>MTL-BERT</strong>: Full model (pretraining + multitask + SMILES enumeration)</li>
<li><strong>STL-BERT</strong>: Single-task fine-tuning with SMILES enumeration (no multitask)</li>
<li><strong>Cano-BERT</strong>: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)</li>
</ul>
<p>Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.</p>
<h3 id="results-vs-baselines">Results vs. Baselines</h3>
<p>MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:</p>
<ul>
<li>ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.</li>
<li>GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.</li>
<li>MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).</li>
<li>On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.</li>
<li>Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).</li>
</ul>
<h3 id="representation-analysis">Representation Analysis</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:</p>
<ul>
<li>Tokens of the same type cluster together (capturing atomic type information).</li>
<li>Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).</li>
<li>Nearby embeddings share similar molecular neighborhood environments.</li>
</ul>
<h3 id="attention-based-interpretability">Attention-based Interpretability</h3>
<p>The model&rsquo;s attention weights provide interpretability for predictions:</p>
<ul>
<li>For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.</li>
<li>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> (mutagenicity), attention focused on <a href="https://en.wikipedia.org/wiki/Azide">azide</a>, nitrosamide, <a href="https://en.wikipedia.org/wiki/Acyl_chloride">acylchloride</a>, and nitrite groups, which are known mutagenic structural alerts.</li>
</ul>
<h2 id="performance-gains-from-combined-strategies-with-interpretable-attention">Performance Gains from Combined Strategies with Interpretable Attention</h2>
<p>MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.</p>
<p>Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.</p>
<p>Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.</p>
<p>The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>1.7M molecules</td>
          <td>Unlabeled SMILES; 10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning/Evaluation</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>60 datasets (44 classification, 16 regression)</td>
          <td>8:1:1 train/val/test split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.</li>
<li><strong>Fine-tuning</strong>: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.</li>
<li><strong>SMILES enumeration</strong>: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.</li>
<li><strong>Inference fusion</strong>: Predictions from multiple enumerated SMILES are averaged.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size</li>
<li>Pretraining recovery accuracy: 0.962</li>
<li>1,000 task tokens pre-allocated for future tasks</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Classification</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Regression</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Secondary metric</td>
      </tr>
  </tbody>
</table>
<p>All experiments repeated 10 times with random splits; mean and standard deviation reported.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/MTL-BERT">MTL-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Pretraining data source</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Fine-tuning benchmark</td>
      </tr>
      <tr>
          <td><a href="https://admetmesh.scbdd.com/">ADMETlab</a></td>
          <td>Dataset</td>
          <td>Free for academic use</td>
          <td>ADMET property datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., &amp; Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. <em>Research</em>, 2022, Article 0004. <a href="https://doi.org/10.34133/research.0004">https://doi.org/10.34133/research.0004</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2022mtlbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{Article 0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.34133/research.0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science (AAAS)}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Mol2vec: Unsupervised ML with Chemical Intuition</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/mol2vec-unsupervised-chemical-intuition/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/mol2vec-unsupervised-chemical-intuition/</guid><description>Mol2vec applies Word2vec to Morgan substructures, learning dense vector representations of molecules that capture chemical similarity for property prediction.</description><content:encoded><![CDATA[<h2 id="word2vec-meets-cheminformatics">Word2vec Meets Cheminformatics</h2>
<p>Mol2vec is a <strong>Method</strong> paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to <a href="/notes/machine-learning/model-architectures/distributed-representations/">Word2vec</a> from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as &ldquo;words,&rdquo; and entire molecules are treated as &ldquo;sentences.&rdquo; By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.</p>
<h2 id="sparse-fingerprints-and-their-limitations">Sparse Fingerprints and Their Limitations</h2>
<p>Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:</p>
<ul>
<li><strong>High dimensionality and sparsity</strong>: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.</li>
<li><strong>Bit collisions</strong>: The hashing step can map distinct substructures to the same bit position, losing structural information.</li>
<li><strong>No learned relationships</strong>: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.</li>
</ul>
<p>At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> method had been applied to Morgan fingerprints for compound-protein interaction prediction, and <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation</a> had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.</p>
<h2 id="from-substructure-identifiers-to-dense-embeddings">From Substructure Identifiers to Dense Embeddings</h2>
<p>The central insight of Mol2vec is that the Morgan algorithm already produces a natural &ldquo;vocabulary&rdquo; of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.</p>
<h3 id="corpus-construction">Corpus Construction</h3>
<p>The training corpus was assembled from <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a> v15 and <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.</p>
<h3 id="sentence-generation">Sentence Generation</h3>
<p>For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>. This sequence of identifiers forms a &ldquo;sentence&rdquo; for Word2vec training.</p>
<h3 id="word2vec-training">Word2vec Training</h3>
<p>The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:</p>
<ul>
<li><strong>Architecture</strong>: Skip-gram</li>
<li><strong>Window size</strong>: 10</li>
<li><strong>Embedding dimension</strong>: 300</li>
</ul>
<p>Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special &ldquo;UNSEEN&rdquo; token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.</p>
<h3 id="compound-vector-generation">Compound Vector Generation</h3>
<p>The final vector for a molecule is the sum of all its substructure vectors:</p>
<p>$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$</p>
<p>where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.</p>
<h2 id="benchmarking-across-regression-and-classification-tasks">Benchmarking Across Regression and Classification Tasks</h2>
<h3 id="datasets">Datasets</h3>
<p>The authors evaluated Mol2vec on four datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>Regression</td>
          <td>1,144</td>
          <td>Aqueous solubility prediction</td>
      </tr>
      <tr>
          <td>Ames</td>
          <td>Classification</td>
          <td>6,511</td>
          <td><a href="https://en.wikipedia.org/wiki/Mutagen">Mutagenicity</a> (balanced: 3,481 positive, 2,990 negative)</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>Classification</td>
          <td>8,192</td>
          <td>12 human toxicity targets (imbalanced)</td>
      </tr>
      <tr>
          <td>Kinase</td>
          <td>Classification</td>
          <td>284 kinases</td>
          <td>Bioactivity from ChEMBL v23</td>
      </tr>
  </tbody>
</table>
<h3 id="machine-learning-methods">Machine Learning Methods</h3>
<p>Three ML methods were compared using both Mol2vec and Morgan FP features:</p>
<ul>
<li><strong>Random Forest (RF)</strong>: scikit-learn, 500 estimators</li>
<li><strong>Gradient Boosting Machine (GBM)</strong>: XGBoost, 2000 estimators, max depth 3, learning rate 0.1</li>
<li><strong>Deep Neural Network (DNN)</strong>: Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP</li>
</ul>
<p>All models were validated using 20x 5-fold cross-validation with the <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a> for statistical comparison.</p>
<h3 id="esol-regression-results">ESOL Regression Results</h3>
<table>
  <thead>
      <tr>
          <th>Features</th>
          <th>Method</th>
          <th>$R^2_{\text{ext}}$</th>
          <th>MSE</th>
          <th>MAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>MLR</td>
          <td>0.81 +/- 0.01</td>
          <td>0.82</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>Molecular Graph</td>
          <td>CNN</td>
          <td>0.93</td>
          <td>0.31 +/- 0.03</td>
          <td>0.40 +/- 0.00</td>
      </tr>
      <tr>
          <td>Morgan FP</td>
          <td>GBM</td>
          <td>0.66 +/- 0.00</td>
          <td>1.43 +/- 0.00</td>
          <td>0.88 +/- 0.00</td>
      </tr>
      <tr>
          <td>Mol2vec</td>
          <td>GBM</td>
          <td>0.86 +/- 0.00</td>
          <td>0.62 +/- 0.00</td>
          <td>0.60 +/- 0.00</td>
      </tr>
  </tbody>
</table>
<p>Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).</p>
<h3 id="classification-results-ames-and-tox21">Classification Results (Ames and Tox21)</h3>
<p>On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).</p>
<h3 id="proteochemometric-pcm-extension">Proteochemometric (PCM) Extension</h3>
<p>Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:</p>
<ul>
<li><strong>CV1</strong>: New compound-target pairs</li>
<li><strong>CV2</strong>: New targets</li>
<li><strong>CV3</strong>: New compounds</li>
<li><strong>CV4</strong>: New compounds and targets</li>
</ul>
<p>On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.</p>
<h2 id="chemical-intuition-and-practical-value">Chemical Intuition and Practical Value</h2>
<h3 id="embedding-quality">Embedding Quality</h3>
<p>The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).</p>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Skip-gram with 300-dimensional embeddings</strong> provides the best Mol2vec representations, consistent with NLP best practices.</li>
<li><strong>Mol2vec excels at regression tasks</strong>, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).</li>
<li><strong>Classification performance is competitive</strong> with Morgan FP across Ames and Tox21 datasets.</li>
<li><strong>PCM2vec enables alignment-independent proteochemometrics</strong>, extending PCM approaches to diverse protein families with low sequence similarity.</li>
<li><strong>Tree-based methods (RF, GBM) outperformed DNNs</strong> on these tasks, though the authors note further DNN tuning could help.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.</li>
<li>Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.</li>
<li>DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.</li>
<li>The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC v15 + ChEMBL v23</td>
          <td>19.9M compounds</td>
          <td>Filtered by MW, atom count, clogP, element types</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,144 compounds</td>
          <td>Aqueous solubility regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Ames</td>
          <td>6,511 compounds</td>
          <td>Mutagenicity classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21</td>
          <td>8,192 compounds</td>
          <td>12 toxicity targets, retrieved via DeepChem</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Kinase (ChEMBL v23)</td>
          <td>284 kinases</td>
          <td>IC50/Kd/Ki binding assays</td>
      </tr>
      <tr>
          <td>Protein corpus</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a></td>
          <td>554,241 sequences</td>
          <td>For ProtVec training</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Word2vec</strong>: Skip-gram, window size 10, 300-dimensional embeddings, min count 3</li>
<li><strong>Morgan algorithm</strong>: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)</li>
<li><strong>UNSEEN token</strong>: Replaces identifiers occurring fewer than 3 times</li>
<li><strong>Compound vector</strong>: Sum of all substructure vectors</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RF</strong>: scikit-learn, 500 estimators, sqrt features, balanced class weights</li>
<li><strong>GBM</strong>: XGBoost, 2000 estimators, max depth 3, learning rate 0.1</li>
<li><strong>DNN</strong>: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Mol2vec Best</th>
          <th>Morgan FP Best</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$R^2_{\text{ext}}$</td>
          <td>0.86 (GBM)</td>
          <td>0.66 (GBM)</td>
          <td>ESOL regression</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>0.87 (RF)</td>
          <td>0.88 (RF)</td>
          <td>Ames classification</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>0.83 (RF)</td>
          <td>0.83 (RF)</td>
          <td>Tox21 classification</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/samoturk/mol2vec">mol2vec</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>Python package with pre-trained model</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jaeger, S., Fulle, S., &amp; Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. <em>Journal of Chemical Information and Modeling</em>, 58(1), 27-35. <a href="https://doi.org/10.1021/acs.jcim.7b00616">https://doi.org/10.1021/acs.jcim.7b00616</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jaeger2018mol2vec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jaeger, Sabrina and Fulle, Simone and Turk, Samo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{58}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{27--35}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.7b00616}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MG-BERT: Graph BERT for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/mg-bert-molecular-graph-bert/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/mg-bert-molecular-graph-bert/</guid><description>MG-BERT integrates graph neural network message passing into BERT with masked atom pretraining on 1.7M molecules for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-aware-bert-for-molecular-property-prediction">A Graph-Aware BERT for Molecular Property Prediction</h2>
<p>MG-BERT is a <strong>Method</strong> paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p><a href="/notes/chemistry/molecular-design/property-prediction/">Molecular property prediction</a> is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.</p>
<p>Prior approaches fall into three categories, each with limitations:</p>
<ol>
<li><strong>Feature engineering</strong> (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.</li>
<li><strong>SMILES-based deep learning</strong> (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a>) learn fixed representations that cannot be fine-tuned.</li>
<li><strong>Graph neural networks</strong> (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.</li>
</ol>
<p>The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a> applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.</p>
<h2 id="bond-based-local-attention-and-masked-atom-pretraining">Bond-Based Local Attention and Masked Atom Pretraining</h2>
<p>The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.</p>
<h3 id="architecture-modifications">Architecture Modifications</h3>
<p>The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:</p>
<ol>
<li>
<p><strong>Atom embeddings replace word embeddings.</strong> The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.</p>
</li>
<li>
<p><strong>No positional encoding.</strong> Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.</p>
</li>
<li>
<p><strong>Local attention replaces global attention.</strong> The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:</p>
</li>
</ol>
<p>$$A&rsquo;_{ij} = \begin{cases} A_{ij} &amp; \text{if bond exists between } i \text{ and } j \\ -\infty &amp; \text{otherwise} \end{cases}$$</p>
<p>where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.</p>
<ol start="4">
<li><strong>Supernode for graph-level readout.</strong> A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.</li>
</ol>
<h3 id="masked-atom-prediction">Masked Atom Prediction</h3>
<p>The pretraining strategy mirrors BERT&rsquo;s masked language model but operates on atoms:</p>
<ul>
<li>15% of atoms in each molecule are randomly selected (at least one atom per molecule)</li>
<li>Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged</li>
<li>The model is trained to predict the original atom type at masked positions</li>
<li>Loss is computed only at masked positions</li>
</ul>
<h3 id="model-configurations">Model Configurations</h3>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MG-BERT Small</td>
          <td>3</td>
          <td>2</td>
          <td>128</td>
          <td>256</td>
          <td>95.27%</td>
      </tr>
      <tr>
          <td>MG-BERT Medium</td>
          <td>6</td>
          <td>4</td>
          <td>256</td>
          <td>512</td>
          <td>98.31%</td>
      </tr>
      <tr>
          <td>MG-BERT Large</td>
          <td>12</td>
          <td>8</td>
          <td>576</td>
          <td>1152</td>
          <td>98.35%</td>
      </tr>
  </tbody>
</table>
<p>The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<p>Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Type</th>
          <th>Dataset</th>
          <th>Category</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Regression</td>
          <td>Caco2</td>
          <td>Absorption</td>
          <td>979</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logD</td>
          <td>Physicochemical</td>
          <td>10,354</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logS</td>
          <td>Physicochemical</td>
          <td>5,045</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PPB</td>
          <td>Distribution</td>
          <td>1,480</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>tox</td>
          <td>Toxicity</td>
          <td>7,295</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>Physicochemical</td>
          <td>1,128</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>Physicochemical</td>
          <td>642</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipo</td>
          <td>Physicochemical</td>
          <td>4,200</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Ames</td>
          <td>Toxicity</td>
          <td>6,719</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBB</td>
          <td>Distribution</td>
          <td>1,855</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>FDAMDD</td>
          <td>Toxicity</td>
          <td>795</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>H_HT</td>
          <td>Toxicity</td>
          <td>2,170</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_inh</td>
          <td>Absorption</td>
          <td>2,125</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_sub</td>
          <td>Absorption</td>
          <td>1,210</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>Biophysics</td>
          <td>1,513</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
      </tr>
  </tbody>
</table>
<p>Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ol>
<li><strong>ECFP4-XGBoost</strong>: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees</li>
<li><strong>GAT</strong>: Graph Attention Network</li>
<li><strong>GCN</strong>: Graph Convolutional Network</li>
<li><strong>CDDD</strong>: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)</li>
<li><strong>SMILES-BERT</strong>: Original BERT applied directly to SMILES strings</li>
</ol>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Two ablation studies were conducted:</p>
<ol>
<li><strong>Pretraining effectiveness</strong>: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters</li>
<li><strong>Hydrogen atoms</strong>: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph</li>
</ol>
<h2 id="consistent-improvements-across-admet-benchmarks">Consistent Improvements Across ADMET Benchmarks</h2>
<h3 id="main-results">Main Results</h3>
<p>MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ECFP4-XGBoost</th>
          <th>GAT</th>
          <th>GCN</th>
          <th>CDDD</th>
          <th>SMILES-BERT</th>
          <th>MG-BERT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Caco2 (R2)</td>
          <td>61.41</td>
          <td>69.16</td>
          <td>67.15</td>
          <td>73.42</td>
          <td>72.39</td>
          <td><strong>74.68</strong></td>
      </tr>
      <tr>
          <td>logD (R2)</td>
          <td>70.84</td>
          <td>84.62</td>
          <td>86.22</td>
          <td>85.85</td>
          <td>86.31</td>
          <td><strong>87.46</strong></td>
      </tr>
      <tr>
          <td>logS (R2)</td>
          <td>73.73</td>
          <td>84.06</td>
          <td>83.47</td>
          <td>84.01</td>
          <td>85.20</td>
          <td><strong>87.66</strong></td>
      </tr>
      <tr>
          <td>PPB (R2)</td>
          <td>55.11</td>
          <td>59.96</td>
          <td>57.34</td>
          <td>54.12</td>
          <td>62.37</td>
          <td><strong>65.94</strong></td>
      </tr>
      <tr>
          <td>Ames (AUC)</td>
          <td>87.21</td>
          <td>86.38</td>
          <td>87.04</td>
          <td>86.82</td>
          <td>87.69</td>
          <td><strong>89.33</strong></td>
      </tr>
      <tr>
          <td>BBB (AUC)</td>
          <td>94.62</td>
          <td>93.03</td>
          <td>92.67</td>
          <td>94.44</td>
          <td>94.02</td>
          <td><strong>95.41</strong></td>
      </tr>
      <tr>
          <td>BBBP (AUC)</td>
          <td>89.16</td>
          <td>90.33</td>
          <td>90.74</td>
          <td>91.12</td>
          <td>91.32</td>
          <td><strong>92.08</strong></td>
      </tr>
  </tbody>
</table>
<p>The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P &lt;= 0.001).</p>
<h3 id="pretraining-ablation">Pretraining Ablation</h3>
<p>Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.</p>
<h3 id="hydrogen-atom-ablation">Hydrogen Atom Ablation</h3>
<p>Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.</p>
<h3 id="interpretability-via-attention-visualization">Interpretability via Attention Visualization</h3>
<p>The authors provide two forms of interpretability analysis:</p>
<ol>
<li>
<p><strong>t-SNE visualization of atomic representations</strong>: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.</p>
</li>
<li>
<p><strong>Attention weight visualization</strong>: On the logD task, the supernode&rsquo;s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The paper does not extensively discuss limitations, but several can be identified:</p>
<ul>
<li>The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features</li>
<li>The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements</li>
<li>Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested</li>
<li>The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL (random subset)</td>
          <td>1.7M molecules (1.53M train)</td>
          <td>10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>16 datasets (642-10,354 molecules)</td>
          <td>8:1:1 splits, stratified by SMILES length</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})</li>
<li><strong>Pretraining epochs</strong>: 10</li>
<li><strong>Fine-tuning</strong>: Up to 100 epochs with early stopping</li>
<li><strong>Dropout</strong>: Optimized per task in range [0.0, 0.5]</li>
<li><strong>Masking</strong>: 15% of atoms (80% [MASK], 10% random, 10% unchanged)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)</li>
<li><strong>Molecule processing</strong>: RDKit for graph conversion with explicit hydrogens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>R-squared (R2)</td>
          <td>Regression</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Accuracy, RMSE</td>
          <td>Both</td>
          <td>Reported in supplementary Table S1</td>
      </tr>
  </tbody>
</table>
<p>All results averaged over 10 random splits with standard deviations reported.</p>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements (GPU type, training time, or memory usage).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/Molecular-graph-BERT">Molecular-graph-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation; last code push August 2021</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., &amp; Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. <em>Briefings in Bioinformatics</em>, 22(6), bbab152. <a href="https://doi.org/10.1093/bib/bbab152">https://doi.org/10.1093/bib/bbab152</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2021mgbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbab152}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbab152}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MAT: Graph-Augmented Transformer for Molecules (2020)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/molecule-attention-transformer/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/molecule-attention-transformer/</guid><description>MAT augments the Transformer self-attention mechanism with inter-atomic distances and molecular graph adjacency for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-augmented-transformer-for-molecular-property-prediction">A Graph-Augmented Transformer for Molecular Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that proposes the Molecule Attention Transformer (MAT), a Transformer-based architecture adapted for molecular property prediction. The primary contribution is a modified self-attention mechanism that incorporates inter-atomic distances and molecular graph structure alongside the standard query-key attention. Combined with self-supervised pretraining on 2 million molecules from ZINC15, MAT achieves competitive performance across seven diverse molecular property prediction tasks while requiring minimal hyperparameter tuning.</p>
<h2 id="challenges-in-deep-learning-for-molecular-properties">Challenges in Deep Learning for Molecular Properties</h2>
<p>Predicting molecular properties is central to drug discovery and materials design, yet deep neural networks have struggled to consistently outperform shallow methods like random forests and SVMs on these tasks. Wu et al. (2018) demonstrated through the MoleculeNet benchmark that graph neural networks do not reliably beat classical models. Two recurring problems compound this:</p>
<ol>
<li><strong>Underfitting</strong>: Graph neural networks tend to underfit training data, with performance failing to scale with model complexity (Ishiguro et al., 2019).</li>
<li><strong>Hyperparameter sensitivity</strong>: Deep models for molecule property prediction require extensive hyperparameter search (often 500+ configurations) to achieve competitive results, making them impractical for many practitioners.</li>
</ol>
<p>Concurrent work explored using vanilla Transformers on SMILES string representations of molecules (Honda et al., 2019; Wang et al., 2019), but these approaches discard the explicit structural information encoded in molecular graphs and 3D conformations. The motivation for MAT is to combine the flexibility of the Transformer architecture with domain-specific inductive biases from molecular structure.</p>
<h2 id="molecule-self-attention-combining-attention-distance-and-graph-structure">Molecule Self-Attention: Combining Attention, Distance, and Graph Structure</h2>
<p>The core innovation is the Molecule Self-Attention layer, which replaces standard Transformer self-attention. In a standard Transformer, head $i$ computes:</p>
<p>$$
\mathcal{A}^{(i)} = \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) \mathbf{V}_{i}
$$</p>
<p>MAT augments this with two additional information sources. Let $\mathbf{A} \in {0, 1}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the molecular graph adjacency matrix and $\mathbf{D} \in \mathbb{R}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the inter-atomic distance matrix. The modified attention becomes:</p>
<p>$$
\mathcal{A}^{(i)} = \left(\lambda_{a} \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) + \lambda_{d}, g(\mathbf{D}) + \lambda_{g}, \mathbf{A}\right) \mathbf{V}_{i}
$$</p>
<p>where $\lambda_{a}$, $\lambda_{d}$, and $\lambda_{g}$ are scalar hyperparameters weighting each component, and $g$ is either a row-wise softmax or an element-wise exponential decay $g(d) = \exp(-d)$.</p>
<p>Key architectural details:</p>
<ul>
<li><strong>Atom embedding</strong>: Each atom is represented as a 26-dimensional vector encoding atomic identity (one-hot over B, N, C, O, F, P, S, Cl, Br, I, dummy, other), number of heavy neighbors, number of hydrogens, formal charge, ring membership, and aromaticity.</li>
<li><strong>Dummy node</strong>: An artificial disconnected node (distance $10^{6}$ from all atoms) is added to each molecule, allowing the model to &ldquo;skip&rdquo; attention heads when no relevant pattern exists, similar to how BERT uses the separation token.</li>
<li><strong>3D conformers</strong>: Distance matrices are computed from RDKit-generated 3D conformers using the Universal Force Field (UFF).</li>
<li><strong>Pretraining</strong>: Node-level masked atom prediction on 2 million ZINC15 molecules (following Hu et al., 2019), where 15% of atom features are masked and the model predicts them.</li>
</ul>
<h2 id="benchmark-evaluation-and-ablation-studies">Benchmark Evaluation and Ablation Studies</h2>
<h3 id="experimental-setup">Experimental setup</h3>
<p>MAT is evaluated on seven molecular property prediction datasets spanning regression and classification:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Size</th>
          <th>Metric</th>
          <th>Split</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FreeSolv</td>
          <td>Regression (hydration free energy)</td>
          <td>642</td>
          <td>RMSE</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>Regression (log solubility)</td>
          <td>1,128</td>
          <td>RMSE</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Classification (BBB permeability)</td>
          <td>2,039</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>Estrogen-alpha</td>
          <td>Classification (receptor activity)</td>
          <td>2,398</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>Estrogen-beta</td>
          <td>Classification (receptor activity)</td>
          <td>1,961</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>MetStab-high</td>
          <td>Classification (metabolic stability)</td>
          <td>2,127</td>
          <td>ROC AUC</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>MetStab-low</td>
          <td>Classification (metabolic stability)</td>
          <td>2,127</td>
          <td>ROC AUC</td>
          <td>Random</td>
      </tr>
  </tbody>
</table>
<p>Baselines include GCN, Weave, EAGCN, Random Forest (RF), and SVM. Each model receives the same hyperparameter search budget (150 or 500 evaluations). Results are averaged over 6 random train/validation/test splits.</p>
<h3 id="main-results">Main results</h3>
<p>MAT achieves the best average rank across all seven tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Avg. Rank (500 budget)</th>
          <th>Avg. Rank (150 budget)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MAT</td>
          <td>2.42</td>
          <td>2.71</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>3.14</td>
          <td>3.14</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>3.57</td>
          <td>3.28</td>
      </tr>
      <tr>
          <td>GCN</td>
          <td>3.57</td>
          <td>3.71</td>
      </tr>
      <tr>
          <td>Weave</td>
          <td>3.71</td>
          <td>3.57</td>
      </tr>
      <tr>
          <td>EAGCN</td>
          <td>4.14</td>
          <td>4.14</td>
      </tr>
  </tbody>
</table>
<p>With self-supervised pretraining, Pretrained MAT achieves an average rank of 1.57, outperforming both Pretrained EAGCN (4.0) and SMILES Transformer (4.29). Pretrained MAT requires tuning only the learning rate (7 values tested), compared to 500 hyperparameter combinations for the non-pretrained models.</p>
<h3 id="ablation-results">Ablation results</h3>
<p>Ablation studies on BBBP, ESOL, and FreeSolv reveal:</p>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>BBBP (AUC)</th>
          <th>ESOL (RMSE)</th>
          <th>FreeSolv (RMSE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MAT (full)</td>
          <td>.723</td>
          <td>.286</td>
          <td>.250</td>
      </tr>
      <tr>
          <td>- Graph</td>
          <td>.716</td>
          <td>.316</td>
          <td>.276</td>
      </tr>
      <tr>
          <td>- Distance</td>
          <td>.729</td>
          <td>.281</td>
          <td>.281</td>
      </tr>
      <tr>
          <td>- Attention</td>
          <td>.692</td>
          <td>.306</td>
          <td>.329</td>
      </tr>
      <tr>
          <td>- Dummy node</td>
          <td>.714</td>
          <td>.317</td>
          <td>.249</td>
      </tr>
      <tr>
          <td>+ Edge features</td>
          <td>.683</td>
          <td>.314</td>
          <td>.358</td>
      </tr>
  </tbody>
</table>
<p>Removing any single component degrades performance on at least one task, supporting the value of combining all three information sources. Adding edge features does not help, suggesting the adjacency and distance matrices already capture sufficient bond-level information.</p>
<h3 id="interpretability-analysis">Interpretability analysis</h3>
<p>Individual attention heads in the first layer learn chemically meaningful functions. Six heads were identified that focus on specific chemical patterns: 2-neighbored aromatic carbons, sulfur atoms, non-ring nitrogens, carbonyl oxygens, 3-neighbored aromatic atoms (substitution positions), and aromatic ring nitrogens. Statistical validation using Kruskal-Wallis tests confirmed that atoms matching these SMARTS patterns receive significantly higher attention weights ($p &lt; 0.001$ for all patterns).</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>MAT demonstrates that augmenting Transformer self-attention with molecular graph structure and 3D distance information produces a model that performs consistently well across diverse property prediction tasks. The key practical finding is that self-supervised pretraining dramatically reduces the hyperparameter tuning burden: Pretrained MAT matches or exceeds the performance of extensively tuned models while requiring only learning rate selection.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li><strong>Fingerprint-based models still win on some tasks</strong>: RF and SVM with extended-connectivity fingerprints outperform MAT on metabolic stability and Estrogen-beta tasks, suggesting that incorporating fingerprint representations could improve MAT further.</li>
<li><strong>Single conformer</strong>: Only one pre-computed 3D conformer is used per molecule. More sophisticated conformer sampling or ensemble strategies were not explored.</li>
<li><strong>Limited pretraining exploration</strong>: Only the masked atom prediction task from Hu et al. (2019) was used. The authors note that exploring additional pretraining objectives is a promising direction.</li>
<li><strong>Scalability</strong>: The pretrained model uses 1024-dimensional embeddings with 8 layers and 16 attention heads, fitting the largest model that fits in GPU memory.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ZINC15</td>
          <td>2M molecules</td>
          <td>Sampled from ZINC database</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Hydration free energy regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Log solubility regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td>Blood-brain barrier classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Estrogen-alpha/beta</td>
          <td>2,398 / 1,961</td>
          <td>Receptor activity classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MetStab-high/low</td>
          <td>2,127 each</td>
          <td>Metabolic stability classification</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam with Noam learning rate scheduler (warmup then inverse square root decay)</li>
<li>Pretraining: 8 epochs, learning rate 0.001, batch size 256, binary cross-entropy loss</li>
<li>Fine-tuning: 100 epochs, batch size 32, learning rate selected from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6}</li>
<li>Distance kernel: exponential decay $g(d) = \exp(-d)$ for pretrained model</li>
<li>Lambda weights: $\lambda_{a} = \lambda_{d} = 0.33$ for pretrained model</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Pretrained MAT: 1024-dim embeddings, 8 layers, 16 attention heads, 1 feed-forward layer per block</li>
<li>Dropout: 0.0, weight decay: 0.0 for pretrained model</li>
<li>Atom featurization: 26-dimensional one-hot encoding (Table 1 in paper)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: RMSE (FreeSolv, ESOL)</li>
<li>Classification: ROC AUC (BBBP, Estrogen-alpha/beta, MetStab-high/low)</li>
<li>All experiments repeated 6 times with different train/validation/test splits</li>
<li>Scaffold split for BBBP, Estrogen, random split for others</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify exact hardware details. The pretrained model is described as &ldquo;the largest model that still fits the GPU memory.&rdquo;</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gmum/MAT">gmum/MAT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Maziarka, Ł., Danel, T., Mucha, S., Rataj, K., Tabor, J., &amp; Jastrzębski, S. (2020). Molecule Attention Transformer. <em>arXiv preprint arXiv:2002.08264</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{maziarka2020molecule,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecule Attention Transformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Maziarka, {\L}ukasz and Danel, Tomasz and Mucha, S{\l}awomir and Rataj, Krzysztof and Tabor, Jacek and Jastrz{\k{e}}bski, Stanis{\l}aw}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2002.08264}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DMP: Dual-View Molecule Pre-training (SMILES+GNN)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/</guid><description>DMP pre-trains molecular encoders using both SMILES Transformer and GNN branches with a BYOL-style dual-view consistency loss for property prediction.</description><content:encoded><![CDATA[<h2 id="a-dual-branch-pre-training-method-for-molecular-property-prediction">A Dual-Branch Pre-training Method for Molecular Property Prediction</h2>
<p>DMP (Dual-view Molecule Pre-training) is a <strong>Method</strong> paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).</p>
<h2 id="why-combine-smiles-and-graph-views-for-molecules">Why Combine SMILES and Graph Views for Molecules</h2>
<p>Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.</p>
<p>Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.</p>
<h2 id="dual-view-consistency-with-byol-style-training">Dual-View Consistency with BYOL-Style Training</h2>
<p>The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:</p>
<ul>
<li><strong>Transformer branch</strong>: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.</li>
<li><strong>GNN branch</strong>: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.</li>
</ul>
<p>The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:</p>
<p>$$
p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s)
$$</p>
<p>The consistency loss maximizes cross-view <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> with stop-gradient (SG) on the target:</p>
<p>$$
\ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s))
$$</p>
<p>where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.</p>
<p>The full training objective combines three losses:</p>
<ol>
<li><strong>MLM on Transformer</strong>: Recover masked tokens in SMILES sequences</li>
<li><strong>MLM on GNN</strong>: Recover masked atoms in molecular graphs</li>
<li><strong>Dual-view consistency</strong>: The BYOL-style loss above</li>
</ol>
<p>Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.</p>
<h2 id="experiments-on-moleculenet-and-retrosynthesis">Experiments on MoleculeNet and Retrosynthesis</h2>
<h3 id="pre-training-setup">Pre-training Setup</h3>
<p>DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).</p>
<h3 id="molecular-property-prediction-moleculenet">Molecular Property Prediction (MoleculeNet)</h3>
<p>DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.</p>
<p>Key results on DeepChem splits (ROC-AUC %):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MolCLR</th>
          <th>TF (MLM)</th>
          <th>DMP_TF</th>
          <th>DMP_TF+GNN</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>73.6</td>
          <td>74.9</td>
          <td><strong>78.1</strong></td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>79.8</td>
          <td>77.6</td>
          <td><strong>78.8</strong></td>
          <td>79.1</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>93.2</td>
          <td>92.9</td>
          <td><strong>95.0</strong></td>
          <td>95.6</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>80.2</td>
          <td><strong>81.0</strong></td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>89.0</td>
          <td>88.0</td>
          <td><strong>89.3</strong></td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>68.0</td>
          <td>68.4</td>
          <td><strong>69.2</strong></td>
          <td>69.8</td>
      </tr>
  </tbody>
</table>
<p>On scaffold splits (comparison with GROVER and MPG):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GROVER</th>
          <th>MPG</th>
          <th>DMP_TF</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP (AUC)</td>
          <td>0.940</td>
          <td>0.922</td>
          <td><strong>0.945</strong></td>
      </tr>
      <tr>
          <td>SIDER (AUC)</td>
          <td>0.658</td>
          <td>0.661</td>
          <td><strong>0.695</strong></td>
      </tr>
      <tr>
          <td>ClinTox (AUC)</td>
          <td>0.944</td>
          <td>0.963</td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>ESOL (RMSE)</td>
          <td>0.831</td>
          <td>0.741</td>
          <td><strong>0.700</strong></td>
      </tr>
      <tr>
          <td>QM7 (MAE)</td>
          <td>72.6</td>
          <td>-</td>
          <td><strong>69.6</strong></td>
      </tr>
      <tr>
          <td>QM8 (MAE)</td>
          <td>0.0125</td>
          <td>-</td>
          <td><strong>0.0124</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis">Retrosynthesis</h3>
<p>DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a &ldquo;DMP fusion&rdquo; approach (fusing pre-trained representations into a Transformer encoder-decoder for <a href="/notes/chemistry/molecular-design/reaction-prediction/">retrosynthesis</a>), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Transformer</th>
          <th>ChemBERTa fusion</th>
          <th>DMP fusion</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-50K (unknown)</td>
          <td>42.3</td>
          <td>43.9</td>
          <td><strong>46.1</strong></td>
      </tr>
      <tr>
          <td>USPTO-50K (known)</td>
          <td>54.2</td>
          <td>56.4</td>
          <td><strong>57.5</strong></td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>42.9</td>
          <td>-</td>
          <td><strong>45.0</strong></td>
      </tr>
  </tbody>
</table>
<p>For GNN-based retrosynthesis, replacing GLN&rsquo;s GNN modules with DMP&rsquo;s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).</p>
<h3 id="representation-quality">Representation Quality</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The <a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ul>
<li>Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.</li>
<li>Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).</li>
<li>The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.</li>
<li>Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.</li>
</ul>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ol>
<li>Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.</li>
<li>A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.</li>
<li>The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.</li>
</ol>
<p><strong>Relation to co-training:</strong> The authors clarify that DMP differs from classical <a href="https://en.wikipedia.org/wiki/Co-training">co-training</a> (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>10M compounds</td>
          <td>Same subset as MolCLR and ChemBERTa</td>
      </tr>
      <tr>
          <td>Pre-training (large)</td>
          <td>PubChem subset</td>
          <td>100M compounds</td>
          <td>Additional scale experiment</td>
      </tr>
      <tr>
          <td>Evaluation (classification)</td>
          <td>MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)</td>
          <td>1.5K-41K molecules</td>
          <td>Official DeepChem splits</td>
      </tr>
      <tr>
          <td>Evaluation (regression)</td>
          <td>MoleculeNet (ESOL, QM7, QM8)</td>
          <td>Varies</td>
          <td>Scaffold splits from GROVER</td>
      </tr>
      <tr>
          <td>Evaluation (retrosynthesis)</td>
          <td>USPTO-50K, USPTO-full</td>
          <td>50K / 950K reactions</td>
          <td>Splits from Dai et al. (2019)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Transformer branch</strong>: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).</li>
<li><strong>GNN branch</strong>: DeeperGCN with 12 layers, atom masking for MLM.</li>
<li><strong>Dual-view loss</strong>: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.</li>
<li><strong>Optimizer</strong>: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transformer branch</td>
          <td>RoBERTa-base (12L, 768H, 12 heads)</td>
          <td>87M</td>
      </tr>
      <tr>
          <td>GNN branch</td>
          <td>DeeperGCN (12L, 384H)</td>
          <td>7.4M</td>
      </tr>
      <tr>
          <td>DMP (total)</td>
          <td>Transformer + GNN + projection/prediction heads</td>
          <td>104.1M</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC, averaged over 3 random seeds</li>
<li>Regression: RMSE (ESOL) or MAE (QM7, QM8)</li>
<li>Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x</li>
<li>Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained model weights were identified for this paper. The paper references GLN&rsquo;s code repository (<a href="https://github.com/Hanjun-Dai/GLN">https://github.com/Hanjun-Dai/GLN</a>) for the retrosynthesis baseline but does not release DMP-specific code.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Hanjun-Dai/GLN">GLN (baseline)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Retrosynthesis baseline, not DMP code</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., &amp; Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In <em>Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</em> (pp. 3615-3627). <a href="https://doi.org/10.1145/3580305.3599317">https://doi.org/10.1145/3580305.3599317</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2023dualview,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-view Molecular Pre-training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3615--3627}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3580305.3599317}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>X-MOL: Pre-training on 1.1B Molecules for SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/</guid><description>X-MOL pre-trains a shared encoder-decoder Transformer on 1.1 billion molecules, then fine-tunes for property prediction, reaction analysis, and generation.</description><content:encoded><![CDATA[<h2 id="a-unified-molecular-pre-training-framework">A Unified Molecular Pre-training Framework</h2>
<p>X-MOL is a <strong>Method</strong> paper that introduces a large-scale pre-training framework for <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from <a href="/notes/chemistry/datasets/zinc-22/">ZINC15</a>, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, <a href="https://en.wikipedia.org/wiki/Drug_interaction">drug-drug interaction</a> (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.</p>
<h2 id="bridging-scale-and-understanding-in-molecular-smiles">Bridging Scale and Understanding in Molecular SMILES</h2>
<p>Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>). Two challenges motivated this work:</p>
<ol>
<li><strong>SMILES sacrifices structural information for simplicity.</strong> While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.</li>
<li><strong>Labelled molecular data is scarce.</strong> Most benchmark datasets (<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.</li>
</ol>
<p>The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.</p>
<h2 id="generative-pre-training-with-random-smiles">Generative Pre-training with Random SMILES</h2>
<p>The core innovation in X-MOL is a <strong>generative pre-training strategy</strong> that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (<a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">random SMILES</a>), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:</p>
<ol>
<li>Reconstruct the molecular structure from the input SMILES</li>
<li>Generate a valid output SMILES following SMILES grammar rules</li>
</ol>
<p>The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.</p>
<p>The self-attention mechanism computes attention for each character $i$ as:</p>
<p>$$
Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V
$$</p>
<p>where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.</p>
<h3 id="model-architecture">Model Architecture</h3>
<ul>
<li>12 Transformer encoder layers</li>
<li>768-dimensional hidden units</li>
<li>12 attention heads</li>
<li>Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])</li>
<li>Characters within square brackets and double digits preceded by &ldquo;%&rdquo; are treated as single tokens</li>
</ul>
<h3 id="data-augmentation-in-pre-training">Data Augmentation in Pre-training</h3>
<p>Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.</p>
<h2 id="experimental-setup-across-five-tasks">Experimental Setup Across Five Tasks</h2>
<p>X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.</p>
<h3 id="prediction-tasks">Prediction Tasks</h3>
<p>For prediction tasks, the [CLS] token&rsquo;s output representation is passed through a fully connected network to produce predictions. The input format varies by task:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input Format</th>
          <th>Loss Function</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Property prediction (classification)</td>
          <td>Single SMILES</td>
          <td>Cross-entropy</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Property prediction (regression)</td>
          <td>Single SMILES</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Reaction productivity prediction</td>
          <td>Four SMILES (reactant, additive, base, ligand)</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Two SMILES (drug pair)</td>
          <td>Cross-entropy</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p><strong>Molecular Property Prediction (Classification):</strong> Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBBP</a> (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.</p>
<p><strong>Molecular Property Prediction (Regression):</strong> Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.</p>
<p><strong>Chemical Reaction Productivity Prediction:</strong> The <a href="https://en.wikipedia.org/wiki/Cross-coupling_reaction">C-N cross-coupling</a> dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.</p>
<p><strong>DDI Prediction:</strong> The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.</p>
<h3 id="generation-tasks">Generation Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Generation Source</th>
          <th>Sampling Strategy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distribution learning (DL) generation</td>
          <td>Fixed initial symbol ([CLS])</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Goal-directed (GD) generation</td>
          <td>Unfixed initial symbol</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Molecule optimization</td>
          <td>Input molecule</td>
          <td>Beam search (beam size = 4)</td>
      </tr>
  </tbody>
</table>
<p><strong>DL-based Generation:</strong> Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.</p>
<p><strong>GD Generation:</strong> Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.</p>
<p><strong>Molecule Optimization:</strong> Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Classification (ROC-AUC, higher is better):</strong> X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.</p>
<p><strong>Regression (RMSE, lower is better):</strong> X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.</p>
<p><strong>Reaction Productivity:</strong> X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.</p>
<p><strong>DDI Prediction:</strong> X-MOL achieved accuracy of 0.952, improving over DeepDDI&rsquo;s 0.924.</p>
<p><strong>DL-based Generation:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>20%</td>
          <td>99.97%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>MRNN</td>
          <td>65%</td>
          <td>99.89%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>GraphAF</td>
          <td>68%</td>
          <td>99.10%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td><strong>X-MOL</strong></td>
          <td><strong>85.28%</strong></td>
          <td><strong>99.91%</strong></td>
          <td><strong>100%</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>GD Generation:</strong> X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.</p>
<h3 id="knowledge-embedding-ablation">Knowledge Embedding Ablation</h3>
<p>The paper tested three additional embedding strategies to inject structural information into the model:</p>
<ul>
<li><strong>Link embedding:</strong> Encodes connection information between atoms (position of the previous connected atom)</li>
<li><strong>Ring embedding:</strong> Encodes ring structure information from SMILES number pairs</li>
<li><strong>Type embedding:</strong> Categorizes characters into 9 types (atoms, bonds, structural symbols)</li>
</ul>
<p>None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label &ldquo;SMILES is all you need.&rdquo;</p>
<h3 id="attention-visualization">Attention Visualization</h3>
<p>The authors provide attention heatmap analysis demonstrating that:</p>
<ul>
<li>Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures</li>
<li>Later layers abstract higher-level features for property prediction</li>
<li>In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)</li>
<li>In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)</li>
</ul>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:</p>
<ol>
<li><strong>Scale enables SMILES understanding.</strong> Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.</li>
<li><strong>Unified framework.</strong> A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.</li>
<li><strong>SMILES is sufficient.</strong> Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.</li>
<li><strong>Interpretable attention.</strong> Attention visualization confirms that the model reconstructs molecular structure internally.</li>
</ol>
<p><strong>Limitations</strong> (observed):</p>
<ul>
<li>The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.</li>
<li>Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.</li>
<li>The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.</li>
<li>No code or model weights have been publicly released, limiting independent verification.</li>
<li>The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.</li>
</ul>
<p><strong>Future directions</strong> proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC15</td>
          <td>1.1 billion molecules</td>
          <td>Random SMILES augmentation</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV (MoleculeNet)</td>
          <td>41,127</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE (MoleculeNet)</td>
          <td>1,513</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP (MoleculeNet)</td>
          <td>2,039</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>ClinTox (MoleculeNet)</td>
          <td>1,484</td>
          <td>Two sub-datasets, averaged</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL (MoleculeNet)</td>
          <td>1,128</td>
          <td>Water solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv (MoleculeNet)</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity (MoleculeNet)</td>
          <td>4,200</td>
          <td>logD at pH 7.4</td>
      </tr>
      <tr>
          <td>Reaction</td>
          <td>C-N cross-coupling</td>
          <td>3,956</td>
          <td>From Ahneman et al. (2018)</td>
      </tr>
      <tr>
          <td>DDI</td>
          <td>DeepDDI</td>
          <td>192,284 DDI pairs</td>
          <td>86 interaction types</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>ZINC250K</td>
          <td>249,456</td>
          <td>For DL, GD, and optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer</li>
<li>Fine-tuning prediction tasks: [CLS] token passed through fully connected layers</li>
<li>Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)</li>
<li>Data augmentation: Random SMILES augmentation for regression tasks</li>
<li>Repeated training: 20 random splits with averaged results for classification/regression</li>
<li>10-fold cross-validation for reaction productivity</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>12-layer Transformer, 768 hidden dimensions, 12 attention heads</li>
<li>Character-level tokenization: 108 chemical characters + 5 special tokens</li>
<li>Implemented in PaddlePaddle framework</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>X-MOL</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BACE (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BBBP (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ClinTox (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ESOL (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>FreeSolv (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>Lipophilicity (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>C-N coupling</td>
          <td>RMSE</td>
          <td>0.0626</td>
          <td>0.078 (random forest)</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Accuracy</td>
          <td>0.952</td>
          <td>0.924 (DeepDDI)</td>
      </tr>
      <tr>
          <td>DL generation</td>
          <td>Validity</td>
          <td>85.28%</td>
          <td>68% (GraphAF)</td>
      </tr>
      <tr>
          <td>GD generation</td>
          <td>Top-3 QED</td>
          <td>All 0.948</td>
          <td>0.948/0.948/0.947 (GraphAF)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days</li>
<li>Data pre-processing: Over 1,000 CPUs with Hadoop</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu&rsquo;s PaddlePaddle framework, but no repository is available.</p>
<p><strong>Reproducibility status: Closed.</strong> While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., &amp; Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. <em>bioRxiv</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xue2020xmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{bioRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1101/2020.12.23.424259}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Cold Spring Harbor Laboratory}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>VAE for Automatic Chemical Design (2018 Seminal)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/</guid><description>A variational autoencoder maps SMILES strings to a continuous latent space, enabling gradient-based optimization for molecular design and generation.</description><content:encoded><![CDATA[<h2 id="a-foundational-method-for-continuous-molecular-representation">A Foundational Method for Continuous Molecular Representation</h2>
<p>This is a <strong>Method</strong> paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.</p>
<h2 id="the-challenge-of-searching-discrete-chemical-space">The Challenge of Searching Discrete Chemical Space</h2>
<p>Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:</p>
<ol>
<li><strong>Virtual screening</strong> over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.</li>
<li><strong>Discrete local search</strong> (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.</li>
</ol>
<p>The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.</p>
<h2 id="a-vae-architecture-for-smiles-strings-with-joint-property-prediction">A VAE Architecture for SMILES Strings with Joint Property Prediction</h2>
<p>The architecture consists of three coupled neural networks trained jointly:</p>
<ol>
<li>
<p><strong>Encoder</strong>: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for <a href="/notes/chemistry/datasets/qm9/">QM9</a>, 156 dimensions.</p>
</li>
<li>
<p><strong>Decoder</strong>: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.</p>
</li>
<li>
<p><strong>Property Predictor</strong>: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.</p>
</li>
</ol>
<h3 id="the-vae-objective">The VAE Objective</h3>
<p>The model uses the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder framework of Kingma and Welling</a>. The training objective combines three terms:</p>
<p>$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$</p>
<p>where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.</p>
<p>The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing &ldquo;dead areas&rdquo; in latent space that would decode to invalid molecules.</p>
<h3 id="gradient-based-optimization">Gradient-Based Optimization</h3>
<p>After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:</p>
<ol>
<li>Encoding a seed molecule into the latent space</li>
<li>Using the GP model to define a smooth property surface over the latent space</li>
<li>Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent</li>
<li>Decoding the optimized $z$ back into a SMILES string</li>
</ol>
<p>The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).</p>
<h2 id="experiments-on-zinc-and-qm9-datasets">Experiments on ZINC and QM9 Datasets</h2>
<p>Two autoencoder systems were trained:</p>
<ul>
<li><strong>ZINC</strong>: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.</li>
<li><strong>QM9</strong>: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).</li>
</ul>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.</p>
<p>Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.</p>
<h3 id="molecular-generation-comparison">Molecular Generation Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Dataset</th>
          <th>Samples</th>
          <th>logP</th>
          <th>SAS</th>
          <th>QED</th>
          <th>% in ZINC</th>
          <th>% in eMolecules</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data</td>
          <td>ZINC</td>
          <td>249k</td>
          <td>2.46 (1.43)</td>
          <td>3.05 (0.83)</td>
          <td>0.73 (0.14)</td>
          <td>100</td>
          <td>12.9</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>ZINC</td>
          <td>5303</td>
          <td>2.84 (1.86)</td>
          <td>3.80 (1.01)</td>
          <td>0.57 (0.20)</td>
          <td>6.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>ZINC</td>
          <td>8728</td>
          <td>2.67 (1.46)</td>
          <td>3.18 (0.86)</td>
          <td>0.70 (0.14)</td>
          <td>5.8</td>
          <td>7.0</td>
      </tr>
      <tr>
          <td>Data</td>
          <td>QM9</td>
          <td>134k</td>
          <td>0.30 (1.00)</td>
          <td>4.25 (0.94)</td>
          <td>0.48 (0.07)</td>
          <td>0.0</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>QM9</td>
          <td>5470</td>
          <td>0.96 (1.53)</td>
          <td>4.47 (1.01)</td>
          <td>0.53 (0.13)</td>
          <td>0.018</td>
          <td>3.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>QM9</td>
          <td>2839</td>
          <td>0.30 (0.97)</td>
          <td>4.34 (0.98)</td>
          <td>0.47 (0.08)</td>
          <td>0.0</td>
          <td>8.9</td>
      </tr>
  </tbody>
</table>
<p>The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.</p>
<h3 id="property-prediction">Property Prediction</h3>
<table>
  <thead>
      <tr>
          <th>Dataset/Property</th>
          <th>Mean Baseline</th>
          <th>ECFP</th>
          <th>Graph Conv.</th>
          <th>1-hot SMILES</th>
          <th>Encoder Only</th>
          <th>VAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ZINC/logP</td>
          <td>1.14</td>
          <td>0.38</td>
          <td>0.05</td>
          <td>0.16</td>
          <td>0.13</td>
          <td>0.15</td>
      </tr>
      <tr>
          <td>ZINC/QED</td>
          <td>0.112</td>
          <td>0.045</td>
          <td>0.017</td>
          <td>0.041</td>
          <td>0.037</td>
          <td>0.054</td>
      </tr>
      <tr>
          <td>QM9/HOMO (eV)</td>
          <td>0.44</td>
          <td>0.20</td>
          <td>0.12</td>
          <td>0.12</td>
          <td>0.13</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/LUMO (eV)</td>
          <td>1.05</td>
          <td>0.20</td>
          <td>0.15</td>
          <td>0.11</td>
          <td>0.14</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/Gap (eV)</td>
          <td>1.07</td>
          <td>0.30</td>
          <td>0.18</td>
          <td>0.16</td>
          <td>0.18</td>
          <td>0.21</td>
      </tr>
  </tbody>
</table>
<p>The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.</p>
<h3 id="optimization-results">Optimization Results</h3>
<p>Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.</p>
<h2 id="key-findings-limitations-and-legacy">Key Findings, Limitations, and Legacy</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.</li>
<li>Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.</li>
<li>The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.</li>
</ul>
<h3 id="acknowledged-limitations">Acknowledged Limitations</h3>
<ul>
<li>The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.</li>
<li>Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.</li>
<li>Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.</li>
</ul>
<h3 id="directions-identified">Directions Identified</h3>
<p>The authors point to several extensions that were already underway at the time of publication:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a></strong>: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).</li>
<li><strong>Graph-based decoders</strong>: Directly outputting molecular graphs to avoid the SMILES validity problem.</li>
<li><strong>Adversarial training</strong>: Using GANs for molecular generation (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN, ORGANIC</a>).</li>
<li><strong>LSTM/RNN generators</strong>: Applying recurrent networks directly to SMILES for generation and reaction prediction.</li>
</ul>
<p>This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ZINC (drug-like subset)</td>
          <td>250,000 molecules</td>
          <td>Randomly sampled from ZINC database</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>QM9</td>
          <td>108,000 molecules</td>
          <td>Molecules with fewer than 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ZINC held-out set</td>
          <td>5,000 molecules</td>
          <td>For latent space analysis</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Encoder</strong>: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer</li>
<li><strong>Decoder</strong>: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing</li>
<li><strong>Property Predictor</strong>: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping</li>
<li><strong>Variational loss annealing</strong>: Sigmoid schedule after 29 epochs, total 120 epochs</li>
<li><strong>SMILES validation</strong>: Post-hoc filtering with RDKit; invalid outputs discarded</li>
<li><strong>Optimization</strong>: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space</li>
</ul>
<h3 id="models">Models</h3>
<p>Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>Water-octanol partition coefficient</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimation of Drug-likeness (0-1)</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>Synthetic Accessibility Score</td>
      </tr>
      <tr>
          <td>HOMO/LUMO (eV)</td>
          <td>Frontier orbital energies (QM9)</td>
      </tr>
      <tr>
          <td>Decoding validity</td>
          <td>Fraction of latent points producing valid SMILES</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/chemical_vae">chemical_vae</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with training scripts and pre-trained models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., &amp; Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. <em>ACS Central Science</em>, 4(2), 268-276. <a href="https://doi.org/10.1021/acscentsci.7b00572">https://doi.org/10.1021/acscentsci.7b00572</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gomez2018automatic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{G{\&#39;o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and S{\&#39;a}nchez-Lengeling, Benjam{\&#39;i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{268--276}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acscentsci.7b00572}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer-CNN: SMILES Embeddings for QSAR Modeling</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</guid><description>Transformer-CNN uses SMILES embeddings from a canonicalization Transformer with a CNN head for interpretable QSAR property prediction.</description><content:encoded><![CDATA[<h2 id="transformer-based-smiles-embeddings-for-property-prediction">Transformer-Based SMILES Embeddings for Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces Transformer-CNN, a two-stage architecture for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder&rsquo;s internal representations are then used as &ldquo;dynamic SMILES embeddings&rdquo; for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.</p>
<h2 id="from-descriptors-to-learned-embeddings-in-qsar">From Descriptors to Learned Embeddings in QSAR</h2>
<p>Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.</p>
<p>The authors identify two specific gaps. First, existing SMILES-based autoencoders such as <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.</p>
<h2 id="dynamic-smiles-embeddings-via-canonicalization-pre-training">Dynamic SMILES Embeddings via Canonicalization Pre-training</h2>
<p>The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.</p>
<h3 id="pre-training-on-smiles-canonicalization">Pre-training on SMILES Canonicalization</h3>
<p>The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.</p>
<p>The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:</p>
<p>$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$</p>
<p>where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.</p>
<p>On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).</p>
<h3 id="from-encoder-states-to-qsar-predictions">From Encoder States to QSAR Predictions</h3>
<p>After pre-training, the encoder&rsquo;s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these &ldquo;dynamic embeddings&rdquo; preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.</p>
<p>To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).</p>
<p>The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.</p>
<h3 id="interpretability-via-layer-wise-relevance-propagation">Interpretability via Layer-wise Relevance Propagation</h3>
<p>The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:</p>
<p>$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$</p>
<p>In practice, biases absorb some relevance, so the total propagated to the input is less than the output:</p>
<p>$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$</p>
<p>For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.</p>
<h2 id="benchmarks-across-18-regression-and-classification-datasets">Benchmarks Across 18 Regression and Classification Datasets</h2>
<p>The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.</p>
<h3 id="regression-results-r2">Regression Results ($r^2$)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MP (19,104)</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center"><strong>0.86</strong></td>
          <td style="text-align: center">0.85</td>
      </tr>
      <tr>
          <td>BP (11,893)</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.97</td>
          <td style="text-align: center"><strong>0.98</strong></td>
          <td style="text-align: center">0.98</td>
      </tr>
      <tr>
          <td>BCF (378)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center"><strong>0.85</strong></td>
          <td style="text-align: center">0.81</td>
      </tr>
      <tr>
          <td>FreeSolv (642)</td>
          <td style="text-align: center"><strong>0.94</strong></td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
      </tr>
      <tr>
          <td>LogS (1,311)</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.91</td>
      </tr>
      <tr>
          <td>Lipo (4,200)</td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.60</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center"><strong>0.74</strong></td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.66</td>
          <td style="text-align: center"><strong>0.76</strong></td>
          <td style="text-align: center">0.75</td>
      </tr>
      <tr>
          <td>DHFR (739)</td>
          <td style="text-align: center">0.62</td>
          <td style="text-align: center">0.63</td>
          <td style="text-align: center">0.46</td>
          <td style="text-align: center"><strong>0.67</strong></td>
          <td style="text-align: center">0.61</td>
      </tr>
      <tr>
          <td>LEL (483)</td>
          <td style="text-align: center">0.19</td>
          <td style="text-align: center">0.25</td>
          <td style="text-align: center">0.20</td>
          <td style="text-align: center"><strong>0.27</strong></td>
          <td style="text-align: center">0.23</td>
      </tr>
  </tbody>
</table>
<h3 id="classification-results-auc">Classification Results (AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (41,127)</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.74</td>
      </tr>
      <tr>
          <td>AMES (6,542)</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center"><strong>0.89</strong></td>
          <td style="text-align: center">0.86</td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center"><strong>0.91</strong></td>
          <td style="text-align: center">0.90</td>
      </tr>
      <tr>
          <td>ClinTox (1,478)</td>
          <td style="text-align: center"><strong>0.77</strong></td>
          <td style="text-align: center">0.76</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center">0.77</td>
          <td style="text-align: center">0.73</td>
      </tr>
      <tr>
          <td>Tox21 (7,831)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.82</td>
      </tr>
      <tr>
          <td>BBBP (2,039)</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.89</td>
      </tr>
      <tr>
          <td>JAK3 (886)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.80</strong></td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.76</td>
      </tr>
      <tr>
          <td>BioDeg (1,737)</td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center"><strong>0.93</strong></td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.92</td>
      </tr>
      <tr>
          <td>RP AR (930)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center"><strong>0.87</strong></td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.87</td>
          <td style="text-align: center">0.86</td>
      </tr>
  </tbody>
</table>
<h3 id="key-comparisons">Key Comparisons</h3>
<p>Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.</p>
<p>Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method&rsquo;s effectiveness.</p>
<p>A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.</p>
<h3 id="interpretability-case-studies">Interpretability Case Studies</h3>
<p>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of <a href="https://en.wikipedia.org/wiki/Haloperidol">haloperidol</a>, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.</p>
<h2 id="effective-transfer-learning-for-small-qsar-datasets">Effective Transfer Learning for Small QSAR Datasets</h2>
<p>Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.</p>
<p>The authors acknowledge several limitations and future directions:</p>
<ul>
<li>Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties</li>
<li>The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)</li>
<li>The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work</li>
<li>Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (SMILES &lt;= 110 chars)</td>
          <td>17.7M pairs</td>
          <td>10x augmentation + 1 identity pair per molecule</td>
      </tr>
      <tr>
          <td>Validation (canon.)</td>
          <td>Generated ChEMBL-like SMILES</td>
          <td>500,000</td>
          <td>From a molecular generator</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>9 regression + 9 classification</td>
          <td>378-41,127</td>
          <td>Available on OCHEM (<a href="https://ochem.eu">https://ochem.eu</a>)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)</li>
<li>TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)</li>
<li>Augmentation: n=10 non-canonical SMILES per molecule during training and inference</li>
<li>LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)</li>
<li>QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping</li>
<li>Pre-trained embeddings and standalone prediction models available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$</li>
<li>Classification: Area Under the ROC Curve (AUC)</li>
<li>Five-fold cross-validation with bootstrap standard errors</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)</li>
<li>TensorFlow v1.12.0, RDKit v2018.09.2</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bigchem/transformer-cnn">transformer-cnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Source code, pre-trained embeddings, standalone prediction models</td>
      </tr>
      <tr>
          <td><a href="https://ochem.eu">OCHEM</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online platform hosting the method, training datasets, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Karpov, P., Godin, G., &amp; Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. <em>Journal of Cheminformatics</em>, 12, 17. <a href="https://doi.org/10.1186/s13321-020-00423-w">https://doi.org/10.1186/s13321-020-00423-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{karpov2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00423-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer Name-to-SMILES with Atom Count Losses</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/transformer-chemical-name-to-smiles/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/transformer-chemical-name-to-smiles/</guid><description>A Transformer seq2seq model translates chemical compound names to SMILES, using atom-count constraints and SMILES/InChI multi-task learning.</description><content:encoded><![CDATA[<h2 id="translating-chemical-names-to-structures-with-transformers">Translating Chemical Names to Structures with Transformers</h2>
<p>This is a <strong>Method</strong> paper that proposes using Transformer-based sequence-to-sequence models to predict chemical compound structures (represented as SMILES strings) from chemical compound names. The primary contribution is the application of neural machine translation techniques to the name-to-structure problem, along with two domain-specific improvements: an atom-count constraint loss function and a multi-task learning approach that jointly predicts SMILES and InChI strings.</p>
<h2 id="why-rule-based-name-to-structure-fails-for-synonyms">Why Rule-Based Name-to-Structure Fails for Synonyms</h2>
<p>Chemical compound names come in several varieties. IUPAC names follow systematic nomenclature and are well-handled by rule-based parsers like OPSIN. Database IDs (e.g., CAS registry numbers) can be resolved by dictionary lookup. The third category, Synonyms (which includes abbreviations, common names, and other informal designations), is problematic because naming patterns are complex and widely variable.</p>
<p>In preliminary experiments, rule-based tools achieved F-measures of 0.878 to 0.960 on IUPAC names but only 0.719 to 0.758 on Synonyms. This performance gap motivates a data-driven approach. The authors frame name-to-SMILES prediction as a machine translation problem: the source language is the chemical compound name and the target language is the SMILES string. A neural model trained on millions of name-SMILES pairs can learn patterns that rule-based systems miss, particularly for non-systematic nomenclature.</p>
<h2 id="atom-count-constraints-and-multi-task-learning">Atom-Count Constraints and Multi-Task Learning</h2>
<p>The paper introduces two improvements over a vanilla Transformer seq2seq model.</p>
<h3 id="atom-count-constraint-loss">Atom-Count Constraint Loss</h3>
<p>A correct structure prediction must contain the right number of atoms of each element. The authors add an auxiliary loss that penalizes the squared difference between the predicted and true atom counts for each element. The predicted atom counts are obtained by summing Gumbel-softmax outputs across all decoded positions.</p>
<p>For the $i$-th output token, the Gumbel-softmax probability vector is:</p>
<p>$$
y_{ij} = \frac{\exp\left((\log(\pi_{ij}) + g_{ij}) / \tau\right)}{\sum_{k=1}^{|\mathcal{V}|} \exp\left((\log(\pi_{ik}) + g_{ik}) / \tau\right)}
$$</p>
<p>where $\pi_{ij}$ is the model&rsquo;s softmax output, $g_{ij}$ is a Gumbel noise sample, and $\tau = 0.1$ is the temperature. The predicted token frequency vector is $\mathbf{y}^{pred} = \sum_{i=1}^{m} \mathbf{y}_i$, and the atom-count loss is:</p>
<p>$$
\mathcal{L}_{atom} = \frac{1}{|A|} \sum_{a \in A} \left(N_a(T) - y_{idx(a)}^{pred}\right)^2
$$</p>
<p>where $A$ is the set of chemical elements in the vocabulary, $N_a(T)$ returns the number of atoms of element $a$ in the correct SMILES string $T$, and $idx(a)$ returns the vocabulary index of element $a$. Only element tokens (e.g., &ldquo;C&rdquo;, &ldquo;O&rdquo;) are counted; bond symbols (e.g., &ldquo;=&rdquo;, &ldquo;#&rdquo;) are excluded.</p>
<p>The combined objective is:</p>
<p>$$
\mathcal{L}_{smiles} + \lambda_{atom} \mathcal{L}_{atom}
$$</p>
<p>with $\lambda_{atom} = 0.7$.</p>
<h3 id="multi-task-smilesinchi-prediction">Multi-Task SMILES/InChI Prediction</h3>
<p>SMILES and InChI strings encode the same chemical structure in different formats. The authors hypothesize that jointly predicting both representations can improve the shared encoder. The multi-task model shares the encoder between a SMILES decoder and an InChI decoder, minimizing:</p>
<p>$$
\mathcal{L}_{smiles} + \lambda_{inchi} \mathcal{L}_{inchi}
$$</p>
<p>where $\mathcal{L}_{inchi} = -\log P(I | X; \boldsymbol{\theta}_{enc}, \boldsymbol{\theta}_{inchi})$ and $\lambda_{inchi} = 0.3$.</p>
<h2 id="experimental-setup-and-evaluation">Experimental Setup and Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The dataset was constructed from PubChem dump data (97M compound records). Chemical compound names categorized as Synonyms were paired with canonical SMILES strings (converted via RDKit). Database-like IDs were filtered out using regular expressions. Duplicate names mapping to different CIDs were removed.</p>
<table>
  <thead>
      <tr>
          <th>Split</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>5,000,000</td>
      </tr>
      <tr>
          <td>Development</td>
          <td>1,113</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>11,194</td>
      </tr>
  </tbody>
</table>
<h3 id="model-configuration">Model Configuration</h3>
<p>The Transformer uses 6 encoder/decoder layers, 8 attention heads, 512-dimensional embeddings, and 0.1 dropout. Training used label-smoothing cross-entropy ($\epsilon = 0.1$), Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$), and a warmup schedule with peak learning rate 0.0005 over 4,000 steps followed by inverse square root decay. Models were trained for 300,000 update steps. Final predictions averaged the last 10 checkpoints and used beam search (beam size 4, length penalty $\alpha = 0.6$, max output length 200).</p>
<h3 id="tokenization">Tokenization</h3>
<p>Three tokenization strategies were compared:</p>
<ul>
<li><strong>BPE</strong>: Byte pair encoding learned on chemical compound names (500 merge operations) via fastBPE</li>
<li><strong>OPSIN-TK</strong>: The OPSIN rule-based tokenizer</li>
<li><strong>OPSIN-TK+BPE</strong>: A hybrid where OPSIN handles tokenizable names and BPE handles the rest</li>
</ul>
<p>SMILES tokens were identified by regular expressions (elements as single tokens, remaining symbols as characters). InChI strings were tokenized by SentencePiece (vocabulary size 1,000).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>OPSIN</strong>: Open-source rule-based parser</li>
<li><strong>Tool A</strong> and <strong>Tool B</strong>: Two commercially available name-to-structure tools</li>
</ul>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Tokenizer</th>
          <th>Recall</th>
          <th>Precision</th>
          <th>F-measure</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OPSIN</td>
          <td>Rule-based</td>
          <td>0.693</td>
          <td>0.836</td>
          <td>0.758</td>
      </tr>
      <tr>
          <td>Tool A</td>
          <td>Rule-based</td>
          <td>0.711</td>
          <td>0.797</td>
          <td>0.752</td>
      </tr>
      <tr>
          <td>Tool B</td>
          <td>Rule-based</td>
          <td>0.653</td>
          <td>0.800</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>BPE</td>
          <td>0.793</td>
          <td>0.806</td>
          <td>0.799</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>BPE</td>
          <td>0.798</td>
          <td>0.808</td>
          <td>0.803</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>BPE</td>
          <td>0.810</td>
          <td>0.819</td>
          <td>0.814</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.763</td>
          <td>0.873</td>
          <td>0.814</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.768</td>
          <td>0.876</td>
          <td>0.818</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.779</td>
          <td>0.886</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>OPSIN-TK</td>
          <td>0.755</td>
          <td>0.868</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>OPSIN-TK</td>
          <td>0.757</td>
          <td>0.867</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>OPSIN-TK</td>
          <td>0.754</td>
          <td>0.869</td>
          <td>0.807</td>
      </tr>
  </tbody>
</table>
<p>The best configuration (inchigen with OPSIN-TK+BPE) achieved an F-measure of 0.829, surpassing OPSIN by 0.071 points. The multi-task learning approach (inchigen) consistently outperformed the atom-count constraint alone (atomnum) across all tokenizer settings.</p>
<h2 id="key-findings-and-error-analysis">Key Findings and Error Analysis</h2>
<p>The Transformer-based approach produced grammatically correct SMILES strings (parseable by RDKit) for 99% of test examples, compared to 81.6-88.4% for the rule-based tools. Even when predictions were incorrect, they tended to be structurally similar to the correct answer. Using MACCS fingerprints and Jaccard (Tanimoto) similarity, the average similarity between incorrectly predicted and correct structures was 0.753.</p>
<p>The OPSIN-TK tokenizer yielded higher precision than BPE because approximately 11.5% (1,293 of 11,194) of test compounds could not be tokenized by OPSIN, reducing the number of outputs. BPE-based tokenizers achieved higher recall by covering all inputs. The hybrid OPSIN-TK+BPE approach balanced both, achieving the highest overall F-measure.</p>
<p><strong>Limitations</strong>: The paper does not evaluate on IUPAC names separately with the Transformer models (only comparing rule-based tools on IUPAC). The atom-count constraint and multi-task learning are not combined in a single model. The dataset is released but the training code is not. Hardware details and training times are not reported. The evaluation uses only exact-match F-measure and Jaccard similarity, without measuring partial credit for nearly-correct structures.</p>
<p><strong>Future work</strong>: The authors plan to explore additional tokenization methods, combine the atom-count constraint with multi-task learning, and apply the constraint loss to other chemistry problems including chemical reaction prediction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>5,000,000 pairs</td>
          <td>Chemical compound names to canonical SMILES</td>
      </tr>
      <tr>
          <td>Development</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>1,113 pairs</td>
          <td>Filtered for duplicates</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>11,194 pairs</td>
          <td>Filtered for duplicates; released as benchmark</td>
      </tr>
  </tbody>
</table>
<p>The authors state the dataset is released for future research. The data was constructed from the PubChem dump (97M compound records) using RDKit for SMILES canonicalization. Database-like IDs were removed with regular expressions and duplicate names across CIDs were filtered.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer seq2seq (6 layers, 8 heads, 512-dim embeddings)</li>
<li>BPE tokenization via fastBPE (500 merge operations)</li>
<li>SentencePiece for InChI tokenization (vocabulary size 1,000)</li>
<li>Gumbel-softmax atom-count constraint ($\tau = 0.1$, $\lambda_{atom} = 0.7$)</li>
<li>Multi-task SMILES/InChI loss ($\lambda_{inchi} = 0.3$)</li>
<li>Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$)</li>
<li>Label smoothing ($\epsilon = 0.1$), 300K training steps</li>
<li>Beam search (beam size 4, length penalty $\alpha = 0.6$)</li>
</ul>
<h3 id="models">Models</h3>
<p>Standard Transformer architecture following Vaswani et al. (2017). No pre-trained weights or model checkpoints are released.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Model</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>F-measure</td>
          <td>0.829</td>
          <td>inchigen (OPSIN-TK+BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Precision</td>
          <td>0.886</td>
          <td>inchigen (OPSIN-TK+BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>0.810</td>
          <td>inchigen (BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Grammatical correctness</td>
          <td>99%</td>
          <td>inchigen (BPE)</td>
          <td>SMILES parseable by RDKit</td>
      </tr>
      <tr>
          <td>Avg. Jaccard similarity (errors)</td>
          <td>0.753</td>
          <td>inchigen (BPE)</td>
          <td>On incorrect predictions only</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Omote, Y., Matsushita, K., Iwakura, T., Tamura, A., &amp; Ninomiya, T. (2020). Transformer-based Approach for Predicting Chemical Compound Structures. <em>Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing</em>, 154-162. <a href="https://doi.org/10.18653/v1/2020.aacl-main.19">https://doi.org/10.18653/v1/2020.aacl-main.19</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{omote2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-based Approach for Predicting Chemical Compound Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Omote, Yutaro and Matsushita, Kyoumoto and Iwakura, Tomoya and Tamura, Akihiro and Ninomiya, Takashi}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{154--162}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2020.aacl-main.19}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>t-SMILES: Tree-Based Fragment Molecular Encoding</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/t-smiles-fragment-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/t-smiles-fragment-molecular-representation/</guid><description>t-SMILES encodes fragmented molecules as SMILES-type strings via breadth-first traversal of full binary trees, reducing nesting depth and improving generation.</description><content:encoded><![CDATA[<h2 id="a-fragment-based-molecular-representation-method">A Fragment-Based Molecular Representation Method</h2>
<p>This is a <strong>Method</strong> paper that proposes t-SMILES (tree-based SMILES), a framework for representing molecules as SMILES-type strings derived from fragment-based decompositions. The primary contribution is an encoding algorithm that converts fragmented molecular graphs into full binary trees (FBTs) and then traverses them breadth-first to produce linear strings. Three coding variants are introduced: TSSA (shared atom), TSDY (dummy atom without ID), and TSID (dummy atom with ID). The framework achieves 100% theoretical validity, higher novelty scores, and improved distribution-learning metrics compared to classical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> across ChEMBL, ZINC, and <a href="/notes/chemistry/datasets/qm9/">QM9</a> benchmarks.</p>
<h2 id="why-fragment-based-representations-matter-for-molecular-generation">Why Fragment-Based Representations Matter for Molecular Generation</h2>
<p>Classical SMILES encodes molecules via depth-first traversal of the molecular graph, requiring parentheses and ring identifiers to appear in matched pairs with deep nesting. When generative models (LSTM, Transformer) are trained on SMILES, they produce chemically invalid strings, particularly on small datasets, because they struggle to learn these long-range pairing constraints. DeepSMILES addresses some syntactical issues but still permits semantic violations (e.g., oxygen with three bonds). SELFIES guarantees 100% valid strings but at the cost of readability and, as the authors show, lower <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a> scores indicating generated molecules diverge from the training distribution.</p>
<p>Fragment-based approaches reduce the search space compared to atom-level methods and can provide insights into molecular recognition (e.g., protein-ligand interactions). However, existing fragment-based deep learning methods rely on fixed dictionaries of candidate fragments, creating in-vocabulary/out-of-vocabulary problems and high-dimensional sparse representations. The encoding of fragments as SMILES-type strings, rather than dictionary IDs, had not been systematically explored before this work.</p>
<p>The authors draw on the observation that fragments in organic molecules follow a <a href="https://en.wikipedia.org/wiki/Zipf's_law">Zipf-like</a> rank distribution similar to words in natural language, motivating the use of NLP techniques for fragment-based molecular modeling.</p>
<h2 id="core-innovation-binary-tree-encoding-of-fragmented-molecules">Core Innovation: Binary Tree Encoding of Fragmented Molecules</h2>
<p>The t-SMILES algorithm proceeds in three steps:</p>
<ol>
<li><strong>Fragmentation</strong>: A molecule is decomposed into valid chemical fragments using a chosen algorithm (JTVAE, BRICS, <a href="https://en.wikipedia.org/wiki/Matched_molecular_pair_analysis">MMPA</a>, or Scaffold), producing a fragmented molecular graph.</li>
<li><strong>Tree construction</strong>: The fragmented graph is converted into an Acyclic Molecular Tree (AMT), which is a reduced graph where nodes represent fragments and edges represent bonds between them. The AMT is then transformed into a Full Binary Tree (FBT), where every internal node has exactly two children.</li>
<li><strong>String generation</strong>: The FBT is traversed using breadth-first search (BFS) to produce the t-SMILES string.</li>
</ol>
<p>The framework introduces only two new symbols beyond standard SMILES: <code>&amp;</code> marks empty tree nodes (branch terminators providing global structural information), and <code>^</code> separates adjacent substructure segments (analogous to spaces between words in English).</p>
<h3 id="three-coding-variants">Three Coding Variants</h3>
<ul>
<li><strong>TSSA</strong> (shared atom): Two fragments share a real atom at their connection point. Produces the highest novelty scores and is recommended for goal-directed tasks.</li>
<li><strong>TSDY</strong> (dummy atom, no ID): Uses dummy atoms (marked with <code>*</code>) to indicate bonding points. Provides a balanced choice between novelty and distribution fidelity.</li>
<li><strong>TSID</strong> (dummy atom with ID): Uses numbered dummy atoms (<code>[n*]</code>) for unambiguous reconstruction. Produces the most faithful distribution reproduction and is recommended for distribution-learning tasks.</li>
</ul>
<h3 id="structural-advantages">Structural Advantages</h3>
<p>The key structural benefit is a dramatic reduction in nesting depth. For TSDY_M on ChEMBL, the proportion of tokens at nesting depth 0-1-2 increases from 68.0% (SMILES) to 99.3%, while depth 3-4-5 drops from 31.9% to 0.7%, and depth 6-11 drops from 0.1% to 0.0002%. The <code>&amp;</code> symbol, which encodes molecular topology, does not need to appear in pairs (unlike parentheses in SMILES), and its high frequency means it does not create a scarcity problem for learning.</p>
<p>The framework also supports a multi-code system where classical SMILES can be integrated as a special case called TS_Vanilla, and multiple fragmentation-based codes can be combined into hybrid models.</p>
<h3 id="reconstruction-and-data-augmentation">Reconstruction and Data Augmentation</h3>
<p>Molecules can be reconstructed from t-SMILES strings by reversing the process: rebuilding the FBT from the string, converting to AMT, and assembling fragments into a molecular graph. This reconstruction process can itself generate novel molecules without any model training by randomly assembling fragments. On ChEMBL, TSSA reconstruction achieves uniqueness above 0.98 and novelty above 0.68 for all four fragmentation algorithms, with 100% validity.</p>
<p>Data augmentation in t-SMILES operates at four levels: (1) different decomposition algorithms, (2) reconstruction, (3) enumeration of fragment strings, and (4) enumeration of FBTs. Unlike <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> (which only produces different strings for the same molecule), t-SMILES reconstruction generates genuinely different molecules from the same fragment set.</p>
<h2 id="systematic-evaluation-across-multiple-benchmarks">Systematic Evaluation Across Multiple Benchmarks</h2>
<p>All experiments use MolGPT (a Transformer-decoder model) as the primary generative model. Three types of metrics are employed: distribution-learning benchmarks, goal-directed benchmarks, and Wasserstein distance metrics for physicochemical properties.</p>
<h3 id="low-resource-datasets-jnk3-and-aid1706">Low-Resource Datasets (JNK3 and AID1706)</h3>
<p>On <a href="https://en.wikipedia.org/wiki/MAPK10">JNK3</a> (923 active molecules), the authors investigate overfitting behavior across training epochs:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Novelty</th>
          <th>FCD</th>
          <th>Active Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILES [R200]</td>
          <td>0.795</td>
          <td>0.120</td>
          <td>0.584</td>
          <td>0.072</td>
      </tr>
      <tr>
          <td>SMILES [R2000]</td>
          <td>1.000</td>
          <td>0.001</td>
          <td>0.765</td>
          <td>0.004</td>
      </tr>
      <tr>
          <td>SELFIES [R200]</td>
          <td>1.000</td>
          <td>0.238</td>
          <td>0.544</td>
          <td>0.148</td>
      </tr>
      <tr>
          <td>SELFIES [R2000]</td>
          <td>1.000</td>
          <td>0.008</td>
          <td>0.767</td>
          <td>0.050</td>
      </tr>
      <tr>
          <td>TSSA_S [R300]</td>
          <td>1.000</td>
          <td>0.833</td>
          <td>0.564</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>TSSA_S [R5000]</td>
          <td>1.000</td>
          <td>0.817</td>
          <td>0.608</td>
          <td>0.564</td>
      </tr>
      <tr>
          <td>TF_TSSA_S [R5]</td>
          <td>1.000</td>
          <td>0.932</td>
          <td>0.483</td>
          <td>0.710</td>
      </tr>
      <tr>
          <td>TSSA_S_Rec50 [R10]</td>
          <td>1.000</td>
          <td>0.962</td>
          <td>0.389</td>
          <td>0.829</td>
      </tr>
  </tbody>
</table>
<p>Key findings: SMILES and DeepSMILES novelty scores collapse to near zero after 200 epochs, while t-SMILES novelty stabilizes around 0.8. The highest active-novel score of 0.829 comes from t-SMILES with reconstruction-based data augmentation. Transfer learning with t-SMILES maintains novelty of 0.710 at 5 epochs versus 0.526 for SMILES, and at 100 epochs the gap widens dramatically (0.569 vs. 0.023).</p>
<h3 id="distribution-learning-on-chembl">Distribution Learning on ChEMBL</h3>
<p>t-SMILES models outperform graph baselines (Graph MCTS, hG2G, MGM) and fragment-based methods (FASMIFRA). TSID_B and TSID_S achieve FCD scores of 0.909 while maintaining novelty of 0.941 and 0.933, surpassing SMILES (FCD 0.906, novelty 0.907) in both dimensions. TSDY and TSID models consistently outperform TSSA on distribution fidelity for larger molecules.</p>
<h3 id="goal-directed-tasks-on-chembl">Goal-Directed Tasks on ChEMBL</h3>
<p>On 20 <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> subtasks, different fragmentation algorithms excel at different tasks. The goal-directed reconstruction algorithm significantly outperforms random reconstruction. On the <a href="https://en.wikipedia.org/wiki/Sitagliptin">Sitagliptin</a> MPO task (T16.SMPO), the TSDY_M model with goal-directed reconstruction achieves a score of 0.930, compared to 0.598 for SMILES and 0.708 for CReM. On <a href="https://en.wikipedia.org/wiki/Valsartan">Valsartan</a> SMARTS (T18.VS), t-SMILES models reach 0.997 versus 0.985 for SMILES.</p>
<h3 id="distribution-learning-on-zinc-and-qm9">Distribution Learning on ZINC and QM9</h3>
<p>On ZINC, t-SMILES models significantly outperform existing fragment-based baselines (JTVAE, FragDgm). Seven t-SMILES models achieve both higher FCD and novelty scores than SELFIES. On QM9 (smaller molecules), all string-based models achieve high FCD scores (above 0.960), with t-SMILES performing better than existing string and graph approaches.</p>
<h3 id="physicochemical-properties">Physicochemical Properties</h3>
<p>Across ChEMBL and ZINC, TSDY and TSID models capture physicochemical property distributions (MolWt, LogP, SAScore, N_Atoms, N_Rings, etc.) more faithfully than TSSA models. Multiple t-SMILES models outperform SMILES in more than four out of nine property categories. Baseline models hG2G and JTVAE show the weakest pattern learning, producing molecules with fewer atoms and rings than the training data.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-results">Main Results</h3>
<ol>
<li>t-SMILES achieves 100% theoretical validity by fragmenting molecules into chemically valid pieces before encoding.</li>
<li>The framework avoids the overfitting problem on low-resource datasets, maintaining stable novelty scores where SMILES, DeepSMILES, and SELFIES collapse.</li>
<li>The multi-code system allows different coding algorithms to complement each other, with hybrid models accessing broader chemical space.</li>
<li>Goal-directed reconstruction significantly outperforms all baselines on targeted optimization tasks.</li>
<li>TSDY and TSID provide better distribution fidelity than TSSA on larger molecules, while TSSA excels at novelty generation for goal-directed tasks.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Whether the tree structure of t-SMILES can be effectively learned by Large Language Models remains unexplored.</li>
<li>Only published fragmentation algorithms were tested; custom fragmentation schemes were not investigated.</li>
<li>Experiments on more complex (larger) molecules were not performed.</li>
<li>The reconstruction algorithm uses simple rules for fragment assembly; more sophisticated assembly methods (Monte Carlo tree search, CReM) could improve quality.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest exploring advanced reconstruction and optimization algorithms, improved generative models, evolutionary techniques, and extending t-SMILES to property prediction, retrosynthesis, and reaction prediction tasks. The framework is also extensible to other string representations (t-DSMILES, t-SELFIES) by changing how fragments are encoded.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Low-resource evaluation</td>
          <td>JNK3</td>
          <td>923 active molecules</td>
          <td>Kinase inhibitors</td>
      </tr>
      <tr>
          <td>Low-resource evaluation</td>
          <td>AID1706</td>
          <td>329 active molecules</td>
          <td>SARS 3CLPro inhibitors</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>ChEMBL</td>
          <td>Standard split</td>
          <td>Large drug-like molecules</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>ZINC</td>
          <td>250K subset</td>
          <td>Medium drug-like molecules</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>QM9</td>
          <td>~134K molecules</td>
          <td>Small organic molecules</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Fragmentation</strong>: JTVAE, BRICS, MMPA, Scaffold (all via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>)</li>
<li><strong>Tree construction</strong>: AMT from reduced graph, then FBT transformation</li>
<li><strong>Traversal</strong>: Breadth-first search on FBT</li>
<li><strong>Generative model</strong>: MolGPT (Transformer decoder)</li>
<li><strong>Discriminative model</strong>: AttentiveFP for activity prediction on JNK3/AID1706</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated strings that decode to valid molecules</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of distinct molecules among valid generations</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
      <tr>
          <td>KLD</td>
          <td>Kullback-Leibler divergence for physicochemical property distributions</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>Frechet ChemNet Distance measuring chemical similarity to training set</td>
      </tr>
      <tr>
          <td>Active Novel</td>
          <td>Novel molecules predicted active by AttentiveFP</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/juanniwu/t-SMILES">t-SMILES GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with training/generation scripts</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/ZENODO.10991703">Zenodo deposit</a></td>
          <td>Code + Data</td>
          <td>CC-BY-4.0</td>
          <td>Archived code and data</td>
      </tr>
      <tr>
          <td><a href="https://codeocean.com/capsule/3034546/tree">Code Ocean capsule</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Certified reproducible compute capsule</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper mentions limited computational resources but does not specify exact GPU types or training times.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, J.-N., Wang, T., Chen, Y., Tang, L.-J., Wu, H.-L., &amp; Yu, R.-Q. (2024). t-SMILES: a fragment-based molecular representation framework for de novo ligand design. <em>Nature Communications</em>, 15, 4993.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{t-SMILES: a fragment-based molecular representation framework for de novo ligand design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Juan-Ni and Wang, Tong and Chen, Yue and Tang, Li-Juan and Wu, Hai-Long and Yu, Ru-Qin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4993}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-49388-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPMM: A Bidirectional Molecular Foundation Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/</guid><description>SPMM is a multimodal molecular foundation model that aligns SMILES structures with property vectors for bidirectional generation and prediction tasks.</description><content:encoded><![CDATA[<h2 id="a-multimodal-foundation-model-for-structure-property-comprehension">A Multimodal Foundation Model for Structure-Property Comprehension</h2>
<p>This is a <strong>Method</strong> paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.</p>
<h2 id="bridging-the-gap-between-molecular-structure-and-properties">Bridging the Gap Between Molecular Structure and Properties</h2>
<p>Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.</p>
<p>The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.</p>
<h2 id="treating-property-vectors-as-a-language">Treating Property Vectors as a Language</h2>
<p>The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a &ldquo;language&rdquo; where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.</p>
<h3 id="dual-stream-architecture">Dual-Stream Architecture</h3>
<p>SPMM follows the dual-stream VLP architecture. The model has three components:</p>
<ol>
<li><strong>SMILES Encoder</strong>: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention</li>
<li><strong>PV Encoder</strong>: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings</li>
<li><strong>Fusion Encoder</strong>: 6 BERT-base layers with cross-attention that combines both modalities, using one modality&rsquo;s features as queries and the other as keys/values</li>
</ol>
<h3 id="pre-training-objectives">Pre-training Objectives</h3>
<p>The model is pre-trained with four complementary losses:</p>
<p><strong>Contrastive Learning</strong> aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:</p>
<p>$$
\text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls})
$$</p>
<p>The intermodal similarities are computed with a learnable temperature $\tau$:</p>
<p>$$
s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)}
$$</p>
<p>The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):</p>
<p>$$
L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right)
$$</p>
<p><strong>Next Word Prediction (NWP)</strong> trains autoregressive SMILES generation conditioned on the PV:</p>
<p>$$
L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right)
$$</p>
<p><strong>Next Property Prediction (NPP)</strong> applies the same autoregressive concept to property values, using mean-square-error loss:</p>
<p>$$
L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2}
$$</p>
<p><strong>SMILES-PV Matching (SPM)</strong> is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.</p>
<p>The overall pre-training loss combines all four:</p>
<p>$$
L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM}
$$</p>
<p>where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.</p>
<h3 id="random-property-masking">Random Property Masking</h3>
<p>During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.</p>
<h2 id="experiments-across-bidirectional-and-unimodal-tasks">Experiments Across Bidirectional and Unimodal Tasks</h2>
<h3 id="pv-to-smiles-generation-conditional-molecule-design">PV-to-SMILES Generation (Conditional Molecule Design)</h3>
<p>The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:</p>
<table>
  <thead>
      <tr>
          <th>Sampling</th>
          <th>Input PV</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Norm. RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Deterministic</td>
          <td>1000 unseen PVs</td>
          <td>0.995</td>
          <td>0.999</td>
          <td>0.961</td>
          <td>0.216</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Full PV (molecule 1)</td>
          <td>0.974</td>
          <td>0.905</td>
          <td>0.998</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Molar mass = 150</td>
          <td>0.974</td>
          <td>0.945</td>
          <td>0.872</td>
          <td>0.192</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>4 properties controlled</td>
          <td>0.998</td>
          <td>0.981</td>
          <td>0.952</td>
          <td>0.257</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>No control (all [UNK])</td>
          <td>0.971</td>
          <td>0.991</td>
          <td>0.950</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).</p>
<h3 id="smiles-to-pv-generation-multi-property-prediction">SMILES-to-PV Generation (Multi-Property Prediction)</h3>
<p>When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.</p>
<h3 id="moleculenet-benchmarks">MoleculeNet Benchmarks</h3>
<p>Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SPMM</th>
          <th>Best Baseline</th>
          <th>Baseline Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.817</td>
          <td>0.798</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>LIPO</td>
          <td>RMSE</td>
          <td>0.681</td>
          <td>0.660</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>1.868</td>
          <td>1.877</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>BACE (reg)</td>
          <td>RMSE</td>
          <td>1.041</td>
          <td>1.047</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Clearance</td>
          <td>RMSE</td>
          <td>42.607</td>
          <td>43.175</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>AUROC</td>
          <td>75.1%</td>
          <td>73.6%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BACE (cls)</td>
          <td>AUROC</td>
          <td>84.4%</td>
          <td>86.3%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>AUROC</td>
          <td>92.7%</td>
          <td>91.2%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>AUROC</td>
          <td>66.9%</td>
          <td>67.2%</td>
          <td>ChemRL-GEM</td>
      </tr>
  </tbody>
</table>
<p>SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.</p>
<h3 id="dili-classification">DILI Classification</h3>
<p>On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.</p>
<h3 id="reaction-prediction">Reaction Prediction</h3>
<p>On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.</p>
<h2 id="bidirectional-generation-from-a-single-pre-trained-model">Bidirectional Generation From a Single Pre-trained Model</h2>
<p>SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:</p>
<ol>
<li><strong>Flexible conditional generation</strong>: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.</li>
<li><strong>Interpretable cross-attention</strong>: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).</li>
<li><strong>Competitive unimodal transfer</strong>: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>&rsquo;s 77M or Chemformer&rsquo;s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>SMILES representation constraints</strong>: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.</li>
<li><strong>Stereochemistry blindness</strong>: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.</li>
<li><strong>No wet-lab validation</strong>: Generated molecules and predicted properties are not experimentally verified.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>50M molecules</td>
          <td>SMILES + 53 RDKit properties</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>642-4200 per task</td>
          <td>Scaffold split via DeepChem (8:1:1)</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>Ai et al. dataset</td>
          <td>Not specified</td>
          <td>Following published preparation</td>
      </tr>
      <tr>
          <td>Forward reaction</td>
          <td>USPTO-480k</td>
          <td>479,035 pairs</td>
          <td>Reactant-product pairs</td>
      </tr>
      <tr>
          <td>Retro reaction</td>
          <td>USPTO-50k</td>
          <td>50,037 pairs</td>
          <td>Product-reactant pairs, no reaction types used</td>
      </tr>
      <tr>
          <td>SMILES-to-PV test</td>
          <td>ZINC15</td>
          <td>1000 molecules</td>
          <td>Not in pre-training set</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: BPE with 300-subword dictionary</li>
<li><strong>Property masking</strong>: 50% random replacement with [UNK] during pre-training</li>
<li><strong>Momentum distillation</strong>: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch</li>
<li><strong>Contrastive queue</strong>: Size $k = 24{,}576$ for storing recent SMILES and PV instances</li>
<li><strong>Beam search</strong>: $k = 2$ for PV-to-SMILES generation</li>
<li><strong>SMILES augmentation</strong>: Random non-canonical augmentation with probability 0.5 for reaction tasks</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)</li>
<li><strong>Vocabulary</strong>: 300 BPE subwords for SMILES; 53 property tokens for PV</li>
<li><strong>Pre-trained weights</strong>: Available via GitHub</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Validity</td>
          <td>99.5%</td>
          <td>1000 unseen PubChem PVs</td>
      </tr>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Normalized RMSE</td>
          <td>0.216</td>
          <td>Across 53 properties</td>
      </tr>
      <tr>
          <td>SMILES-to-PV</td>
          <td>Mean $r^{2}$</td>
          <td>0.924</td>
          <td>1000 ZINC15 molecules</td>
      </tr>
      <tr>
          <td>Forward reaction (USPTO-480k)</td>
          <td>Top-1 accuracy</td>
          <td>91.5%</td>
          <td>Best among all tested models</td>
      </tr>
      <tr>
          <td>Retro reaction (USPTO-50k)</td>
          <td>Top-1 accuracy</td>
          <td>53.4%</td>
          <td>Second-best string-based</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>AUROC</td>
          <td>92.6%</td>
          <td>Single model vs. 5-ensemble</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Pre-training</strong>: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours</li>
<li><strong>Batch size</strong>: 96</li>
<li><strong>Optimizer</strong>: AdamW with weight decay 0.02</li>
<li><strong>Learning rate</strong>: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jinhojsk515/SPMM">SPMM Source Code</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with experimental scripts</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10567599">SPMM Zenodo Archive</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Archived version for reproducibility</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>50M molecules for pre-training</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>Varies</td>
          <td>Benchmark datasets via DeepChem</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, J., &amp; Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. <em>Nature Communications</em>, 15, 2323. <a href="https://doi.org/10.1038/s41467-024-46440-3">https://doi.org/10.1038/s41467-024-46440-3</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chang2024bidirectional,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Bidirectional generation of structure and properties through a single molecular foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chang, Jinho and Ye, Jong Chul}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2323}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-46440-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPE: Data-Driven SMILES Substructure Tokenization</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/</guid><description>SMILES Pair Encoding adapts byte pair encoding to learn chemically meaningful substructure tokens from SMILES, improving generation and QSAR prediction.</description><content:encoded><![CDATA[<h2 id="a-data-driven-tokenization-method-for-chemical-deep-learning">A Data-Driven Tokenization Method for Chemical Deep Learning</h2>
<p>This is a <strong>Method</strong> paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">byte pair encoding (BPE)</a> in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> prediction benchmarks.</p>
<h2 id="limitations-of-atom-level-smiles-tokenization">Limitations of Atom-Level SMILES Tokenization</h2>
<p>SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:</p>
<ul>
<li><strong>Character-level tokenization</strong> breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, <code>[C@@H]</code> becomes six separate tokens (<code>[</code>, <code>C</code>, <code>@</code>, <code>@</code>, <code>H</code>, <code>]</code>), losing the stereochemistry information of a single carbon.</li>
<li><strong>Atom-level tokenization</strong> addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.</li>
<li><strong>k-mer tokenization</strong> (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.</li>
</ul>
<p>All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.</p>
<h2 id="core-innovation-adapting-byte-pair-encoding-for-smiles">Core Innovation: Adapting Byte Pair Encoding for SMILES</h2>
<p>SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:</p>
<p><strong>Vocabulary training:</strong></p>
<ol>
<li>Tokenize SMILES from a large dataset (ChEMBL) at the atom level</li>
<li>Initialize the vocabulary with all unique atom-level tokens</li>
<li>Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary</li>
<li>Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached</li>
</ol>
<p><strong>Tokenization:</strong> Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.</p>
<p>The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.</p>
<p>The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.</p>
<p>The algorithm is also compatible with other text-based molecular representations such as <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, since these share atom-level character structures that can serve as the starting point for pair merging.</p>
<h2 id="molecular-generation-and-qsar-prediction-experiments">Molecular Generation and QSAR Prediction Experiments</h2>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SPE</th>
          <th>Atom-level</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>0.941</td>
          <td>0.970</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.994</td>
          <td>0.992</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.983</td>
          <td>0.978</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>0.897</td>
          <td>0.886</td>
      </tr>
      <tr>
          <td>Nearest neighbor similarity</td>
          <td>0.391</td>
          <td>0.386</td>
      </tr>
  </tbody>
</table>
<p>The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:</p>
<p>$$
\text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2)
$$</p>
<p>where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:</p>
<p>$$
\text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R)
$$</p>
<p>Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).</p>
<h3 id="qsar-prediction">QSAR Prediction</h3>
<p>QSAR models were built using the <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT transfer learning framework</a>, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (<a href="https://en.wikipedia.org/wiki/KCNH2">hERG</a>). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.</p>
<p><a href="https://en.wikipedia.org/wiki/Effect_size">Cohen&rsquo;s d</a> effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included <a href="https://en.wikipedia.org/wiki/Cannabinoid_receptor_1">cannabinoid CB1 receptor</a> (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and <a href="https://en.wikipedia.org/wiki/Aurora_kinase_A">Aurora-A kinase</a> (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.</p>
<p>Cohen&rsquo;s d is defined as:</p>
<p>$$
\text{Cohen&rsquo;s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}}
$$</p>
<p>where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.</p>
<p>SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (<a href="https://en.wikipedia.org/wiki/Cyclooxygenase-2">COX-2</a>, <a href="https://en.wikipedia.org/wiki/Acetylcholinesterase">acetylcholinesterase</a>, erbB1, and hERG).</p>
<p>In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.</p>
<h2 id="results-summary-and-future-directions">Results Summary and Future Directions</h2>
<p>The main findings of this study are:</p>
<ol>
<li>
<p><strong>SPE produces chemically meaningful tokens.</strong> The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.</p>
</li>
<li>
<p><strong>SPE compresses input sequences by ~6-7x.</strong> Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.</p>
</li>
<li>
<p><strong>SPE improves molecular generation diversity.</strong> The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).</p>
</li>
<li>
<p><strong>SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction.</strong> Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.</p>
</li>
</ol>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.</li>
<li>The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.</li>
<li>The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with <code>[UNK]</code> tokens, but this is a limitation of the comparison rather than of SPE itself.</li>
</ul>
<p><strong>Future directions:</strong> The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (<a href="/notes/chemistry/molecular-design/generation/">generation</a>, <a href="/notes/chemistry/molecular-design/property-prediction/">property prediction</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/">reaction prediction</a>, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SPE vocabulary training</td>
          <td>ChEMBL25</td>
          <td>~3.4M SMILES</td>
          <td>1 canonical + 1 non-canonical per molecule</td>
      </tr>
      <tr>
          <td>Language model training</td>
          <td>ChEMBL25 augmented</td>
          <td>~9M SMILES</td>
          <td>1 canonical + 5 non-canonical per molecule</td>
      </tr>
      <tr>
          <td>Molecular generation evaluation</td>
          <td>Sampled from model</td>
          <td>1M SMILES per model</td>
          <td>Validated with RDKit</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>Cortes-Ciriano et al.</td>
          <td>24 datasets, 199-5010 molecules</td>
          <td>pIC50 regression tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000</li>
<li>Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units</li>
<li>Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2</li>
<li>Training: 10 epochs, base learning rate 0.008, one-cycle policy</li>
<li>QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation</li>
<li>Test time augmentation: average of canonical + 4 augmented SMILES predictions</li>
<li>RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>AWD-LSTM architecture from Merity et al. (2018)</li>
<li>MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity, Uniqueness, Novelty</td>
          <td>Generation</td>
          <td>Basic quality metrics</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>Generation</td>
          <td>1 - mean pairwise Tanimoto (ECFP6)</td>
      </tr>
      <tr>
          <td>Nearest neighbor similarity</td>
          <td>Generation</td>
          <td>Mean max Tanimoto to reference set</td>
      </tr>
      <tr>
          <td>Substructure coverage</td>
          <td>Generation</td>
          <td>BRICS, functional groups, scaffolds, ring systems</td>
      </tr>
      <tr>
          <td>RMSE, R-squared, MAE</td>
          <td>QSAR regression</td>
          <td>10 random 80:10:10 splits</td>
      </tr>
      <tr>
          <td>Cohen&rsquo;s d</td>
          <td>QSAR comparison</td>
          <td>Effect size between tokenization methods</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not explicitly specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/SmilesPE">SmilesPE</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>SPE tokenization Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/MolPMoFiT">MolPMoFiT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Transfer learning QSAR framework</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, X., &amp; Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 61(4), 1560-1569. <a href="https://doi.org/10.1021/acs.jcim.0c01127">https://doi.org/10.1021/acs.jcim.0c01127</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2021smiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Xinhao and Fourches, Denis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1560--1569}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c01127}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Smirk: Complete Tokenization for Molecular Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</guid><description>Smirk tokenizer achieves full OpenSMILES coverage with 165 tokens by decomposing bracketed atoms into glyphs, validated via n-gram proxy models.</description><content:encoded><![CDATA[<h2 id="a-method-for-complete-chemical-tokenization">A Method for Complete Chemical Tokenization</h2>
<p>This is a <strong>Method</strong> paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.</p>
<h2 id="vocabulary-gaps-in-molecular-tokenization">Vocabulary Gaps in Molecular Tokenization</h2>
<p>Molecular foundation models overwhelmingly use &ldquo;atom-wise&rdquo; tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all &ldquo;bracketed atoms&rdquo; (e.g., <code>[C@@H]</code>, <code>[18F]</code>, <code>[Au+]</code>) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.</p>
<p>This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token <code>[UNK]</code> at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SPE and APE</a> tokenizers produce <code>[UNK]</code> for roughly 19% of tokens on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and approximately 50% on the tmQM transition metal complex dataset. Even models like <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a> and <a href="/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/">ReactionT5</a> lack tokens for elements such as copper, ruthenium, gold, and uranium.</p>
<p>The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa&rsquo;s</a> BPE) conflate chemically distinct entities. The same <code>Sc</code> token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in <code>[Sc]</code>), creating ambiguity in downstream analysis.</p>
<h2 id="smirk-glyph-level-decomposition-of-smiles">Smirk: Glyph-Level Decomposition of SMILES</h2>
<p>The core insight behind Smirk is to fully decompose bracketed atoms into their constituent &ldquo;glyphs,&rdquo; the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.</p>
<p>Smirk uses a two-stage tokenization process:</p>
<ol>
<li><strong>Atom decomposition</strong>: Split a SMILES string into atom-level units using a regex (e.g., <code>OC[C@@H][OH]</code> becomes <code>O C [C@@H] [OH]</code>).</li>
<li><strong>Glyph decomposition</strong>: Further split each unit into its constituent glyphs (e.g., <code>[C@@H]</code> becomes <code>[ C @@ H ]</code>).</li>
</ol>
<p>The two-stage process is necessary to resolve ambiguities. For example, <code>Sc</code> in an unbracketed context represents a sulfur-carbon bond, while <code>[Sc]</code> denotes scandium. This ambiguity occurs over half a million times in PubChem&rsquo;s compound dataset.</p>
<p>The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace&rsquo;s Tokenizers library and is available on PyPI.</p>
<p><strong>Smirk-GPE</strong> (Glyph Pair Encoding) extends Smirk with a <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">BPE</a>-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.</p>
<h2 id="evaluation-framework-intrinsic-metrics-n-gram-proxies-and-transformer-benchmarks">Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks</h2>
<p>The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.</p>
<h3 id="intrinsic-metrics">Intrinsic Metrics</h3>
<p>Four intrinsic metrics are computed for each tokenizer:</p>
<p><strong>Fertility</strong> measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:</p>
<p>$$
\text{cost} \propto \text{fertility}^2
$$</p>
<p><strong>Normalized entropy</strong> quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:</p>
<p>$$
\eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x)
$$</p>
<p>where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.</p>
<p><strong>Token imbalance</strong> measures the distance between observed token frequencies and a uniform distribution:</p>
<p>$$
D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}|
$$</p>
<p><strong>Unknown token frequency</strong> captures the fraction of emitted tokens that are <code>[UNK]</code>. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit <code>[UNK]</code> at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.</p>
<h3 id="n-gram-proxy-language-models">N-Gram Proxy Language Models</h3>
<p>The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with <a href="https://en.wikipedia.org/wiki/Additive_smoothing">add-one smoothing</a>:</p>
<p>$$
P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|}
$$</p>
<p>where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were &ldquo;pretrained&rdquo; on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.</p>
<p>To quantify information lost to <code>[UNK]</code> tokens, the authors compute the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a> between token distributions with and without unknown tokens, using a bidirectional character n-gram model:</p>
<p>$$
B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|}
$$</p>
<h3 id="transformer-experiments">Transformer Experiments</h3>
<p>Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer&rsquo;s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.</p>
<p>Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.</p>
<h2 id="key-findings-and-practical-implications">Key Findings and Practical Implications</h2>
<h3 id="tokenizer-performance">Tokenizer Performance</h3>
<ul>
<li><strong>Smirk</strong> shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.</li>
<li><strong>SPE and APE</strong> tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high <code>[UNK]</code> rates.</li>
<li><strong>Molecular encoding choice</strong> (<a href="/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/">SMILES vs. SELFIES</a>) has a negligible effect on performance.</li>
<li><strong>NLP tokenizers</strong> (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.</li>
</ul>
<h3 id="n-gram-proxy-validation">N-Gram Proxy Validation</h3>
<p>N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman&rsquo;s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.</p>
<h3 id="information-loss-from-unknown-tokens">Information Loss from Unknown Tokens</h3>
<p>Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.</p>
<h3 id="practical-recommendations">Practical Recommendations</h3>
<p>The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., <a href="https://en.wikipedia.org/wiki/Amoxicillin">Amoxicillin</a>), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., <a href="https://en.wikipedia.org/wiki/Cisplatin">Cisplatin</a>, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.</li>
<li>Smirk-GPE&rsquo;s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.</li>
<li>Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.</li>
<li>The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa&rsquo;s conflation of <code>Sc</code> as both sulfur-carbon and scandium) remains unclear.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>Enamine REAL Space</td>
          <td>1.6B SMILES (n-gram), 245M molecules (transformer)</td>
          <td>80/10/10 train/val/test split</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet</td>
          <td>Multiple tasks</td>
          <td>6 regression + 7 classification tasks</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>tmQM</td>
          <td>108K transition metal complexes</td>
          <td>OpenSMILES molecular encodings</td>
      </tr>
      <tr>
          <td>Smirk-GPE training</td>
          <td>Enamine REAL Space (subset)</td>
          <td>262M molecules</td>
          <td>Training split only</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Smirk</strong>: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.</li>
<li><strong>Smirk-GPE</strong>: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.</li>
<li><strong>N-gram models</strong>: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).</li>
<li><strong>Pretraining</strong>: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.</li>
<li><strong>Finetuning</strong>: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)</li>
<li>Fixed-effects models for standardized effect size estimation</li>
<li>Spearman&rsquo;s rank correlation between n-gram and transformer metrics</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)</li>
<li>Finetuning: 1x NVIDIA A40 GPU</li>
<li>N-gram models: CPU-based (Julia implementation)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BattModels/Smirk">Smirk tokenizer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Rust implementation with Python bindings, available on PyPI</td>
      </tr>
      <tr>
          <td>Model checkpoints</td>
          <td>Model</td>
          <td>Not specified</td>
          <td>Pretrained and finetuned checkpoints included in data release</td>
      </tr>
      <tr>
          <td>N-gram code</td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Julia implementation included in data release</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wadell, A., Bhutani, A., &amp; Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. <em>Journal of Chemical Information and Modeling</em>, 66(3), 1384-1393. <a href="https://doi.org/10.1021/acs.jcim.5c01856">https://doi.org/10.1021/acs.jcim.5c01856</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wadell2026tokenization,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Tokenization for Molecular Foundation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{66}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1384--1393}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.5c01856}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES2Vec: Interpretable Chemical Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/smiles2vec-interpretable-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/smiles2vec-interpretable-property-prediction/</guid><description>SMILES2Vec uses a Bayesian-optimized CNN-GRU architecture to predict chemical properties directly from SMILES strings with an interpretable explanation mask.</description><content:encoded><![CDATA[<h2 id="a-general-purpose-rnn-for-chemical-property-prediction-from-smiles">A General-Purpose RNN for Chemical Property Prediction from SMILES</h2>
<p>SMILES2Vec is a <strong>Method</strong> paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> text representations. The primary contributions are: (1) a Bayesian-optimized CNN-<a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, <a href="https://en.wikipedia.org/wiki/Solvation">solvation</a> energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network&rsquo;s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.</p>
<h2 id="motivation-beyond-engineered-features-in-chemical-modeling">Motivation: Beyond Engineered Features in Chemical Modeling</h2>
<p>At the time of writing (2017), deep learning models in chemistry relied heavily on engineered <a href="https://en.wikipedia.org/wiki/Molecular_descriptor">molecular descriptors</a> and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a>/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:</p>
<ol>
<li><strong>Restricted search space</strong>: Engineered features limit the neural network&rsquo;s ability to discover potentially useful representations that domain experts have not anticipated.</li>
<li><strong>Incomplete domain knowledge</strong>: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.</li>
</ol>
<p>In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.</p>
<p>A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.</p>
<h2 id="core-innovation-cnn-gru-architecture-with-explanation-masks">Core Innovation: CNN-GRU Architecture with Explanation Masks</h2>
<h3 id="architecture-design-via-bayesian-optimization">Architecture Design via <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a></h3>
<p>SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database) through three stages:</p>
<ol>
<li><strong>Embedding layer</strong>: Maps one-hot character vectors to a learned embedding space (size 50)</li>
<li><strong>1D convolutional layer</strong>: 192 filters with kernel size 3, stride 1</li>
<li><strong>Bidirectional GRU layers</strong>: Two layers with 224 and 384 units respectively</li>
</ol>
<p>The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Parameter</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Embedding</td>
          <td>Size</td>
          <td>50</td>
      </tr>
      <tr>
          <td>Conv1D</td>
          <td>Filters</td>
          <td>192</td>
      </tr>
      <tr>
          <td>BiGRU Layer 1</td>
          <td>Units</td>
          <td>224</td>
      </tr>
      <tr>
          <td>BiGRU Layer 2</td>
          <td>Units</td>
          <td>384</td>
      </tr>
  </tbody>
</table>
<h3 id="explanation-mask-for-interpretability">Explanation Mask for Interpretability</h3>
<p>The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model&rsquo;s output while masking as much input as possible. The loss function for a single sample is:</p>
<p>$$
\text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i)
$$</p>
<p>where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.</p>
<p>The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="datasets">Datasets</h3>
<p>The model was evaluated on four datasets from the <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark and the ESOL solubility dataset:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Property</th>
          <th>Task</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>Toxicity</td>
          <td>Multi-task classification</td>
          <td>8,014</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Activity</td>
          <td>Single-task classification</td>
          <td>41,193</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Solvation energy</td>
          <td>Single-task regression</td>
          <td>643</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>Solubility</td>
          <td>Single-task regression</td>
          <td>1,128</td>
      </tr>
  </tbody>
</table>
<p>SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.</p>
<h3 id="training-protocol">Training Protocol</h3>
<ul>
<li><strong>Optimizer</strong>: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$</li>
<li><strong>Batch size</strong>: 32</li>
<li><strong>Epochs</strong>: 250 with early stopping (patience of 25 epochs based on validation loss)</li>
<li><strong>Classification loss</strong>: Binary cross-entropy</li>
<li><strong>Regression loss</strong>: Mean absolute error</li>
<li><strong>Metrics</strong>: AUC for classification, RMSE for regression</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>SMILES2Vec was compared against:</p>
<ul>
<li><strong>MLP with engineered features</strong>: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)</li>
<li><strong>Molecular graph convolutions</strong>: Graph-based neural network from MoleculeNet</li>
<li><strong>Chemception</strong>: CNN operating on 2D chemical images</li>
</ul>
<h3 id="bayesian-optimization-protocol">Bayesian Optimization Protocol</h3>
<p>Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.</p>
<h2 id="results-competitive-accuracy-with-interpretable-predictions">Results: Competitive Accuracy with Interpretable Predictions</h2>
<h3 id="property-prediction-performance">Property Prediction Performance</h3>
<p>SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SMILES2Vec</th>
          <th>SMILES2Vec + Pre-training</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>AUC</td>
          <td>0.80</td>
          <td>0.81</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>AUC</td>
          <td>0.78</td>
          <td>0.80</td>
          <td>0.80</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE (kcal/mol)</td>
          <td>1.4</td>
          <td>1.2</td>
          <td>1.3</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.63</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.</p>
<p>Key findings:</p>
<ul>
<li>SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.</li>
<li>Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).</li>
<li>SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.</li>
</ul>
<h3 id="interpretability-evaluation">Interpretability Evaluation</h3>
<p>On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (&gt; 1.0) and insoluble (&lt; -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.</p>
<p>Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.</li>
<li>The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.</li>
<li>SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.</li>
<li>The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Architecture optimization</td>
          <td>Tox21 (nr-ahr task)</td>
          <td>8,014</td>
          <td>Single toxicity task for Bayesian optimization</td>
      </tr>
      <tr>
          <td>Architecture optimization</td>
          <td>FreeSolv</td>
          <td>643</td>
          <td>Solvation free energy regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21 (full, 12 tasks)</td>
          <td>8,014</td>
          <td>Multi-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,193</td>
          <td>Single-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Solubility regression, also used for interpretability</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)</li>
<li>RMSprop optimizer with standard settings</li>
<li>Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Final architecture: Embedding(50) -&gt; Conv1D(192, kernel=3, stride=1) -&gt; BiGRU(224) -&gt; BiGRU(384)</li>
<li>Explanation network: 20-layer residual network with SELU activations</li>
<li>No pre-trained weights or code were released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC</td>
          <td>Tox21</td>
          <td>0.81</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>HIV</td>
          <td>0.80</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>FreeSolv</td>
          <td>1.2 kcal/mol</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>ESOL</td>
          <td>0.63</td>
          <td>Base model</td>
      </tr>
      <tr>
          <td>Top-3 accuracy</td>
          <td>ESOL interpretability</td>
          <td>88%</td>
          <td>Explanation mask</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Goh, G. B., Hodas, N. O., Siegel, C., &amp; Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. <em>arXiv preprint arXiv:1712.02034</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{goh2017smiles2vec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.02034}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.1712.02034}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES-BERT: BERT-Style Pre-Training for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-bert/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-bert/</guid><description>SMILES-BERT applies BERT-style masked pre-training to SMILES strings for molecular property prediction, using Transformer encoders fine-tuned on labeled data.</description><content:encoded><![CDATA[<h2 id="pre-training-transformers-on-smiles-for-molecular-properties">Pre-Training Transformers on SMILES for Molecular Properties</h2>
<p>SMILES-BERT is a <strong>Method</strong> paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a>, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.</p>
<h2 id="limited-labels-in-molecular-property-prediction">Limited Labels in Molecular Property Prediction</h2>
<p>Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).</p>
<p>Prior unsupervised approaches like <a href="/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/">Seq2seq Fingerprint</a> used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.</p>
<p>The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.</p>
<h2 id="masked-smiles-recovery-with-transformer-encoders">Masked SMILES Recovery with Transformer Encoders</h2>
<p>The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT&rsquo;s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.</p>
<h3 id="architecture">Architecture</h3>
<p>SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.</p>
<p>The self-attention mechanism uses scaled dot-product attention:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V}
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.</p>
<p>Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special <code>&lt;GO&gt;</code> token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.</p>
<h3 id="pre-training-masked-smiles-recovery">Pre-training: Masked SMILES Recovery</h3>
<p>Following BERT&rsquo;s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:</p>
<ul>
<li>85% are replaced with a <code>&lt;MASK&gt;</code> token</li>
<li>10% are replaced with a random token from the vocabulary</li>
<li>5% are kept unchanged</li>
</ul>
<p>The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.</p>
<h3 id="fine-tuning">Fine-tuning</h3>
<p>After pre-training, a classifier or regressor head is added to the <code>&lt;GO&gt;</code> token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.</p>
<p>Key differences from the original BERT:</p>
<ol>
<li>Only the Masked SMILES Recovery task is used (BERT&rsquo;s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)</li>
<li>Segment embeddings are removed</li>
<li>The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language</li>
</ol>
<p>The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>SMILES-BERT was pre-trained on the <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a> with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a></td>
          <td>NCATS/NIH</td>
          <td>10,850</td>
          <td>Classification (threshold 1.88)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PM2</td>
          <td>NCATS/NIH</td>
          <td>323,242</td>
          <td>Classification (threshold 0.024896)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PCBA-686978</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></td>
          <td>302,175</td>
          <td>Classification</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p>All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>Circular Fingerprint (CircularFP)</strong>: Manually designed hash-based fingerprint (ECFP family)</li>
<li><strong>Neural Fingerprint (NeuralFP)</strong>: Graph-based neural network replacing hash functions with learned layers</li>
<li><strong>Seq2seq Fingerprint (Seq2seqFP)</strong>: Unsupervised encoder-decoder model on SMILES</li>
<li><strong>Seq3seq Fingerprint (Seq3seqFP)</strong>: Semi-supervised encoder-decoder model on SMILES</li>
</ul>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP</th>
          <th>PM2</th>
          <th>PCBA-686978</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CircularFP</td>
          <td>~0.90</td>
          <td>0.6858</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>NeuralFP</td>
          <td>~0.90</td>
          <td>0.6802</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>Seq2seqFP</td>
          <td>~0.87</td>
          <td>0.6112</td>
          <td>~0.80</td>
      </tr>
      <tr>
          <td>Seq3seqFP</td>
          <td>~0.90</td>
          <td>0.7038</td>
          <td>~0.84</td>
      </tr>
      <tr>
          <td><strong>SMILES-BERT</strong></td>
          <td><strong>0.9154</strong></td>
          <td><strong>0.7589</strong></td>
          <td><strong>0.8784</strong></td>
      </tr>
  </tbody>
</table>
<p>SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.</p>
<h3 id="structure-study">Structure Study</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>FFN Dim</th>
          <th>LogP Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILES-BERT</td>
          <td>6</td>
          <td>4</td>
          <td>1024</td>
          <td>0.9154</td>
      </tr>
      <tr>
          <td>SMILES-BERT (large)</td>
          <td>12</td>
          <td>12</td>
          <td>3072</td>
          <td>0.9147</td>
      </tr>
  </tbody>
</table>
<p>The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.</p>
<p>Key findings:</p>
<ul>
<li>The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction</li>
<li>The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives</li>
<li>A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data</li>
<li>Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks</li>
</ul>
<p><strong>Limitations</strong>: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.</p>
<p><strong>Future work</strong>: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model&rsquo;s classification capability, analogous to BERT&rsquo;s next sentence prediction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC</td>
          <td>18,671,355 SMILES</td>
          <td>Publicly available database</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LogP</td>
          <td>10,850</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PM2</td>
          <td>323,242</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PCBA-686978</td>
          <td>302,175</td>
          <td>Public, from PubChem BioAssay</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs</li>
<li>Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024</li>
<li>Token embedding + positional embedding, <code>&lt;GO&gt;</code> special token</li>
<li>Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILES-BERT</th>
          <th>Best Baseline (Seq3seqFP)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP Accuracy</td>
          <td>0.9154</td>
          <td>~0.90</td>
          <td>~2% improvement</td>
      </tr>
      <tr>
          <td>PM2 Accuracy</td>
          <td>0.7589</td>
          <td>0.7038</td>
          <td>~5.5% improvement</td>
      </tr>
      <tr>
          <td>PCBA Accuracy</td>
          <td>0.8784</td>
          <td>~0.84</td>
          <td>~3.8% improvement</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No public code or model release identified</td>
          <td>-</td>
          <td>-</td>
          <td>Paper does not provide a GitHub link or model checkpoint</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, S., Guo, Y., Wang, Y., Sun, H., &amp; Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In <em>Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB &lsquo;19)</em>, 429-436. <a href="https://doi.org/10.1145/3307339.3342186">https://doi.org/10.1145/3307339.3342186</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2019smilesbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{429--436}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3307339.3342186}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES vs SELFIES Tokenization for Chemical LMs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</guid><description>Atom Pair Encoding (APE) tokenizer outperforms BPE on SMILES and SELFIES in RoBERTa-based chemical language models across MoleculeNet classification tasks.</description><content:encoded><![CDATA[<h2 id="atom-pair-encoding-for-chemical-language-modeling">Atom Pair Encoding for Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.</p>
<h2 id="why-tokenization-matters-for-chemical-strings">Why Tokenization Matters for Chemical Strings</h2>
<p>Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte Pair Encoding (BPE)</a> was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:</p>
<ul>
<li><strong>Stray characters</strong>: BPE may create tokens like &ldquo;C)(&rdquo; that have no chemical meaning.</li>
<li><strong>Element splitting</strong>: Multi-character elements like chlorine (&ldquo;Cl&rdquo;) can be split into &ldquo;C&rdquo; and &ldquo;l&rdquo;, causing the model to misinterpret carbon and a dangling character.</li>
<li><strong>Lost structural context</strong>: BPE compresses sequences without considering how character position encodes molecular structure.</li>
</ul>
<p>Previous work on <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a> attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.</p>
<h2 id="the-ape-tokenizer-chemistry-aware-subword-merging">The APE Tokenizer: Chemistry-Aware Subword Merging</h2>
<p>APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:</p>
<ol>
<li>
<p><strong>Atom-level initialization</strong>: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., &ldquo;Cl&rdquo;, &ldquo;Br&rdquo;) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.</p>
</li>
<li>
<p><strong>Iterative pair merging</strong>: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.</p>
</li>
<li>
<p><strong>Larger vocabulary</strong>: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE&rsquo;s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.</p>
</li>
<li>
<p><strong>SELFIES compatibility</strong>: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.</p>
</li>
</ol>
<p>The tokenizer was trained on a subset of 2 million molecules from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.</p>
<h2 id="pre-training-and-evaluation-on-moleculenet-benchmarks">Pre-training and Evaluation on MoleculeNet Benchmarks</h2>
<h3 id="model-architecture">Model architecture</h3>
<p>All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.</p>
<h3 id="downstream-tasks">Downstream tasks</h3>
<p>The models were fine-tuned on three <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Category</th>
          <th>Compounds</th>
          <th>Tasks</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Biophysics</td>
          <td>41,127</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>Physiology</td>
          <td>7,831</td>
          <td>12</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p>Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.</p>
<h3 id="baselines">Baselines</h3>
<p>Results were compared against two text-based models (<a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a> MTR-77M and <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a>) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).</p>
<h3 id="main-results">Main results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>HIV ROC</th>
          <th>Tox21 ROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILYAPE-1M</td>
          <td>0.754 +/- 0.006</td>
          <td>0.772 +/- 0.010</td>
          <td>0.838 +/- 0.002</td>
      </tr>
      <tr>
          <td>SMILYBPE-1M</td>
          <td>0.746 +/- 0.006</td>
          <td>0.754 +/- 0.015</td>
          <td>0.849 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYAPE-1M</td>
          <td>0.735 +/- 0.015</td>
          <td>0.768 +/- 0.012</td>
          <td>0.842 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYBPE-1M</td>
          <td>0.676 +/- 0.014</td>
          <td>0.709 +/- 0.012</td>
          <td>0.825 +/- 0.001</td>
      </tr>
      <tr>
          <td>ChemBERTa-2-MTR-77M</td>
          <td>0.698 +/- 0.014</td>
          <td>0.735 +/- 0.008</td>
          <td>0.790 +/- 0.003</td>
      </tr>
      <tr>
          <td>SELFormer</td>
          <td>0.716 +/- 0.021</td>
          <td>0.769 +/- 0.010</td>
          <td>0.838 +/- 0.005</td>
      </tr>
      <tr>
          <td>MoleculeNet-Graph-Conv</td>
          <td>0.690</td>
          <td>0.763</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.737</td>
          <td>0.776</td>
          <td>0.851</td>
      </tr>
  </tbody>
</table>
<p>APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.</p>
<h3 id="statistical-significance">Statistical significance</h3>
<p><a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U tests</a> confirmed statistically significant differences between SMILYAPE and SMILYBPE (p &lt; 0.05 on all datasets). Cliff&rsquo;s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff&rsquo;s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="ape-outperforms-bpe-by-preserving-atomic-identity">APE outperforms BPE by preserving atomic identity</h3>
<p>The consistent advantage of APE over BPE stems from APE&rsquo;s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.</p>
<h3 id="smiles-outperforms-selfies-with-ape-tokenization">SMILES outperforms SELFIES with APE tokenization</h3>
<p>SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.</p>
<h3 id="selfies-models-show-higher-inter-tokenizer-agreement">SELFIES models show higher inter-tokenizer agreement</h3>
<p>On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.</li>
<li>Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.</li>
<li>The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE&rsquo;s advantage may be task-dependent.</li>
<li>No comparison with recent atom-level tokenizers like <a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES</a> or newer approaches beyond SPE.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tokenizer training</td>
          <td>PubChem subset</td>
          <td>2M molecules</td>
          <td>SMILES strings converted to SELFIES via selfies library</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>1M molecules</td>
          <td>100K validation set</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21</td>
          <td>7,831 compounds</td>
          <td>80/10/10 split, 12 tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)</li>
<li>Pre-training: Masked Language Modeling (15% masking) for 20 epochs</li>
<li>Optimizer: AdamW with Optuna hyperparameter search</li>
<li>Fine-tuning: 5 epochs with early stopping on validation ROC-AUC</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads</li>
<li>Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILYAPE</th>
          <th>SMILYBPE</th>
          <th>SELFYAPE</th>
          <th>SELFYBPE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP ROC-AUC</td>
          <td>0.754</td>
          <td>0.746</td>
          <td>0.735</td>
          <td>0.676</td>
      </tr>
      <tr>
          <td>HIV ROC-AUC</td>
          <td>0.772</td>
          <td>0.754</td>
          <td>0.768</td>
          <td>0.709</td>
      </tr>
      <tr>
          <td>Tox21 ROC-AUC</td>
          <td>0.838</td>
          <td>0.849</td>
          <td>0.842</td>
          <td>0.825</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA RTX 3060 GPU with 12 GiB VRAM</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mikemayuare/apetokenizer">APE Tokenizer</a></td>
          <td>Code</td>
          <td>Other (unspecified SPDX)</td>
          <td>Official APE tokenizer implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/mikemayuare/PubChem10M_SMILES_SELFIES">PubChem10M SMILES/SELFIES</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>10M SMILES with SELFIES conversions</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/mikemayuare">Pre-trained and fine-tuned models</a></td>
          <td>Model</td>
          <td>Not specified</td>
          <td>All four model variants on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., &amp; Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. <em>Scientific Reports</em>, 14(1), 25016. <a href="https://doi.org/10.1038/s41598-024-76440-8">https://doi.org/10.1038/s41598-024-76440-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leon2024comparing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{25016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-024-76440-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES Transformer: Low-Data Molecular Fingerprints</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-transformer/</guid><description>SMILES Transformer uses unsupervised Transformer pre-training on SMILES strings to produce molecular fingerprints that excel in low-data drug discovery tasks.</description><content:encoded><![CDATA[<h2 id="a-transformer-approach-to-learned-molecular-fingerprints">A Transformer Approach to Learned Molecular Fingerprints</h2>
<p>This is a <strong>Method</strong> paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.</p>
<h2 id="the-low-data-problem-in-molecular-property-prediction">The Low-Data Problem in Molecular Property Prediction</h2>
<p>Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.</p>
<p>Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (<a href="/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/">Seq2Seq fingerprints</a>, <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a>, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.</p>
<h2 id="transformer-pre-training-on-smiles-with-pooled-fingerprint-extraction">Transformer Pre-training on SMILES with Pooled Fingerprint Extraction</h2>
<p>The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.</p>
<h3 id="architecture">Architecture</h3>
<p>The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., &lsquo;c&rsquo;, &lsquo;Br&rsquo;, &lsquo;=&rsquo;, &lsquo;(&rsquo;, &lsquo;2&rsquo;) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.</p>
<h3 id="pre-training">Pre-training</h3>
<p>The model is pre-trained on 861,000 unlabeled SMILES sampled from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL24</a> to minimize cross-entropy between input and output SMILES (i.e., reconstruction). <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).</p>
<h3 id="fingerprint-extraction">Fingerprint Extraction</h3>
<p>Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:</p>
<ol>
<li>Mean-pooled output of the last encoder layer</li>
<li>Max-pooled output of the last encoder layer</li>
<li>First output token of the last encoder layer</li>
<li>First output token of the penultimate encoder layer</li>
</ol>
<p>This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.</p>
<h3 id="data-efficiency-metric">Data Efficiency Metric</h3>
<p>The paper proposes DEM to measure how well a model performs across different training set sizes:</p>
<p>$$
M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i)
$$</p>
<p>where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.</p>
<h2 id="benchmarking-across-moleculenet-with-data-efficiency-focus">Benchmarking Across MoleculeNet with Data Efficiency Focus</h2>
<h3 id="datasets">Datasets</h3>
<p>The evaluation uses 10 datasets from MoleculeNet spanning three categories:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Molecules</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>Regression</td>
          <td>1,128</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>Regression</td>
          <td>643</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>1</td>
          <td>Regression</td>
          <td>4,200</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>MUV</td>
          <td>17</td>
          <td>Classification</td>
          <td>93,127</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>HIV</td>
          <td>1</td>
          <td>Classification</td>
          <td>41,913</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>BACE</td>
          <td>1</td>
          <td>Classification</td>
          <td>1,522</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>Classification</td>
          <td>2,053</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>Tox21</td>
          <td>12</td>
          <td>Classification</td>
          <td>8,014</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>SIDER</td>
          <td>27</td>
          <td>Classification</td>
          <td>1,427</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>ClinTox</td>
          <td>2</td>
          <td>Classification</td>
          <td>1,491</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>ECFP4</strong>: Rule-based extended-connectivity fingerprint with 1024 dimensions</li>
<li><strong>RNNS2S</strong>: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)</li>
<li><strong>GraphConv</strong>: Graph convolution network trained end-to-end on labeled data</li>
</ul>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.</p>
<h3 id="data-efficiency-results-dem">Data Efficiency Results (DEM)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ST+MLP</th>
          <th>ECFP+MLP</th>
          <th>RNNS2S+MLP</th>
          <th>GraphConv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL (RMSE, lower is better)</td>
          <td><strong>1.144</strong></td>
          <td>1.741</td>
          <td>1.317</td>
          <td>1.673</td>
      </tr>
      <tr>
          <td>FreeSolv (RMSE, lower is better)</td>
          <td><strong>2.246</strong></td>
          <td>3.043</td>
          <td>2.987</td>
          <td>3.476</td>
      </tr>
      <tr>
          <td>Lipophilicity (RMSE, lower is better)</td>
          <td>1.169</td>
          <td><strong>1.090</strong></td>
          <td>1.219</td>
          <td><strong>1.062</strong></td>
      </tr>
      <tr>
          <td>MUV (PRC-AUC, higher is better)</td>
          <td>0.009</td>
          <td><strong>0.036</strong></td>
          <td>0.010</td>
          <td>0.004</td>
      </tr>
      <tr>
          <td>HIV (ROC-AUC, higher is better)</td>
          <td>0.683</td>
          <td>0.697</td>
          <td>0.682</td>
          <td><strong>0.723</strong></td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC, higher is better)</td>
          <td>0.719</td>
          <td><strong>0.769</strong></td>
          <td>0.717</td>
          <td>0.744</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC, higher is better)</td>
          <td><strong>0.900</strong></td>
          <td>0.760</td>
          <td>0.884</td>
          <td>0.795</td>
      </tr>
      <tr>
          <td>Tox21 (ROC-AUC, higher is better)</td>
          <td><strong>0.706</strong></td>
          <td>0.616</td>
          <td>0.702</td>
          <td>0.687</td>
      </tr>
      <tr>
          <td>SIDER (ROC-AUC, higher is better)</td>
          <td>0.559</td>
          <td><strong>0.588</strong></td>
          <td>0.558</td>
          <td>0.557</td>
      </tr>
      <tr>
          <td>ClinTox (ROC-AUC, higher is better)</td>
          <td><strong>0.963</strong></td>
          <td>0.515</td>
          <td>0.904</td>
          <td>0.936</td>
      </tr>
  </tbody>
</table>
<p>ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).</p>
<h3 id="linear-model-experiments">Linear Model Experiments</h3>
<p>To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.</p>
<h3 id="stratified-analysis-by-molecule-size">Stratified Analysis by Molecule Size</h3>
<p>On BBBP stratified by SMILES length, ST&rsquo;s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.</p>
<h3 id="comparison-with-record-scores-large-data">Comparison with Record Scores (Large Data)</h3>
<p>Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST&rsquo;s main advantage is in the low-data regime.</p>
<h2 id="strong-low-data-performance-with-caveats-on-scalability">Strong Low-Data Performance with Caveats on Scalability</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.</li>
<li>The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.</li>
<li>With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.</li>
<li>Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.</li>
<li>The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.</li>
<li>MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.</li>
<li>No comparison with BERT-style masked language model pre-training, which later work (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>) would show as a viable alternative.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, <a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL24</td>
          <td>861,000 SMILES</td>
          <td>Unlabeled, randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MoleculeNet (10 datasets)</td>
          <td>643 to 93,127 molecules</td>
          <td>See Table 1 for per-dataset details</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions</li>
<li>Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation</li>
<li>Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs</li>
<li>Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DSPsleeporg/smiles-transformer">smiles-transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Jupyter notebooks)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each</li>
<li>Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison</li>
<li>Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU type or training time for the pre-training phase.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Honda, S., Shi, S., &amp; Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. <em>arXiv preprint arXiv:1911.04738</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{honda2019smiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Honda, Shion and Shi, Shoi and Ueda, Hiroki R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1911.04738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMI+AIS: Hybridizing SMILES with Environment Tokens</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smi-ais-hybrid-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smi-ais-hybrid-molecular-representation/</guid><description>SMI+AIS hybridizes SMILES with Atom-In-SMILES tokens encoding local chemical environments, improving molecular generation binding affinity and synthesizability.</description><content:encoded><![CDATA[<h2 id="a-hybrid-molecular-representation-combining-smiles-and-chemical-environment-tokens">A Hybrid Molecular Representation Combining SMILES and Chemical-Environment Tokens</h2>
<p>This is a <strong>Method</strong> paper that introduces SMI+AIS(N), a hybrid molecular string representation combining standard <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> tokens with <a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-In-SMILES (AIS)</a> tokens. AIS tokens encode local chemical environment information (central atom, ring membership, and neighboring atoms) into a single token. The key contribution is a systematic hybridization strategy that selectively replaces the most frequent SMILES tokens with AIS equivalents, preserving SMILES grammar compatibility while enriching token diversity. The method is validated on molecular structure generation via latent space optimization for drug design.</p>
<h2 id="limitations-of-standard-smiles-for-machine-learning">Limitations of Standard SMILES for Machine Learning</h2>
<p>SMILES is the most widely adopted string-based molecular representation, used in major databases like ZINC and PubChem. Despite this ubiquity, SMILES has several well-known limitations for machine learning applications:</p>
<ol>
<li><strong>Non-unique representations</strong>: The same molecule can be encoded as multiple distinct SMILES strings.</li>
<li><strong>Invalid string generation</strong>: Generative models can produce syntactically invalid SMILES that do not correspond to any molecule.</li>
<li><strong>Limited token diversity</strong>: SMILES tokens map one-to-one to atoms or bonds, so the token vocabulary is restricted to the available atom and bond types.</li>
<li><strong>Insufficient chemical context</strong>: Individual SMILES tokens carry no information about the local chemical environment of an atom.</li>
</ol>
<p>Alternative representations like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (guaranteeing validity) and <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> (guaranteeing uniqueness) address some of these issues but share the same fundamental limitation of low token diversity. The Atom-In-SMILES (AIS) representation (Ucak et al., 2023) enriches tokens with neighboring atom and ring information, but using AIS exclusively produces a large vocabulary with many infrequent tokens that can cause data sparsity problems. The authors aim to find a middle ground: adding chemical context to the most common tokens while keeping the vocabulary manageable.</p>
<h2 id="core-innovation-selective-token-hybridization-with-ais">Core Innovation: Selective Token Hybridization with AIS</h2>
<p>The SMI+AIS(N) representation hybridizes standard SMILES with AIS tokens through a frequency-based selection process:</p>
<h3 id="ais-token-structure">AIS Token Structure</h3>
<p>Each AIS token encodes three pieces of information about an atom, delimited by semicolons:</p>
<p>$$
\lbrack \text{central atom} ; \text{ring info} ; \text{neighbor atoms} \rbrack
$$</p>
<p>For example, the oxygen in a carboxyl group of benzoic acid is represented as <code>[O;!R;C]</code>, meaning: oxygen atom, not in a ring, bonded to carbon. In standard SMILES, this would simply be <code>O</code>.</p>
<h3 id="hybridization-procedure">Hybridization Procedure</h3>
<ol>
<li>Convert all SMILES strings in the <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a> to their full AIS representations.</li>
<li>Count the frequency of each AIS token across the database.</li>
<li>Select the top-N most frequent AIS tokens to form the hybrid vocabulary.</li>
<li>In the hybrid representation, atoms matching these top-N AIS tokens are written in AIS notation; all other atoms use standard SMILES notation.</li>
</ol>
<p>For benzoic acid, the hybridization produces:</p>
<p>$$
\text{SMI}: \texttt{O=C(O)c1ccccc1}
$$</p>
<p>$$
\text{SMI+AIS}: \texttt{\lbrack O;!R;C\rbrack=\lbrack C;!R;COO\rbrack(\lbrack OH;!R;C\rbrack)c1ccccc1}
$$</p>
<p>The parameter N controls vocabulary size. The authors test N = 50, 100, 150, and 200, finding that N = 100-150 provides the best balance for the ZINC database.</p>
<h3 id="token-frequency-rebalancing">Token Frequency Rebalancing</h3>
<p>A key benefit of hybridization is mitigating the severe token frequency imbalance in standard SMILES. Carbon (C), the most frequent element with ~184 million occurrences in ZINC, is represented by only 16 token types in SMILES. With SMI+AIS(200), carbon is distinguished into 145 token types based on chemical environment, with 74% of carbon occurrences represented by AIS tokens. Less common elements like halogens see minimal change (only 2% AIS representation), which avoids introducing unnecessarily rare tokens.</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Frequency</th>
          <th>SMILES Types</th>
          <th>SMI+AIS(100) Types (AIS %)</th>
          <th>SMI+AIS(200) Types (AIS %)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>C</td>
          <td>183,860,954</td>
          <td>16</td>
          <td>78 (73%)</td>
          <td>145 (74%)</td>
      </tr>
      <tr>
          <td>O</td>
          <td>27,270,229</td>
          <td>8</td>
          <td>16 (11%)</td>
          <td>24 (11%)</td>
      </tr>
      <tr>
          <td>N</td>
          <td>26,022,928</td>
          <td>11</td>
          <td>32 (1%)</td>
          <td>46 (10%)</td>
      </tr>
      <tr>
          <td>X (halogens)</td>
          <td>6,137,030</td>
          <td>7</td>
          <td>10 (2%)</td>
          <td>11 (2%)</td>
      </tr>
      <tr>
          <td>S</td>
          <td>4,581,307</td>
          <td>12</td>
          <td>17 (2%)</td>
          <td>24 (2%)</td>
      </tr>
  </tbody>
</table>
<h2 id="latent-space-optimization-for-molecular-generation">Latent Space Optimization for Molecular Generation</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p>The evaluation uses a <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">conditional variational autoencoder (CVAE)</a> with:</p>
<ul>
<li><strong>Encoder</strong>: BERT-style architecture with entity and positional embeddings, 4 multi-head attention layers (8 heads each), producing mean and standard deviation vectors in latent space.</li>
<li><strong>Decoder</strong>: 4 stacked gated recurrent unit (GRU) layers that transform sampled latent vectors (conditioned) back into token sequences.</li>
<li>Training: 20 epochs on 9 million compounds from the ZINC database (8:1:1 train/valid/test split) under identical conditions for all representations.</li>
</ul>
<h3 id="optimization-setup">Optimization Setup</h3>
<p><a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a> (BO) via BoTorch is applied to the CVAE <a href="/notes/chemistry/molecular-design/generation/latent-space/">latent space</a>, maximizing a multi-objective function:</p>
<p>$$
\text{Obj} = -\text{BA} - 0.5 \times \text{SA}^2
$$</p>
<p>where BA is binding affinity (docking score from QuickVina 2, lower is stronger) and SA is synthetic accessibility score (from RDKit, lower is more synthesizable). Each BO iteration generates 800 candidate latent vectors. Invalid strings receive a penalty objective value of -100.</p>
<h3 id="protein-targets">Protein Targets</h3>
<p>Four diverse targets were used to assess generalizability:</p>
<ul>
<li><strong>PDK4</strong> (<a href="https://en.wikipedia.org/wiki/Pyruvate_dehydrogenase_kinase">Pyruvate Dehydrogenase Kinase</a> 4): narrow, deep binding pocket</li>
<li><strong>5-HT1B</strong> (<a href="https://en.wikipedia.org/wiki/5-HT1B_receptor">Serotonin Receptor 1B</a>): shallow, open <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> conformation</li>
<li><strong>PARP1</strong> (<a href="https://en.wikipedia.org/wiki/PARP1">Poly ADP-ribose Polymerase 1</a>): small, flexible molecule binding site</li>
<li><strong>CK1d</strong> (<a href="https://en.wikipedia.org/wiki/Casein_kinase_1">Casein Kinase I</a> Delta): broad, accessible conformation</li>
</ul>
<p>Protein structures were obtained from the <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">Protein Data Bank</a> (PDB IDs: 4V26, 4IAQ, 6I8M, 4TN6). Each optimization was run 10 times independently from the same 5 initial compounds selected from BindingDB.</p>
<h3 id="key-results">Key Results</h3>
<p>SMI+AIS(100) consistently achieved the highest objective values across protein targets.</p>
<p><strong>PDK4 Optimization</strong> (Top-1 results over 10 independent runs):</p>
<ul>
<li>SMI+AIS(100) achieved approximately 12% improvement over standard SMILES and 28% improvement over SELFIES based on median Top-1 objective values.</li>
<li>Generated structures exhibited BA scores between -10 and -9 and SA scores between 2.0 and 2.3.</li>
<li>Molecular weights clustered around 400 amu, consistent with the CVAE conditioning.</li>
</ul>
<p><strong>Validity Ratios</strong>: Standard SMILES produced approximately 40% valid structures. SMI+AIS representations showed significant improvement as N increased, though SMI+AIS(200) showed slight saturation, likely from insufficiently trained infrequent tokens.</p>
<p><strong>SELFIES</strong>: Despite achieving the highest validity ratio, SELFIES failed to generate chemically meaningful structures with desirable BA and SA scores. The authors attribute this to SELFIES grammar where token meaning is highly context-dependent, causing minor latent space variations to produce large structural changes.</p>
<p><strong>Cross-target consistency</strong>: Improvements were observed across all four protein targets, with slight variation (5-HT1B showed smaller differences between SMI and SMI+AIS(100) for Top-1, while other targets showed significant improvements).</p>
<h2 id="improved-molecular-generation-through-chemical-context-enrichment">Improved Molecular Generation Through Chemical Context Enrichment</h2>
<p>The SMI+AIS(N) representation achieves consistent improvements in molecular generation quality compared to both standard SMILES and SELFIES. The core findings are:</p>
<ol>
<li><strong>Binding affinity improvement</strong>: Approximately 7% improvement over standard SMILES for the PDK4 target.</li>
<li><strong>Synthesizability improvement</strong>: Approximately 6% increase in synthetic accessibility scores.</li>
<li><strong>Target independence</strong>: Performance gains transfer across four structurally diverse protein targets.</li>
<li><strong>Preserved structural motifs</strong>: The generative model retains chemically meaningful fragments (e.g., acetamide and <a href="https://en.wikipedia.org/wiki/Piperidine">piperidine</a>) from initial compounds without explicit fragment constraints.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Stereochemistry</strong>: SMI+AIS inherits the limited stereochemistry handling of standard SMILES.</li>
<li><strong>Evaluation scope</strong>: Only molecular generation was tested; property prediction and other ML tasks remain unexplored.</li>
<li><strong>Compute constraints</strong>: The study was limited to molecular generation due to computing power and time.</li>
<li><strong>Single optimization strategy</strong>: Only latent space optimization with Bayesian optimization was evaluated; other generative approaches were not compared.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest extending SMI+AIS to diverse benchmarking tests including molecular property prediction, experimental validation, and broader applications of chemical language models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Vocab</td>
          <td>ZINC Database</td>
          <td>9M compounds</td>
          <td>Canonicalized, deduplicated, split 8:1:1</td>
      </tr>
      <tr>
          <td>Binding targets</td>
          <td>BindingDB</td>
          <td>5 initial compounds per target</td>
          <td>Selected for each protein target</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td>PDB</td>
          <td>4 structures</td>
          <td>IDs: 4V26, 4IAQ, 6I8M, 4TN6</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: AIS token frequency counting on full ZINC database, top-N selection</li>
<li><strong>Generative model</strong>: Conditional VAE with BERT encoder (4 layers, 8 heads) and GRU decoder (4 layers)</li>
<li><strong>Optimization</strong>: Bayesian Optimization via BoTorch (800 candidates per iteration)</li>
<li><strong>Docking</strong>: QuickVina 2 with 25 A pocket size, 10 docking simulations per ligand</li>
<li><strong>SA scoring</strong>: RDKit SA score</li>
<li>Training: 20 epochs for all representations under identical conditions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>CVAE architecture details in supplementary (Fig. S9, Tables S2, S4)</li>
<li>No pre-trained weights released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMI+AIS(100) vs SMILES</th>
          <th>SMI+AIS(100) vs SELFIES</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Median Top-1 Obj. Value</td>
          <td>+12%</td>
          <td>+28%</td>
          <td>PDK4 target</td>
      </tr>
      <tr>
          <td>Validity Ratio</td>
          <td>Higher than ~40% (SMILES)</td>
          <td>Lower than SELFIES</td>
          <td>SMI+AIS improves with N</td>
      </tr>
      <tr>
          <td>BA (binding affinity)</td>
          <td>~7% improvement</td>
          <td>Substantial</td>
          <td>Lower (more negative) is better</td>
      </tr>
      <tr>
          <td>SA (synthesizability)</td>
          <td>~6% improvement</td>
          <td>Substantial</td>
          <td>Lower is more synthesizable</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the main text. Optimization wall times are reported in supplementary Table S5.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/herim-han/AIS-Drug-Opt">AIS-Drug-Opt</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Source code and datasets for reproduction</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. Code and processed data are publicly available on GitHub, but no pre-trained model weights are released, the license is unspecified, and hardware requirements are not documented in the main text.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Han, H., Yeom, M. S., &amp; Choi, S. (2025). Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation. <em>Scientific Reports</em>, 15, 16892. <a href="https://doi.org/10.1038/s41598-025-01890-7">https://doi.org/10.1038/s41598-025-01890-7</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{han2025hybridization,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Han, Herim and Yeom, Min Sun and Choi, Sunghwan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{16892}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-025-01890-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMI-TED: Encoder-Decoder Foundation Models for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smi-ted-encoder-decoder-chemistry/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smi-ted-encoder-decoder-chemistry/</guid><description>SMI-TED is a family of encoder-decoder transformer models pre-trained on 91M PubChem molecules for molecular property prediction and generation.</description><content:encoded><![CDATA[<h2 id="an-encoder-decoder-chemical-foundation-model-family">An Encoder-Decoder Chemical Foundation Model Family</h2>
<p>SMI-TED is a <strong>Method</strong> paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.</p>
<h2 id="bridging-encoding-and-decoding-for-molecular-representations">Bridging Encoding and Decoding for Molecular Representations</h2>
<p>Chemical language models based on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> have gained traction for molecular property prediction and generation. Most existing models, such as <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> and <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model&rsquo;s utility for generative tasks and limits the interpretability of the learned latent space.</p>
<p>The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>.</p>
<h2 id="invertible-pooling-and-two-phase-pre-training">Invertible Pooling and Two-Phase Pre-Training</h2>
<p>The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:</p>
<p>$$
\mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2
$$</p>
<p>where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:</p>
<p>$$
\tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4
$$</p>
<p>where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.</p>
<p>The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.</p>
<p><strong>Two-phase pre-training strategy:</strong></p>
<ul>
<li><strong>Phase 1</strong>: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.</li>
<li><strong>Phase 2</strong>: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.</li>
</ul>
<p><strong><a href="https://en.wikipedia.org/wiki/Mixture_of_experts">Mixture-of-Experts</a> (MoE-OSMI):</strong> The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:</p>
<p>$$
y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x})
$$</p>
<p>where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.</p>
<h2 id="benchmarks-across-property-prediction-generation-and-reaction-yield">Benchmarks Across Property Prediction, Generation, and Reaction Yield</h2>
<h3 id="moleculenet-classification-6-datasets-roc-auc"><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification (6 datasets, ROC-AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BBBP</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
          <th>Tox21</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>73.6 +/- 0.8</td>
          <td>91.2 +/- 1.4</td>
          <td>80.5 +/- 1.65</td>
          <td>86.3 +/- 0.6</td>
          <td>65.5 +/- 0.2</td>
          <td>80.46 +/- 0.2</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>72.9 +/- 0.6</td>
          <td>91.9 +/- 1.8</td>
          <td>80.8 +/- 0.3</td>
          <td>85.7 +/- 0.2</td>
          <td>65.9 +/- 1.3</td>
          <td>79.6 +/- 0.5</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>72.4 +/- 0.4</td>
          <td>90.1 +/- 1.3</td>
          <td>80.6 +/- 0.9</td>
          <td>85.6 +/- 1.1</td>
          <td>67.2 +/- 0.4</td>
          <td>78.1 +/- 0.1</td>
      </tr>
      <tr>
          <td>SMI-TED289M (pre-trained)</td>
          <td>91.46 +/- 0.47</td>
          <td>93.49 +/- 0.85</td>
          <td>80.51 +/- 1.34</td>
          <td>85.58 +/- 0.92</td>
          <td>66.01 +/- 0.88</td>
          <td>81.53 +/- 0.45</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>92.26 +/- 0.57</strong></td>
          <td><strong>94.27 +/- 1.83</strong></td>
          <td>76.85 +/- 0.89</td>
          <td><strong>88.24 +/- 0.50</strong></td>
          <td>65.68 +/- 0.45</td>
          <td><strong>81.85 +/- 1.42</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.</p>
<h3 id="moleculenet-regression-5-datasets-mae-for-qm9qm8-rmse-for-esolfreesolvlipophilicity">MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>1.5894</td>
          <td>0.0102</td>
          <td>0.880</td>
          <td>2.342</td>
          <td>0.700</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>3.241</td>
          <td>0.0143</td>
          <td>0.98</td>
          <td>2.18</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>1.3246</strong></td>
          <td><strong>0.0095</strong></td>
          <td><strong>0.6112</strong></td>
          <td><strong>1.2233</strong></td>
          <td><strong>0.5522</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).</p>
<h3 id="reaction-yield-prediction-buchwald-hartwig-c-n-cross-coupling">Reaction yield prediction (<a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling)</h3>
<p>The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:</p>
<table>
  <thead>
      <tr>
          <th>Split</th>
          <th>Yield-BERT (Aug)</th>
          <th>DRFP</th>
          <th>SMI-TED289M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>70/30</td>
          <td>0.97</td>
          <td>0.95</td>
          <td><strong>0.984</strong></td>
      </tr>
      <tr>
          <td>10/90</td>
          <td>0.81</td>
          <td>0.81</td>
          <td><strong>0.961</strong></td>
      </tr>
      <tr>
          <td>2.5/97.5</td>
          <td>0.61</td>
          <td>0.62</td>
          <td><strong>0.875</strong></td>
      </tr>
      <tr>
          <td>Test 1-4 avg</td>
          <td>0.58</td>
          <td>0.71</td>
          <td><strong>0.983</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.</p>
<h3 id="moses-molecular-generation-benchmarks"><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> molecular generation benchmarks</h3>
<p>SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/">LIMO</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen-7b</a>, and <a href="/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/">GP-MoLFormer</a> on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.</p>
<h3 id="latent-space-compositionality">Latent space compositionality</h3>
<p>Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.</p>
<p>For structure-property analysis on <a href="/notes/chemistry/datasets/qm9/">QM9</a>, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (<a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.</p>
<h2 id="strong-performance-with-a-compositional-latent-space">Strong Performance with a Compositional Latent Space</h2>
<p>SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:</p>
<ol>
<li><strong>Broad applicability</strong>: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.</li>
<li><strong>Low-data robustness</strong>: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.</li>
<li><strong>Compositional embeddings</strong>: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).</li>
<li><strong>Structure-property capture</strong>: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO</a> energy, outperforming encoder-only models in latent space organization.</li>
</ol>
<p><strong>Limitations</strong>: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.</p>
<p><strong>Future directions</strong>: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (curated)</td>
          <td>91M molecules, 4B tokens</td>
          <td>Deduplicated, canonicalized, validity-checked</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>MOSES</td>
          <td>1.94M molecules</td>
          <td>Train/test/scaffold test splits</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig HTE</td>
          <td>3,955 reactions</td>
          <td>3x 1536-well plates</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li>Two-phase pre-training (95/5 split then 100% joint training)</li>
<li>RoFormer attention with rotary position embeddings</li>
<li>Vocabulary: 2,993 tokens (2,988 molecular + 5 special)</li>
<li>Maximum sequence length: 202 tokens (covers 99.4% of PubChem)</li>
<li>Learning rate: 1.6e-4, batch size: 288 molecules</li>
<li>40 epochs over the full PubChem corpus</li>
<li>10 random seeds per experiment for robustness</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Parameters</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMI-TED289M base</td>
          <td>289M</td>
          <td>47M</td>
          <td>242M</td>
          <td>12 layers, 12 attention heads, hidden size 768, dropout 0.2</td>
      </tr>
      <tr>
          <td>MoE-OSMI</td>
          <td>8x289M</td>
          <td>-</td>
          <td>-</td>
          <td>8 experts, top-k=2 routing, gating network</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC</li>
<li>Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)</li>
<li>Reaction yield: $R^2$</li>
<li>Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)</li>
<li>Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>24 NVIDIA V100 GPUs (16GB)</li>
<li>4 nodes with DDP (Distributed Data Parallel)</li>
<li>Pre-training: 40 epochs on 91M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/materials/tree/main/models/smi_ted">IBM/materials (smi_ted)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Training, fine-tuning scripts, Jupyter notebooks</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/materials.smi-ted">ibm/materials.smi-ted</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pre-trained model weights</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.15603701">Zenodo archive</a></td>
          <td>Code + Data</td>
          <td>Apache-2.0</td>
          <td>Archival copy of scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., &amp; Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. <em>Communications Chemistry</em>, 8(1). <a href="https://doi.org/10.1038/s42004-025-01585-0">https://doi.org/10.1038/s42004-025-01585-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{soares2025smited,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{An open-source family of large encoder-decoder foundation models for chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Communications Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42004-025-01585-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Seq2seq Fingerprint: Unsupervised Molecular Embedding</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/</guid><description>Seq2seq fingerprint uses a GRU encoder-decoder trained on SMILES self-translation to produce unsupervised molecular embeddings for property prediction.</description><content:encoded><![CDATA[<h2 id="an-unsupervised-seq2seq-method-for-molecular-fingerprints">An Unsupervised Seq2seq Method for Molecular Fingerprints</h2>
<p>This is a <strong>Method</strong> paper that introduces seq2seq fingerprint, an unsupervised molecular embedding approach based on sequence-to-sequence learning. The core idea is to train a <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> encoder-decoder network to translate <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings to themselves, then extract the intermediate fixed-length vector as a molecular fingerprint. These fingerprints are then used with standard supervised classifiers for downstream property prediction tasks such as solubility classification and promiscuity prediction.</p>
<h2 id="the-labeled-data-bottleneck-in-drug-discovery">The Labeled Data Bottleneck in Drug Discovery</h2>
<p>Machine learning approaches to molecular property prediction depend on fixed-length feature vectors as inputs. Traditional molecular fingerprints fall into two categories: hash-based methods like Extended-Connectivity Fingerprints (ECFP) that are fast but lossy and non-invertible, and biologist-guided local-feature fingerprints that require domain expertise and are task-specific. Supervised deep learning fingerprints (e.g., neural fingerprints) can learn representations from data but require large amounts of labeled data, which is expensive to obtain in drug discovery due to the cost of biological experiments.</p>
<p>The authors identify three limitations of existing approaches:</p>
<ol>
<li>Hash-based fingerprints discard information during the hashing process and cannot reconstruct the original molecule</li>
<li>Local-feature fingerprints require expert knowledge and generalize poorly across tasks</li>
<li>Supervised deep learning fingerprints are data-hungry and fail when labeled data is limited</li>
</ol>
<h2 id="self-translation-as-unsupervised-molecular-encoding">Self-Translation as Unsupervised Molecular Encoding</h2>
<p>The key insight is to adapt the <a href="https://en.wikipedia.org/wiki/Seq2seq">sequence-to-sequence</a> learning framework from machine translation (originally English-to-French) to molecular representation learning by setting both the input and output to the same SMILES string. Since the intermediate vector must contain enough information to reconstruct the original SMILES, it serves as a rich, task-agnostic molecular fingerprint.</p>
<p>The architecture consists of two components:</p>
<ul>
<li><strong>Perceiver network</strong>: A multi-layer GRU encoder that reads the SMILES string and compresses it into a fixed-length vector</li>
<li><strong>Interpreter network</strong>: A multi-layer GRU decoder that reconstructs the original SMILES from the fingerprint vector</li>
</ul>
<p>The GRU cell computes a sequence of outputs $(s_1, \ldots, s_T)$ from input sequences $(x_1, \ldots, x_T)$ by iterating:</p>
<p>$$
z_t = \sigma_g(W_z x_t + U_z s_{t-1} + b_z)
$$</p>
<p>$$
r_t = \sigma_r(W_r x_t + U_r s_{t-1} + b_r)
$$</p>
<p>$$
h_t = \tanh(U_h x_t + W_h(s_{t-1} \circ r_t))
$$</p>
<p>$$
s_t = (1 - z_t) \circ h_{t-1} + z_t \circ s_{t-1}
$$</p>
<p>where $z_t$ is the update gate, $r_t$ is the reset gate, $\circ$ denotes element-wise multiplication, and $W$, $U$, $b$ are trainable parameters.</p>
<p>Several adaptations to the original seq2seq framework make this work for molecular data:</p>
<ol>
<li><strong>GRU instead of LSTM</strong>: GRU provides comparable performance with faster training, which is important given the large training data pool</li>
<li><strong>Attention mechanism</strong>: Establishes a stronger connection between the perceiver and interpreter networks via soft alignment, addressing the challenge of passing information through hidden memory for long sequences (SMILES can be up to 250 characters)</li>
<li><strong>Dropout layers</strong>: Added to input and output gates (but not hidden memory transfer) following the approach of Zaremba et al. to combat overfitting when training on large datasets</li>
<li><strong>Fingerprint extraction layer</strong>: A fixed-unit fully connected layer combined with a GRU cell state concatenation layer is inserted between encoder and decoder to explicitly output the fingerprint vector</li>
<li><strong>Reverse target sequence</strong>: Following Sutskever et al., the target sequence is reversed to improve SGD optimization</li>
<li><strong>Bucket training</strong>: Sequences are distributed into buckets by length and padded to enable GPU parallelization</li>
</ol>
<h2 id="classification-experiments-on-logp-and-pm2-datasets">Classification Experiments on LogP and PM2 Datasets</h2>
<h3 id="training-setup">Training Setup</h3>
<p>The unsupervised training used 334,092 valid SMILES representations from combined LogP and PM2-full datasets obtained from the National Center for Advancing Translational Sciences (NCATS) at NIH. Three model variants were trained with fingerprint dimensions of 512, 768, and 1024, differing in the number of GRU layers (2, 3, and 4 respectively) while keeping the latent dimension at 256. Each model was trained for 24 hours on a workstation with an Intel i7-6700K CPU, 16 GB RAM, and an NVIDIA GTX 1080 GPU.</p>
<h3 id="reconstruction-performance">Reconstruction Performance</h3>
<p>The models were evaluated on their ability to reconstruct SMILES strings from their fingerprints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>GRU Layers</th>
          <th>Latent Dim</th>
          <th>Perplexity</th>
          <th>Exact Match Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>seq2seq-512</td>
          <td>2</td>
          <td>256</td>
          <td>1.00897</td>
          <td>94.24%</td>
      </tr>
      <tr>
          <td>seq2seq-768</td>
          <td>3</td>
          <td>256</td>
          <td>1.00949</td>
          <td>92.92%</td>
      </tr>
      <tr>
          <td>seq2seq-1024</td>
          <td>4</td>
          <td>256</td>
          <td>1.01472</td>
          <td>90.26%</td>
      </tr>
  </tbody>
</table>
<p>Deeper models showed lower reconstruction accuracy, possibly because larger fingerprint spaces introduce more null spaces and require longer training to converge.</p>
<h3 id="classification-results">Classification Results</h3>
<p>Two labeled datasets were used for downstream classification:</p>
<ul>
<li><strong>LogP</strong>: 10,850 samples with <a href="https://en.wikipedia.org/wiki/Partition_coefficient">water-octanol partition coefficient</a> values, binarized at a threshold of 1.88</li>
<li><strong>PM2-10k</strong>: 10,000 samples with binary promiscuity class labels</li>
</ul>
<p>The seq2seq fingerprints were evaluated with three ensemble classifiers (<a href="https://en.wikipedia.org/wiki/AdaBoost">AdaBoost</a>, <a href="https://en.wikipedia.org/wiki/Gradient_boosting">GradientBoost</a>, RandomForest) against circular fingerprints (ECFP) and neural fingerprints. Results are 100-run averages of 5-fold cross-validation accuracy.</p>
<p><strong>LogP classification accuracy:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Mean Accuracy</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Circular FP (ECFP)</td>
          <td>0.3674</td>
          <td>0.0074</td>
      </tr>
      <tr>
          <td>Neural FP</td>
          <td>0.6080</td>
          <td>0.0135</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + GradientBoost</td>
          <td><strong>0.7664</strong></td>
          <td>0.0043</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + AdaBoost</td>
          <td>0.7342</td>
          <td>0.0042</td>
      </tr>
      <tr>
          <td>Seq2seq-512 + GradientBoost</td>
          <td>0.7350</td>
          <td>0.0060</td>
      </tr>
  </tbody>
</table>
<p><strong>PM2-10k classification accuracy:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Mean Accuracy</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Circular FP (ECFP)</td>
          <td>0.3938</td>
          <td>0.0114</td>
      </tr>
      <tr>
          <td>Neural FP</td>
          <td>0.5227</td>
          <td>0.0112</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + GradientBoost</td>
          <td><strong>0.6206</strong></td>
          <td>0.0198</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + AdaBoost</td>
          <td>0.6036</td>
          <td>0.0147</td>
      </tr>
      <tr>
          <td>Seq2seq-512 + GradientBoost</td>
          <td>0.5741</td>
          <td>0.0086</td>
      </tr>
  </tbody>
</table>
<p>The seq2seq fingerprint outperformed both baselines across all configurations. Despite the seq2seq-1024 model having lower reconstruction accuracy, it provided the best classification performance, suggesting that the longer fingerprint captures more discriminative information for downstream tasks even if the reconstruction is less exact.</p>
<h2 id="unsupervised-transfer-learning-for-molecular-properties">Unsupervised Transfer Learning for Molecular Properties</h2>
<p>The results demonstrate that unsupervised pretraining on large unlabeled molecular datasets can produce fingerprints that transfer well to supervised property prediction with limited labels. The key advantages confirmed by the experiments are:</p>
<ol>
<li><strong>Label-free training</strong>: The unsupervised approach uses essentially unlimited SMILES data, avoiding the expensive label collection process</li>
<li><strong>Task-agnostic representations</strong>: The same fingerprints work across different classification tasks (solubility and promiscuity) without retraining</li>
<li><strong>Invertibility</strong>: The fingerprints contain enough information to reconstruct the original SMILES (up to 94.24% exact match), unlike hash-based methods</li>
</ol>
<p><strong>Limitations</strong> acknowledged by the authors include:</p>
<ul>
<li>Long training times (24 hours per model variant), motivating future work on distributed training</li>
<li>The relationship between fingerprint dimensionality and downstream performance is non-monotonic (768-dim underperforms 512-dim on some tasks), suggesting sensitivity to hyperparameter choices</li>
<li>Only classification tasks were evaluated; regression performance was not assessed</li>
<li>The comparison baselines are limited to ECFP and neural fingerprints from 2015</li>
</ul>
<p><strong>Future directions</strong> proposed include distributed training strategies, hyperparameter optimization methods, and semi-supervised extensions that incorporate label information into the fingerprint training.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unsupervised training</td>
          <td>LogP + PM2-full (combined)</td>
          <td>334,092 SMILES</td>
          <td>Obtained from NCATS at NIH</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>LogP</td>
          <td>10,850 samples</td>
          <td>Binary labels at LogP threshold 1.88</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>PM2-10k</td>
          <td>10,000 samples</td>
          <td>Binary promiscuity labels</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder-decoder: Multi-layer GRU with attention mechanism and dropout</li>
<li>Fingerprint dimensions: 512, 768, 1024 (with 2, 3, 4 GRU layers respectively)</li>
<li>Latent dimension: 256 for all variants</li>
<li>Downstream classifiers: AdaBoost, GradientBoost, RandomForest</li>
<li>Evaluation: 5-fold cross-validation, 100-run averages</li>
<li>Baselines: ECFP via RDKit, Neural Fingerprint from HIPS/neural-fingerprint</li>
</ul>
<h3 id="models">Models</h3>
<p>Three model variants trained for 24 hours each. The paper states code would become publicly available after acceptance, but no public repository has been confirmed.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Task</th>
          <th>Configuration</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification accuracy</td>
          <td>0.7664</td>
          <td>LogP</td>
          <td>seq2seq-1024 + GradientBoost</td>
      </tr>
      <tr>
          <td>Classification accuracy</td>
          <td>0.6206</td>
          <td>PM2-10k</td>
          <td>seq2seq-1024 + GradientBoost</td>
      </tr>
      <tr>
          <td>Exact match reconstruction</td>
          <td>94.24%</td>
          <td>SMILES recovery</td>
          <td>seq2seq-512</td>
      </tr>
      <tr>
          <td>Perplexity</td>
          <td>1.00897</td>
          <td>SMILES recovery</td>
          <td>seq2seq-512</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: Intel i7-6700K @ 4.00 GHz, 16 GB RAM, NVIDIA GTX 1080 GPU</li>
<li>Hyperparameter search and classifier training: TACC Lonestar 5 cluster</li>
<li>Training time: 24 hours per model variant</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HIPS/neural-fingerprint">Neural Fingerprint (baseline)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Baseline comparison code</td>
      </tr>
  </tbody>
</table>
<p>The authors indicated the seq2seq fingerprint code would be released after acceptance, but no public repository has been found as of this writing. The datasets were sourced from NCATS/NIH.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Wang, S., Zhu, F., &amp; Huang, J. (2017). Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. <em>Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB &lsquo;17)</em>, 285-294. <a href="https://doi.org/10.1145/3107411.3107424">https://doi.org/10.1145/3107411.3107424</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{xu2017seq2seq,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xu, Zheng and Wang, Sheng and Zhu, Feiyun and Huang, Junzhou}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{285--294}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3107411.3107424}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>S4 Structured State Space Models for De Novo Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/s4-chemical-language-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/s4-chemical-language-modeling/</guid><description>S4 state space models are applied to chemical language modeling for de novo drug design, outperforming LSTMs and GPTs in bioactivity learning from SMILES.</description><content:encoded><![CDATA[<h2 id="structured-state-spaces-meet-chemical-language-modeling">Structured State Spaces Meet Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.</p>
<h2 id="bridging-the-lstm-transformer-gap-in-molecular-generation">Bridging the LSTM-Transformer Gap in Molecular Generation</h2>
<p>Chemical language models (CLMs) generate molecules by learning the &ldquo;chemical language&rdquo; of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:</p>
<ul>
<li><strong>LSTMs</strong> generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.</li>
<li><strong>GPTs</strong> (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.</li>
</ul>
<p>Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.</p>
<h2 id="the-dual-nature-of-s4-convolution-meets-recurrence">The Dual Nature of S4: Convolution Meets Recurrence</h2>
<p>S4 models are built on discrete <a href="https://en.wikipedia.org/wiki/State-space_model">state space models</a>, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:</p>
<p>$$
x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k}
$$</p>
<p>$$
y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k}
$$</p>
<p>This linear recurrence can equivalently be &ldquo;unrolled&rdquo; into a global convolution:</p>
<p>$$
\mathbf{y} = \mathbf{u} * \overline{\mathbf{K}}
$$</p>
<p>where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:</p>
<ul>
<li><strong>Training</strong>: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.</li>
<li><strong>Generation</strong>: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.</li>
</ul>
<p>S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.</p>
<p>For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:</p>
<p>$$
\mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}})
$$</p>
<p>where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.</p>
<h2 id="benchmarking-s4-across-drug-discovery-tasks">Benchmarking S4 Across Drug Discovery Tasks</h2>
<h3 id="drug-like-molecule-generation">Drug-like molecule generation</h3>
<p>All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>S4</td>
          <td>99,268 (97%)</td>
          <td>98,712 (96%)</td>
          <td>95,552 (93%)</td>
      </tr>
      <tr>
          <td>LSTM</td>
          <td>97,151 (95%)</td>
          <td>96,618 (94%)</td>
          <td>82,988 (81%)</td>
      </tr>
      <tr>
          <td>GPT</td>
          <td>93,580 (91%)</td>
          <td>93,263 (91%)</td>
          <td>91,590 (89%)</td>
      </tr>
  </tbody>
</table>
<p>S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.</p>
<h3 id="bioactivity-learning-via-transfer-learning">Bioactivity learning via transfer learning</h3>
<p>Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, <a href="https://en.wikipedia.org/wiki/Mitogen-activated_protein_kinase_1">MAPK1</a>, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.</p>
<p>S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:</p>
<ul>
<li>S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7</li>
<li>S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2</li>
</ul>
<p>TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to <a href="/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/">activity cliffs</a> in the test set.</p>
<h3 id="chemical-space-exploration-with-temperature-sampling">Chemical space exploration with temperature sampling</h3>
<p>Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:</p>
<ul>
<li><strong>Validity</strong>: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).</li>
<li><strong>Rediscovery</strong>: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.</li>
<li><strong>Scaffold diversity</strong>: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).</li>
</ul>
<p>S4 provides the best balance between bioactivity capture and structural diversity.</p>
<h3 id="natural-product-design">Natural product design</h3>
<p>Models were trained on 32,360 large natural product SMILES (length &gt; 100 tokens) from the COCONUT database and used to generate 102,400 designs each.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>S4</th>
          <th>LSTM</th>
          <th>GPT</th>
          <th>Training Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>82,633 (81%)</td>
          <td>76,264 (74%)</td>
          <td>70,117 (68%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Unique</td>
          <td>53,293 (52%)</td>
          <td>51,326 (50%)</td>
          <td>50,487 (49%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Novel</td>
          <td>40,897 (40%)</td>
          <td>43,245 (42%)</td>
          <td>43,168 (42%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>NP-likeness</td>
          <td>1.6 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.6 +/- 0.7</td>
      </tr>
  </tbody>
</table>
<p>S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).</p>
<p>For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.</p>
<h3 id="prospective-mapk1-inhibitor-design">Prospective MAPK1 inhibitor design</h3>
<p>The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i &lt; 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via <a href="/notes/chemistry/molecular-simulation/classical-methods/umbrella-sampling/">Umbrella Sampling</a> <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> simulations.</p>
<p>Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.</p>
<h2 id="s4-combines-the-best-of-lstms-and-gpts-for-molecular-design">S4 Combines the Best of LSTMs and GPTs for Molecular Design</h2>
<p>The main findings of this study are:</p>
<ol>
<li><strong>S4 outperforms both LSTM and GPT</strong> in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.</li>
<li><strong>The dual formulation is key</strong>: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.</li>
<li><strong>S4 is especially strong for longer sequences</strong>: natural product design (SMILES &gt; 100 tokens) shows the largest advantages over benchmarks in validity and property matching.</li>
<li><strong>Prospective validation</strong>: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.</li>
</ol>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>All evaluations are computational; no wet-lab experimental validation is reported.</li>
<li>Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.</li>
<li>The MD simulations, while more rigorous than simple docking, still represent in silico predictions.</li>
<li>SMILES augmentation and improved ranking protocols could further boost performance.</li>
</ul>
<p><strong>Future directions</strong> include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v31</td>
          <td>1.9M SMILES</td>
          <td>Molecules with SMILES length &lt;= 100 tokens</td>
      </tr>
      <tr>
          <td>Fine-tuning (bioactivity)</td>
          <td>LIT-PCBA (5 targets)</td>
          <td>11-56 actives + ~10K inactives per target</td>
          <td>PKM2, MAPK1, GBA, mTORC1, TP53</td>
      </tr>
      <tr>
          <td>Natural product training</td>
          <td>COCONUT</td>
          <td>32,360 SMILES</td>
          <td>SMILES length &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>Prospective fine-tuning</td>
          <td>ChEMBL v33 (MAPK1)</td>
          <td>68 inhibitors</td>
          <td>$K_i &lt; 1 \mu M$, target ID CHEMBL4040</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: next-token prediction on SMILES strings</li>
<li>Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)</li>
<li>Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)</li>
<li>Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>S4</strong>: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations</li>
<li><strong>LSTM</strong>: 40 configurations optimized via random search</li>
<li><strong>GPT</strong>: 35 configurations optimized via random search</li>
<li>All models share the same pre-training data and fine-tuning protocol for fair comparison</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (ChEMBL)</td>
          <td>S4</td>
          <td>97%</td>
          <td>Out of 102,400 generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness (ChEMBL)</td>
          <td>S4</td>
          <td>96%</td>
          <td>Among valid designs</td>
      </tr>
      <tr>
          <td>Novelty (ChEMBL)</td>
          <td>S4</td>
          <td>93%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Bioactivity ranking (top 10)</td>
          <td>S4</td>
          <td>Significant (p = 8.41e-6 vs LSTM)</td>
          <td>Wilcoxon signed-rank test</td>
      </tr>
      <tr>
          <td>NP validity</td>
          <td>S4</td>
          <td>81%</td>
          <td>COCONUT, SMILES &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>MAPK1 inhibitor success</td>
          <td>S4</td>
          <td>8/10 designs active</td>
          <td>Validated by MD (Umbrella Sampling)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Hyperparameter search: NVIDIA A100 40GB GPUs</li>
<li>LSTM/GPT search: 5 days on single A100</li>
<li>S4 search: 10 days on multiple A100 GPUs</li>
<li>MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (<a href="/notes/chemistry/molecular-simulation/classical-methods/umbrella-sampling/">Umbrella Sampling</a>)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/s4-for-de-novo-drug-design">S4 for de novo drug design</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with data and trained models</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.12666371">Zenodo archive</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Source data and molecule designs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ozcelik, R., de Ruiter, S., Criscuolo, E., &amp; Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. <em>Nature Communications</em>, 15, 6176.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ozcelik2024chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language modeling with structured state space sequence models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{\&#34;O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6176}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-50469-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Protein-to-Drug Molecule Translation via Transformer</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/</guid><description>A Transformer model frames protein-targeted drug generation as machine translation from amino acid sequences to SMILES molecular strings.</description><content:encoded><![CDATA[<h2 id="protein-targeted-drug-generation-as-machine-translation">Protein-Targeted Drug Generation as Machine Translation</h2>
<p>This is a <strong>Method</strong> paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the &ldquo;language&rdquo; of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein&rsquo;s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein&rsquo;s three-dimensional structure.</p>
<h2 id="limitations-of-existing-generative-drug-design-approaches">Limitations of Existing Generative Drug Design Approaches</h2>
<p>Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.</p>
<p>The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein&rsquo;s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.</p>
<h2 id="sequence-to-sequence-translation-with-self-attention">Sequence-to-Sequence Translation with Self-Attention</h2>
<p>The core insight is to treat protein-targeted drug generation as a translation problem between two &ldquo;languages,&rdquo; applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.</p>
<p>The self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:</p>
<p>$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$</p>
<p>$$
\text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O
$$</p>
<p>Positional encoding uses sinusoidal functions:</p>
<p>$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.</p>
<h2 id="data-model-architecture-and-docking-evaluation">Data, Model Architecture, and Docking Evaluation</h2>
<h3 id="data">Data</h3>
<p>The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.</p>
<p>Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment).</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>The model uses the original Transformer implementation via the tensor2tensor library with:</p>
<ul>
<li>4 encoder/decoder layers of size 128</li>
<li>4 attention heads</li>
<li>Adam optimizer with learning rate decay from the original Transformer paper</li>
<li>Batch size of 4,096 tokens</li>
<li>Training for 600K epochs on a single GPU in Google Colaboratory</li>
<li>Vocabulary of 71 symbols (character-level tokenization)</li>
</ul>
<p>Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (&ldquo;one per one&rdquo; mode) and beam size 10 keeping all 10 results (&ldquo;ten per one&rdquo; mode).</p>
<h3 id="chemical-validity-and-uniqueness">Chemical Validity and Uniqueness</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>One per One (avg)</th>
          <th>Ten per One (avg)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES (%)</td>
          <td>90.2</td>
          <td>82.6</td>
      </tr>
      <tr>
          <td>Unique SMILES (%)</td>
          <td>92.3</td>
          <td>81.7</td>
      </tr>
      <tr>
          <td>ZINC15 match (%)</td>
          <td>30.6</td>
          <td>17.1</td>
      </tr>
  </tbody>
</table>
<h3 id="docking-evaluation">Docking Evaluation</h3>
<p>To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with <a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">SMINA</a>. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).</p>
<p>ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.</p>
<h3 id="drug-likeness-properties">Drug-Likeness Properties</h3>
<p>Generated molecules were evaluated against <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and other drug-likeness criteria:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Constraint</th>
          <th>One per One (%)</th>
          <th>Ten per One (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>&lt; 5</td>
          <td>84.4</td>
          <td>85.6</td>
      </tr>
      <tr>
          <td>Molecular weight</td>
          <td>&lt; 500 Da</td>
          <td>95.8</td>
          <td>88.9</td>
      </tr>
      <tr>
          <td>H-bond donors</td>
          <td>&lt; 5</td>
          <td>95.8</td>
          <td>91.9</td>
      </tr>
      <tr>
          <td>H-bond acceptors</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>Rotatable bonds</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>91.2</td>
      </tr>
      <tr>
          <td>TPSA</td>
          <td>&lt; 140</td>
          <td>98.0</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>&lt; 6</td>
          <td>99.9</td>
          <td>100.0</td>
      </tr>
  </tbody>
</table>
<p>Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).</p>
<h3 id="structural-novelty">Structural Novelty</h3>
<p>Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (&gt; 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.</p>
<h2 id="generated-molecules-show-drug-like-properties-and-predicted-binding">Generated Molecules Show Drug-Like Properties and Predicted Binding</h2>
<p>The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.</li>
<li>Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.</li>
<li>The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL&rsquo;s 1.5 million molecules).</li>
<li>Model interpretability remains limited and is identified as important future work.</li>
<li>The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Test</td>
          <td>BindingDB (filtered)</td>
          <td>238,147 records</td>
          <td>1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 &lt; 100 nM</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>11 (IGF-1R), 20 (VEGFR2)</td>
          <td>SMINA docking with default settings</td>
      </tr>
      <tr>
          <td>Database matching</td>
          <td>ZINC15</td>
          <td>N/A</td>
          <td>Used for novelty assessment</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer (encoder-decoder) via tensor2tensor library</li>
<li>Beam search decoding (beam sizes 4 and 10)</li>
<li>Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)</li>
<li>SMINA for molecular docking</li>
<li>RDKit for validity checking, property calculation, and canonicalization</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4 layers, 128 hidden size, 4 attention heads</li>
<li>Character-level tokenization with 71-symbol vocabulary</li>
<li>5-fold Monte Carlo cross-validation with &lt; 20% sequence similarity between train/test proteins</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES</td>
          <td>90.2% (1-per-1), 82.6% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>Unique SMILES</td>
          <td>92.3% (1-per-1), 81.7% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>ZINC15 match</td>
          <td>30.6% (1-per-1), 17.1% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)</td>
          <td>Drug-likeness score</td>
      </tr>
      <tr>
          <td>SAS compliance</td>
          <td>99.9% (1-per-1), 100% (10-per-1)</td>
          <td>SAS &lt; 6</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Google Colaboratory with one GPU</li>
<li>Training for 600K epochs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dariagrechishnikova/molecule_structure_generation">molecule_structure_generation</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation using tensor2tensor</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. <em>Scientific Reports</em>, 11, 321. <a href="https://doi.org/10.1038/s41598-020-79682-4">https://doi.org/10.1038/s41598-020-79682-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grechishnikova2021transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer neural network for protein-specific de novo drug generation as a machine translation problem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grechishnikova, Daria}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{321}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-020-79682-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PrefixMol: Prefix Embeddings for Drug Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/</guid><description>PrefixMol uses prefix embeddings in a GPT SMILES generator to jointly condition on protein pockets and chemical properties for drug design.</description><content:encoded><![CDATA[<h2 id="unified-multi-conditional-molecular-generation">Unified Multi-Conditional Molecular Generation</h2>
<p>PrefixMol is a <strong>Method</strong> paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski</a>) as a learnable feature vector prepended to the input sequence of a GPT-based <a href="/notes/chemistry/molecular-representations/notations/smiles-original-paper/">SMILES</a> generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.</p>
<h2 id="bridging-target-aware-and-chemistry-aware-molecular-design">Bridging Target-Aware and Chemistry-Aware Molecular Design</h2>
<p>Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a>, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:</p>
<ol>
<li><strong>Data scarcity</strong>: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.</li>
<li><strong>Negative transfer</strong>: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.</li>
</ol>
<p>PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.</p>
<h2 id="prefix-conditioning-in-attention-layers">Prefix Conditioning in Attention Layers</h2>
<p>The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}&rsquo; = [\text{PREFIX}; \mathbf{x}]$.</p>
<p>The output of each position is:</p>
<p>$$
h_i = \begin{cases} p_{\phi,i}, &amp; \text{if } i &lt; n_c \\ \text{LM}_\theta(x_i&rsquo;, h_{&lt;i}), &amp; \text{otherwise} \end{cases}
$$</p>
<p>Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:</p>
<p>$$
\begin{aligned}
\text{head} &amp;= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\
&amp;\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}}
\end{aligned}
$$</p>
<p>where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.</p>
<p><strong>Condition correlation</strong> is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:</p>
<p>$$
\text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v)
$$</p>
<p>The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.</p>
<h3 id="condition-encoders">Condition Encoders</h3>
<p>Each condition has a dedicated encoder:</p>
<ul>
<li><strong>3D Pocket</strong>: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.</li>
<li><strong>Chemical properties</strong>: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.</li>
</ul>
<h3 id="training-objective">Training Objective</h3>
<p>PrefixMol is trained with two losses. The auto-regressive loss is:</p>
<p>$$
\mathcal{L}_{AT} = -\sum_{1 &lt; i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{&lt;i}, \mathbf{p}_\phi)
$$</p>
<p>A triplet property prediction loss encourages generated molecules to match desired properties:</p>
<p>$$
\mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right)
$$</p>
<p>where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).</p>
<h2 id="experimental-setup-and-controllability-evaluation">Experimental Setup and Controllability Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li><strong>Vina score</strong> (binding affinity, computed by QVina after UFF refinement)</li>
<li><strong>QED</strong> (quantitative estimate of drug-likeness, 0-1)</li>
<li><strong>SA</strong> (synthetic accessibility, 0-1)</li>
<li><strong>LogP</strong> (octanol-water partition coefficient)</li>
<li><strong>Lipinski</strong> (rule-of-five compliance count)</li>
<li><strong>High Affinity</strong> (fraction of pockets where generated molecules match or exceed test set affinities)</li>
<li><strong>Diversity</strong> (average pairwise Tanimoto distance over Morgan fingerprints)</li>
<li><strong>Sim.Train</strong> (maximum Tanimoto similarity to training set)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Unconditional generation</strong> (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.</p>
<p><strong>Single-property control</strong> (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.</p>
<p><strong>Multi-property control</strong> (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.</p>
<h3 id="condition-relation-analysis">Condition Relation Analysis</h3>
<p>By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:</p>
<ul>
<li><strong>Vina is weakly self-controllable</strong> but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.</li>
<li><strong>LogP and QED</strong> are the most correlated property pair.</li>
<li><strong>Lipinski is coupled to QED and SA</strong>, saturating at 5.0 when both QED and SA control scales reach +2.</li>
</ul>
<h2 id="key-findings-limitations-and-interpretability-insights">Key Findings, Limitations, and Interpretability Insights</h2>
<p>PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:</p>
<ol>
<li>A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.</li>
<li>Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.</li>
<li>The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).</li>
</ol>
<p><strong>Limitations</strong>: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training / Evaluation</td>
          <td>CrossDocked (extended)</td>
          <td>22.5M protein-ligand structures</td>
          <td>Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-based auto-regressive SMILES generation with prefix conditioning</li>
<li>GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention</li>
<li>Separate MLP encoders for each chemical property</li>
<li>Triplet property prediction loss with non-differentiable RDKit-computed properties</li>
<li>QVina for Vina score computation with UFF refinement</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT transformer backbone for SMILES generation</li>
<li>6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski</li>
<li>Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PrefixMol (unconditional)</th>
          <th>Pocket2Mol</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vina (kcal/mol)</td>
          <td>-6.532</td>
          <td>-7.288</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.551</td>
          <td>0.563</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>SA</td>
          <td>0.750</td>
          <td>0.765</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.856</td>
          <td>0.688</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Sim.Train</td>
          <td>0.239</td>
          <td>0.376</td>
          <td>Lower is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/A4Bio/PrefixMol">PrefixMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, Z., Hu, Y., Tan, C., &amp; Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. <em>arXiv preprint arXiv:2302.07120</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gao2023prefixmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2302.07120}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PASITHEA: Gradient-Based Molecular Design via Dreaming</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/deep-molecular-dreaming-pasithea/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/deep-molecular-dreaming-pasithea/</guid><description>PASITHEA applies inceptionism to molecular design, using gradient-based optimization on SELFIES representations to generate molecules with target properties.</description><content:encoded><![CDATA[<h2 id="inceptionism-applied-to-molecular-inverse-design">Inceptionism Applied to Molecular Inverse Design</h2>
<p>This is a <strong>Method</strong> paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.</p>
<h2 id="the-need-for-direct-gradient-based-molecular-optimization">The Need for Direct Gradient-Based Molecular Optimization</h2>
<p>Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.</p>
<p>This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.</p>
<p>A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.</p>
<h2 id="core-innovation-inverting-regression-networks-on-selfies">Core Innovation: Inverting Regression Networks on SELFIES</h2>
<p>PASITHEA&rsquo;s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.</p>
<p><strong>Phase 1: Prediction training.</strong> A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:</p>
<p>$$
\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2
$$</p>
<p>where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.</p>
<p><strong>Phase 2: Inverse training (deep dreaming).</strong> The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:</p>
<p>$$
\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}})
$$</p>
<p>This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.</p>
<p><strong>The role of SELFIES.</strong> The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like &ldquo;CCCC1CCCCC&rdquo;), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.</p>
<p><strong>Input noise injection.</strong> Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.</p>
<h2 id="experimental-setup-on-qm9-with-logp-optimization">Experimental Setup on QM9 with LogP Optimization</h2>
<h3 id="dataset-and-property">Dataset and Property</h3>
<p>The experiments use a random subset of 10,000 molecules from the <a href="/notes/chemistry/datasets/qm9/">QM9</a> dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.</p>
<p>For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.</p>
<h3 id="optimization-targets">Optimization Targets</h3>
<p>Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.</p>
<h2 id="distribution-shifts-and-interpretable-molecular-transformations">Distribution Shifts and Interpretable Molecular Transformations</h2>
<h3 id="distribution-level-results">Distribution-Level Results</h3>
<p>Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:</p>
<table>
  <thead>
      <tr>
          <th>Statistic</th>
          <th>QM9 Original</th>
          <th>Optimized (target +6)</th>
          <th>Optimized (target -6)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean logP</td>
          <td>0.3909</td>
          <td>1.8172</td>
          <td>-0.3360</td>
      </tr>
      <tr>
          <td>Min logP</td>
          <td>-2.1903</td>
          <td>-0.8240</td>
          <td>-2.452</td>
      </tr>
      <tr>
          <td>Max logP</td>
          <td>3.0786</td>
          <td>4.2442</td>
          <td>0.9018</td>
      </tr>
  </tbody>
</table>
<p>The optimized distributions extend beyond the original dataset&rsquo;s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.</p>
<p>Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.</p>
<h3 id="molecule-level-interpretability">Molecule-Level Interpretability</h3>
<p>The stepwise molecular transformations reveal interpretable &ldquo;strategies&rdquo; the network employs:</p>
<ol>
<li>
<p><strong>Nitrogen appendage</strong>: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.</p>
</li>
<li>
<p><strong>Length modulation</strong>: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.</p>
</li>
<li>
<p><strong>Bond order changes</strong>: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.</p>
</li>
<li>
<p><strong>Consistency across trials</strong>: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.</p>
</li>
</ol>
<h3 id="thermodynamic-stability">Thermodynamic Stability</h3>
<p>The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.</p>
<h3 id="comparison-to-vaes">Comparison to VAEs</h3>
<p>The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the &ldquo;detour&rdquo; through a latent space.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors are forthright about the preliminary nature of these results:</p>
<ul>
<li>The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).</li>
<li>The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.</li>
<li>Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.</li>
<li>The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>QM9 (random subset)</td>
          <td>10,000 molecules</td>
          <td>logP values computed via RDKit</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prediction training</strong>: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split</li>
<li><strong>Inverse training</strong>: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6</li>
<li><strong>Heats of formation</strong>: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Novel molecules</td>
          <td>97.2%</td>
          <td>Generated molecules not in training set</td>
      </tr>
      <tr>
          <td>Max logP (target +6)</td>
          <td>4.2442</td>
          <td>Exceeds QM9 max of 3.0786</td>
      </tr>
      <tr>
          <td>Min logP (target -6)</td>
          <td>-2.452</td>
          <td>Below QM9 min of -2.1903</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/Pasithea">Pasithea</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Shen, C., Krenn, M., Eppel, S., &amp; Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. <em>Machine Learning: Science and Technology</em>, 2(3), 03LT02. <a href="https://doi.org/10.1088/2632-2153/ac09d6">https://doi.org/10.1088/2632-2153/ac09d6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2021deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{03LT02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/ac09d6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Neural Machine Translation of Chemical Nomenclature</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/nmt-chemical-nomenclature-en-zh/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/nmt-chemical-nomenclature-en-zh/</guid><description>Xu et al. apply CNN and LSTM seq2seq models to translate chemical nomenclature between English and Chinese, outperforming rule-based tools.</description><content:encoded><![CDATA[<h2 id="a-method-for-neural-translation-of-chemical-names">A Method for Neural Translation of Chemical Names</h2>
<p>This is a <strong>Method</strong> paper that introduces deep learning approaches for translating chemical nomenclature between English and Chinese. The primary contribution is demonstrating that character-level sequence-to-sequence neural networks (both CNN-based and LSTM-based) can serve as viable alternatives to hand-crafted rule-based translation systems for chemical names. The work compares two neural architectures against an existing rule-based tool on bilingual chemical name datasets.</p>
<h2 id="bridging-the-english-chinese-chemical-nomenclature-gap">Bridging the English-Chinese Chemical Nomenclature Gap</h2>
<p>English and Chinese are the two most widely used languages for chemical nomenclature worldwide. Translation between them is important for chemical data processing, especially for converting Chinese chemical names extracted via named entity recognition into English names that existing name-to-structure tools can parse. Rule-based translation between these languages faces considerable challenges:</p>
<ol>
<li>Chinese chemical names lack word boundaries (no spaces), making segmentation difficult.</li>
<li>Word order is often reversed between English and Chinese chemical names (e.g., &ldquo;ethyl acetate&rdquo; maps to characters meaning &ldquo;acetate-ethyl&rdquo; in Chinese).</li>
<li>The same English morpheme can map to different Chinese characters depending on chemical context (e.g., &ldquo;ethyl&rdquo; translates differently in &ldquo;ethyl acetate&rdquo; vs. &ldquo;ethyl alcohol&rdquo;).</li>
<li>Trivial names, especially for natural products, follow irregular translation patterns or are transliterations.</li>
</ol>
<p>Building comprehensive rule sets requires a formally trained chemist fluent in both languages, making rule-based approaches expensive and fragile.</p>
<h2 id="character-level-sequence-to-sequence-translation">Character-Level Sequence-to-Sequence Translation</h2>
<p>The core idea is to treat chemical name translation as a character-level machine translation task, applying encoder-decoder architectures with attention mechanisms. Two architectures are proposed:</p>
<p><strong>CNN-based architecture</strong>: Three 1D convolutional layers encode the input character sequence. A decoder with three 1D convolutional layers processes the target sequence offset by one timestep, combined with attention mechanism layers that connect encoder and decoder outputs. Two additional 1D convolutional layers produce the final decoded output sequence.</p>
<p><strong>LSTM-based architecture</strong>: An LSTM encoder converts the input sequence into two state vectors. An LSTM decoder is trained with teacher forcing, using the encoder&rsquo;s state vectors as its initial state, and generating the target sequence offset by one timestep.</p>
<p>Both models operate at the character level. Input chemical name strings are transformed into embedding vectors, with the vocabulary size equal to the number of unique characters in the respective language (100 unique characters for English names, 2,056 unique characters for Chinese names).</p>
<h2 id="experimental-setup-and-comparison-with-rule-based-tool">Experimental Setup and Comparison with Rule-Based Tool</h2>
<h3 id="datasets">Datasets</h3>
<p>The authors built two directional datasets from a manually curated corpus of scientific literature maintained at their institution:</p>
<ul>
<li><strong>En2Ch (English to Chinese)</strong>: 30,394 name pairs after deduplication</li>
<li><strong>Ch2En (Chinese to English)</strong>: 37,207 name pairs after deduplication</li>
</ul>
<p>The datasets cover systematic compound names through trivial names. For names with multiple valid translations, the most commonly used translation was selected. Each dataset was split 80/20 for training and validation.</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>Both neural network models used the following hyperparameters:</p>
<ul>
<li>Batch size: 64</li>
<li>Epochs: 100</li>
<li>Latent dimensionality: 256 (encoding and decoding space)</li>
<li>Implementation: Python 3.7 with Keras 2.3 and TensorFlow backend</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>The models were evaluated on five metrics across both translation directions:</p>
<ul>
<li><strong>Success Rate</strong>: Percentage of inputs that produced any output</li>
<li><strong>String Matching Accuracy</strong>: Exact match with the single target name</li>
<li><strong>Data Matching Accuracy</strong>: Exact match allowing any valid translation from the corpus</li>
<li><strong>Manual Spot Check</strong>: Blind evaluation of 100 random samples per approach</li>
<li><strong>Running Time</strong>: Wall-clock time on the same hardware</li>
</ul>
<h3 id="baseline">Baseline</h3>
<p>The rule-based comparison system operates in three steps: disassemble the input name into word fragments, translate each fragment, and reassemble into the target language. This tool had been deployed as an online service with over one million uses at the time of publication.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>CNN</th>
          <th>LSTM</th>
          <th>Rule-based</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success Rate En2Ch</td>
          <td>100%</td>
          <td>100%</td>
          <td>75.97%</td>
      </tr>
      <tr>
          <td>Success Rate Ch2En</td>
          <td>100%</td>
          <td>100%</td>
          <td>59.90%</td>
      </tr>
      <tr>
          <td>String Match En2Ch</td>
          <td>82.92%</td>
          <td>89.64%</td>
          <td>39.81%</td>
      </tr>
      <tr>
          <td>String Match Ch2En</td>
          <td>78.11%</td>
          <td>55.44%</td>
          <td>43.77%</td>
      </tr>
      <tr>
          <td>Data Match En2Ch</td>
          <td>84.44%</td>
          <td>90.82%</td>
          <td>45.15%</td>
      </tr>
      <tr>
          <td>Data Match Ch2En</td>
          <td>80.22%</td>
          <td>57.40%</td>
          <td>44.91%</td>
      </tr>
      <tr>
          <td>Manual Check En2Ch</td>
          <td>90.00%</td>
          <td>89.00%</td>
          <td>80.00%</td>
      </tr>
      <tr>
          <td>Manual Check Ch2En</td>
          <td>82.00%</td>
          <td>61.00%</td>
          <td>78.00%</td>
      </tr>
      <tr>
          <td>Time En2Ch (s)</td>
          <td>1423</td>
          <td>190</td>
          <td>288</td>
      </tr>
      <tr>
          <td>Time Ch2En (s)</td>
          <td>1876</td>
          <td>303</td>
          <td>322</td>
      </tr>
  </tbody>
</table>
<p>Both neural approaches achieved 100% success rate (always producing output), while the rule-based tool failed on 24% and 40% of inputs for En2Ch and Ch2En respectively. The rule-based tool&rsquo;s failures were concentrated on Chinese names lacking word boundaries and on trivial names of natural products.</p>
<p>For English-to-Chinese translation, LSTM performed best at 89.64% string matching accuracy (90.82% data matching), followed by CNN at 82.92%. For Chinese-to-English, CNN substantially outperformed LSTM (78.11% vs. 55.44% string matching), suggesting that LSTM had difficulty with long-term dependencies in Chinese character sequences. The authors observed that many LSTM errors appeared at the ends of chemical names.</p>
<h3 id="analysis-by-name-type">Analysis by Name Type</h3>
<p>The CNN-based approach outperformed LSTM on CAS names (80% vs. 52% in manual checks) and was more robust for longer names. The rule-based tool showed consistent performance regardless of name length, suggesting it was more suited to regular systematic names but struggled with the diversity of real-world chemical nomenclature.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Performance depends heavily on training data quality and quantity.</li>
<li>Neither neural approach was validated on an external test set outside the institution&rsquo;s corpus.</li>
<li>The CNN model was considerably slower (5-6x) than the other two approaches.</li>
<li>No comparison against modern transformer-based NMT architectures (the study predates widespread adoption of transformers for this task).</li>
<li>The dataset is relatively small by modern NMT standards (30-37K pairs).</li>
<li>The authors noted that some neural translations were actually better than the target labels, suggesting the evaluation metrics understate true performance.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest that combining CNN and LSTM architectures could yield further improvements, and that the approach has practical applications in scientific publishing (Chinese journals requiring English abstracts) and chemical database interoperability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Validation (En2Ch)</td>
          <td>Curated bilingual corpus</td>
          <td>30,394 pairs</td>
          <td>80/20 split, from SIOC chemical data system</td>
      </tr>
      <tr>
          <td>Training/Validation (Ch2En)</td>
          <td>Curated bilingual corpus</td>
          <td>37,207 pairs</td>
          <td>80/20 split, from SIOC chemical data system</td>
      </tr>
      <tr>
          <td>Testing (En2Ch)</td>
          <td>Held-out validation split</td>
          <td>6,079 records</td>
          <td>Same source</td>
      </tr>
      <tr>
          <td>Testing (Ch2En)</td>
          <td>Held-out validation split</td>
          <td>7,441 records</td>
          <td>Same source</td>
      </tr>
  </tbody>
</table>
<p>Training data, Python code for both models, and result data are provided as supplementary files with the paper.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Character-level CNN encoder-decoder with attention (3+3+2 conv layers)</li>
<li>Character-level LSTM encoder-decoder with teacher forcing</li>
<li>Batch size: 64, epochs: 100, latent dim: 256</li>
</ul>
<h3 id="models">Models</h3>
<p>Both models implemented in Python 3.7 with Keras 2.3 / TensorFlow. No pre-trained weights are released separately, but the training code is provided as supplementary material.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value (En2Ch)</th>
          <th>Best Value (Ch2En)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success Rate</td>
          <td>100% (both DL)</td>
          <td>100% (both DL)</td>
          <td>Rule-based: 75.97% / 59.90%</td>
      </tr>
      <tr>
          <td>String Matching</td>
          <td>89.64% (LSTM)</td>
          <td>78.11% (CNN)</td>
          <td>Best neural model per direction</td>
      </tr>
      <tr>
          <td>Data Matching</td>
          <td>90.82% (LSTM)</td>
          <td>80.22% (CNN)</td>
          <td>Allows multiple valid translations</td>
      </tr>
      <tr>
          <td>Manual Spot Check</td>
          <td>90.00% (CNN)</td>
          <td>82.00% (CNN)</td>
          <td>Blind evaluation of 100 samples</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Running times reported but hardware details not provided.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.1186/s13321-020-00457-0">Supplementary files</a></td>
          <td>Code + Data</td>
          <td>CC-BY 4.0</td>
          <td>Training data, CNN/LSTM code, results (Additional files 1-6)</td>
      </tr>
      <tr>
          <td><a href="https://www.organchem.csdb.cn/translate">SIOC Translation Tool</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Rule-based baseline tool, online service</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, T., Chen, W., Zhou, J., Dai, J., Li, Y., &amp; Zhao, Y. (2020). Neural machine translation of chemical nomenclature between English and Chinese. <em>Journal of Cheminformatics</em>, 12, 50. <a href="https://doi.org/10.1186/s13321-020-00457-0">https://doi.org/10.1186/s13321-020-00457-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xu2020neural,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Neural machine translation of chemical nomenclature between English and Chinese}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xu, Tingjun and Chen, Weiming and Zhou, Junhong and Dai, Jingfang and Li, Yingyong and Zhao, Yingli}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00457-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>nach0: A Multimodal Chemical and NLP Foundation Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/</guid><description>nach0 is a T5-based encoder-decoder model pre-trained on SMILES, scientific text, and patents, then instruction-tuned for chemical and NLP tasks.</description><content:encoded><![CDATA[<h2 id="a-multi-domain-encoder-decoder-for-chemistry-and-nlp">A Multi-Domain Encoder-Decoder for Chemistry and NLP</h2>
<p>nach0 is a <strong>Method</strong> paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.</p>
<h2 id="bridging-chemical-and-linguistic-representations">Bridging Chemical and Linguistic Representations</h2>
<p>Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a> and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.</p>
<p>nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.</p>
<h2 id="unified-text-to-text-framework-with-smiles-tokenization">Unified Text-to-Text Framework with SMILES Tokenization</h2>
<p>The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.</p>
<h3 id="smiles-token-integration">SMILES Token Integration</h3>
<p>Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format <code>&lt;sm_{token}&gt;</code>, creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.</p>
<h3 id="architecture">Architecture</h3>
<p>Both model sizes use the standard <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a> encoder-decoder architecture:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Parameters</th>
          <th>Layers</th>
          <th>Hidden Size</th>
          <th>FFN Size</th>
          <th>Attention Heads</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>250M</td>
          <td>12</td>
          <td>768</td>
          <td>3072</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>780M</td>
          <td>24</td>
          <td>1024</td>
          <td>4096</td>
          <td>16</td>
      </tr>
  </tbody>
</table>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>The model is pre-trained with a language modeling objective on three data sources:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Documents</th>
          <th>Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubMed abstracts (chemistry-filtered)</td>
          <td>13M</td>
          <td>355M</td>
      </tr>
      <tr>
          <td>USPTO patent descriptions</td>
          <td>119K</td>
          <td>2.9B</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a> molecular database</td>
          <td>~100M</td>
          <td>4.7B</td>
      </tr>
  </tbody>
</table>
<h3 id="instruction-tuning">Instruction Tuning</h3>
<p>Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as &ldquo;What reactants could be used to synthesize [SMILES]?&rdquo; and a property prediction task as &ldquo;Can [SMILES] penetrate the <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a>?&rdquo; This enables multi-task training across all domains with a single loss function and shared hyperparameters.</p>
<p>Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.</p>
<h2 id="multi-task-evaluation-across-nlp-and-chemistry-benchmarks">Multi-Task Evaluation Across NLP and Chemistry Benchmarks</h2>
<p>nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.</p>
<h3 id="task-categories">Task Categories</h3>
<p><strong>NLP tasks</strong>: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).</p>
<p><strong>Chemistry tasks</strong>: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>; <a href="/notes/chemistry/datasets/qm9/">QM9</a> from Mol-Instructions), molecular generation (<a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>), forward reaction prediction, reagent prediction, and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> (from Mol-Instructions/USPTO).</p>
<p><strong>Cross-domain tasks</strong>: Description-guided molecule design and molecular description generation (from Mol-Instructions).</p>
<h3 id="baselines">Baselines</h3>
<p>nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.</p>
<h3 id="key-results">Key Results</h3>
<p>On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>MolT5</th>
          <th>SciFive</th>
          <th>FLAN</th>
          <th>nach0 Base</th>
          <th>nach0 Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Forward reaction</td>
          <td>Acc@1</td>
          <td>27.0%</td>
          <td>60.0%</td>
          <td>59.0%</td>
          <td>88.0%</td>
          <td>89.9%</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>Acc@1</td>
          <td>15.0%</td>
          <td>31.0%</td>
          <td>31.0%</td>
          <td>53.0%</td>
          <td>56.3%</td>
      </tr>
      <tr>
          <td>Reagent prediction</td>
          <td>Acc@1</td>
          <td>1.1%</td>
          <td>3.8%</td>
          <td>4.0%</td>
          <td>6.3%</td>
          <td>13.1%</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>BA</td>
          <td>0.58</td>
          <td>0.65</td>
          <td>0.65</td>
          <td>0.74</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>BA</td>
          <td>0.55</td>
          <td>0.66</td>
          <td>0.60</td>
          <td>0.67</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>HFE (FreeSolv)</td>
          <td>R2</td>
          <td>-0.36</td>
          <td>0.51</td>
          <td>0.55</td>
          <td>0.77</td>
          <td>0.78</td>
      </tr>
      <tr>
          <td>MOSES (FCD)</td>
          <td>FCD/Test</td>
          <td>0.521</td>
          <td>0.578</td>
          <td>0.529</td>
          <td>0.311</td>
          <td>0.304</td>
      </tr>
      <tr>
          <td>Description-guided mol. design</td>
          <td>BLEU-2</td>
          <td>30.3%</td>
          <td>44.2%</td>
          <td>43.6%</td>
          <td>49.0%</td>
          <td>48.8%</td>
      </tr>
      <tr>
          <td>Mol. description gen.</td>
          <td>BLEU-2</td>
          <td>35.6%</td>
          <td>39.6%</td>
          <td>38.6%</td>
          <td>43.9%</td>
          <td>41.7%</td>
      </tr>
  </tbody>
</table>
<p>On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:</p>
<ul>
<li>nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics</li>
<li>The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance</li>
<li>nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens</li>
</ul>
<h3 id="case-studies">Case Studies</h3>
<p>Two applied case studies demonstrate nach0 in drug discovery scenarios:</p>
<ol>
<li>
<p><strong>End-to-end drug discovery for <a href="https://en.wikipedia.org/wiki/Diabetes">diabetes mellitus</a></strong>: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Janus_kinase_3">JAK3</a> inhibitor generation with Chemistry42</strong>: nach0 replaces 42 specialized generative models in Insilico Medicine&rsquo;s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42&rsquo;s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.</p>
</li>
</ol>
<h3 id="comparison-with-chatgpt">Comparison with ChatGPT</h3>
<p>On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).</p>
<h2 id="competitive-multi-task-performance-with-clear-limitations">Competitive Multi-Task Performance with Clear Limitations</h2>
<p>nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model&rsquo;s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.</p>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ol>
<li>
<p><strong>Not at chemist expert level</strong>: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.</p>
</li>
<li>
<p><strong>SMILES-only molecular representation</strong>: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> as a potential alternative representation.</p>
</li>
<li>
<p><strong>Prompt sensitivity</strong>: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.</p>
</li>
<li>
<p><strong>Limited chemical diversity</strong>: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, representing only a fraction of predicted chemical space.</p>
</li>
</ol>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending nach0 with protein sequence modalities (using <a href="/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/">Group SELFIES</a>), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training (text)</td>
          <td>PubMed abstracts</td>
          <td>13M docs, 355M tokens</td>
          <td>Filtered for chemistry-related content</td>
      </tr>
      <tr>
          <td>Pre-training (text)</td>
          <td>USPTO patents</td>
          <td>119K docs, 2.9B tokens</td>
          <td>Patent descriptions</td>
      </tr>
      <tr>
          <td>Pre-training (chemical)</td>
          <td>ZINC</td>
          <td>~100M docs, 4.7B tokens</td>
          <td>Molecular SMILES strings</td>
      </tr>
      <tr>
          <td>Fine-tuning (NLP)</td>
          <td>17 NLP datasets</td>
          <td>Varies</td>
          <td>See Table 1 in paper</td>
      </tr>
      <tr>
          <td>Fine-tuning (chemistry)</td>
          <td>MoleculeNet, MOSES, Mol-Instructions</td>
          <td>Varies</td>
          <td>Predefined or random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)</li>
<li>Pre-training objective: Language modeling (masked span prediction)</li>
<li>Fine-tuning: Multi-task instruction tuning with examples-proportional mixing</li>
<li>Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01</li>
<li>Pre-training: 1 epoch; fine-tuning: 10 epochs</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_base">nach0 Base (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>250M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_large">nach0 Large (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>780M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://github.com/insilicomedicine/nach0">nach0 GitHub Repository</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Base models: NVIDIA A4000 and A5000 GPUs</li>
<li>Large models: NVIDIA DGX cloud platform</li>
<li>Training used tensor and pipeline parallelism via NeMo toolkit</li>
<li>Specific GPU counts and training times not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., &amp; Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. <em>Chemical Science</em>, 15(22), 8380-8389. <a href="https://doi.org/10.1039/D4SC00966E">https://doi.org/10.1039/D4SC00966E</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{livne2024nach0,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{nach0: multimodal natural and chemical languages foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\&#39;a}n and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8380--8389}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D4SC00966E}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolPMoFiT: Inductive Transfer Learning for QSAR</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/</guid><description>MolPMoFiT adapts ULMFiT for QSAR by pre-training an LSTM language model on 1M ChEMBL SMILES and fine-tuning on small molecular property datasets.</description><content:encoded><![CDATA[<h2 id="transfer-learning-meets-molecular-property-prediction">Transfer Learning Meets Molecular Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSPR/QSAR</a> modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a> as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.</p>
<h2 id="the-small-data-problem-in-qsar-modeling">The Small Data Problem in QSAR Modeling</h2>
<p>Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like <a href="https://en.wikipedia.org/wiki/Allosteric_regulation">allosteric inhibition</a>, renal clearance, and inhibitor residence times.</p>
<p>Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), <a href="/notes/chemistry/molecular-representations/encoders/mol2vec-unsupervised-chemical-intuition/">Mol2vec</a> (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.</p>
<h2 id="core-innovation-ulmfit-adapted-for-smiles">Core Innovation: ULMFiT Adapted for SMILES</h2>
<p>MolPMoFiT adapts ULMFiT&rsquo;s three-stage transfer learning pipeline to molecular property prediction:</p>
<p><strong>Stage 1: General-Domain MSPM Pre-training.</strong> A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.</p>
<p><strong>Stage 2: Task-Specific MSPM Fine-tuning (Optional).</strong> The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:</p>
<p>$$\eta^{layer-1} = \eta^{layer} / 2.6$$</p>
<p>where higher layers (containing more task-specific features) receive higher learning rates.</p>
<p><strong>Stage 3: QSAR/QSPR Model Fine-tuning.</strong> The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:</p>
<ul>
<li><strong>Discriminative fine-tuning</strong>: Different learning rates per layer group</li>
<li><strong>Gradual unfreezing</strong>: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)</li>
<li><strong>One cycle policy</strong>: Learning rate scheduling following Smith&rsquo;s approach</li>
</ul>
<p>The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.</p>
<p><strong>SMILES Augmentation.</strong> Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.</p>
<h2 id="benchmarks-across-four-qsar-datasets">Benchmarks Across Four QSAR Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,200</td>
          <td>Regression (logD)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Regression (<a href="https://en.wikipedia.org/wiki/Solvation">solvation energy</a>)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>Classification (replication inhibition)</td>
          <td>AUROC</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,039</td>
          <td>Classification (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>)</td>
          <td>AUROC</td>
      </tr>
  </tbody>
</table>
<p>All datasets use the same 10 random 80:10:10 splits from <a href="/notes/chemistry/molecular-design/property-prediction/systematic-study-molecular-property-prediction/">Yang et al. (2019)</a> for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.</p>
<h3 id="baselines">Baselines</h3>
<p>Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> descriptors.</p>
<h3 id="hyperparameters">Hyperparameters</h3>
<p>The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):</p>
<table>
  <thead>
      <tr>
          <th>Layer Group</th>
          <th>Base Learning Rate</th>
          <th>Epochs</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Linear head only</td>
          <td>3e-2</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final LSTM layer</td>
          <td>5e-3</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final two LSTM layers</td>
          <td>5e-4</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Full model</td>
          <td>5e-5</td>
          <td>6</td>
      </tr>
  </tbody>
</table>
<p>Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="benchmark-results">Benchmark Results</h3>
<p><strong>Lipophilicity (random split):</strong> MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.</p>
<p><strong>FreeSolv (random split):</strong> RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.</p>
<p><strong>BBBP (random split):</strong> AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.</p>
<p><strong>HIV (random split):</strong> General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.</p>
<p>Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.</p>
<h3 id="transfer-learning-impact">Transfer Learning Impact</h3>
<p>Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.</p>
<h3 id="smiles-augmentation-analysis">SMILES Augmentation Analysis</h3>
<p>Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (curated)</td>
          <td>1M molecules</td>
          <td>Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td>MoleculeNet benchmark</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers</li>
<li>ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy</li>
<li>SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens</li>
<li>SMILES enumeration for data augmentation with optional Gaussian label noise for regression</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)</li>
<li>Task-specific MSPMs fine-tuned per dataset (optional stage)</li>
<li>QSAR models fine-tuned with transferred embeddings and encoder</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Split</th>
          <th>Metric</th>
          <th>MolPMoFiT (TTA)</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lipophilicity</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$0.565 \pm 0.037$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$0.635 \pm 0.031$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$1.197 \pm 0.127$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$2.082 \pm 0.460$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.950 \pm 0.020$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.931 \pm 0.025$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.828 \pm 0.029$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.816 \pm 0.022$</td>
          <td>D-MPNN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P4000 GPU (single GPU)</li>
<li>General-domain MSPM pre-training: approximately 1 day</li>
<li>Pre-training needs to be done only once; fine-tuning is fast per task</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/MolPMoFiT">MolPMoFiT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>PyTorch + fastai v1 implementation with curated datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, X., &amp; Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. <em>Journal of Cheminformatics</em>, 12, 27. <a href="https://doi.org/10.1186/s13321-020-00430-x">https://doi.org/10.1186/s13321-020-00430-x</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2020molpmofit,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Xinhao and Fourches, Denis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00430-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolBERT: Auxiliary Tasks for Molecular BERT Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/</guid><description>MolBERT applies BERT to SMILES with domain-relevant auxiliary tasks like physicochemical property prediction, improving virtual screening and QSAR.</description><content:encoded><![CDATA[<h2 id="bert-based-molecular-representations-with-auxiliary-pre-training-tasks">BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">quantitative structure-activity relationship (QSAR)</a> benchmarks.</p>
<h2 id="why-domain-relevant-pre-training-matters-for-molecular-language-models">Why Domain-Relevant Pre-Training Matters for Molecular Language Models</h2>
<p>Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:</p>
<ol>
<li><strong>Task selection for pre-training</strong>: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.</li>
<li><strong>SMILES ambiguity</strong>: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.</li>
<li><strong>Domain knowledge integration</strong>: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.</li>
</ol>
<p>MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.</p>
<h2 id="three-auxiliary-tasks-for-chemistry-aware-pre-training">Three Auxiliary Tasks for Chemistry-Aware Pre-Training</h2>
<p>MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:</p>
<p><strong>Masked Language Modeling (MaskedLM)</strong>: The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.</p>
<p><strong>SMILES Equivalence (SMILES-Eq)</strong>: A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.</p>
<p><strong>Physicochemical Property Prediction (PhysChemPred)</strong>: Using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:</p>
<p>$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$</p>
<p>where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model&rsquo;s prediction.</p>
<p>The final training loss is the arithmetic mean of all active task losses:</p>
<p>$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$</p>
<p>where $\mathcal{T}$ is the set of active pre-training tasks.</p>
<p>Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.</p>
<h2 id="ablation-study-and-benchmark-evaluation">Ablation Study and Benchmark Evaluation</h2>
<h3 id="pre-training-setup">Pre-Training Setup</h3>
<p>All models were pre-trained on the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol benchmark dataset</a>, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).</p>
<h3 id="ablation-impact-of-task-combinations-on-virtual-screening">Ablation: Impact of Task Combinations on Virtual Screening</h3>
<p>The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):</p>
<table>
  <thead>
      <tr>
          <th style="text-align: center">MaskedLM</th>
          <th style="text-align: center">PhysChemPred</th>
          <th style="text-align: center">SMILES-Eq</th>
          <th style="text-align: center">AUROC (w/ perm)</th>
          <th style="text-align: center">BEDROC20 (w/ perm)</th>
          <th style="text-align: center">AUROC (w/o perm)</th>
          <th style="text-align: center">BEDROC20 (w/o perm)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.685 +/- 0.069</td>
          <td style="text-align: center">0.246 +/- 0.041</td>
          <td style="text-align: center">0.707 +/- 0.059</td>
          <td style="text-align: center">0.280 +/- 0.042</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.738 +/- 0.060</td>
          <td style="text-align: center">0.323 +/- 0.071</td>
          <td style="text-align: center">0.740 +/- 0.066</td>
          <td style="text-align: center">0.322 +/- 0.065</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.483 +/- 0.092</td>
          <td style="text-align: center">0.092 +/- 0.069</td>
          <td style="text-align: center">0.493 +/- 0.068</td>
          <td style="text-align: center">0.108 +/- 0.070</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.476 +/- 0.077</td>
          <td style="text-align: center">0.064 +/- 0.034</td>
          <td style="text-align: center">0.514 +/- 0.165</td>
          <td style="text-align: center">0.084 +/- 0.014</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.696 +/- 0.058</td>
          <td style="text-align: center">0.283 +/- 0.077</td>
          <td style="text-align: center">0.676 +/- 0.060</td>
          <td style="text-align: center">0.250 +/- 0.073</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.719 +/- 0.057</td>
          <td style="text-align: center">0.293 +/- 0.071</td>
          <td style="text-align: center">0.716 +/- 0.061</td>
          <td style="text-align: center">0.290 +/- 0.076</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.129 +/- 0.067</td>
          <td style="text-align: center">0.005 +/- 0.037</td>
          <td style="text-align: center">0.508 +/- 0.068</td>
          <td style="text-align: center">0.048 +/- 0.035</td>
      </tr>
  </tbody>
</table>
<p>Key findings from the ablation:</p>
<ul>
<li>PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).</li>
<li>Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).</li>
<li>The SMILES-Eq task consistently decreased performance when added to other task combinations.</li>
</ul>
<p>A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.</p>
<h3 id="virtual-screening-results">Virtual Screening Results</h3>
<p>Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>AUROC</th>
          <th>BEDROC20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolBERT (100 epochs)</td>
          <td>0.743 +/- 0.062</td>
          <td>0.344 +/- 0.062</td>
      </tr>
      <tr>
          <td>CDDD</td>
          <td>0.725 +/- 0.057</td>
          <td>0.310 +/- 0.080</td>
      </tr>
      <tr>
          <td>RDKit descriptors</td>
          <td>0.633 +/- 0.027</td>
          <td>0.217 +/- 0.000</td>
      </tr>
      <tr>
          <td>ECFC4</td>
          <td>0.603 +/- 0.056</td>
          <td>0.170 +/- 0.079</td>
      </tr>
  </tbody>
</table>
<p>MolBERT outperformed all baselines including <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).</p>
<h3 id="qsar-results">QSAR Results</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> regression tasks (RMSE, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td style="text-align: center">0.687 +/- 0.08</td>
          <td style="text-align: center">0.902 +/- 0.06</td>
          <td style="text-align: center">0.567 +/- 0.06</td>
          <td style="text-align: center">0.552 +/- 0.07</td>
          <td style="text-align: center"><strong>0.531 +/- 0.04</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td style="text-align: center">1.671 +/- 0.45</td>
          <td style="text-align: center">2.876 +/- 0.38</td>
          <td style="text-align: center">1.456 +/- 0.43</td>
          <td style="text-align: center">1.523 +/- 0.66</td>
          <td style="text-align: center"><strong>0.948 +/- 0.33</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td style="text-align: center">0.738 +/- 0.04</td>
          <td style="text-align: center">0.770 +/- 0.03</td>
          <td style="text-align: center">0.669 +/- 0.02</td>
          <td style="text-align: center">0.602 +/- 0.01</td>
          <td style="text-align: center"><strong>0.561 +/- 0.03</strong></td>
      </tr>
  </tbody>
</table>
<p>On MoleculeNet classification tasks (AUROC, higher is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BACE</td>
          <td style="text-align: center">0.831</td>
          <td style="text-align: center">0.845</td>
          <td style="text-align: center">0.833</td>
          <td style="text-align: center">0.849</td>
          <td style="text-align: center"><strong>0.866</strong></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td style="text-align: center">0.696</td>
          <td style="text-align: center">0.678</td>
          <td style="text-align: center">0.761</td>
          <td style="text-align: center">0.750</td>
          <td style="text-align: center"><strong>0.762</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td style="text-align: center">0.708</td>
          <td style="text-align: center">0.714</td>
          <td style="text-align: center">0.753</td>
          <td style="text-align: center">0.747</td>
          <td style="text-align: center"><strong>0.783</strong></td>
      </tr>
  </tbody>
</table>
<p>Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Pre-training task selection matters significantly.</strong> The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.</li>
<li><strong>Domain-relevant auxiliary tasks improve representation quality.</strong> Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.</li>
<li><strong>The SMILES equivalence task hurts performance.</strong> Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.</li>
<li><strong>PhysChemPred organizes the embedding space.</strong> Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).</li>
<li>The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.</li>
<li>Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.</li>
<li>The model&rsquo;s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>GuacaMol (ChEMBL)</td>
          <td>~1.6M compounds</td>
          <td>80% train / 5% validation split</td>
      </tr>
      <tr>
          <td>Virtual Screening</td>
          <td>RDKit benchmark v1.2</td>
          <td>69 target datasets</td>
          <td>Filtered subset with active/decoy compounds</td>
      </tr>
      <tr>
          <td>QSAR (Regression)</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
      <tr>
          <td>QSAR (Classification)</td>
          <td>BACE, BBBP, HIV</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)</li>
<li>Optimizer: Adam, learning rate $3 \times 10^{-5}$</li>
<li>Vocabulary: 42 tokens, sequence length 128</li>
<li>Masking: 15% of tokenized input</li>
<li>Positional encoding: relative positional embeddings (Transformer-XL)</li>
<li>Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)</li>
<li>Fine-tuning head: single linear layer on pooled output</li>
<li>Embeddings: pooled output (or average sequence output when only MaskedLM is used)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BERT-Base with ~85M parameters</li>
<li>Pre-trained weights available at <a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUROC</td>
          <td>Virtual Screening, Classification QSAR</td>
          <td>Standard area under ROC curve</td>
      </tr>
      <tr>
          <td>BEDROC20</td>
          <td>Virtual Screening</td>
          <td>Early enrichment metric, $\alpha = 20$</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression QSAR</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>2 GPUs, 16 CPUs</li>
<li>Pre-training time: ~40 hours (20 epochs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Official implementation with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., &amp; Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. <em>arXiv preprint arXiv:2011.13230</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fabian2020molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular representation learning with language models and domain-relevant auxiliary tasks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fabian, Benedek and Edlich, Thomas and Gaspar, H{\&#39;e}l{\&#39;e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2011.13230}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LMs Generate 3D Molecules from XYZ, CIF, PDB Files</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</guid><description>Transformer language models trained on XYZ, CIF, and PDB sequences generate valid 3D molecules, crystals, and protein binding sites.</description><content:encoded><![CDATA[<h2 id="language-models-as-3d-chemical-structure-generators">Language Models as 3D Chemical Structure Generators</h2>
<p>This is a <strong>Method</strong> paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.</p>
<h2 id="beyond-graphs-and-strings-the-need-for-3d-chemical-generation">Beyond Graphs and Strings: The Need for 3D Chemical Generation</h2>
<p>Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.</p>
<p>Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.</p>
<p>Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.</p>
<h2 id="direct-tokenization-of-chemical-file-formats">Direct Tokenization of Chemical File Formats</h2>
<p>The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (<a href="https://en.wikipedia.org/wiki/XYZ_file_format">XYZ</a>, <a href="https://en.wikipedia.org/wiki/Crystallographic_Information_File">CIF</a>, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)">PDB</a>). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.</p>
<p>A molecule with $n$ atoms is represented as:</p>
<p>$$
\mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:</p>
<p>$$
\mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:</p>
<p>$$
\mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n)
$$</p>
<p>The language model learns the joint distribution via autoregressive factorization:</p>
<p>$$
p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1)
$$</p>
<p>Two tokenization strategies are explored:</p>
<ol>
<li><strong>Character-level (LM-CH)</strong>: Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).</li>
<li><strong>Atom+coordinate-level (LM-AC)</strong>: Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., &lsquo;-1.98&rsquo;). The vocabulary is larger (~100-10K tokens) but sequences are shorter.</li>
</ol>
<p>Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.</p>
<h2 id="experiments-across-molecules-crystals-and-protein-binding-sites">Experiments Across Molecules, Crystals, and Protein Binding Sites</h2>
<h3 id="molecular-generation-zinc">Molecular Generation (ZINC)</h3>
<p>The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit&rsquo;s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.</p>
<p>For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.</p>
<p>Standard metrics include validity, uniqueness, novelty, and earth mover&rsquo;s distance (WA) for molecular property distributions (QED, SA score, molecular weight).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>3D</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>WA MW</th>
          <th>WA SA</th>
          <th>WA QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Train</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>0.816</td>
          <td>0.013</td>
          <td>0.002</td>
      </tr>
      <tr>
          <td>SM-LM</td>
          <td>No</td>
          <td>98.35</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.640</td>
          <td>0.049</td>
          <td>0.005</td>
      </tr>
      <tr>
          <td>SF-LM</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.772</td>
          <td>0.085</td>
          <td>0.006</td>
      </tr>
      <tr>
          <td>JTVAE</td>
          <td>No</td>
          <td>100.0</td>
          <td>98.56</td>
          <td>100.0</td>
          <td>22.63</td>
          <td>0.126</td>
          <td>0.023</td>
      </tr>
      <tr>
          <td>ENF</td>
          <td>Yes</td>
          <td>1.05</td>
          <td>96.37</td>
          <td>99.72</td>
          <td>168.5</td>
          <td>1.886</td>
          <td>0.160</td>
      </tr>
      <tr>
          <td>G-SchNet</td>
          <td>Yes</td>
          <td>1.20</td>
          <td>55.96</td>
          <td>98.33</td>
          <td>152.7</td>
          <td>1.126</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>Yes</td>
          <td>77.51</td>
          <td>96.40</td>
          <td>95.30</td>
          <td>101.2</td>
          <td>0.939</td>
          <td>0.093</td>
      </tr>
      <tr>
          <td>LM-CH</td>
          <td>Yes</td>
          <td>90.13</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.912</td>
          <td>2.608</td>
          <td>0.077</td>
      </tr>
      <tr>
          <td>LM-AC</td>
          <td>Yes</td>
          <td>98.51</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>1.811</td>
          <td>0.026</td>
          <td>0.004</td>
      </tr>
  </tbody>
</table>
<p>The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.</p>
<h3 id="crystal-generation-perov-5-and-mp-20">Crystal Generation (Perov-5 and MP-20)</h3>
<p>Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 <a href="https://en.wikipedia.org/wiki/Perovskite_(structure)">perovskite</a> materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).</p>
<p>Evaluation metrics include structural validity (minimum interatomic distance &gt; 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover&rsquo;s distance for density and number of unique elements.</p>
<table>
  <thead>
      <tr>
          <th>Data</th>
          <th>Model</th>
          <th>Struc. Valid (%)</th>
          <th>Comp. Valid (%)</th>
          <th>COV-R (%)</th>
          <th>COV-P (%)</th>
          <th>WA density</th>
          <th>WA elements</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perov-5</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>98.59</td>
          <td>99.45</td>
          <td>98.46</td>
          <td>0.126</td>
          <td>0.063</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-CH</td>
          <td>100.0</td>
          <td>98.51</td>
          <td>99.60</td>
          <td>99.42</td>
          <td>0.071</td>
          <td>0.036</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-AC</td>
          <td>100.0</td>
          <td>98.79</td>
          <td>98.78</td>
          <td>99.36</td>
          <td>0.089</td>
          <td>0.028</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>86.70</td>
          <td>99.15</td>
          <td>99.49</td>
          <td>0.688</td>
          <td>1.432</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-CH</td>
          <td>84.81</td>
          <td>83.55</td>
          <td>99.25</td>
          <td>97.89</td>
          <td>0.864</td>
          <td>0.132</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-AC</td>
          <td>95.81</td>
          <td>88.87</td>
          <td>99.60</td>
          <td>98.55</td>
          <td>0.696</td>
          <td>0.092</td>
      </tr>
  </tbody>
</table>
<p>On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).</p>
<h3 id="protein-binding-site-generation-pdb">Protein Binding Site Generation (PDB)</h3>
<p>The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.</p>
<p>Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.</p>
<h2 id="competitive-3d-generation-without-geometric-inductive-biases">Competitive 3D Generation Without Geometric Inductive Biases</h2>
<p>The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.</p>
<p>Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.</p>
<p>The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.</p>
<p>Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC</td>
          <td>250K molecules</td>
          <td>~23 heavy atoms avg; XYZ files via RDKit conformer generation</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Perov-5</td>
          <td>18,928 perovskites</td>
          <td>5 atoms/unit cell, 56 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>MP-20</td>
          <td>45,231 materials</td>
          <td>1-20 atoms/unit cell, 89 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Protein binding sites</td>
          <td>~180K protein-ligand pairs</td>
          <td>Processed to 200-250 atoms per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-style transformer with ~1M to 100M parameters</li>
<li><strong>Layers</strong>: 12</li>
<li><strong>Embedding size</strong>: 128 to 1024</li>
<li><strong>Attention heads</strong>: 4 to 12</li>
<li><strong>Batch size</strong>: 4 to 32 structures</li>
<li><strong>Learning rate</strong>: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$</li>
<li><strong>Data augmentation</strong>: Random rotation of training structures at each epoch</li>
<li><strong>Numerical precision</strong>: 2 decimal places (molecules, proteins), 3 decimal places (crystals)</li>
</ul>
<h3 id="models">Models</h3>
<p>No pre-trained model weights are publicly available. The paper mentions &ldquo;Example code can be found at&rdquo; but the URL appears to be missing from the published version.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Molecules</td>
          <td>xyz2mol produces valid RDKit Mol object</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Crystals</td>
          <td>Structural (min distance &gt; 0.5 angstrom) and compositional (charge neutral)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>All</td>
          <td>Fraction of distinct generated structures</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>All</td>
          <td>Fraction not in training set</td>
      </tr>
      <tr>
          <td>Earth mover&rsquo;s distance</td>
          <td>All</td>
          <td>Distribution match for domain-specific properties</td>
      </tr>
      <tr>
          <td>RMSD</td>
          <td>Molecules</td>
          <td>Deviation from RDKit conformer geometries</td>
      </tr>
      <tr>
          <td>Coverage</td>
          <td>Crystals</td>
          <td>Recall and precision between generated and test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D. &amp; Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. <em>arXiv preprint arXiv:2305.05708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2023language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2305.05708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLM4Mol: ChatGPT Captions as Molecular Representations</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/llm4mol-captions-as-representations/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/llm4mol-captions-as-representations/</guid><description>LLM4Mol uses ChatGPT to generate text explanations for SMILES strings and fine-tunes RoBERTa on these captions for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="llm-generated-text-as-molecular-representations">LLM-Generated Text as Molecular Representations</h2>
<p>This is a <strong>Method</strong> paper that proposes using large language models (specifically ChatGPT) to generate natural language explanations for molecules represented as SMILES strings, and then using those explanations as input representations for downstream molecular property prediction. The approach is called <strong>Captions as new Representations (CaR)</strong>. The authors also evaluate ChatGPT directly on zero-shot and few-shot molecular classification to gauge in-context learning ability on chemical data.</p>
<h2 id="bridging-molecular-data-and-natural-language-understanding">Bridging Molecular Data and Natural Language Understanding</h2>
<p>Molecular property prediction is central to <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a>, drug discovery, and materials design. Molecules are typically represented either as graphs (processed by GNNs) or as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a> (processed by NLP-based methods). While both paradigms have shown success, they do not directly use the broad world knowledge embedded in large language models.</p>
<p>LLMs such as ChatGPT demonstrate strong capabilities in text understanding and can generate informative descriptions when given SMILES strings, including functional groups, chemical properties, and potential pharmaceutical applications. The question motivating this work is whether LLM-generated textual descriptions can serve as better molecular representations than raw SMILES or graph encodings for property prediction tasks.</p>
<p>Prior work had not systematically explored two directions: (1) whether LLMs can perform molecular classification via in-context learning, and (2) whether LLM-generated captions can serve as transferable representations for small downstream models.</p>
<h2 id="captions-as-representations-car">Captions as Representations (CaR)</h2>
<p>The core contribution is the CaR framework, which operates in two stages:</p>
<ol>
<li>
<p><strong>Caption generation</strong>: Given a molecule&rsquo;s SMILES string, ChatGPT is prompted to produce a detailed textual explanation covering functional groups, chemical properties, and potential applications.</p>
</li>
<li>
<p><strong>Fine-tuning a small LM</strong>: The generated text explanations replace the original SMILES as input to a pre-trained language model (e.g., RoBERTa). This small LM is then fine-tuned on downstream classification or regression tasks.</p>
</li>
</ol>
<p>The insight is that ChatGPT&rsquo;s world knowledge can enrich the molecular representation with semantically meaningful features that raw SMILES lack. For example, on the PTC (Predictive Toxicology Challenge) dataset, the authors performed keyword searches for terms like &ldquo;toxicity&rdquo;, &ldquo;cancer&rdquo;, and &ldquo;harmful&rdquo; in the ChatGPT-generated explanations and found that these keywords appeared predominantly in entries labeled as toxic, indicating that the generated captions carry predictive signal.</p>
<p>The authors also explore <strong>in-context molecular classification</strong>, where ChatGPT is directly prompted with zero or few examples to classify molecules. This serves as a preliminary evaluation of LLM reasoning capabilities on molecular data.</p>
<h2 id="experimental-setup-and-benchmarks">Experimental Setup and Benchmarks</h2>
<h3 id="datasets">Datasets</h3>
<p>The evaluation spans 9 datasets across classification and regression:</p>
<ul>
<li><strong>Classification (TUDataset)</strong>: MUTAG, PTC, AIDS</li>
<li><strong>Classification (<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>)</strong>: SIDER, ClinTox, BACE, BBBP</li>
<li><strong>Regression (MoleculeNet)</strong>: ESOL, <a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Baselines include GNN-based methods (GCN, GIN, ChebyNet, D-MPNN, GraphMVP, InfoGraph, G-Motif, Mole-BERT) and SMILES-based methods (ECFP4-MLP, <a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">SMILES-Transformer</a>, MolR, <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, MolKD).</p>
<h3 id="splitting-strategies">Splitting Strategies</h3>
<ul>
<li><strong>Random splitting</strong>: 8/1/1 train/validate/test with 10-fold cross-validation</li>
<li><strong>Scaffold splitting</strong>: 5 random seeds, reported as mean and standard deviation</li>
</ul>
<h3 id="key-results-random-splitting">Key Results: Random Splitting</h3>
<p>Under random splitting, CaR-RoBERTa achieves the best results on almost all datasets:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>MUTAG (ACC)</th>
          <th>PTC (ACC)</th>
          <th>AIDS (ACC)</th>
          <th>SIDER (AUC)</th>
          <th>ClinTox (AUC)</th>
          <th>ESOL (RMSE)</th>
          <th>Lipo (RMSE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCN</td>
          <td>90.00</td>
          <td>62.57</td>
          <td>78.68</td>
          <td>64.24</td>
          <td>91.88</td>
          <td>0.77</td>
          <td>0.80</td>
      </tr>
      <tr>
          <td>GIN</td>
          <td>89.47</td>
          <td>58.29</td>
          <td>78.01</td>
          <td>66.19</td>
          <td>92.08</td>
          <td>0.67</td>
          <td>0.79</td>
      </tr>
      <tr>
          <td>ECFP4-MLP</td>
          <td>96.84</td>
          <td>85.71</td>
          <td>94.64</td>
          <td>90.19</td>
          <td>95.81</td>
          <td>0.60</td>
          <td>0.60</td>
      </tr>
      <tr>
          <td>CaR-RoBERTa</td>
          <td>91.05</td>
          <td>93.14</td>
          <td>94.37</td>
          <td>88.81</td>
          <td>99.80</td>
          <td>0.45</td>
          <td>0.47</td>
      </tr>
  </tbody>
</table>
<p>CaR-RoBERTa improves over the best GNN by up to 53% on PTC and reduces RMSE by 35-37% on regression tasks. However, ECFP4-MLP outperforms CaR on MUTAG (96.84 vs. 91.05).</p>
<h3 id="key-results-scaffold-splitting">Key Results: Scaffold Splitting</h3>
<p>Under the more challenging scaffold splitting:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>SIDER (AUC)</th>
          <th>ClinTox (AUC)</th>
          <th>BACE (AUC)</th>
          <th>BBBP (AUC)</th>
          <th>ESOL (RMSE)</th>
          <th>Lipo (RMSE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GraphMVP-C</td>
          <td>63.90</td>
          <td>77.50</td>
          <td>81.20</td>
          <td>72.40</td>
          <td>1.03</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>Mole-BERT</td>
          <td>62.80</td>
          <td>78.90</td>
          <td>80.80</td>
          <td>71.90</td>
          <td>1.02</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>MolKD</td>
          <td>61.30</td>
          <td>83.80</td>
          <td>80.10</td>
          <td>74.80</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>CaR-RoBERTa</td>
          <td>58.06</td>
          <td>84.16</td>
          <td>80.73</td>
          <td>81.99</td>
          <td>0.96</td>
          <td>1.02</td>
      </tr>
  </tbody>
</table>
<p>Results are more mixed under scaffold splitting. CaR achieves the best performance on ClinTox (+30% over GNNs) and BBBP (+15%), but underperforms on SIDER and Lipophilicity.</p>
<h3 id="few-shot-classification-with-chatgpt">Few-Shot Classification with ChatGPT</h3>
<p>Direct few-shot classification with ChatGPT shows mixed results. On MUTAG, ChatGPT underperforms classical methods across all shot counts. On PTC, ChatGPT outperforms GNNs in the few-shot regime. Performance improves with increasing number of shots, but results are inconsistent across different prompts.</p>
<h3 id="replacing-the-small-lm">Replacing the Small LM</h3>
<p>The authors test CaR with different downstream models: RoBERTa, DeBERTa, and an adaptive language model for molecules. Pre-trained models all perform similarly, and all outperform a DeBERTa trained from scratch, validating that CaR&rsquo;s effectiveness comes from the caption quality rather than the specific choice of downstream model.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>ChatGPT-generated text explanations serve as effective molecular representations, outperforming GNNs and SMILES-based methods on most benchmarks under random splitting.</li>
<li>ChatGPT has some capacity for few-shot molecular classification, but performance is inconsistent and prompt-sensitive.</li>
<li>The CaR approach is model-agnostic: different pre-trained small LMs achieve similar results when fine-tuned on the generated captions.</li>
<li>Under scaffold splitting, CaR shows strong results on some datasets (ClinTox, BBBP) but underperforms on others (SIDER, Lipophilicity).</li>
</ol>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ul>
<li><strong>Single LLM</strong>: Only ChatGPT was used. Other LLMs (GPT-4, domain-specific models like MolReGPT) were not evaluated.</li>
<li><strong>No graph structure integration</strong>: CaR treats molecular prediction purely as an NLP task and does not incorporate structural graph information, which is known to be important for molecular properties.</li>
<li><strong>Limited to small molecules</strong>: The approach works only for molecules representable as SMILES. Proteins, antibodies, and other large biomolecules with 3D structure are not addressed.</li>
</ul>
<h3 id="additional-considerations">Additional Considerations</h3>
<p>The random splitting results are notably strong, but random splits tend to overestimate performance compared to scaffold splits, which test generalization to structurally novel molecules. The high variance on some scaffold-split results (e.g., ClinTox with 17.63 standard deviation) suggests instability. The reliance on a proprietary API (ChatGPT) also limits reproducibility and introduces cost constraints for large-scale applications.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification</td>
          <td>MUTAG (TUDataset)</td>
          <td>188 molecules</td>
          <td>Mutagenicity prediction</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>PTC (TUDataset)</td>
          <td>344 molecules</td>
          <td>Predictive toxicology</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>AIDS (TUDataset)</td>
          <td>2,000 molecules</td>
          <td>HIV activity</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>SIDER (MoleculeNet)</td>
          <td>1,427 molecules</td>
          <td>Side effect prediction</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>ClinTox (MoleculeNet)</td>
          <td>1,478 molecules</td>
          <td>Clinical trial toxicity</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE (MoleculeNet)</td>
          <td>1,513 molecules</td>
          <td><a href="https://en.wikipedia.org/wiki/Beta-secretase_1">Beta-secretase</a> inhibition</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP (MoleculeNet)</td>
          <td>2,039 molecules</td>
          <td>Blood-brain barrier penetration</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL (MoleculeNet)</td>
          <td>1,128 molecules</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity (MoleculeNet)</td>
          <td>4,200 molecules</td>
          <td>Lipophilicity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>ChatGPT (GPT-3.5) generates textual explanations for SMILES strings</li>
<li>RoBERTa is fine-tuned on generated captions using HuggingFace Transformers with default parameters</li>
<li>10-fold cross-validation for random split; 5 random seeds for scaffold split</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ChatGPT (GPT-3.5) for caption generation</li>
<li>RoBERTa-base for downstream fine-tuning (default HuggingFace parameters)</li>
<li>DeBERTa and adaptive-lm-molecules tested as alternatives</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: accuracy (ACC) and ROC-AUC</li>
<li>Regression: RMSE</li>
<li>Mean and standard deviation reported across folds/seeds</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ChnQ/LLM4Mol">LLM4Mol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, C., Tang, H., Yang, Z., Liang, H., &amp; Liu, Y. (2023). Can Large Language Models Empower Molecular Property Prediction? <em>arXiv preprint arXiv:2307.07443</em>. <a href="https://arxiv.org/abs/2307.07443">https://arxiv.org/abs/2307.07443</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2023can,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Can Large Language Models Empower Molecular Property Prediction?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Chen and Tang, Huayi and Yang, Zhirui and Liang, Hong and Liu, Yong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2307.07443}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2307.07443}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLM-Prop: Predicting Crystal Properties from Text</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</guid><description>LLM-Prop fine-tunes the T5 encoder on crystal text descriptions to predict band gap, volume, and other properties, outperforming GNN baselines.</description><content:encoded><![CDATA[<h2 id="text-based-crystal-property-prediction-with-llms">Text-Based Crystal Property Prediction with LLMs</h2>
<p>LLM-Prop is a <strong>Method</strong> paper that proposes using the encoder portion of <a href="https://en.wikipedia.org/wiki/T5_(language_model)">T5</a> (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for <a href="/notes/chemistry/molecular-design/property-prediction/">property prediction</a>, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.</p>
<h2 id="why-text-instead-of-crystal-graphs">Why Text Instead of Crystal Graphs?</h2>
<p>Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:</p>
<ol>
<li><strong>Periodicity encoding</strong>: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.</li>
<li><strong>Information incorporation</strong>: Critical structural information like bond angles, <a href="https://en.wikipedia.org/wiki/Space_group">space group</a> symmetry, and <a href="https://en.wikipedia.org/wiki/Wyckoff_positions">Wyckoff sites</a> is difficult to incorporate into graph representations.</li>
<li><strong>Expressiveness</strong>: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.</li>
</ol>
<p>Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.</p>
<h2 id="core-innovation-t5-encoder-with-careful-fine-tuning">Core Innovation: T5 Encoder with Careful Fine-Tuning</h2>
<p>The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (<a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:</p>
<ul>
<li>Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences</li>
<li>Longer sequences mean more crystal information can be included</li>
<li>The encoder-only approach avoids T5&rsquo;s known weakness at regression in text-to-text format</li>
</ul>
<p>The framework applies several preprocessing strategies to the crystal text descriptions:</p>
<ol>
<li><strong>Stopword removal</strong>: Standard English stopwords are removed, except digits and symbols carrying chemical information</li>
<li><strong>Numerical token replacement</strong>: Bond distances are replaced with a <code>[NUM]</code> token and bond angles with <code>[ANG]</code>, reducing sequence length while preserving structural cues</li>
<li><strong>[CLS] token prepending</strong>: A classification token is added at the start, and its learned embedding is used as input to the prediction layer</li>
<li><strong>Label scaling</strong>: For regression tasks, targets are normalized using z-score, min-max, or log normalization</li>
</ol>
<p>The normalization schemes are defined as:</p>
<p>$$
\hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma}
$$</p>
<p>$$
\hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}}
$$</p>
<p>$$
\hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1)
$$</p>
<p>The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens <code>[NUM]</code>, <code>[ANG]</code>, and <code>[CLS]</code> are added to the vocabulary.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="dataset-textedge">Dataset: TextEdge</h3>
<p>The authors collected data from the <a href="https://en.wikipedia.org/wiki/Materials_Project">Materials Project</a> database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Type</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Band gap (eV)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Unit cell volume (A^3/cell)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Formation energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy above hull (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Is-gap-direct</td>
          <td>Classification</td>
          <td>AUC (higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<p>Seven baselines were compared:</p>
<ul>
<li><strong>GNN-based</strong>: CGCNN, MEGNet, ALIGNN, DeeperGATGNN</li>
<li><strong>Classic ML</strong>: XGBoost, Random Forest (on Robocrystallographer features)</li>
<li><strong>Text-based</strong>: MatBERT (domain-specific pre-trained BERT, ~110M parameters)</li>
</ul>
<p>All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.</p>
<h3 id="main-results-llm-prop-vs-gnn-baselines">Main Results: LLM-Prop vs. GNN Baselines</h3>
<p>When using crystal text descriptions as input, LLM-Prop achieved:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>0.293</td>
          <td>188.834</td>
          <td>0.046</td>
          <td>0.082</td>
          <td>0.040</td>
          <td>0.830</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>0.304</td>
          <td>297.948</td>
          <td>0.077</td>
          <td>0.056</td>
          <td>0.051</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>0.250</td>
          <td>129.580</td>
          <td>0.027</td>
          <td>0.059</td>
          <td>0.028</td>
          <td>0.678</td>
      </tr>
      <tr>
          <td>DeeperGATGNN</td>
          <td>0.291</td>
          <td>111.857</td>
          <td>0.081</td>
          <td>0.116</td>
          <td>0.045</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>LLM-Prop (Descr.)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.252</strong></td>
          <td>0.056</td>
          <td>0.067</td>
          <td>0.047</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on <a href="https://en.wikipedia.org/wiki/Band_gap">band gap</a> prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.</p>
<h3 id="llm-prop-vs-matbert">LLM-Prop vs. MatBERT</h3>
<p>LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&amp;[ANG]):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MatBERT (best)</td>
          <td>0.258</td>
          <td>54.969</td>
          <td>0.071</td>
          <td>0.098</td>
          <td>0.050</td>
          <td>0.722</td>
      </tr>
      <tr>
          <td>LLM-Prop (best)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.138</strong></td>
          <td><strong>0.056</strong></td>
          <td><strong>0.067</strong></td>
          <td><strong>0.047</strong></td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: LLM-Prop&rsquo;s best band gap (0.231) comes from the &ldquo;w/o Numbers&rdquo; configuration, while the best volume (39.138) comes from &ldquo;w/ Numbers&rdquo;. The best Is-gap-direct AUC (0.857) uses the &ldquo;[NUM]&amp;[ANG]&rdquo; configuration.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The contribution of each preprocessing strategy was evaluated:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Band gap</th>
          <th>Volume</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLM-Prop (baseline)</td>
          <td>0.256</td>
          <td>69.352</td>
          <td>0.796</td>
      </tr>
      <tr>
          <td>+ modified tokenizer</td>
          <td>0.247</td>
          <td>78.632</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>+ label scaling</td>
          <td>0.242</td>
          <td>44.515</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>+ [CLS] token</td>
          <td>0.231</td>
          <td>39.520</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>+ [NUM] token</td>
          <td>0.251</td>
          <td>86.090</td>
          <td>0.793</td>
      </tr>
      <tr>
          <td>+ [ANG] token</td>
          <td>0.242</td>
          <td>64.965</td>
          <td>0.810</td>
      </tr>
      <tr>
          <td>- stopwords</td>
          <td>0.252</td>
          <td>56.593</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>LLM-Prop+all (no space group)</td>
          <td>0.235</td>
          <td>97.457</td>
          <td>0.705</td>
      </tr>
      <tr>
          <td>LLM-Prop+all</td>
          <td><strong>0.229</strong></td>
          <td>42.259</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.</p>
<h3 id="data-efficiency-and-transfer-learning">Data Efficiency and Transfer Learning</h3>
<p>LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.</p>
<p>Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Volume-to-Band gap (Test)</th>
          <th>Band gap-to-Volume (Test)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN-transfer</td>
          <td>0.295</td>
          <td>182.997</td>
      </tr>
      <tr>
          <td>ALIGNN-transfer</td>
          <td>0.322</td>
          <td>136.164</td>
      </tr>
      <tr>
          <td>MatBERT-transfer</td>
          <td>0.266</td>
          <td>54.289</td>
      </tr>
      <tr>
          <td>LLM-Prop-transfer</td>
          <td><strong>0.244</strong></td>
          <td><strong>50.753</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text</li>
<li>A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary</li>
<li>Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning</li>
<li>Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The origin of LLM-Prop&rsquo;s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself</li>
<li>LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data</li>
<li>The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency</li>
<li>Current LLMs&rsquo; inability to reason about numerical values limits the use of quantitative information in descriptions</li>
</ul>
<p><strong>Future directions</strong> suggested by the authors include investigating techniques to use <a href="/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">CIF files</a> directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>TextEdge</td>
          <td>144,931 crystals</td>
          <td>From Materials Project (Nov 2022), text generated by Robocrystallographer</td>
      </tr>
      <tr>
          <td>Training split</td>
          <td>TextEdge</td>
          <td>125,098</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Validation split</td>
          <td>TextEdge</td>
          <td>9,945</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Test split</td>
          <td>TextEdge</td>
          <td>9,888</td>
          <td>Random split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam with one-cycle learning rate scheduler</li>
<li><strong>Learning rate</strong>: 1e-3 for LLM-Prop, 5e-5 for MatBERT</li>
<li><strong>Dropout</strong>: 0.2 for LLM-Prop, 0.5 for MatBERT</li>
<li><strong>Batch size</strong>: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop</li>
<li><strong>Epochs</strong>: 200-300 depending on task</li>
<li><strong>Loss</strong>: MAE for regression, BCE for classification</li>
<li><strong>Evaluation</strong>: MAE for regression, AUC for classification</li>
<li><strong>Each model run 5 times on test set</strong>, averaged MAE reported</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base model</strong>: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)</li>
<li><strong>Vocabulary size</strong>: 32k (retrained tokenizer)</li>
<li><strong>Max input tokens</strong>: 888 (default) or 2000</li>
<li><strong>Special tokens</strong>: [CLS], [NUM], [ANG]</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/vertaix/LLM-Prop">LLM-Prop</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://drive.google.com/drive/folders/1YCDBzwjwNRIc1FRkB662G3Y5AOWaokUG">TextEdge + Checkpoints</a></td>
          <td>Dataset + Model</td>
          <td>Not specified</td>
          <td>Benchmark dataset and trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: NVIDIA RTX A6000</li>
<li><strong>Training time</strong>: ~40 minutes per epoch for LLM-Prop</li>
<li><strong>Inference</strong>: ~1 minute for 10,000 materials on one GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rubungo, A. N., Arnold, C. B., Rand, B. P., &amp; Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. <em>npj Computational Materials</em>, 11, 186. <a href="https://doi.org/10.1038/s41524-025-01536-2">https://doi.org/10.1038/s41524-025-01536-2</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rubungo2025llmprop,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LLM-Prop: predicting the properties of crystalline materials using large language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{npj Computational Materials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{186}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41524-025-01536-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Link-INVENT: RL-Driven Molecular Linker Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/link-invent-generative-linker-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/link-invent-generative-linker-design/</guid><description>Link-INVENT extends REINVENT for molecular linker design using RNN-based generation and reinforcement learning with flexible multi-parameter scoring.</description><content:encoded><![CDATA[<h2 id="a-method-for-generative-linker-design-with-reinforcement-learning">A Method for Generative Linker Design with Reinforcement Learning</h2>
<p>Link-INVENT is a <strong>Method</strong> paper that introduces a generative model for molecular linker design built on the <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and <a href="https://en.wikipedia.org/wiki/Proteolysis_targeting_chimera">proteolysis targeting chimera</a> (PROTAC) design.</p>
<h2 id="why-linker-design-needs-flexible-multi-parameter-optimization">Why Linker Design Needs Flexible Multi-Parameter Optimization</h2>
<p>Generating suitable chemical linkers between molecular subunits is a central challenge in <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-based drug discovery</a> (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.</p>
<p>The key gaps that Link-INVENT addresses are:</p>
<ol>
<li><strong>Conditioning on both subunits</strong>: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.</li>
<li><strong>Flexible scoring</strong>: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 4&rsquo;s</a> full scoring infrastructure and adds linker-specific properties.</li>
<li><strong>Generalizability</strong>: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.</li>
</ol>
<h2 id="core-innovation-conditional-linker-generation-with-augmented-likelihood-rl">Core Innovation: Conditional Linker Generation with Augmented Likelihood RL</h2>
<p>Link-INVENT&rsquo;s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.</p>
<h3 id="training">Training</h3>
<p>The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES randomization</a> augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.</p>
<h3 id="multi-parameter-optimization-via-rl">Multi-Parameter Optimization via RL</h3>
<p>The scoring function $S(x)$ is a weighted geometric mean of individual component scores:</p>
<p>$$
S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}}
$$</p>
<p>where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.</p>
<p>The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The <a href="/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented log likelihood</a> is:</p>
<p>$$
\log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x)
$$</p>
<p>where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:</p>
<p>$$
J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2}
$$</p>
<p>Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior&rsquo;s chemical space.</p>
<h3 id="diversity-filters">Diversity Filters</h3>
<p>Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique <a href="/notes/chemistry/molecular-design/generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/">Bemis-Murcko scaffolds</a>. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.</p>
<h3 id="linker-specific-scoring-components">Linker-Specific Scoring Components</h3>
<p>New scoring components provide direct control over linker properties:</p>
<ul>
<li><strong>Linker effective length</strong>: number of bonds between attachment atoms</li>
<li><strong>Linker maximum graph length</strong>: bonds in the longest graph traversal path</li>
<li><strong>Linker length ratio</strong>: effective length divided by maximum graph length (controls branching)</li>
<li><strong>Linker ratio of rotatable bonds</strong>: rotatable bonds over total bonds (controls flexibility)</li>
<li><strong>Linker number of rings</strong>: controls linearity vs. cyclicity</li>
<li><strong>Linker number of HBDs</strong>: hydrogen-bond donors in the linker itself</li>
</ul>
<h2 id="experimental-evaluation-across-three-drug-discovery-tasks">Experimental Evaluation Across Three Drug Discovery Tasks</h2>
<p>Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.</p>
<h3 id="illustrative-example-two-benzene-rings">Illustrative Example: Two Benzene Rings</h3>
<p>A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.</p>
<h3 id="experiment-1a-fragment-linking-ck2-alpha-inhibitors">Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)</h3>
<p>Based on the <a href="https://en.wikipedia.org/wiki/Casein_kinase_2">casein kinase 2</a> (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio &gt;= 70 and linker MW &lt;= 200 Da.</p>
<p>Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:</p>
<ul>
<li>Docking score distributions across triplicates were nearly identical, demonstrating reproducibility</li>
<li>Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)</li>
<li>More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates</li>
<li>Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction</li>
</ul>
<h3 id="experiment-1b-comparison-fragment-linking-impdh-inhibitors">Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)</h3>
<p>Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio &gt;= 70, and linker MW &lt;= 150 Da.</p>
<p>Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker&rsquo;s 9000 molecular graphs). Results:</p>
<ul>
<li>Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs</li>
<li>Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference</li>
<li>Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules</li>
<li>Link-INVENT&rsquo;s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc</li>
</ul>
<h3 id="experiment-2-scaffold-hopping-dlk-inhibitor-cns-optimization">Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)</h3>
<p>Based on Patel et al.&rsquo;s <a href="https://en.wikipedia.org/wiki/MAP3K12">dual leucine zipper kinase</a> (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs &lt; 2, tPSA &lt;= 90 A squared, 3 &lt;= SlogP &lt;= 4, MW &lt;= 450 Da, 1-2 aromatic rings in linker).</p>
<p>The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand&rsquo;s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.</p>
<h3 id="experiment-3-protac-design-bcl-2mcl-1-dual-degradation">Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)</h3>
<p>Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.&rsquo;s Bcl-2/Mcl-1 dual degradation strategy:</p>
<table>
  <thead>
      <tr>
          <th>Sub-Experiment</th>
          <th>Objective</th>
          <th>Key Finding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sub-Exp 1: Linker length</td>
          <td>Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]</td>
          <td>Clear enrichment within target intervals vs. baseline broad distribution</td>
      </tr>
      <tr>
          <td>Sub-Exp 2: Linearity</td>
          <td>Control linear vs. cyclic linkers at fixed length [7,9]</td>
          <td>Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment</td>
      </tr>
      <tr>
          <td>Sub-Exp 3: Flexibility</td>
          <td>Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios</td>
          <td>Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-practical-implications-for-drug-discovery">Key Findings and Practical Implications for Drug Discovery</h2>
<p>Link-INVENT demonstrates several practical advantages for molecular linker design:</p>
<ol>
<li><strong>Single prior, multiple tasks</strong>: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.</li>
<li><strong>Docking as a learning signal</strong>: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.</li>
<li><strong>Implicit 3D awareness</strong>: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.</li>
<li><strong>Diverse and reproducible output</strong>: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.</li>
</ol>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors</li>
<li>Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)</li>
<li>Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking</li>
<li>No direct experimental (wet-lab) validation was performed in this study</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL v27 (reaction-sliced)</td>
          <td>Not specified</td>
          <td>Filtered for drug-like compounds, then reaction-based slicing with SMIRKS</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Held-out Bemis-Murcko scaffolds</td>
          <td>287 scaffolds</td>
          <td>Held out from training set</td>
      </tr>
      <tr>
          <td>SMILES augmentation</td>
          <td>Randomized SMILES per epoch</td>
          <td>Same tuples, different representations</td>
          <td>Improves generalizability</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256</li>
<li><strong>RL loss</strong>: DAP (Difference of Augmented and Posterior likelihoods)</li>
<li><strong>Batch size</strong>: 128 molecules per epoch</li>
<li><strong>Diversity filter</strong>: Bemis-Murcko scaffold buckets of size 25</li>
<li><strong>Score threshold</strong>: 0 (to store all molecules for analysis)</li>
<li><strong>Scoring function</strong>: Weighted geometric mean of component scores</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Single pre-trained prior used across all experiments</li>
<li>Agent initialized as copy of prior, updated via RL</li>
<li>Pre-trained prior available at GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecular docking via DockStream with Glide/LigPrep backend</li>
<li>Triplicate runs for all experiments</li>
<li>Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT (Link-INVENT code)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Main codebase for Link-INVENT</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity">ReinventCommunity (data + tutorial)</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. <em>Digital Discovery</em>, 2, 392-408. <a href="https://doi.org/10.1039/D2DD00115B">https://doi.org/10.1039/D2DD00115B</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2023link,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Link-INVENT: generative linker design with reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--408}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D2DD00115B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lingo3DMol: Language Model for 3D Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/lingo3dmol-3d-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/lingo3dmol-3d-molecule-generation/</guid><description>Lingo3DMol combines language models with geometric deep learning for structure-based 3D molecule generation using a fragment-based SMILES representation.</description><content:encoded><![CDATA[<h2 id="a-language-model-approach-to-structure-based-drug-design">A Language Model Approach to Structure-Based Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.</p>
<h2 id="limitations-of-existing-3d-molecular-generative-models">Limitations of Existing 3D Molecular Generative Models</h2>
<p>Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.</p>
<h2 id="fsmiles-fragment-based-smiles-with-dual-coordinate-systems">FSMILES: Fragment-Based SMILES with Dual Coordinate Systems</h2>
<p>The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., <code>C_6</code> for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.</p>
<p>The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:</p>
<p>$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$</p>
<p>$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$</p>
<p>$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$</p>
<p>Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.</p>
<h3 id="ncianchor-prediction-model">NCI/Anchor Prediction Model</h3>
<p>A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, <a href="https://en.wikipedia.org/wiki/Halogen_bond">halogen bonds</a>, salt bridges, or <a href="https://en.wikipedia.org/wiki/Pi_stacking">pi-pi stacking</a> interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).</p>
<h3 id="pretraining-and-architecture">Pretraining and Architecture</h3>
<p>The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:</p>
<p>$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$</p>
<p>The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:</p>
<p>$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$</p>
<p>Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.</p>
<h2 id="evaluation-on-dud-e-with-drug-likeness-filtering">Evaluation on DUD-E with Drug-Likeness Filtering</h2>
<p>The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED &gt;= 0.3 and SAS &lt;= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.</p>
<p><strong>Molecular properties and binding mode (Table 1, drug-like molecules only):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Pocket2Mol</th>
          <th>TargetDiff</th>
          <th>Lingo3DMol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (% of total)</td>
          <td>61%</td>
          <td>49%</td>
          <td><strong>82%</strong></td>
      </tr>
      <tr>
          <td>Mean QED</td>
          <td>0.56</td>
          <td>0.60</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Mean SAS</td>
          <td>3.5</td>
          <td>4.0</td>
          <td><strong>3.1</strong></td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% of targets)</td>
          <td>8%</td>
          <td>3%</td>
          <td><strong>33%</strong></td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td>-6.7</td>
          <td>-6.2</td>
          <td><strong>-6.8</strong></td>
      </tr>
      <tr>
          <td>Mean GlideSP redocking</td>
          <td>-7.5</td>
          <td>-7.0</td>
          <td><strong>-7.8</strong></td>
      </tr>
      <tr>
          <td>Mean RMSD vs. low-energy conformer (A)</td>
          <td>1.1</td>
          <td>1.1</td>
          <td><strong>0.9</strong></td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.84</td>
          <td><strong>0.88</strong></td>
          <td>0.82</td>
      </tr>
  </tbody>
</table>
<p>Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.</p>
<p><strong>Molecular geometry:</strong> Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).</p>
<p><strong>Information leakage analysis:</strong> The authors controlled for information leakage by excluding proteins with &gt;30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol&rsquo;s training set, Lingo3DMol&rsquo;s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.</p>
<p><strong>Ablation studies (Table 2):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard</th>
          <th>Random NCI</th>
          <th>No Pretraining</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like (%)</td>
          <td><strong>82%</strong></td>
          <td>47%</td>
          <td>71%</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5</td>
          <td><strong>33%</strong></td>
          <td>6%</td>
          <td>3%</td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td><strong>-6.8</strong></td>
          <td>-5.8</td>
          <td>-4.9</td>
      </tr>
      <tr>
          <td>Dice score</td>
          <td><strong>0.25</strong></td>
          <td>0.15</td>
          <td>0.13</td>
      </tr>
  </tbody>
</table>
<p>Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.</p>
<p>Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.</p>
<p>Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>In-house commercial library</td>
          <td>12M molecules (1.4M public)</td>
          <td>Filtered for drug-likeness; conformers via ConfGen</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PDBbind 2020 (general set)</td>
          <td>11,800 samples (8,201 PDB IDs)</td>
          <td>Filtered for &lt;30% sequence identity to DUD-E targets</td>
      </tr>
      <tr>
          <td>NCI labels</td>
          <td>PDBbind 2020</td>
          <td>Same as fine-tuning</td>
          <td>Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>DUD-E</td>
          <td>101 targets, 20,000+ active compounds</td>
          <td>Standard benchmark for structure-based drug design</td>
      </tr>
      <tr>
          <td>Geometry evaluation</td>
          <td>CrossDocked2020</td>
          <td>100 targets</td>
          <td>Used for bond length and atom distance distribution comparisons</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)</li>
<li>Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption</li>
<li>Depth-first search sampling with reward function combining model confidence and anchor fulfillment</li>
<li>Fine-tuning: first three encoder layers frozen</li>
<li>Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)</li>
<li>NCI/anchor prediction model: same architecture, initialized from pretrained parameters</li>
<li>Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Lingo3DMol</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (%)</td>
          <td>82%</td>
          <td>61% (P2M)</td>
          <td>QED &gt;= 0.3, SAS &lt;= 5</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% targets)</td>
          <td>33%</td>
          <td>8% (P2M)</td>
          <td>Tanimoto similarity to known actives</td>
      </tr>
      <tr>
          <td>Min-in-place GlideSP</td>
          <td>-6.8</td>
          <td>-6.7 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>GlideSP redocking</td>
          <td>-7.8</td>
          <td>-7.5 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>RMSD vs. low-energy conformer</td>
          <td>0.9 A</td>
          <td>1.1 A (both)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Generation speed (100 mol)</td>
          <td>874 +/- 401 s</td>
          <td>962 +/- 622 s (P2M)</td>
          <td>NVIDIA Tesla V100</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference benchmarked on NVIDIA Tesla V100 GPUs</li>
<li>Generation of 100 valid molecules per target: 874 +/- 401 seconds</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/stonewiseAIDrugDesign/Lingo3DMol">Lingo3DMol</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Inference code and model architecture</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/software/Code_for_Lingo3DMo/24633084">Model checkpoints</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and NCI checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Data_for_Lingo3DMol/24550351">Training data</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules</td>
      </tr>
      <tr>
          <td><a href="https://sw3dmg.stonewise.cn">Online service</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Web interface for molecule generation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., &amp; Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. <em>Nature Machine Intelligence</em>, 6(1), 62-73. <a href="https://doi.org/10.1038/s42256-023-00775-6">https://doi.org/10.1038/s42256-023-00775-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{feng2024generation,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generation of 3D molecules in pockets via a language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{62--73}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00775-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Group SELFIES: Fragment-Based Molecular Strings</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/</guid><description>Group SELFIES extends SELFIES with fragment-based group tokens for chemically robust molecular string representations that improve distribution learning.</description><content:encoded><![CDATA[<h2 id="a-fragment-aware-extension-of-selfies">A Fragment-Aware Extension of SELFIES</h2>
<p>This is a <strong>Method</strong> paper that introduces Group SELFIES, a molecular string representation extending SELFIES by incorporating group tokens that represent functional groups or entire substructures. The primary contribution is a representation that maintains the 100% chemical validity guarantee of SELFIES while enabling fragment-level molecular encoding. Group SELFIES is shorter, more human-readable, and produces better distribution learning compared to both SMILES and standard SELFIES.</p>
<h2 id="from-atoms-to-fragments-in-molecular-strings">From Atoms to Fragments in Molecular Strings</h2>
<p>Molecular string representations underpin nearly all string-based molecular generation, from chemical language models and VAEs to genetic algorithms. SMILES, the dominant representation, suffers from validity issues: generated strings frequently contain syntax errors or violate valency constraints. SELFIES solved this by guaranteeing that every string decodes to a valid molecule, but both SMILES and SELFIES operate at the atomic level. Human chemists, by contrast, think about molecules in terms of functional groups and substructures.</p>
<p>Fragment-based generative models exploit this inductive bias by constructing custom representations amenable to fragment-based molecular design. However, these approaches are typically graph-based, losing the desirable properties of string representations: easy manipulation and direct input into established language models. Historical string representations like Wiswesser Line Notation (WLN), Hayward Notation, and SYBYL Line Notation (SLN) did use non-atomic tokens, but none provided chemical robustness guarantees.</p>
<p>The gap is clear: no existing string representation combines the chemical robustness of SELFIES with the fragment-level abstraction that captures meaningful chemical motifs.</p>
<h2 id="group-tokens-with-chemical-robustness-guarantees">Group Tokens with Chemical Robustness Guarantees</h2>
<p>The core innovation is the introduction of <strong>group tokens</strong> into the SELFIES framework. Each group token represents a predefined molecular fragment (such as a benzene ring, carboxyl group, or any user-specified substructure) and is treated as a single unit during encoding and decoding.</p>
<h3 id="group-definition">Group Definition</h3>
<p>Each group is defined as a set of atoms and bonds with labeled <strong>attachment points</strong> that specify how the group participates in bonding. Each attachment point has a specified maximum valency, allowing the decoder to continue tracking available valency during string construction. Group tokens take the form <code>[:S&lt;group-name&gt;]</code>, where <code>S</code> is the starting attachment index.</p>
<h3 id="encoding">Encoding</h3>
<p>To encode a molecule, the encoder first recognizes and replaces substructure matches from the group set. By default, the encoder processes larger groups first, but users can override this with priority values. The encoder then traverses the molecular graph similarly to standard SELFIES encoding, inserting tokens that track attachment indices for entering and exiting groups.</p>
<h3 id="decoding">Decoding</h3>
<p>When the decoder encounters a group token, it looks up the corresponding group in the group set dictionary, places all atoms of the group, and connects the main chain to the starting attachment point. Navigation between attachment points is handled by reading subsequent tokens as relative indices. If an attachment point is occupied, the next available one is used. If all attachment points are exhausted, the group is immediately popped from the stack.</p>
<h3 id="chemical-robustness">Chemical Robustness</h3>
<p>The key property preserved from SELFIES is that <strong>any arbitrary Group SELFIES string decodes to a molecule with valid valency</strong>. This is achieved by maintaining the same two SELFIES decoder features within the group framework:</p>
<ol>
<li>Token overloading: every token can be interpreted as a number when needed (for branch lengths, ring targets, or attachment indices).</li>
<li>Valency tracking: if adding a bond would exceed available valency, the decoder adjusts the bond order or skips the bond.</li>
</ol>
<p>The authors verified robustness by encoding and decoding 25 million molecules from the eMolecules database.</p>
<h3 id="chirality-handling">Chirality Handling</h3>
<p>Group SELFIES handles chirality differently from SMILES and SELFIES. Rather than using <code>@</code>-notation for tetrahedral chirality, all chiral centers must be specified as groups. An &ldquo;essential set&rdquo; of 23 groups covers all relevant chiral centers in the eMolecules database. This approach also supports extended chirality (axial, helical, planar) by abstracting the entire chiral substructure into a group token.</p>
<h3 id="fragment-selection">Fragment Selection</h3>
<p>The group set is a user-defined dictionary that maps group names to molecular fragments. Users can specify groups manually using SMILES-like syntax, extract them from fragment libraries, or use fragmentation algorithms such as matched molecular pair analysis. The authors tested several approaches, including a naive method that cleaves side chains from rings and methods based on cheminformatics fragmentation tools. A useful group set typically contains fragments that appear in many molecules and replace many atoms, with similar fragments merged to reduce redundancy.</p>
<h2 id="experiments-on-compactness-generation-and-distribution-learning">Experiments on Compactness, Generation, and Distribution Learning</h2>
<h3 id="compactness-section-41">Compactness (Section 4.1)</h3>
<p>Using 53 groups (30 extracted from ZINC-250k plus 23 from the essential set), Group SELFIES strings are shorter than their SMILES and SELFIES equivalents. Despite Group SELFIES having a larger alphabet, the compressed file size of the ZINC-250k dataset is smallest for Group SELFIES, indicating lower information-theoretic complexity.</p>
<h3 id="random-molecular-generation-section-42">Random Molecular Generation (Section 4.2)</h3>
<p>To isolate the effect of the representation from the generative model, the authors use a primitive generative model: sample a random string length from the dataset, draw tokens uniformly from a bag of all tokens, and concatenate. From 100,000 ZINC-250k molecules:</p>
<ul>
<li>Randomly sampled Group SELFIES strings produce molecules whose SAScore and QED distributions more closely overlap with the original ZINC dataset than molecules from randomly sampled SELFIES strings.</li>
<li>The Wasserstein distances to the ZINC distribution are consistently lower for Group SELFIES.</li>
<li>On a nonfullerene acceptor (NFA) dataset, Group SELFIES preserves aromatic rings while SELFIES rarely does.</li>
</ul>
<h3 id="distribution-learning-with-vaes-section-43">Distribution Learning with VAEs (Section 4.3)</h3>
<p>Using the MOSES benchmarking framework, VAEs were trained for 125 epochs on both Group SELFIES and SELFIES representations. The Group SELFIES VAE used 300 groups extracted from the MOSES training set. Results from 100,000 generated molecules:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Group-VAE-125</th>
          <th>SELFIES-VAE-125</th>
          <th>Train (Reference)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>1.0 (0)</td>
          <td>1.0 (0)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Unique@1k</td>
          <td>1.0 (0)</td>
          <td>0.9996 (5)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Unique@10k</td>
          <td>0.9985 (4)</td>
          <td>0.9986 (4)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>FCD (Test)</td>
          <td>0.1787 (29)</td>
          <td>0.6351 (43)</td>
          <td>0.008</td>
      </tr>
      <tr>
          <td>FCD (TestSF)</td>
          <td>0.734 (109)</td>
          <td>1.3136 (128)</td>
          <td>0.4755</td>
      </tr>
      <tr>
          <td>SNN (Test)</td>
          <td>0.6051 (4)</td>
          <td>0.6014 (3)</td>
          <td>0.6419</td>
      </tr>
      <tr>
          <td>Frag (Test)</td>
          <td>0.9995 (0)</td>
          <td>0.9989 (0)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Scaf (Test)</td>
          <td>0.9649 (21)</td>
          <td>0.9588 (15)</td>
          <td>0.9907</td>
      </tr>
      <tr>
          <td>IntDiv</td>
          <td>0.8587 (1)</td>
          <td>0.8579 (1)</td>
          <td>0.8567</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.9623 (7)</td>
          <td>0.96 (4)</td>
          <td>1.0</td>
      </tr>
  </tbody>
</table>
<p>The most notable improvement is in Frechet ChemNet Distance (FCD), where Group SELFIES achieves 0.1787 versus 0.6351 for SELFIES on the test set. FCD measures the difference between penultimate-layer activations of ChemNet, encoding a mixture of biological and chemical properties relevant to drug-likeness. Most other metrics are comparable, with Group SELFIES matching or slightly outperforming SELFIES across the board.</p>
<h2 id="advantages-limitations-and-future-directions">Advantages, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<p>Group SELFIES provides three main advantages over standard SELFIES:</p>
<ol>
<li><strong>Substructure control</strong>: Important scaffolds, chiral centers, and charged groups can be preserved during molecular optimization.</li>
<li><strong>Compactness</strong>: Group tokens represent multiple atoms, yielding shorter strings with lower information-theoretic complexity.</li>
<li><strong>Improved distribution learning</strong>: The FCD metric shows substantial improvement, indicating generated molecules better capture biological and chemical properties of the training set.</li>
</ol>
<p>Both SELFIES and Group SELFIES achieve 100% validity, eliminating the validity issues associated with SMILES-based generation.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational speed</strong>: Encoding and decoding is slower than SELFIES due to RDKit overhead, particularly for the encoder which performs substructure matching for every group in the set.</li>
<li><strong>No group overlap</strong>: Groups cannot overlap in the current formulation, which limits expressiveness for polycyclic compounds.</li>
<li><strong>Group set design</strong>: Choosing an effective group set remains an open design choice that may require domain expertise or fragmentation algorithm tuning.</li>
<li><strong>Limited generative model evaluation</strong>: The paper focuses on random sampling and VAEs; evaluation with more sophisticated models (GANs, reinforcement learning, genetic algorithms) is left to future work.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose several extensions: flexible scaffold tokens that preserve topology while allowing atom-type variation, representations based on cellular complexes or hypergraphs to handle overlapping groups, and integration with genetic algorithms like JANUS for molecular optimization.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compactness / Generation</td>
          <td>ZINC-250k</td>
          <td>250,000 molecules</td>
          <td>Random subset of 10,000 for fragment extraction; 100,000 for generation</td>
      </tr>
      <tr>
          <td>Distribution Learning</td>
          <td>MOSES benchmark</td>
          <td>~1.9M molecules</td>
          <td>Standard train/test split from MOSES framework</td>
      </tr>
      <tr>
          <td>Robustness Verification</td>
          <td>eMolecules</td>
          <td>25M molecules</td>
          <td>Full database encode-decode round trip</td>
      </tr>
      <tr>
          <td>NFA Generation</td>
          <td>NFA dataset</td>
          <td>Not specified</td>
          <td>Nonfullerene acceptors from Lopez et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Fragmentation</strong>: Naive ring-sidechain cleavage, matched molecular pair analysis, and diversity-based selection of 300 groups for VAE experiments.</li>
<li><strong>Essential set</strong>: 23 chiral groups covering all relevant chiral centers in eMolecules.</li>
<li><strong>Random generation</strong>: Bag-of-tokens sampling with length matched to dataset distribution.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>VAE</strong>: Trained for 125 epochs on MOSES dataset using both SELFIES and Group SELFIES tokenizations.</li>
<li>Architecture details follow the MOSES benchmark VAE configuration.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FCD</td>
          <td>Frechet ChemNet Distance (penultimate layer activations)</td>
      </tr>
      <tr>
          <td>SNN</td>
          <td>Average Tanimoto similarity to nearest neighbor in reference set</td>
      </tr>
      <tr>
          <td>Frag</td>
          <td>Cosine similarity of BRICS fragment distributions</td>
      </tr>
      <tr>
          <td>Scaf</td>
          <td>Cosine similarity of Bemis-Murcko scaffold distributions</td>
      </tr>
      <tr>
          <td>IntDiv</td>
          <td>Internal diversity via Tanimoto similarity</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Percentage passing RDKit parsing</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Percentage of non-duplicate generated molecules</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Robustness verification performed on the Niagara supercomputer (SciNet HPC Consortium).</li>
<li>VAE training hardware not specified.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/group-selfies">group-selfies</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Open-source Python implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cheng, A. H., Cai, A., Miret, S., Malkomes, G., Phielipp, M., &amp; Aspuru-Guzik, A. (2023). Group SELFIES: A robust fragment-based molecular string representation. <em>Digital Discovery</em>, 2(3), 748-758. <a href="https://doi.org/10.1039/D3DD00012E">https://doi.org/10.1039/D3DD00012E</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cheng2023group,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Group SELFIES: A Robust Fragment-Based Molecular String Representation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cheng, Austin H. and Cai, Andy and Miret, Santiago and Malkomes, Gustavo and Phielipp, Mariano and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{748--758}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3DD00012E}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Evolutionary Molecular Design via Deep Learning + GA</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</guid><description>Kwon et al. combine an RNN decoder for SMILES reconstruction with a genetic algorithm operating on ECFP fingerprints for goal-directed molecular design.</description><content:encoded><![CDATA[<h2 id="fingerprint-based-evolutionary-molecular-design">Fingerprint-Based Evolutionary Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as <a href="https://en.wikipedia.org/wiki/Chemical_similarity">extended-connectivity fingerprint</a> (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.</p>
<h2 id="challenges-in-evolutionary-molecular-optimization">Challenges in Evolutionary Molecular Optimization</h2>
<p>Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.</p>
<p>High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.</p>
<h2 id="core-innovation-evolving-fingerprints-with-neural-decoding">Core Innovation: Evolving Fingerprints with Neural Decoding</h2>
<p>The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:</p>
<p><strong>Encoding function</strong> $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).</p>
<p><strong>Decoding function</strong> $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:</p>
<p>$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$</p>
<p>The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.</p>
<p><strong>Property prediction function</strong> $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:</p>
<p>$$\mathbf{t} = f(e(\mathbf{m}))$$</p>
<p>The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.</p>
<h3 id="genetic-algorithm-operations">Genetic Algorithm Operations</h3>
<p>The GA evolves ECFP vectors using the DEAP library with the following parameters:</p>
<ul>
<li><strong>Population size</strong>: 50</li>
<li><strong>Crossover rate</strong>: 0.7 (uniform crossover, mixing ratio 0.2)</li>
<li><strong>Mutation rate</strong>: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)</li>
<li><strong>Selection</strong>: Tournament selection with size 3, top 3 individuals as parents</li>
<li><strong>Termination</strong>: 500 generations or 30 consecutive generations without fitness improvement</li>
</ul>
<p>The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.</p>
<h2 id="experimental-setup-light-absorbing-wavelength-optimization">Experimental Setup: Light-Absorbing Wavelength Optimization</h2>
<h3 id="training-data-and-deep-learning-performance">Training Data and Deep Learning Performance</h3>
<p>The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO, and LUMO</a> energies using B3LYP/6-31G.</p>
<table>
  <thead>
      <tr>
          <th>Training Data</th>
          <th>Validity (%)</th>
          <th>Reconstructability (%)</th>
          <th>$S_{1}$ (R, MAE)</th>
          <th>HOMO (R, MAE)</th>
          <th>LUMO (R, MAE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>100,000</td>
          <td>88.8</td>
          <td>62.4</td>
          <td>0.977, 0.185 eV</td>
          <td>0.948, 0.168 eV</td>
          <td>0.960, 0.195 eV</td>
      </tr>
      <tr>
          <td>50,000</td>
          <td>86.7</td>
          <td>60.1</td>
          <td>0.973, 0.198 eV</td>
          <td>0.945, 0.172 eV</td>
          <td>0.955, 0.209 eV</td>
      </tr>
      <tr>
          <td>30,000</td>
          <td>85.3</td>
          <td>59.8</td>
          <td>0.930, 0.228 eV</td>
          <td>0.934, 0.191 eV</td>
          <td>0.945, 0.224 eV</td>
      </tr>
      <tr>
          <td>10,000</td>
          <td>83.2</td>
          <td>55.7</td>
          <td>0.913, 0.278 eV</td>
          <td>0.885, 0.244 eV</td>
          <td>0.917, 0.287 eV</td>
      </tr>
  </tbody>
</table>
<p>Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).</p>
<h3 id="design-task-1-unconstrained-s1-modification">Design Task 1: Unconstrained S1 Modification</h3>
<p>Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.</p>
<h3 id="design-task-2-s1-modification-with-homolumo-constraints">Design Task 2: S1 Modification with HOMO/LUMO Constraints</h3>
<p>The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} &lt; \text{HOMO} &lt; -5.0 \text{ eV}$ and $\text{LUMO} &lt; 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.</p>
<h3 id="design-task-3-extrapolation-beyond-training-data">Design Task 3: Extrapolation Beyond Training Data</h3>
<p>To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative &ldquo;phases&rdquo;: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:</p>
<ul>
<li>Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV</li>
<li>Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV</li>
<li>Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV</li>
</ul>
<p>While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.</p>
<h3 id="design-task-4-guacamol-benchmarks">Design Task 4: GuacaMol Benchmarks</h3>
<p>The method was evaluated on the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Best of Dataset</th>
          <th><a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a></th>
          <th>SMILES GA</th>
          <th><a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph GA</a></th>
          <th><a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph MCTS</a></th>
          <th>cRNN</th>
          <th>EDM (ours)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Celecoxib rediscovery</td>
          <td>0.505</td>
          <td>1.000</td>
          <td>0.607</td>
          <td>1.000</td>
          <td>0.378</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Troglitazone rediscovery</td>
          <td>0.419</td>
          <td>1.000</td>
          <td>0.558</td>
          <td>1.000</td>
          <td>0.312</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Thiothixene rediscovery</td>
          <td>0.456</td>
          <td>1.000</td>
          <td>0.495</td>
          <td>1.000</td>
          <td>0.308</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(-1.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.980</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(8.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.979</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>TPSA(150.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>CNS MPO</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.944</td>
          <td>0.948</td>
          <td>0.948</td>
      </tr>
  </tbody>
</table>
<p>The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="results">Results</h3>
<p>The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching <a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a>, <a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, and cRNN baselines.</p>
<h3 id="limitations">Limitations</h3>
<p>Several limitations are worth noting:</p>
<ol>
<li><strong>Reconstructability ceiling</strong>: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.</li>
<li><strong>Data dependence</strong>: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.</li>
<li><strong>Structural constraints</strong>: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.</li>
<li><strong>DFT reliance</strong>: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.</li>
<li><strong>Limited benchmark scope</strong>: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>PubChem random sample</td>
          <td>10,000-100,000 molecules</td>
          <td>MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO</td>
      </tr>
      <tr>
          <td>GuacaMol Benchmark</td>
          <td>ChEMBL25</td>
          <td>Standard split</td>
          <td>Used for retraining RNN; 256 top-scoring seeds</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Genetic algorithm</strong>: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3</li>
<li><strong>RNN decoder</strong>: 3 hidden layers, 500 LSTM units each, three-character substring generation</li>
<li><strong>DNN predictor</strong>: 5 layers, 250 hidden units, sigmoid activations, linear output</li>
<li><strong>Training</strong>: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5</li>
</ul>
<h3 id="models">Models</h3>
<p>All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>RNN validity</strong>: Proportion of chemically valid SMILES (RDKit check)</li>
<li><strong>Reconstructability</strong>: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)</li>
<li><strong>DNN accuracy</strong>: Correlation coefficient (R) and MAE via 10-fold cross-validation</li>
<li><strong>Evolutionary performance</strong>: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range</li>
<li><strong>GuacaMol</strong>: Standard rediscovery and property satisfaction benchmarks</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.nature.com/articles/s41598-021-96812-8">Paper (Nature)</a></td>
          <td>Paper</td>
          <td>CC-BY 4.0</td>
          <td>Open access</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kwon, Y., Kang, S., Choi, Y.-S., &amp; Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. <em>Scientific Reports</em>, 11, 17304. <a href="https://doi.org/10.1038/s41598-021-96812-8">https://doi.org/10.1038/s41598-021-96812-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kwon2021evolutionary,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evolutionary design of molecules based on deep learning and a genetic algorithm}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-021-96812-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v3: Scaffold-Constrained Graph Transformer</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</guid><description>DrugEx v3 proposes a Graph Transformer with novel positional encoding for scaffold-constrained molecular generation via multi-objective reinforcement learning.</description><content:encoded><![CDATA[<h2 id="a-graph-transformer-method-for-scaffold-constrained-drug-design">A Graph Transformer Method for Scaffold-Constrained Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.</p>
<h2 id="from-fixed-objectives-to-user-guided-scaffold-design">From Fixed Objectives to User-Guided Scaffold Design</h2>
<p>Prior versions of DrugEx (v1 and <a href="/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/">v2</a>) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.</p>
<p>Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture&rsquo;s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.</p>
<h2 id="graph-positional-encoding-for-molecular-transformers">Graph Positional Encoding for Molecular Transformers</h2>
<p>The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.</p>
<p><strong>Graph word encoding.</strong> Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:</p>
<p>$$
W = T_{atom} \times 4 + T_{bond}
$$</p>
<p>where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).</p>
<p><strong>Graph positional encoding.</strong> Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:</p>
<p>$$
P = I_{Atom} \times L_{max} + I_{Connected}
$$</p>
<p>where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:</p>
<p>$$
PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}})
$$</p>
<p>$$
PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}})
$$</p>
<p>with $d_{m} = 512$.</p>
<p><strong>Molecule generation procedure.</strong> Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.</p>
<p><strong>Multi-objective reinforcement learning.</strong> The generator is trained with a policy gradient objective:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T})
$$</p>
<p>where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:</p>
<p>$$
R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>with $k$ being the solution&rsquo;s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.</p>
<h2 id="experimental-setup-architecture-comparison-and-rl-optimization">Experimental Setup: Architecture Comparison and RL Optimization</h2>
<h3 id="data">Data</h3>
<p>The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.</p>
<h3 id="architecture-comparison">Architecture comparison</h3>
<p>Four architectures were compared:</p>
<ol>
<li><strong>Graph Transformer</strong>: graph input with novel positional encoding</li>
<li><strong>Sequential Transformer</strong>: SMILES input with standard Transformer</li>
<li><strong>LSTM-BASE</strong>: SMILES encoder-decoder with three recurrent layers</li>
<li><strong>LSTM+ATTN</strong>: LSTM-BASE with an attention mechanism between encoder and decoder</li>
</ol>
<p>All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<h3 id="hardware">Hardware</h3>
<p>Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.</p>
<h2 id="key-results-graph-representation-advantages-and-rl-trade-offs">Key Results: Graph Representation Advantages and RL Trade-offs</h2>
<h3 id="pre-training-and-fine-tuning-performance">Pre-training and fine-tuning performance</h3>
<p>The Graph Transformer achieved the best overall performance across all metrics:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity (PT)</th>
          <th>Accuracy (PT)</th>
          <th>Validity (FT)</th>
          <th>Accuracy (FT)</th>
          <th>Novelty (FT)</th>
          <th>Uniqueness (FT)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph Transformer (512)</td>
          <td>100.0%</td>
          <td>99.3%</td>
          <td>100.0%</td>
          <td>99.2%</td>
          <td>68.9%</td>
          <td>82.9%</td>
      </tr>
      <tr>
          <td>Seq. Transformer (512)</td>
          <td>96.7%</td>
          <td>74.0%</td>
          <td>99.3%</td>
          <td>92.7%</td>
          <td>8.9%</td>
          <td>28.9%</td>
      </tr>
      <tr>
          <td>LSTM+ATTN (512)</td>
          <td>94.3%</td>
          <td>72.8%</td>
          <td>96.9%</td>
          <td>85.9%</td>
          <td>6.3%</td>
          <td>20.7%</td>
      </tr>
      <tr>
          <td>LSTM-BASE (512)</td>
          <td>93.9%</td>
          <td>52.4%</td>
          <td>98.7%</td>
          <td>81.6%</td>
          <td>3.9%</td>
          <td>19.2%</td>
      </tr>
  </tbody>
</table>
<p>PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.</p>
<p>The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.</p>
<h3 id="reinforcement-learning-results">Reinforcement learning results</h3>
<p>With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:</p>
<table>
  <thead>
      <tr>
          <th>$\varepsilon$</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.0</td>
          <td>74.6%</td>
          <td>60.7%</td>
          <td>60.6%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.1</td>
          <td>66.8%</td>
          <td>75.0%</td>
          <td>74.6%</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>0.2</td>
          <td>61.6%</td>
          <td>80.2%</td>
          <td>79.4%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.3</td>
          <td>56.8%</td>
          <td>89.8%</td>
          <td>88.8%</td>
          <td>0.874</td>
      </tr>
  </tbody>
</table>
<p>The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.</p>
<h3 id="limitations">Limitations</h3>
<p>The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model&rsquo;s applicability domain.</p>
<h3 id="future-directions">Future directions</h3>
<p>The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a &ldquo;hit&rdquo; serves as input to generate improved analogs, is also identified as a natural extension.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v27</td>
          <td>~1.7M molecules (9.3M scaffold-molecule pairs)</td>
          <td>Preprocessed via RDKit</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LIGAND set (A2A AR ligands from ChEMBL)</td>
          <td>10,828 ligands (53,888 pairs)</td>
          <td>Split 8:1:1 train/val/test</td>
      </tr>
      <tr>
          <td>Bioactivity labels</td>
          <td>ChEMBL A2A AR activity data</td>
          <td>pX threshold = 6.5</td>
          <td>Average pChEMBL values</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)</li>
<li>Optimizer: Adam with learning rate $10^{-4}$, batch size 256</li>
<li>Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)</li>
<li>Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors</li>
<li>Pareto-based multi-objective ranking with GPU acceleration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$</li>
<li>Sequential Transformer: same hidden size, sinusoidal positional encoding</li>
<li>LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Graph Transformer</th>
          <th>Best SMILES Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (fine-tuned)</td>
          <td>100.0%</td>
          <td>99.6% (LSTM-BASE 1024)</td>
          <td>Valence checking guarantees validity</td>
      </tr>
      <tr>
          <td>Accuracy (fine-tuned)</td>
          <td>99.2%</td>
          <td>94.3% (Seq. Transformer 1024)</td>
          <td>Scaffold containment</td>
      </tr>
      <tr>
          <td>Desirability (RL, $\varepsilon$=0.0)</td>
          <td>74.6%</td>
          <td>N/A</td>
          <td>Only Graph Transformer used for RL</td>
      </tr>
      <tr>
          <td>Diversity (RL)</td>
          <td>0.879</td>
          <td>N/A</td>
          <td>Solow-Polasky index</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware-1">Hardware</h3>
<p>NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CDDLeiden/DrugEx">CDDLeiden/DrugEx</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (v1, v2, v3)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v27</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Pre-training data source</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., &amp; van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. <em>Journal of Cheminformatics</em>, 15, 24. <a href="https://doi.org/10.1186/s13321-023-00694-z">https://doi.org/10.1186/s13321-023-00694-z</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2023drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00694-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DeepSMILES: Adapting SMILES Syntax for Machine Learning</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/</guid><description>DeepSMILES modifies SMILES syntax to eliminate unbalanced parentheses and unpaired ring closures, reducing invalid outputs from generative molecular models.</description><content:encoded><![CDATA[<h2 id="a-new-molecular-string-notation-for-generative-models">A New Molecular String Notation for Generative Models</h2>
<p>This is a <strong>Method</strong> paper that introduces DeepSMILES, a modified SMILES syntax designed to reduce the rate of syntactically invalid strings produced by machine-learning generative models. The primary contribution is a pair of string-level transformations (for ring closures and for branches) that can be applied independently and interconverted with standard SMILES without loss of information, including stereochemistry.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p>Deep neural networks for de novo molecular design commonly operate on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational autoencoders</a> (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al., 2018</a>), recurrent neural networks with LSTM (<a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">Segler et al., 2018</a>; Olivecrona et al., 2017), and grammar-based approaches (<a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Kusner et al., 2017</a>) all generate molecules by sampling character sequences. A persistent problem is that many generated strings are syntactically invalid SMILES, with reported validity rates ranging from 7% to 80%.</p>
<p>Two structural features of SMILES syntax are responsible for most invalid strings:</p>
<ol>
<li><strong>Balanced parentheses</strong>: Branches require matched open/close parenthesis pairs. A generative model must track nesting state across long sequences to produce valid brackets.</li>
<li><strong>Paired ring closure symbols</strong>: Rings require two identical digit tokens at corresponding positions. The model must remember which digits are &ldquo;open&rdquo; and close them appropriately.</li>
</ol>
<p>Grammar-based approaches (e.g., <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a>) can enforce balanced parentheses through a context-free grammar, but they cannot enforce the ring closure pairing constraint because that constraint is context-sensitive. Syntax-directed approaches (Dai et al., 2018) add explicit ring closure constraints but at the cost of significantly more complex decoder architectures.</p>
<h2 id="core-innovation-postfix-branch-notation-and-single-ring-closure-symbols">Core Innovation: Postfix Branch Notation and Single Ring Closure Symbols</h2>
<p>DeepSMILES addresses both syntax problems through two independent string transformations.</p>
<h3 id="ring-closure-transformation">Ring closure transformation</h3>
<p>Standard SMILES uses a pair of identical digits to mark ring openings and closings (e.g., <code>c1ccccc1</code> for benzene). DeepSMILES eliminates the ring-opening digit and replaces the ring-closing digit with the ring size, counting back along the tree path to the ring-opening atom. Benzene becomes <code>cccccc6</code>, where <code>6</code> means &ldquo;connect to the atom 6 positions back.&rdquo;</p>
<p>This transformation has three key properties:</p>
<ul>
<li>Every ring of a given size always uses the same digit, regardless of context. A phenyl ring is always <code>cccccc6</code> in DeepSMILES, whereas in SMILES it might be <code>c1ccccc1</code>, <code>c2ccccc2</code>, <code>c3ccccc3</code>, etc.</li>
<li>A single symbol cannot be &ldquo;unmatched&rdquo; since there is no corresponding opening symbol.</li>
<li>For double-digit ring sizes, the <code>%N</code> notation is used (and <code>%(N)</code> for sizes above 99).</li>
</ul>
<p>Bond stereochemistry is preserved by moving any explicit or stereo bond from the eliminated ring-opening symbol to the ring-closing symbol, with direction adjusted as needed.</p>
<h3 id="branch-parenthesis-transformation">Branch (parenthesis) transformation</h3>
<p>Standard SMILES uses matched open/close parenthesis pairs for branches (e.g., <code>C(OC)(SC)F</code>). DeepSMILES replaces this with a postfix notation inspired by Reverse Polish Notation (RPN). Only close parentheses are used, and the number of consecutive close parentheses indicates how far back on the current branch the next atom attaches.</p>
<p>For example, <code>C(OC)(SC)F</code> becomes <code>COC))SC))F</code>. The interpretation uses a stack: atoms are pushed onto the stack as they are read, each close parenthesis pops one atom from the stack, and the next atom connects to whatever is on top of the stack.</p>
<h3 id="stereochemistry-preservation">Stereochemistry preservation</h3>
<p>Tetrahedral stereochemistry is fully preserved through the transformations. When ring closure symbol reordering would change the stereo configuration, the <code>@</code>/<code>@@</code> annotation is inverted during encoding to compensate.</p>
<h3 id="independence-of-transformations">Independence of transformations</h3>
<p>The two transformations are independent and can be applied separately or together. Any application of DeepSMILES should specify which transformations were applied.</p>
<h2 id="roundtrip-validation-on-chembl-23">Roundtrip Validation on ChEMBL 23</h2>
<p>The authors validated DeepSMILES by roundtripping all entries in the ChEMBL 23 database through SMILES-to-DeepSMILES-to-SMILES conversion. Canonical SMILES (including stereochemistry) were generated by four independent cheminformatics toolkits: CDK, OEChem, Open Babel, and RDKit. Using multiple toolkits ensures coverage of different traversal orders and ring closure ordering conventions.</p>
<p>All SMILES strings roundtripped without error across all three configurations (branches only, rings only, both). The exact string representation may differ in ring closure digit assignment or digit ordering, sometimes with an associated stereo inversion at tetrahedral centers, but the canonical SMILES of the original and roundtripped molecules are identical.</p>
<h3 id="performance-characteristics">Performance characteristics</h3>
<p>The following table shows the effect of DeepSMILES conversion on string length and throughput, measured on canonical SMILES from Open Babel for ChEMBL 23:</p>
<table>
  <thead>
      <tr>
          <th>Transformation</th>
          <th>Mean % change in length</th>
          <th>Encoding (per sec)</th>
          <th>Decoding (per sec)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Branches only</td>
          <td>+8.2%</td>
          <td>32,000</td>
          <td>16,000</td>
      </tr>
      <tr>
          <td>Rings only</td>
          <td>-6.4%</td>
          <td>26,000</td>
          <td>24,000</td>
      </tr>
      <tr>
          <td>Both</td>
          <td>+1.9%</td>
          <td>26,000</td>
          <td>17,500</td>
      </tr>
  </tbody>
</table>
<p>The ring transformation slightly shortens strings (by removing one digit per ring), while the branch transformation slightly lengthens them (additional close parentheses). Combined, the net effect is a small increase of about 2%. Throughput is in the tens of thousands of conversions per second in pure Python.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>DeepSMILES does not eliminate all invalid strings. Invalid DeepSMILES can still be generated, for example when there are more close parentheses than atoms on the stack, or when a ring size exceeds the number of available atoms. The reference implementation raises a <code>DecodeError</code> in these cases, though the authors note that a more tolerant decoder (ignoring extra parentheses or defaulting to the first atom for oversized rings) could be used during generation.</p>
<p>The paper assumes that input SMILES are generated by a standard cheminformatics toolkit as a depth-first traversal of the molecular graph. Non-standard SMILES (e.g., <code>CC(C1)CCCC1</code>) cannot be directly encoded.</p>
<p>The authors suggest several directions for future work:</p>
<ul>
<li>Investigating whether a preferred traversal order (e.g., shorter branches first) would make DeepSMILES even easier for models to learn.</li>
<li>Exploring notations where atoms in the organic subset explicitly list their hydrogen count, which would allow a fully parenthesis-free representation.</li>
<li>Using SMILES augmentation with random traversal orders (as explored by Bjerrum and Threlfall, 2017) in combination with DeepSMILES.</li>
<li>Designing entirely new line notations optimized for ML, where every string maps to a valid molecule, there are few duplicate representations, small string changes produce small structural changes, and string length correlates with pharmaceutical relevance.</li>
</ul>
<p>The fused ring case presents additional complexity: a bicyclic system has three cycles, and depending on traversal order, the ring size digit may not directly correspond to the ring size of any individual ring. This is an inherent limitation of depth-first traversal-based notations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validation</td>
          <td>ChEMBL 23</td>
          <td>~1.7M compounds</td>
          <td>Canonical SMILES from CDK, OEChem, Open Babel, RDKit</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The DeepSMILES encoder and decoder are pure string-processing algorithms with no machine-learning components. The transformations operate on SMILES syntax tokens (atoms, bonds, parentheses, ring closure digits) without chemical interpretation.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Roundtrip accuracy</td>
          <td>100%</td>
          <td>All ChEMBL 23 entries across 4 toolkits</td>
      </tr>
      <tr>
          <td>Encoding throughput</td>
          <td>26,000-32,000/s</td>
          <td>Pure Python, varies by transformation</td>
      </tr>
      <tr>
          <td>Decoding throughput</td>
          <td>16,000-24,000/s</td>
          <td>Pure Python, varies by transformation</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>No specific hardware requirements. The implementation is a pure Python module with no GPU dependencies.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/nextmovesoftware/deepsmiles">deepsmiles</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Pure Python encoder/decoder</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: O&rsquo;Boyle, N. M., &amp; Dalke, A. (2018). DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. <em>ChemRxiv</em>. <a href="https://doi.org/10.26434/chemrxiv.7097960.v1">https://doi.org/10.26434/chemrxiv.7097960.v1</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{oboyle2018deepsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{O&#39;Boyle, Noel M. and Dalke, Andrew}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ChemRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.26434/chemrxiv.7097960.v1}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Curriculum Learning for De Novo Drug Design (REINVENT)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/curriculum-learning-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/curriculum-learning-molecular-design/</guid><description>Curriculum learning applied to REINVENT accelerates convergence on complex multi-parameter drug design objectives compared to standard reinforcement learning.</description><content:encoded><![CDATA[<h2 id="curriculum-learning-as-a-method-for-molecular-generation">Curriculum Learning as a Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces curriculum learning (CL) into the <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).</p>
<h2 id="the-computational-cost-of-complex-reward-functions">The Computational Cost of Complex Reward Functions</h2>
<p>Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.</p>
<p>The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.</p>
<p>Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.</p>
<h2 id="formalized-curriculum-strategy-for-reinvent">Formalized Curriculum Strategy for REINVENT</h2>
<p>The key innovation is a two-phase training protocol with formal definitions for curriculum progression.</p>
<p>A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:</p>
<p>$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$</p>
<p>where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.</p>
<p>A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.</p>
<p><strong>Curriculum Phase</strong>: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.</p>
<p><strong>Production Phase</strong>: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.</p>
<p>The implementation builds on REINVENT&rsquo;s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.</p>
<h2 id="three-experiments-on-pdk1-inhibitor-design">Three Experiments on PDK1 Inhibitor Design</h2>
<p>The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing <a href="https://en.wikipedia.org/wiki/PDPK1">3-phosphoinositide-dependent protein kinase-1</a> (PDK1) inhibitors.</p>
<h3 id="experiment-1-target-scaffold-construction">Experiment 1: Target Scaffold Construction</h3>
<p>The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior&rsquo;s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.</p>
<p>CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.</p>
<h3 id="experiments-2-and-3-molecular-docking-constraints">Experiments 2 and 3: Molecular Docking Constraints</h3>
<p>These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.</p>
<p><strong>Experiment 2</strong> uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: &ldquo;Low&rdquo; (threshold 0.5) and &ldquo;High&rdquo; (threshold 0.8).</p>
<p><strong>Experiment 3</strong> uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with &ldquo;Low&rdquo; (0.5) and &ldquo;High&rdquo; (0.75) thresholds.</p>
<p>All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.</p>
<p>Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the &ldquo;High&rdquo; threshold scenario outperforms the &ldquo;Low&rdquo; scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).</p>
<p>Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the &ldquo;High&rdquo; Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.</p>
<h3 id="scaffold-diversity-analysis">Scaffold Diversity Analysis</h3>
<p>CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The &ldquo;High&rdquo; scenarios produce more unique scaffolds than the &ldquo;Low&rdquo; scenarios. CL also produces a higher fraction of &ldquo;favorable&rdquo; scaffolds (those with better docking scores than the reference ligand).</p>
<h2 id="accelerated-convergence-with-a-diversity-trade-off">Accelerated Convergence with a Diversity Trade-off</h2>
<p>The results demonstrate three consistent findings across all experiments:</p>
<ol>
<li>
<p><strong>Accelerated productivity</strong>: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.</p>
</li>
<li>
<p><strong>Improved output quality</strong>: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.</p>
</li>
<li>
<p><strong>Controllable trade-off</strong>: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that &ldquo;Low&rdquo; and &ldquo;High&rdquo; scenarios sample from nearby but distinct regions of chemical space.</p>
</li>
</ol>
<p>The authors note that even moderate optimization of similarity-based Curriculum Objectives (the &ldquo;Low&rdquo; scenarios) already substantially narrows the agent&rsquo;s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.</p>
<p><strong>Limitations</strong>: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>Not specified</td>
          <td>Used to pretrain the RNN on SMILES syntax</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 receptor crystal structure</td>
      </tr>
  </tbody>
</table>
<p>Raw data supporting the findings are available from the corresponding author upon request.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)</li>
<li>Scoring function: weighted geometric mean of components</li>
<li>Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)</li>
<li>Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)</li>
<li>Inception (experience replay) for both phases, reset at phase transition</li>
<li>Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Prior: RNN pretrained on ChEMBL SMILES</li>
<li>Agent: Initialized from prior, focused via RL/CL</li>
<li>No pretrained model weights are publicly released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score (Glide SP)</td>
          <td>Predicted binding affinity (kcal/mol)</td>
          <td>Lower is better; reference ligand: -10.907</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimate of Druglikeness</td>
          <td>Range [0, 1]</td>
      </tr>
      <tr>
          <td>Unique Bemis-Murcko scaffolds</td>
          <td>Scaffold diversity measure</td>
          <td>Averaged over triplicates</td>
      </tr>
      <tr>
          <td>Cross-Tanimoto similarity</td>
          <td>Intra-set compound diversity</td>
          <td>Calculated on pooled triplicates</td>
      </tr>
      <tr>
          <td>Tanimoto/ROCS similarity</td>
          <td>Curriculum Objective metrics</td>
          <td>2D fingerprint and 3D shape similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>GPU: NVIDIA Tesla V100 (32 GB)</li>
<li>Docking: AWS p3.8xlarge instance</li>
<li>LigPrep parallelized over 8 CPU cores</li>
<li>Glide docking parallelized over 48 CPU cores via DockStream</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>De novo molecular design platform</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity/blob/master/notebooks/Automated_Curriculum_Learning_Demo.ipynb">CL Tutorial Notebook</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Jupyter notebook tutorial for curriculum learning</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2022). Improving de novo molecular design with curriculum learning. <em>Nature Machine Intelligence</em>, 4, 555-563. <a href="https://doi.org/10.1038/s42256-022-00494-4">https://doi.org/10.1038/s42256-022-00494-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2022curriculum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving de novo molecular design with curriculum learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Fialkov{\&#39;a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{555--563}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00494-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CogMol: Controlled Molecule Generation for COVID-19</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/cogmol-target-specific-drug-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/cogmol-target-specific-drug-design/</guid><description>CogMol combines a SMILES VAE with controlled latent space sampling to generate drug-like molecules with target specificity for novel viral proteins.</description><content:encoded><![CDATA[<h2 id="a-controlled-generation-framework-for-target-specific-drug-design">A Controlled Generation Framework for Target-Specific Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.</p>
<h2 id="multi-constraint-drug-design-for-novel-viral-targets">Multi-Constraint Drug Design for Novel Viral Targets</h2>
<p>Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.</p>
<p>The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.</p>
<h2 id="controlled-latent-space-sampling-with-pre-trained-protein-embeddings">Controlled Latent Space Sampling with Pre-trained Protein Embeddings</h2>
<p>CogMol&rsquo;s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:</p>
<p><strong>1. SMILES VAE with adaptive pre-training.</strong> A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:</p>
<p>$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$</p>
<p>where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.</p>
<p><strong>2. Protein-molecule binding affinity predictor.</strong> A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.</p>
<p><strong>3. CLaSS controlled sampling.</strong> The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:</p>
<p>$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$</p>
<p>where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes&rsquo; rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.</p>
<p><strong>Selectivity modeling.</strong> Off-target selectivity for a molecule $m$ against target $T$ is defined as:</p>
<p>$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$</p>
<p>where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.</p>
<h2 id="experimental-setup-covid-19-targets-and-in-silico-screening">Experimental Setup: COVID-19 Targets and In Silico Screening</h2>
<p><strong>Target proteins.</strong> CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.</p>
<p><strong>Training data.</strong> The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.</p>
<p><strong>CLaSS controlled generation.</strong> Molecules were generated with simultaneous constraints on binding affinity (&gt; 0.5 normalized), QED (&gt; 0.8 normalized), and selectivity (&gt; 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.</p>
<p><strong>In silico screening pipeline.</strong> Generated molecules underwent:</p>
<ul>
<li>Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure</li>
<li>Binding affinity rescoring with a higher-accuracy SMILES-level predictor</li>
<li>Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures</li>
<li>Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data</li>
</ul>
<p><strong>Baselines.</strong> VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>CLaSS enrichment (Table 1).</strong> CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity &gt; 0.5, QED &gt; 0.8, selectivity &gt; 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>CLaSS (Aff+QED+Sel)</th>
          <th>Random (Aff+QED+Sel)</th>
          <th>Enrichment</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NSP9</td>
          <td>6.9%</td>
          <td>0.7%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>RBD</td>
          <td>9.0%</td>
          <td>0.9%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>Mpro</td>
          <td>10.4%</td>
          <td>1.1%</td>
          <td>~9.5x</td>
      </tr>
  </tbody>
</table>
<p><strong>Docking results (Table 3).</strong> 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.</p>
<p><strong>Novelty.</strong> The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.</p>
<p><strong>Synthesizability.</strong> Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.</p>
<p><strong>Toxicity.</strong> Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.</p>
<h2 id="generated-molecules-show-favorable-binding-and-drug-like-properties">Generated Molecules Show Favorable Binding and Drug-Like Properties</h2>
<p>CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:</p>
<ol>
<li>CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).</li>
<li>Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.</li>
<li>Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.</li>
<li>The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.</li>
</ol>
<p><strong>Limitations.</strong> The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor&rsquo;s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.</p>
<p><strong>Future directions.</strong> The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VAE pre-training</td>
          <td>MOSES/ZINC</td>
          <td>1.6M train, 176K test</td>
          <td>Publicly available benchmark</td>
      </tr>
      <tr>
          <td>VAE adaptive training</td>
          <td>BindingDB (DeepAffinity split)</td>
          <td>~27K protein-ligand pairs</td>
          <td>Curated IC50 data</td>
      </tr>
      <tr>
          <td>Protein embeddings</td>
          <td>UniRef50 via UniRep</td>
          <td>24M sequences</td>
          <td>Pre-trained, publicly available</td>
      </tr>
      <tr>
          <td>Toxicity prediction</td>
          <td>Tox21 + ClinTox</td>
          <td>12 in vitro + clinical endpoints</td>
          <td>Public benchmark datasets</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>3 SARS-CoV-2 targets</td>
          <td>Public crystal structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors</li>
<li>CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers</li>
<li>Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings</li>
<li>Selectivity: excess binding affinity over average of $k$ random off-targets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>SMILES VAE with adaptive pre-training (ZINC then BindingDB)</li>
<li>Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints</li>
<li>Binding affinity predictor (latent-level for generation, SMILES-level for screening)</li>
<li>Retrosynthetic predictor based on Molecular Transformer</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>90%</td>
          <td>-</td>
          <td>Generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99%</td>
          <td>-</td>
          <td>Among valid molecules</td>
      </tr>
      <tr>
          <td>Filter pass</td>
          <td>95%</td>
          <td>-</td>
          <td>Relevant chemical filters</td>
      </tr>
      <tr>
          <td>Docking BFE &lt; -6 kcal/mol</td>
          <td>87-95%</td>
          <td>-</td>
          <td>Varies by target</td>
      </tr>
      <tr>
          <td>Synthetic feasibility</td>
          <td>85-90%</td>
          <td>78% (FDA drugs)</td>
          <td>COVID-19 targets</td>
      </tr>
      <tr>
          <td>Low toxicity (0-1 endpoints)</td>
          <td>~70% parent, ~80% metabolite</td>
          <td>Comparable to FDA drugs</td>
          <td>MT-DNN prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU types or training times. The work was funded internally by IBM Research.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">CogMol (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">~3500 generated molecules</a></td>
          <td>Dataset</td>
          <td>Open license</td>
          <td>For three SARS-CoV-2 targets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., &amp; Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. <em>Advances in Neural Information Processing Systems</em>, 33, 4320-4332.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{chenthamarakshan2020cogmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\&#39;c}, Aleksandra}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4320--4332}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CDDD: Learning Descriptors by Translating SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/</guid><description>CDDD learns continuous molecular descriptors by translating between SMILES and InChI representations, outperforming fingerprints in virtual screening.</description><content:encoded><![CDATA[<h2 id="a-translation-based-method-for-learned-molecular-descriptors">A Translation-Based Method for Learned Molecular Descriptors</h2>
<p>This is a <strong>Method</strong> paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from <a href="/notes/chemistry/datasets/zinc-22/">ZINC15</a> and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.</p>
<h2 id="why-translation-instead-of-reconstruction">Why Translation Instead of Reconstruction?</h2>
<p>Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.</p>
<p>Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.</p>
<p>Unsupervised approaches based on autoencoders (notably <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.&rsquo;s VAE</a> and <a href="/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/">Xu et al.&rsquo;s seq2seq model</a>) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.</p>
<p>Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.</p>
<h2 id="translation-as-semantic-compression">Translation as Semantic Compression</h2>
<p>The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.</p>
<p>The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.</p>
<p>Four translation tasks were evaluated:</p>
<ol>
<li><strong>Randomized SMILES to canonical SMILES</strong> (best performing)</li>
<li><strong>InChI to canonical SMILES</strong></li>
<li><strong>Canonical SMILES to canonical SMILES</strong> (autoencoding baseline)</li>
<li><strong>Canonical SMILES to InChI</strong> (failed to learn)</li>
</ol>
<p>The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.</p>
<p>An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban&rsquo;s J, <a href="https://en.wikipedia.org/wiki/Molar_refractivity">molar refractivity</a>, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$</p>
<p>To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.</p>
<h2 id="qsar-benchmarks-virtual-screening-and-latent-space-exploration">QSAR Benchmarks, Virtual Screening, and Latent Space Exploration</h2>
<h3 id="pretraining">Pretraining</h3>
<p>The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, &gt;3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.</p>
<h3 id="qsar-experiments">QSAR Experiments</h3>
<p>Ten QSAR datasets were used, spanning classification (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames mutagenicity</a>, <a href="https://en.wikipedia.org/wiki/KCNH2">hERG inhibition</a>, <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB penetration</a>, BACE inhibition, bee toxicity) and regression (EGFR inhibition, <a href="https://en.wikipedia.org/wiki/Plasmodium_falciparum">Plasmodium falciparum</a> inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.</p>
<p>CDDD descriptors with an SVM were benchmarked against:</p>
<ul>
<li>Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB</li>
<li>Graph convolution models (<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">DeepChem</a>)</li>
</ul>
<p>Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Split</th>
          <th>CDDD + SVM</th>
          <th>Best Fingerprint</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ames (ROC-AUC)</td>
          <td>Random</td>
          <td>0.89</td>
          <td>0.89 (ecfc2, RF)</td>
          <td>0.88</td>
      </tr>
      <tr>
          <td>hERG (ROC-AUC)</td>
          <td>Random</td>
          <td>0.86</td>
          <td>0.85 (ecfc4, RF)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC)</td>
          <td>Random</td>
          <td>0.93</td>
          <td>0.93 (ecfc2, RF)</td>
          <td>0.92</td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC)</td>
          <td>Random</td>
          <td>0.90</td>
          <td>0.91 (ecfc2, RF)</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>Bee toxicity (ROC-AUC)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.91 (ecfc6, RF)</td>
          <td>0.89</td>
      </tr>
      <tr>
          <td>Lipophilicity ($r^2$)</td>
          <td>Random</td>
          <td>0.72</td>
          <td>0.69 (ecfc2, SVM)</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>ESOL ($r^2$)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.58 (ecfc6, SVM)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>Melting point ($r^2$)</td>
          <td>Random</td>
          <td>0.42</td>
          <td>0.38 (ecfc2, SVM)</td>
          <td>0.39</td>
      </tr>
  </tbody>
</table>
<p>CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD&rsquo;s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.</p>
<h3 id="virtual-screening">Virtual Screening</h3>
<p>Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto</a> for fingerprints). This process was repeated 50 times per target.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>CDDD (ROC-AUC)</th>
          <th>Second Best</th>
          <th>p-value (Wilcoxon)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DUD</td>
          <td>0.949</td>
          <td>0.899 (laval)</td>
          <td>$5 \times 10^{-38}$</td>
      </tr>
      <tr>
          <td>MUV</td>
          <td>0.679</td>
          <td>0.677 (ap)</td>
          <td>0.04</td>
      </tr>
  </tbody>
</table>
<p>CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.</p>
<h3 id="latent-space-exploration">Latent Space Exploration</h3>
<p>The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule&rsquo;s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).</p>
<p>When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (&gt;97% for the top beam search output, &gt;99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.</p>
<h2 id="consistent-learned-descriptors-for-chemistry">Consistent Learned Descriptors for Chemistry</h2>
<p>CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:</p>
<ol>
<li><strong>Translation outperforms reconstruction</strong>: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.</li>
<li><strong>Auxiliary property prediction helps</strong>: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.</li>
<li><strong>Consistent performance</strong>: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.</li>
<li><strong>Smooth latent space</strong>: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.</li>
</ol>
<p>The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI&rsquo;s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method&rsquo;s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ZINC15 + PubChem (merged)</td>
          <td>~72M compounds</td>
          <td>Filtered: organic, MW 12-600, &gt;3 heavy atoms, logP -7 to 5</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Ames mutagenicity</td>
          <td>6,130</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Lipophilicity</td>
          <td>3,817</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>hERG, BBBP, BACE, bee toxicity</td>
          <td>188-3,440</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>EGFR, Plasmodium, ESOL, melting point</td>
          <td>184-4,451</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>DUD</td>
          <td>40 targets</td>
          <td>Ligand-based virtual screening</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>MUV</td>
          <td>17 targets</td>
          <td>Maximum unbiased validation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space</li>
<li>Decoder: Matching 3 stacked GRU layers, initialized from latent space</li>
<li>Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties</li>
<li>Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps</li>
<li>Batch size: 64 with bucketing by sequence length</li>
<li>Input regularization: 15% character dropout + Gaussian noise (std 0.05)</li>
<li>Beam search for decoding at inference</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jrwnter/cddd">CDDD (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pretrained model and extraction code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)</li>
<li>Classification metric: ROC-AUC</li>
<li>Regression metric: $r^2$</li>
<li>VS: ROC-AUC averaged over 50 random active set selections per target</li>
<li>Statistical test: <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a> for VS comparisons</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Framework: TensorFlow 1.4.1</li>
<li>Fingerprint extraction on GPU is comparable in speed to RDKit on CPU</li>
<li>SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)</li>
<li>Graph convolution training: ~30 minutes per task on GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Winter, R., Montanari, F., Noe, F., &amp; Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. <em>Chemical Science</em>, 10(6), 1692-1701. <a href="https://doi.org/10.1039/C8SC04175J">https://doi.org/10.1039/C8SC04175J</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{winter2019learning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Winter, Robin and Montanari, Floriane and No{\&#39;e}, Frank and Clevert, Djork-Arn{\&#39;e}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1692--1701}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C8SC04175J}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BindGPT: GPT for 3D Molecular Design and Docking</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/bindgpt-3d-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/bindgpt-3d-molecular-design/</guid><description>BindGPT applies GPT-style language modeling to 3D molecular generation using SMILES+XYZ tokenization, pre-training, and RL-based docking optimization.</description><content:encoded><![CDATA[<h2 id="a-language-model-for-joint-3d-molecular-graph-and-conformation-generation">A Language Model for Joint 3D Molecular Graph and Conformation Generation</h2>
<p>BindGPT is a <strong>Method</strong> paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.</p>
<h2 id="the-graph-reconstruction-problem-in-3d-molecular-generation">The Graph Reconstruction Problem in 3D Molecular Generation</h2>
<p>Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.</p>
<h2 id="smilesxyz-tokenization-jointly-encoding-graphs-and-coordinates">SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates</h2>
<p>The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a <code>&lt;LIGAND&gt;</code> token, followed by character-level SMILES tokens encoding the molecular graph, then an <code>&lt;XYZ&gt;</code> token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.</p>
<p>For protein pockets, sequences begin with a <code>&lt;POCKET&gt;</code> token followed by atom names and coordinates. Following AlphaFold&rsquo;s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.</p>
<p>The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.</p>
<h3 id="pre-training-on-large-scale-3d-data">Pre-training on Large-Scale 3D Data</h3>
<p>Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.</p>
<h3 id="supervised-fine-tuning-with-augmentation">Supervised Fine-Tuning with Augmentation</h3>
<p>For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:</p>
<ol>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES randomization</a></strong>: Each molecule can yield 100-1000 different valid SMILES strings</li>
<li><strong>Random 3D rotation</strong>: The same rotation matrix is applied to both pocket and ligand coordinates</li>
</ol>
<p>During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.</p>
<h3 id="reinforcement-learning-with-docking-feedback">Reinforcement Learning with Docking Feedback</h3>
<p>BindGPT applies REINFORCE (not PPO or <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.</p>
<p>The RL objective can be written as:</p>
<p>$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$</p>
<p>where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.</p>
<h2 id="experimental-evaluation-across-three-3d-generation-tasks">Experimental Evaluation Across Three 3D Generation Tasks</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations (12M molecules) + 3.2M pockets</td>
          <td>Large-scale 3D molecular dataset</td>
      </tr>
      <tr>
          <td>Fine-tuning (SFT)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>14k molecules x 3k pockets, includes all pose qualities</td>
      </tr>
      <tr>
          <td>Fine-tuning (conformer)</td>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM-DRUGS</a></td>
          <td>27M conformations for 300k molecules</td>
          <td>Standard benchmark for 3D conformer generation</td>
      </tr>
      <tr>
          <td>Evaluation (conformer)</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot evaluation holdout</td>
      </tr>
      <tr>
          <td>Evaluation (pocket)</td>
          <td>CrossDocked holdout</td>
          <td>100 pockets</td>
          <td>Held out from training</td>
      </tr>
  </tbody>
</table>
<h3 id="task-1-3d-molecule-generation-pre-training">Task 1: 3D Molecule Generation (Pre-training)</h3>
<p>Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF&rsquo;s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).</p>
<h3 id="task-2-3d-molecule-generation-fine-tuned-on-geom-drugs">Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)</h3>
<p>Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>EDM</th>
          <th>MolDiff</th>
          <th>BindGPT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JS bond lengths</td>
          <td>0.246</td>
          <td>0.365</td>
          <td><strong>0.029</strong></td>
      </tr>
      <tr>
          <td>JS bond angles</td>
          <td>0.282</td>
          <td>0.155</td>
          <td><strong>0.075</strong></td>
      </tr>
      <tr>
          <td>JS dihedral angles</td>
          <td>0.328</td>
          <td>0.162</td>
          <td><strong>0.098</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond types</td>
          <td>0.378</td>
          <td>0.163</td>
          <td><strong>0.045</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond pairs</td>
          <td>0.396</td>
          <td>0.136</td>
          <td><strong>0.043</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond triplets</td>
          <td>0.449</td>
          <td>0.125</td>
          <td><strong>0.042</strong></td>
      </tr>
      <tr>
          <td>Time (1000 molecules)</td>
          <td>1.4e6 s</td>
          <td>7500 s</td>
          <td><strong>200 s</strong></td>
      </tr>
  </tbody>
</table>
<p>BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.</p>
<h3 id="task-3-pocket-conditioned-molecule-generation">Task 3: Pocket-Conditioned Molecule Generation</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score</th>
          <th>SA</th>
          <th>QED</th>
          <th>Lipinski</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-7.15 +/- 4.89</td>
          <td>0.75</td>
          <td>0.57</td>
          <td>4.88</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-7.80 +/- 3.61</td>
          <td>0.58</td>
          <td>0.48</td>
          <td>4.51</td>
      </tr>
      <tr>
          <td>BindGPT-FT</td>
          <td>-5.44 +/- 2.09</td>
          <td>0.78</td>
          <td>0.50</td>
          <td>4.72</td>
      </tr>
      <tr>
          <td>BindGPT-RFT</td>
          <td>-7.24 +/- 1.68</td>
          <td>0.74</td>
          <td>0.48</td>
          <td>4.32</td>
      </tr>
      <tr>
          <td>BindGPT-RL</td>
          <td><strong>-8.60 +/- 1.90</strong></td>
          <td><strong>0.84</strong></td>
          <td>0.43</td>
          <td>4.81</td>
      </tr>
  </tbody>
</table>
<p>The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.</p>
<h3 id="conformer-generation">Conformer Generation</h3>
<p>On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:</p>
<ol>
<li><strong>Joint SMILES+XYZ generation eliminates graph reconstruction errors</strong>, achieving 98.58% validity compared to 12.87% for XYZ-Transformer</li>
<li><strong>Large-scale pre-training is critical for pocket-conditioned generation</strong>, as none of the baselines use pre-training and instead rely on heavy inductive biases</li>
<li><strong>RL fine-tuning with docking feedback substantially improves binding affinity</strong> beyond what SFT alone achieves</li>
<li><strong>Sampling is two orders of magnitude faster</strong> than diffusion baselines (200s vs. 1.4M s for EDM)</li>
</ol>
<p>Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations, 12M molecules, 3.2M pockets</td>
          <td>From Zhou et al. (2023)</td>
      </tr>
      <tr>
          <td>SFT (pocket)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>Full version including low-quality poses</td>
      </tr>
      <tr>
          <td>SFT (conformer)</td>
          <td>GEOM-DRUGS</td>
          <td>27M conformations, 300k molecules</td>
          <td>Standard benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot holdout</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-NeoX with rotary position embeddings (RoPE)</li>
<li><strong>Pre-training</strong>: Causal language modeling with 1.6M tokens per batch</li>
<li><strong>SFT augmentation</strong>: SMILES randomization + random 3D rotation</li>
<li><strong>RL</strong>: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Size</strong>: 108M parameters, 15 layers, 12 heads, hidden size 768</li>
<li><strong>Vocabulary</strong>: Character-level SMILES tokens + special tokens (<code>&lt;LIGAND&gt;</code>, <code>&lt;POCKET&gt;</code>, <code>&lt;XYZ&gt;</code>) + coordinate tokens (6 per 3D position)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Validity, SA, QED, Lipinski</strong>: Standard drug-likeness metrics</li>
<li><strong>Jensen-Shannon divergences</strong>: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)</li>
<li><strong>RMSD</strong>: Alignment quality of generated conformations vs. RDKit reference</li>
<li><strong>RMSD-Coverage</strong>: CDF of RMSD between generated and reference conformers</li>
<li><strong>Vina score</strong>: Binding energy from QVINA docking software</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency</li>
<li>Specific GPU counts and training times are described in Appendix G (not available in the main text)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://bindgpt.github.io/">Project Page</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Project website with additional details</td>
      </tr>
  </tbody>
</table>
<p>No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.</p>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., &amp; Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(24), 26083-26091. <a href="https://doi.org/10.1609/aaai.v39i24.34804">https://doi.org/10.1609/aaai.v39i24.34804</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zholus2025bindgpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{26083--26091}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i24.34804}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Augmented Hill-Climb for RL-Based Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</guid><description>Augmented Hill-Climb combines REINVENT and Hill-Climb RL strategies to improve sample efficiency ~45-fold for SMILES-based de novo molecule generation.</description><content:encoded><![CDATA[<h2 id="a-hybrid-rl-strategy-for-de-novo-molecule-generation">A Hybrid RL Strategy for De Novo Molecule Generation</h2>
<p>This is a <strong>Method</strong> paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> targets, and that the approach generalizes to transformer architectures.</p>
<h2 id="sample-efficiency-bottleneck-in-rl-guided-molecular-generation">Sample Efficiency Bottleneck in RL-Guided Molecular Generation</h2>
<p>Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>. However, RL-guided generation can be highly <a href="/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/">sample-inefficient</a>, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.</p>
<p>The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent&rsquo;s policy and an &ldquo;augmented likelihood&rdquo; that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.</p>
<p>Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.</p>
<h2 id="core-innovation-filtering-low-scoring-molecules-from-the-reinvent-loss">Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss</h2>
<p>Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.</p>
<p>The REINVENT loss defines an augmented likelihood:</p>
<p>$$
\log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T
$$</p>
<p>where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent&rsquo;s log-likelihood:</p>
<p>$$
L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2
$$</p>
<p>In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.</p>
<p>The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.</p>
<h3 id="diversity-filters-to-prevent-mode-collapse">Diversity Filters to Prevent Mode Collapse</h3>
<p>AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:</p>
<ul>
<li>Minimum score threshold of 0.5 (lower than DF1&rsquo;s 0.8)</li>
<li>Linear penalization output mode (softer than binary)</li>
<li>Bin size of 50 (larger than DF1&rsquo;s 25)</li>
<li>Scaffold similarity based on ECFP4 fingerprints</li>
</ul>
<p>The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.</p>
<h2 id="experimental-setup-docking-tasks-and-benchmark-comparisons">Experimental Setup: Docking Tasks and Benchmark Comparisons</h2>
<p>The evaluation spans five experiments:</p>
<p><strong>Experiment 1</strong>: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).</p>
<p><strong>Experiment 2</strong>: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.</p>
<p><strong>Experiment 3</strong>: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (<a href="https://en.wikipedia.org/wiki/Aripiprazole">Aripiprazole</a> similarity, C11H24 isomers, <a href="https://en.wikipedia.org/wiki/Osimertinib">Osimertinib</a> MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).</p>
<p><strong>Experiment 4</strong>: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Difficulty</th>
          <th>Objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heavy atoms</td>
          <td>Easy</td>
          <td>Maximize number of heavy atoms</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Risperidone">Risperidone</a> similarity</td>
          <td>Easy</td>
          <td>Maximize Tanimoto similarity to Risperidone</td>
      </tr>
      <tr>
          <td>DRD2 activity</td>
          <td>Medium</td>
          <td>Maximize QSAR-predicted DRD2 activity</td>
      </tr>
      <tr>
          <td>DRD2 docking</td>
          <td>Medium</td>
          <td>Minimize Glide-SP docking score</td>
      </tr>
      <tr>
          <td>DRD2-DRD3 dual</td>
          <td>Hard</td>
          <td>Maximize predicted activity against both targets</td>
      </tr>
      <tr>
          <td>DRD2/DRD3 selective</td>
          <td>Hard</td>
          <td>Maximize selective DRD2 activity over DRD3</td>
      </tr>
  </tbody>
</table>
<p><strong>Experiment 5</strong>: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.</p>
<h3 id="rnn-and-transformer-architectures">RNN and Transformer Architectures</h3>
<p>Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.</p>
<p>QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.</p>
<h2 id="key-findings-45-fold-sample-efficiency-improvement">Key Findings: 45-Fold Sample Efficiency Improvement</h2>
<h3 id="experiment-1-ahc-consistently-outperforms-reinvent">Experiment 1: AHC Consistently Outperforms REINVENT</h3>
<p>AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT&rsquo;s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.</p>
<p>AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).</p>
<h3 id="experiment-2-improvement-across-four-gpcr-targets">Experiment 2: Improvement Across Four GPCR Targets</h3>
<p>Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.</p>
<p>AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.</p>
<p>Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.</p>
<h3 id="experiment-4-benchmark-against-all-rl-strategies">Experiment 4: Benchmark Against All RL Strategies</h3>
<p>AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).</p>
<p>Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.</p>
<p>In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 2.0</a>. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.</p>
<h3 id="experiment-5-generalization-to-transformers">Experiment 5: Generalization to Transformers</h3>
<p>AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC&rsquo;s efficiency gains generalized to both architectures.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.</li>
<li>The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.</li>
<li>The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).</li>
<li>KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN pretraining</td>
          <td>MOSESn (MOSES neutralized)</td>
          <td>2,454,087 molecules</td>
          <td>ZINC15 clean leads with neutralized charges</td>
      </tr>
      <tr>
          <td>RNN pretraining</td>
          <td>GuacaMol train</td>
          <td>1,273,104 molecules</td>
          <td>ChEMBL24 with property filters</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD2)</td>
          <td>4,609 actives / 343,026 inactives</td>
          <td>Random forest with GHOST thresholds</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD3)</td>
          <td>2,758 actives / 402,524 inactives</td>
          <td>Unique subsets for dual/selective tasks</td>
      </tr>
      <tr>
          <td>DF parameter search</td>
          <td>GuacaMol benchmark tasks</td>
          <td>3 tasks</td>
          <td>825 configurations tested</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>AHC</strong>: REINVENT loss computed on top-k molecules per batch, ranked by reward</li>
<li><strong>Baselines</strong>: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization</li>
<li><strong>Hyperparameters</strong>: Default values from each original publication (listed in Supplementary Table S3)</li>
<li><strong>Docking</strong>: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RNNs</strong>: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)</li>
<li><strong>Transformer</strong>: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim</li>
<li><strong>Gated Transformer</strong>: Same architecture with GRU-style gating replacing residual connections</li>
<li><strong>QSAR</strong>: Random forest classifiers (100 estimators, max depth 15, min leaf 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AHC + DF2</th>
          <th>REINVENT</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization fold-improvement</td>
          <td>1.45x</td>
          <td>baseline</td>
          <td>DRD2 docking, averaged across sigma values</td>
      </tr>
      <tr>
          <td>Sample efficiency</td>
          <td>45.5x fewer samples</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>Step efficiency</td>
          <td>7.4x fewer steps</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>CPU hours to 140% (DRD2 docking)</td>
          <td>16h</td>
          <td>202h (REINVENT 2.0)</td>
          <td>AMD Threadripper 1920 + RTX 2060 Super</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>AMD Threadripper 1920 CPU</li>
<li>Nvidia GeForce RTX 2060 Super GPU</li>
<li>DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/SMILES-RNN">SMILES-RNN</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>RNN and transformer generative model code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/molscore-scoring-benchmarking-framework/">Scoring function platform</a></td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.6084/m9.figshare.19591024.v1">Figshare datasets</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Supporting data (published under same license as paper)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. <em>Journal of Cheminformatics</em>, 14, 68.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2022augmented,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00646-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Atom-in-SMILES: Better Tokens for Chemical Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</guid><description>Atom-in-SMILES replaces generic SMILES tokens with environment-aware atomic tokens, reducing token degeneration and improving chemical translation accuracy.</description><content:encoded><![CDATA[<h2 id="a-new-tokenization-method-for-chemical-language-models">A New Tokenization Method for Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom&rsquo;s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, <a href="/notes/chemistry/molecular-design/reaction-prediction/">single-step retrosynthesis</a>, and <a href="/notes/chemistry/molecular-design/property-prediction/">molecular property prediction</a>.</p>
<h2 id="why-standard-smiles-tokenization-falls-short">Why Standard SMILES Tokenization Falls Short</h2>
<p>Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as &ldquo;C&rdquo; regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.</p>
<p>The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.</p>
<p>SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.</p>
<h2 id="core-innovation-encoding-atom-environments-into-tokens">Core Innovation: Encoding Atom Environments into Tokens</h2>
<p>The key insight is to replace each atomic token with a richer token that encodes the atom&rsquo;s local chemical environment, inspired by the <a href="https://en.wikipedia.org/wiki/Atoms_in_molecules">atoms-in-molecules (AIM)</a> concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:</p>
<p>$$
f(X) = \begin{cases} AE|_{X_{\text{central}}} &amp; \text{if } X \text{ is an atom} \\ X &amp; \text{otherwise} \end{cases}
$$</p>
<p>where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.</p>
<p>Each AIS token is formatted as <code>[Sym;Ring;Neighbors]</code> where:</p>
<ul>
<li><strong>Sym</strong> is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge</li>
<li><strong>Ring</strong> indicates whether the atom is in a ring (<code>R</code>) or not (<code>!R</code>)</li>
<li><strong>Neighbors</strong> lists the neighboring atoms interacting with the central atom</li>
</ul>
<p>This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.</p>
<p>As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).</p>
<p>The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarities</a> computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.</p>
<p>Token repetition can be quantified as:</p>
<p>$$
\text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}]
$$</p>
<p>where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).</p>
<h2 id="experimental-evaluation-across-three-chemical-tasks">Experimental Evaluation Across Three Chemical Tasks</h2>
<h3 id="input-output-equivalent-mapping-smiles-canonicalization">Input-Output Equivalent Mapping (SMILES Canonicalization)</h3>
<p>The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.</p>
<table>
  <thead>
      <tr>
          <th>GDB-13 Subset</th>
          <th>Atom-wise (x10)</th>
          <th>Atom-wise (x50)</th>
          <th>AIS (x10)</th>
          <th>AIS (x50)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ab</td>
          <td>34.2%</td>
          <td>33.2%</td>
          <td>37.3%</td>
          <td>34.1%</td>
      </tr>
      <tr>
          <td>abc</td>
          <td>31.0%</td>
          <td>29.6%</td>
          <td>33.7%</td>
          <td>30.4%</td>
      </tr>
      <tr>
          <td>abcde</td>
          <td>48.7%</td>
          <td>45.5%</td>
          <td>53.6%</td>
          <td>47.0%</td>
      </tr>
      <tr>
          <td>abcdef</td>
          <td>41.8%</td>
          <td>39.1%</td>
          <td>52.5%</td>
          <td>46.9%</td>
      </tr>
      <tr>
          <td>abcdefg</td>
          <td>50.9%</td>
          <td>50.0%</td>
          <td>59.9%</td>
          <td>56.8%</td>
      </tr>
  </tbody>
</table>
<p>AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.</p>
<h3 id="single-step-retrosynthesis">Single-Step Retrosynthesis</h3>
<p>The second task uses the USPTO-50K benchmark for single-step <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthetic prediction</a> via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.</p>
<table>
  <thead>
      <tr>
          <th>Tokenization</th>
          <th>rep-|P - rep-|GT &gt;= 2</th>
          <th>String Exact (%)</th>
          <th>Tc Exact (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Atom-wise baseline</td>
          <td>&ndash;</td>
          <td>42.00</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Atom-wise (reproduced)</td>
          <td>801</td>
          <td>42.05</td>
          <td>44.72</td>
      </tr>
      <tr>
          <td>SmilesPE</td>
          <td>821</td>
          <td>19.82</td>
          <td>22.74</td>
      </tr>
      <tr>
          <td>SELFIES</td>
          <td>886</td>
          <td>28.82</td>
          <td>30.76</td>
      </tr>
      <tr>
          <td>DeepSMILES</td>
          <td>902</td>
          <td>38.63</td>
          <td>41.20</td>
      </tr>
      <tr>
          <td><strong>Atom-in-SMILES</strong></td>
          <td><strong>727</strong></td>
          <td><strong>46.32</strong></td>
          <td><strong>47.62</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SmilesPE</a> both performed substantially worse than the atom-wise baseline on this task.</p>
<p>The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The third task evaluates tokenization schemes on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>SMILES</th>
          <th>DeepSMILES</th>
          <th>SELFIES</th>
          <th>SmilesPE</th>
          <th>AIS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Regression (RMSE, lower is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>0.628</td>
          <td>0.631</td>
          <td>0.675</td>
          <td>0.689</td>
          <td><strong>0.553</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>0.545</td>
          <td>0.544</td>
          <td>0.564</td>
          <td>0.761</td>
          <td><strong>0.441</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>0.924</td>
          <td>0.895</td>
          <td>0.938</td>
          <td>0.800</td>
          <td><strong>0.683</strong></td>
      </tr>
      <tr>
          <td><strong>Classification (ROC-AUC, higher is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>0.758</td>
          <td>0.777</td>
          <td>0.799</td>
          <td>0.847</td>
          <td><strong>0.885</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>0.740</td>
          <td>0.774</td>
          <td>0.746</td>
          <td>0.837</td>
          <td><strong>0.835</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>0.649</td>
          <td>0.648</td>
          <td>0.653</td>
          <td>0.739</td>
          <td><strong>0.729</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.</p>
<h2 id="key-findings-better-tokens-yield-better-chemical-models">Key Findings: Better Tokens Yield Better Chemical Models</h2>
<p>The main findings of this work are:</p>
<ol>
<li>
<p><strong>Tokenization significantly impacts chemical language model quality.</strong> The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.</p>
</li>
<li>
<p><strong>AIS reduces token degeneration by approximately 10%</strong> compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.</p>
</li>
<li>
<p><strong>AIS outperforms all compared tokenization schemes</strong> (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.</p>
</li>
<li>
<p><strong>The fingerprint-like nature of AIS tokens</strong> enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.</p>
</li>
<li>
<p><strong>The mapping is invertible</strong>, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.</p>
</li>
</ol>
<p><strong>Limitations</strong>: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.</p>
<p><strong>Future directions</strong>: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonicalization training</td>
          <td>GDB-13 subsets</td>
          <td>1M + 150K augmented</td>
          <td>Cumulative structural constraints a-h</td>
      </tr>
      <tr>
          <td>Canonicalization testing</td>
          <td>GDB-13 disjoint test sets</td>
          <td>20K per subset</td>
          <td>Various restriction levels</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50K</td>
          <td>~50K reactions</td>
          <td>Sequences &gt; 150 tokens removed</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)</td>
          <td>Varies</td>
          <td>Standard benchmark splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks</li>
<li>200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler</li>
<li>Random Forest with 5-fold cross-validation for property prediction</li>
<li>AIS tokenization implemented via RDKit for atom environment extraction</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>String exact match (%)</td>
          <td>Canonicalization, Retrosynthesis</td>
          <td>Exact SMILES match</td>
      </tr>
      <tr>
          <td>Tanimoto exactness (Tc)</td>
          <td>Retrosynthesis</td>
          <td>Morgan FP radius 3, 2048 bits</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression property prediction</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification property prediction</td>
          <td>BBBP, BACE, HIV</td>
      </tr>
      <tr>
          <td>rep-l</td>
          <td>Token degeneration</td>
          <td>Single-token repetition count</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not explicitly specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/snu-lcbc/atom-in-SMILES">atom-in-SMILES</a></td>
          <td>Code</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>AIS tokenization implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ucak, U. V., Ashyrmamatov, I., &amp; Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. <em>Journal of Cheminformatics</em>, 15, 55. <a href="https://doi.org/10.1186/s13321-023-00725-9">https://doi.org/10.1186/s13321-023-00725-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ucak2023improving,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{55}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00725-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AlphaDrug: MCTS-Guided Target-Specific Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/alphadrug-protein-target-molecular-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/alphadrug-protein-target-molecular-generation/</guid><description>AlphaDrug combines a modified transformer with Monte Carlo tree search and docking rollouts for target-specific de novo molecular generation.</description><content:encoded><![CDATA[<h2 id="target-conditioned-molecular-generation-via-transformer-and-mcts">Target-Conditioned Molecular Generation via Transformer and MCTS</h2>
<p>AlphaDrug is a <strong>Method</strong> paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT&rsquo;s predicted probabilities and docking scores from the <a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">SMINA</a> program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.</p>
<h2 id="bridging-the-gap-between-molecular-generation-and-protein-targeting">Bridging the Gap Between Molecular Generation and Protein Targeting</h2>
<p>Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the <a href="/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/">transformer-based approach of Grechishnikova (2021)</a>, show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.</p>
<p>AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.</p>
<h2 id="lmser-transformer-and-docking-guided-mcts">Lmser Transformer and Docking-Guided MCTS</h2>
<p>The key innovations are the Lmser Transformer architecture and the MCTS search strategy.</p>
<h3 id="lmser-transformer-lt">Lmser Transformer (LT)</h3>
<p>The standard transformer for sequence-to-sequence tasks passes information from the encoder&rsquo;s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder&rsquo;s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.</p>
<p>Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:</p>
<p>$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$</p>
<p>where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.</p>
<p>The multi-head attention follows the standard formulation:</p>
<p>$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$</p>
<p>$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$</p>
<h3 id="mcts-for-molecular-generation">MCTS for Molecular Generation</h3>
<p>The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:</p>
<p><strong>Select</strong>: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:</p>
<p>$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$</p>
<p>where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT&rsquo;s predicted probability.</p>
<p>The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:</p>
<p>$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$</p>
<p><strong>Expand</strong>: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.</p>
<p><strong>Rollout</strong>: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.</p>
<p><strong>Backup</strong>: Docking values propagate back up the tree, updating visit counts and cumulative rewards.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The LT is trained on known protein-ligand pairs using cross-entropy loss:</p>
<p>$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$</p>
<p>MCTS is only activated during inference, not during training.</p>
<h2 id="experiments-on-diverse-protein-targets">Experiments on Diverse Protein Targets</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 &lt; 100 nM, molecular weight &lt; 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>T+BS10</strong>: Standard transformer with beam search (K=10) from <a href="/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/">Grechishnikova (2021)</a></li>
<li><strong>LT+BS10</strong>: The proposed Lmser Transformer with beam search</li>
<li><strong>LiGANN</strong>: 3D pocket-to-ligand shape generation via BicycleGAN</li>
<li><strong>SBMolGen</strong>: ChemTS-based method with docking constraints</li>
<li><strong>SBDD-3D</strong>: 3D autoregressive graph-based generation</li>
<li><strong>Decoys</strong>: Random compounds from ZINC database</li>
<li><strong>Known ligands</strong>: Original binding partners from the database</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Docking</th>
          <th>Uniqueness</th>
          <th>LogP</th>
          <th>QED</th>
          <th>SA</th>
          <th>NP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Decoys</td>
          <td>7.3</td>
          <td>-</td>
          <td>2.4</td>
          <td>0.8</td>
          <td>2.4</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>Known ligands</td>
          <td>9.8</td>
          <td>-</td>
          <td>2.2</td>
          <td>0.5</td>
          <td>3.3</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>LiGANN</td>
          <td>6.7</td>
          <td>94.7%</td>
          <td>2.9</td>
          <td>0.6</td>
          <td>3.0</td>
          <td>-1.1</td>
      </tr>
      <tr>
          <td>SBMolGen</td>
          <td>7.7</td>
          <td>100%</td>
          <td>2.6</td>
          <td>0.7</td>
          <td>2.8</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>SBDD-3D</td>
          <td>7.7</td>
          <td>99.3%</td>
          <td>1.5</td>
          <td>0.6</td>
          <td>4.0</td>
          <td>0.3</td>
      </tr>
      <tr>
          <td>T+BS10</td>
          <td>8.5</td>
          <td>90.6%</td>
          <td>3.8</td>
          <td>0.5</td>
          <td>2.8</td>
          <td>-0.8</td>
      </tr>
      <tr>
          <td>LT+BS10</td>
          <td>8.5</td>
          <td>98.1%</td>
          <td>4.0</td>
          <td>0.5</td>
          <td>2.7</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (freq)</td>
          <td>10.8</td>
          <td>99.5%</td>
          <td>4.9</td>
          <td>0.4</td>
          <td>2.9</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (max)</td>
          <td>11.6</td>
          <td>100%</td>
          <td>5.2</td>
          <td>0.4</td>
          <td>2.7</td>
          <td>-0.8</td>
      </tr>
  </tbody>
</table>
<p>AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.</p>
<h3 id="mcts-vs-beam-search-under-equal-compute">MCTS vs. Beam Search Under Equal Compute</h3>
<p>When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:</p>
<table>
  <thead>
      <tr>
          <th>Docking times (N)</th>
          <th>BS</th>
          <th>MCTS</th>
          <th>P-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>N = 105 (S=10)</td>
          <td>8.4 (10.9)</td>
          <td>10.9 (11.5)</td>
          <td>1.8e-34 (4.5e-3)</td>
      </tr>
      <tr>
          <td>N = 394 (S=50)</td>
          <td>8.3 (11.4)</td>
          <td>11.6 (12.2)</td>
          <td>1.4e-31 (1.8e-3)</td>
      </tr>
      <tr>
          <td>N = 1345 (S=500)</td>
          <td>8.4 (11.9)</td>
          <td>12.4 (13.2)</td>
          <td>2.2e-39 (8.2e-6)</td>
      </tr>
  </tbody>
</table>
<p>Values in parentheses are average top-1 scores per protein.</p>
<h3 id="ablation-effect-of-protein-sequence-input">Ablation: Effect of Protein Sequence Input</h3>
<p>Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Uniqueness</th>
          <th>SpS</th>
          <th>Molecular length</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TE + MCTS (S=50)</td>
          <td>81.0%</td>
          <td>0.1926</td>
          <td>62.93</td>
      </tr>
      <tr>
          <td>T + MCTS (S=50)</td>
          <td>98.0%</td>
          <td>0.2149</td>
          <td>55.63</td>
      </tr>
      <tr>
          <td>LT + MCTS (S=50)</td>
          <td>100.0%</td>
          <td>0.2159</td>
          <td>56.54</td>
      </tr>
  </tbody>
</table>
<p>The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.</p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p>A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.</p>
<h2 id="docking-gains-with-acknowledged-limitations">Docking Gains with Acknowledged Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.</li>
<li>The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.</li>
<li>MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.</li>
<li>Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors identify three areas for improvement:</p>
<ol>
<li><strong>Sequence-only representation</strong>: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.</li>
<li><strong>External docking as value function</strong>: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.</li>
<li><strong>Full rollout requirement</strong>: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.</li>
</ol>
<p>The physicochemical properties (QED, SA) of AlphaDrug&rsquo;s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BindingDB (filtered)</td>
          <td>192,712 protein-ligand pairs</td>
          <td>Human proteins, IC50 &lt; 100 nM, MW &lt; 1000 Da</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>BindingDB (filtered)</td>
          <td>17,049 pairs</td>
          <td>Same filtering criteria</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>BindingDB (filtered)</td>
          <td>100 proteins from 25 clusters</td>
          <td>Clustered at 30% sequence identity via MMseqs2</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>MCTS with PUCT selection criterion, $c_{puct} = 1.5$</li>
<li>$S = 50$ simulations per step (default), $S = 10$ for fast variant</li>
<li>Greedy rollout policy using LT probabilities</li>
<li>Docking lookup table for efficiency (caches SMINA results)</li>
<li>Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Lmser Transformer with hierarchical encoder-to-decoder skip connections</li>
<li>Sinusoidal positional encoding</li>
<li>Multi-head cross-attention at each decoder layer</li>
<li>Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AlphaDrug (max)</th>
          <th>Known ligands</th>
          <th>Best baseline (T+BS10)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score</td>
          <td>11.6</td>
          <td>9.8</td>
          <td>8.5</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>100%</td>
          <td>-</td>
          <td>90.6%</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>100%</td>
          <td>-</td>
          <td>Not reported</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CMACH508/AlphaDrug">CMACH508/AlphaDrug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation, includes data processing and generation scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, H., Lin, C., Zhao, D., Tu, S., &amp; Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. <em>PNAS Nexus</em>, 1(4), pgac227. <a href="https://doi.org/10.1093/pnasnexus/pgac227">https://doi.org/10.1093/pnasnexus/pgac227</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2022alphadrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AlphaDrug: protein target specific de novo molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{PNAS Nexus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{pgac227}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/pnasnexus/pgac227}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>TamGen: GPT-Based Target-Aware Drug Design and Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/tamgen-target-aware-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/tamgen-target-aware-molecule-generation/</guid><description>TamGen combines a GPT-like chemical language model with protein pocket encoding and VAE refinement to generate drug candidates with experimental validation.</description><content:encoded><![CDATA[<h2 id="a-method-for-target-conditioned-molecular-generation">A Method for Target-Conditioned Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.</p>
<h2 id="bridging-generative-ai-and-practical-drug-discovery">Bridging Generative AI and Practical Drug Discovery</h2>
<p>Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.</p>
<p>The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:</p>
<ul>
<li>Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility</li>
<li>High cellular toxicity and decreased developability associated with excessive fused ring counts</li>
<li>Slow generation speeds (tens of minutes to hours per 100 compounds)</li>
<li>Limited real-world experimental validation of generated candidates</li>
</ul>
<p>TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.</p>
<h2 id="three-module-architecture-with-pre-training-and-refinement">Three-Module Architecture with Pre-Training and Refinement</h2>
<p>TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.</p>
<h3 id="compound-decoder-chemical-language-model">Compound Decoder (Chemical Language Model)</h3>
<p>The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:</p>
<p>$$
\min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1)
$$</p>
<p>where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.</p>
<h3 id="protein-encoder-with-distance-aware-attention">Protein Encoder with Distance-Aware Attention</h3>
<p>The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:</p>
<p>$$
h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right)
$$</p>
<p>where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.</p>
<p>The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:</p>
<p>$$
\begin{aligned}
\hat{\alpha}_j &amp;= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\
\alpha_j &amp;= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\
\hat{\boldsymbol{h}}_i^{(l+1)} &amp;= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)})
\end{aligned}
$$</p>
<p>where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.</p>
<h3 id="vae-based-contextual-encoder">VAE-Based Contextual Encoder</h3>
<p>A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:</p>
<p>$$
\min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z))
$$</p>
<p>where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.</p>
<h2 id="benchmark-evaluation-and-tuberculosis-drug-discovery">Benchmark Evaluation and Tuberculosis Drug Discovery</h2>
<h3 id="crossdocked2020-benchmark">CrossDocked2020 Benchmark</h3>
<p>TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:</p>
<ul>
<li><strong>Docking score</strong> (AutoDock-Vina): binding affinity estimate</li>
<li><strong>QED</strong>: quantitative estimate of drug-likeness</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a></strong>: physicochemical property compliance</li>
<li><strong>SAS</strong>: synthetic accessibility score</li>
<li><strong>LogP</strong>: lipophilicity (optimal range 0-5 for oral administration)</li>
<li><strong>Molecular diversity</strong>: Tanimoto similarity between Morgan fingerprints</li>
</ul>
<p>TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.</p>
<p>TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.</p>
<h3 id="design-refine-test-pipeline-for-clpp-inhibitors">Design-Refine-Test Pipeline for ClpP Inhibitors</h3>
<p>The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond <a href="https://en.wikipedia.org/wiki/Bortezomib">Bortezomib</a>.</p>
<p><strong>Design stage</strong>: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.</p>
<p><strong>Refine stage</strong>: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.</p>
<p><strong>Test stage</strong>: From a 446k commercial compound library, 159 analogs (MCS similarity &gt; 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:</p>
<table>
  <thead>
      <tr>
          <th>Compound</th>
          <th>Series</th>
          <th>Source</th>
          <th>$\text{IC}_{50}$ ($\mu$M)</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Analog-005</td>
          <td>II</td>
          <td>Commercial library</td>
          <td>1.9</td>
          <td>Most potent analog</td>
      </tr>
      <tr>
          <td>Analog-003</td>
          <td>I</td>
          <td>Commercial library</td>
          <td>&lt; 20</td>
          <td>Strongest single-dose inhibition</td>
      </tr>
      <tr>
          <td>Syn-A003-01</td>
          <td>I</td>
          <td>TamGen (synthesized)</td>
          <td>&lt; 20</td>
          <td>Diphenylurea scaffold</td>
      </tr>
  </tbody>
</table>
<p>Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen&rsquo;s ability to produce viable hits without the library search step.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Four ablation experiments clarified the contributions of TamGen&rsquo;s components:</p>
<ol>
<li><strong>Without pre-training</strong>: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.</li>
<li><strong>Shuffled pocket-ligand pairs (TamGen-r)</strong>: Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.</li>
<li><strong>Without distance-aware attention</strong>: Significant decline in docking scores when removing the geometric attention term from Eq. 2.</li>
<li><strong>Without coordinate augmentation</strong>: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.</li>
</ol>
<h2 id="validated-drug-like-generation-with-practical-limitations">Validated Drug-Like Generation with Practical Limitations</h2>
<p>TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.</p>
<p>Key limitations acknowledged by the authors include:</p>
<ul>
<li><strong>Insufficient sensitivity to minor target differences</strong>: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins</li>
<li><strong>Requires known structure and pocket</strong>: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information</li>
<li><strong>Limited cellular validation</strong>: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested</li>
<li><strong>1D generation trade-off</strong>: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space</li>
</ul>
<p>Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in <a href="/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/">PrefixMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (random sample)</td>
          <td>10M SMILES</td>
          <td>Compound decoder pre-training</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>CrossDocked2020</td>
          <td>~100k pairs</td>
          <td>Filtered pocket-ligand pairs</td>
      </tr>
      <tr>
          <td>Extended fine-tuning</td>
          <td>CrossDocked + PDB</td>
          <td>~300k pairs</td>
          <td>Used for TB compound generation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>CrossDocked2020 test</td>
          <td>100 pockets</td>
          <td>Same split as TargetDiff/Pocket2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Compound decoder</strong>: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps</li>
<li><strong>Protein encoder</strong>: 4-layer Transformer with hidden dimension 256, distance-aware attention</li>
<li><strong>VAE encoder</strong>: 4-layer standard Transformer encoder with hidden dimension 256</li>
<li><strong>Optimizer</strong>: Adam with initial learning rate $3 \times 10^{-5}$</li>
<li><strong>VAE $\beta$</strong>: 0.1 or 1.0 depending on generation stage</li>
<li><strong>Beam search</strong>: beam sizes of 4, 10, or 20 depending on stage</li>
<li><strong>Pocket definition</strong>: residues within 10 or 15 Angstrom distance cutoff from ligand center</li>
</ul>
<h3 id="models">Models</h3>
<p>Pre-trained model weights are available via Zenodo at <a href="https://doi.org/10.5281/zenodo.13751391">https://doi.org/10.5281/zenodo.13751391</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>TamGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Overall MRR</td>
          <td>Best</td>
          <td>TargetDiff (2nd)</td>
          <td>Ranked across 6 metrics</td>
      </tr>
      <tr>
          <td>Fused rings (avg)</td>
          <td>1.78</td>
          <td>~3-5 (others)</td>
          <td>Matches FDA-approved drug average</td>
      </tr>
      <tr>
          <td>Generation speed</td>
          <td>9 sec/100 compounds</td>
          <td>~13 min (ResGen)</td>
          <td>Single A6000 GPU</td>
      </tr>
      <tr>
          <td>ClpP hit rate</td>
          <td>6/8 synthesized</td>
          <td>N/A</td>
          <td>$\text{IC}_{50}$ &lt; 40 $\mu$M</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x V100 GPUs for 200k steps</li>
<li>Inference benchmarking: 1x A6000 GPU</li>
<li>Generation time: ~9 seconds per 100 compounds per target</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SigmaGenX/TamGen">TamGen code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13751391">Model weights and data</a></td>
          <td>Model + Data</td>
          <td>CC-BY-4.0</td>
          <td>Pre-trained weights, source data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., &amp; Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. <em>Nature Communications</em>, 15, 9360. <a href="https://doi.org/10.1038/s41467-024-53632-4">https://doi.org/10.1038/s41467-024-53632-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tamgen,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{TamGen: drug design with target-aware molecule generation through a chemical language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{9360}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-53632-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>STONED: Training-Free Molecular Design with SELFIES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/</guid><description>STONED uses string mutations in the SELFIES representation for training-free molecular generation, interpolation, and chemical space exploration.</description><content:encoded><![CDATA[<h2 id="a-training-free-algorithm-for-molecular-generation">A Training-Free Algorithm for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery), a suite of algorithms for molecular generation and chemical space exploration. STONED operates entirely through string manipulations on the <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> molecular representation, avoiding the need for deep learning models, training data, or GPU resources. The key claim is that simple character-level mutations and interpolations in SELFIES can achieve results competitive with state-of-the-art deep generative models on standard benchmarks.</p>
<h2 id="why-deep-generative-models-may-be-overkill">Why Deep Generative Models May Be Overkill</h2>
<p>Deep generative models (VAEs, GANs, RNNs, reinforcement learning) have become popular for <a href="/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/">inverse molecular design</a>, but they come with practical costs: large training datasets, expensive GPU compute, and long training times. Fragile representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> compound the problem, since large portions of a latent space can map to invalid molecules. Even with the introduction of SELFIES (a 100% valid string representation), prior work still embedded it within neural network architectures.</p>
<p>The authors argue that for tasks like local chemical space exploration and molecular interpolation, the guarantees of SELFIES alone may be sufficient. Because every SELFIES string maps to a valid molecule, random character mutations always produce valid structures. This observation eliminates the need for learned generation procedures entirely.</p>
<h2 id="core-innovation-selfies-string-mutations-as-molecular-operators">Core Innovation: SELFIES String Mutations as Molecular Operators</h2>
<p>STONED relies on four key techniques built on SELFIES string manipulations:</p>
<p><strong>1. Random character mutations.</strong> A point mutation in SELFIES (character replacement, deletion, or addition) always yields a valid molecule. The position of mutations serves as a hyperparameter controlling exploration vs. exploitation: terminal character mutations preserve more structural similarity to the seed, while random mutations explore more broadly.</p>
<p><strong>2. Multiple SMILES orderings.</strong> A single molecule has many valid SMILES strings, each mapping to a different SELFIES. By generating 50,000 SMILES orderings and converting to SELFIES before mutation, the diversity of generated structures increases substantially.</p>
<p><strong>3. Deterministic interpolation.</strong> Given two SELFIES strings (padded to equal length), characters at equivalent positions can be successively replaced from the start molecule to the target molecule. Every intermediate string is a valid molecule. A chemical path is extracted by keeping only those intermediates that increase fingerprint similarity to the target.</p>
<p><strong>4. Fingerprint-based filtering.</strong> Since edit distance in SELFIES does not reflect molecular similarity, STONED uses fingerprint comparisons (ECFP4, FCFP4, atom-pair) to enforce structural similarity constraints.</p>
<p>The authors also propose a revised joint molecular similarity metric for evaluating median molecules. Given $n$ reference molecules $M = {m_1, m_2, \ldots, m_n}$, the joint similarity of a candidate molecule $m$ is:</p>
<p>$$
F(m) = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(m_i, m) - \left[\max_{i} \text{sim}(m_i, m) - \min_{i} \text{sim}(m_i, m)\right]
$$</p>
<p>This penalizes candidates that are similar to only a subset of references, unlike the geometric mean metric used in GuacaMol which can yield high scores even with lopsided similarities.</p>
<h2 id="experimental-setup-and-applications">Experimental Setup and Applications</h2>
<h3 id="local-chemical-subspace-formation">Local chemical subspace formation</h3>
<p>Starting from a single seed molecule (<a href="https://en.wikipedia.org/wiki/Aripiprazole">aripiprazole</a>, albuterol, mestranol, or <a href="https://en.wikipedia.org/wiki/Celecoxib">celecoxib</a>), the algorithm generates 50,000 SMILES orderings and performs 1-5 point mutations per ordering, producing 250,000 candidate strings. Unique valid molecules are filtered by fingerprint similarity thresholds.</p>
<table>
  <thead>
      <tr>
          <th>Starting structure</th>
          <th>Fingerprint</th>
          <th>Molecules at $\delta &gt; 0.75$</th>
          <th>Molecules at $\delta &gt; 0.60$</th>
          <th>Molecules at $\delta &gt; 0.40$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Aripiprazole (SELFIES, random)</td>
          <td>ECFP4</td>
          <td>513 (0.25%)</td>
          <td>4,206 (2.15%)</td>
          <td>34,416 (17.66%)</td>
      </tr>
      <tr>
          <td>Albuterol (SELFIES, random)</td>
          <td>FCFP4</td>
          <td>587 (0.32%)</td>
          <td>4,156 (2.33%)</td>
          <td>16,977 (9.35%)</td>
      </tr>
      <tr>
          <td>Mestranol (SELFIES, random)</td>
          <td>AP</td>
          <td>478 (0.22%)</td>
          <td>4,079 (1.90%)</td>
          <td>45,594 (21.66%)</td>
      </tr>
      <tr>
          <td>Celecoxib (SELFIES, random)</td>
          <td>ECFP4</td>
          <td>198 (0.10%)</td>
          <td>1,925 (1.00%)</td>
          <td>18,045 (9.44%)</td>
      </tr>
      <tr>
          <td>Celecoxib (SELFIES, terminal 10%)</td>
          <td>ECFP4</td>
          <td>864 (2.02%)</td>
          <td>9,407 (21.99%)</td>
          <td>34,187 (79.91%)</td>
      </tr>
  </tbody>
</table>
<p>Key finding: restricting mutations to terminal characters yields a 20x increase in high-similarity molecules compared to random positions. Compared to SMILES mutations (0.30% valid) and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> (1.44% valid), SELFIES mutations are all valid by construction.</p>
<p>A two-step expansion (mutating all unique first-round neighbors) produced over 17 million unique molecules, with 120,000 having similarity greater than 0.4 to celecoxib.</p>
<h3 id="chemical-path-formation-and-drug-design">Chemical path formation and drug design</h3>
<p>Deterministic SELFIES interpolation between <a href="https://en.wikipedia.org/wiki/Tadalafil">tadalafil</a> and <a href="https://en.wikipedia.org/wiki/Sildenafil">sildenafil</a> generated paths where <a href="https://en.wikipedia.org/wiki/Partition_coefficient">logP</a> and QED values varied smoothly. A more challenging application docked intermediates between <a href="https://en.wikipedia.org/wiki/Dihydroergotamine">dihydroergotamine</a> (<a href="https://en.wikipedia.org/wiki/5-HT1B_receptor">5-HT1B</a> binder) and prinomastat (<a href="https://en.wikipedia.org/wiki/CYP2D6">CYP2D6</a> binder), finding molecules with non-trivial binding affinity to both proteins without any optimization routine.</p>
<h3 id="median-molecules-for-photovoltaics">Median molecules for photovoltaics</h3>
<p>Using 100 triplets from the Harvard Clean Energy (HCE) dataset, each with one molecule optimized for high LUMO energy, one for high dipole moment, and one for high <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO-LUMO gap</a>, generalized chemical paths produced median molecules. These were evaluated with GFN2-xTB semiempirical calculations. The generated medians matched or exceeded the best molecules available in the HCE database in both structural similarity and target properties.</p>
<h3 id="guacamol-benchmarks">GuacaMol benchmarks</h3>
<p>Without any training, STONED achieved an overall <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> score of 14.70, competitive with several deep generative models. The approach simply identifies the single best molecule in the benchmark&rsquo;s training set and generates its local chemical subspace. 38% of the top-100 molecules from each benchmark passed compound quality filters, comparable to <a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph GA</a> and SMILES GA.</p>
<h2 id="results-summary-and-limitations">Results Summary and Limitations</h2>
<p>STONED demonstrates that SELFIES string mutations can match or approach deep generative models on standard molecular design benchmarks while being orders of magnitude faster and requiring no training. The most expensive benchmark (aripiprazole subspace) completed in 500 seconds on a laptop CPU.</p>
<p>The method comparison table from the paper highlights STONED&rsquo;s unique position:</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Expert Systems</th>
          <th>VAE</th>
          <th>GAN</th>
          <th>RL</th>
          <th>STONED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Expert rule-free</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Structure coverage</td>
          <td>Partial</td>
          <td>Partial</td>
          <td>Partial</td>
          <td>Partial</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Interpolatability</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Property-based navigation</td>
          <td>Partial</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Partial</td>
      </tr>
      <tr>
          <td>Training-free</td>
          <td>Yes</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Data independence</td>
          <td>Yes</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>STONED lacks property-based navigation (gradient-guided optimization toward specific property targets). It can only do stochastic property optimization when wrapped in a genetic algorithm.</li>
<li>The success rate of mutations leading to structurally similar molecules is relatively low (0.1-2% at high similarity thresholds), though speed compensates.</li>
<li>Chemical paths can contain molecules with unstable functional groups or <a href="https://en.wikipedia.org/wiki/Tautomer">tautomerization</a> issues, requiring post-hoc filtering with domain-specific rules.</li>
<li>Fingerprint similarity does not capture all aspects of chemical similarity (3D geometry, reactivity, synthesizability).</li>
<li>The penalized logP and QED benchmarks used by GuacaMol do not represent the full complexity of practical molecular design.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Photovoltaics</td>
          <td>Harvard Clean Energy (HCE) database</td>
          <td>~2.3M molecules</td>
          <td>Used for median molecule triplet experiments</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>GuacaMol benchmark suite</td>
          <td>Varies per task</td>
          <td>Standard benchmarks for generative molecular design</td>
      </tr>
      <tr>
          <td>Comparison</td>
          <td>ChEMBL (SCScore &lt;= 2.5 subset)</td>
          <td>Fragment database</td>
          <td>Used for CReM comparison experiments</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Local subspace formation</strong>: 50,000 SMILES orderings per seed molecule, 1-5 SELFIES point mutations each, totaling 250,000 candidates per experiment.</li>
<li><strong>Chemical paths</strong>: Deterministic character-by-character interpolation between padded SELFIES strings, with monotonic fingerprint similarity filtering.</li>
<li><strong>Median molecules</strong>: Generalized paths between 3+ reference molecules using 10,000 paths per triplet with randomized SMILES orderings.</li>
<li><strong>Docking</strong>: <a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">SMINA</a> with crystal structures from PDB (4IAQ for 5-HT1B, 3QM4 for CYP2D6). Top-5 binding poses averaged.</li>
<li><strong>Quantum chemistry</strong>: GFN2-xTB for dipole moments, LUMO energies, and HOMO-LUMO gaps.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GuacaMol overall score</td>
          <td>14.70</td>
          <td>Varies by model</td>
          <td>Competitive with deep generative models</td>
      </tr>
      <tr>
          <td>Quality filter pass rate</td>
          <td>38%</td>
          <td>Graph GA/SMILES GA comparable</td>
          <td>Top-100 molecules per benchmark</td>
      </tr>
      <tr>
          <td>Celecoxib neighbors ($\delta &gt; 0.75$)</td>
          <td>198-864</td>
          <td>CReM: 239</td>
          <td>Depends on mutation position strategy</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>All experiments run on a laptop with Intel i7-8750H CPU at 2.20 GHz. No GPU required. Most expensive single experiment (aripiprazole subspace) completed in 500 seconds.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/stoned-selfies">stoned-selfies</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation of STONED algorithms</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nigam, A. K., Pollice, R., Krenn, M., dos Passos Gomes, G., &amp; Aspuru-Guzik, A. (2021). Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. <em>Chemical Science</em>, 12(20), 7079-7090. <a href="https://doi.org/10.1039/d1sc00231g">https://doi.org/10.1039/d1sc00231g</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{nigam2021stoned,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery ({STONED}) algorithm for molecules using {SELFIES}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Nigam, AkshatKumar and Pollice, Robert and Krenn, Mario and dos Passos Gomes, Gabriel and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{7079--7090}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d1sc00231g}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPECTRA: Evaluating Generalizability of Molecular AI</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/spectra-evaluating-generalizability-molecular-ai/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/spectra-evaluating-generalizability-molecular-ai/</guid><description>SPECTRA evaluates ML model generalizability on molecular datasets by plotting performance across a spectrum of train-test overlap levels.</description><content:encoded><![CDATA[<h2 id="a-spectral-framework-for-evaluating-molecular-ml-generalizability">A Spectral Framework for Evaluating Molecular ML Generalizability</h2>
<p>This is a <strong>Method</strong> paper that introduces SPECTRA (SPECtral framework for model evaluaTion on moleculaR dAtasets), a systematic approach for evaluating how well machine learning models generalize on molecular sequencing data. The primary contribution is a framework that generates train-test splits with controlled, decreasing levels of overlap, producing a spectral performance curve (SPC) and a single summary metric, the area under the spectral performance curve (AUSPC), for comparing model generalizability across tasks and architectures.</p>
<h2 id="why-existing-molecular-benchmarks-overestimate-generalizability">Why Existing Molecular Benchmarks Overestimate Generalizability</h2>
<p>Deep learning has achieved high performance on molecular sequencing benchmarks, but a persistent gap exists between benchmark performance and real-world deployment. The authors identify the root cause: existing evaluation approaches use either metadata-based (MB) splits or similarity-based (SB) splits, both of which provide an incomplete picture of generalizability.</p>
<p>MB splits partition data by metadata properties (e.g., temporal splits, random splits) without controlling sequence similarity between train and test sets. This means high train-test similarity can inflate performance metrics. SB splits control similarity at a single threshold, but the model&rsquo;s behavior at other similarity levels remains unknown.</p>
<p>For example, the TAPE benchmark&rsquo;s remote homology family split has 97% cross-split overlap, while the superfamily split has 71%. Model accuracy drops by 50% between these two points, yet the full curve of performance degradation is never characterized. This gap between evaluated and real-world overlap levels leads to overoptimistic deployment expectations, as demonstrated by the case of <a href="https://en.wikipedia.org/wiki/Rifampicin">rifampicin</a> resistance prediction in <em>M. tuberculosis</em>, where commercial genotypic assays later proved unreliable in specific geographic regions.</p>
<h2 id="the-spectra-framework-spectral-properties-graphs-and-performance-curves">The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves</h2>
<p>SPECTRA takes three inputs: a molecular sequencing dataset, a machine learning model, and a spectral property definition. A spectral property (SP) is a molecular sequence property expected to influence model generalizability for a specific task. For sequence-to-sequence datasets, the spectral property is typically sequence identity (proportion of aligned positions &gt; 0.3). For mutational scan datasets, it is defined by sample barcodes (string representations of mutations present in each sample).</p>
<h3 id="spectral-property-graph-construction">Spectral Property Graph Construction</h3>
<p>SPECTRA constructs a spectral property graph (SPG) where nodes represent samples and edges connect samples that share the spectral property. The goal is to generate train-test splits with controlled levels of cross-split overlap by finding approximate <a href="https://en.wikipedia.org/wiki/Maximal_independent_set">maximal independent sets</a> of this graph.</p>
<p>Finding the exact maximal independent set is NP-Hard, so SPECTRA uses a greedy randomized algorithm parameterized by a spectral parameter $\mathbf{SP} \in [0, 1]$:</p>
<ol>
<li>Randomly order SPG vertices</li>
<li>Select the first vertex and delete each neighbor with probability equal to $\mathbf{SP}$</li>
<li>Continue until no vertices remain</li>
</ol>
<p>When $\mathbf{SP} = 0$, this produces a random split (maximum cross-split overlap). When $\mathbf{SP} = 1$, it approximates the maximal independent set (minimum cross-split overlap). For each spectral parameter value (incremented by 0.05 from 0 to 1), three splits with different random seeds are generated.</p>
<h3 id="the-spectral-performance-curve-and-auspc">The Spectral Performance Curve and AUSPC</h3>
<p>The model is trained and evaluated on each split. Plotting test performance against the spectral parameter produces the spectral performance curve (SPC). The area under this curve, the AUSPC, serves as a single summary metric for model generalizability that captures behavior across the full spectrum of train-test overlap.</p>
<h3 id="handling-mutational-scan-datasets">Handling Mutational Scan Datasets</h3>
<p>For mutational scan datasets where sample barcodes map to multiple samples, SPECTRA introduces two modifications: (1) weighting nodes in the SPG by the number of samples they represent, and (2) running a subset sum algorithm to ensure 80/20 train-test splits by sample count.</p>
<h2 id="evaluation-across-18-datasets-and-19-models">Evaluation Across 18 Datasets and 19 Models</h2>
<p>The authors apply SPECTRA to 18 molecular sequencing datasets spanning three benchmarks (TAPE, PEER, ProteinGym) plus PDBBind, evaluating 19 models including CNNs, LSTMs, GNNs (GearNet), LLMs (ESM2), diffusion models (DiffDock), variational autoencoders (EVE), and logistic regression.</p>
<h3 id="benchmark-datasets">Benchmark Datasets</h3>
<p>The core evaluation covers five primary tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Dataset</th>
          <th>Type</th>
          <th>Metric</th>
          <th>Samples</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Rifampicin resistance (RIF)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>17,474</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Isoniazid">Isoniazid</a> resistance (INH)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>26,574</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Pyrazinamide">Pyrazinamide</a> resistance (PZA)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>12,146</td>
      </tr>
      <tr>
          <td>Fluorescence prediction</td>
          <td><a href="https://en.wikipedia.org/wiki/Green_fluorescent_protein">GFP</a> variants</td>
          <td>MSD</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>54,024</td>
      </tr>
      <tr>
          <td>Vaccine escape</td>
          <td>SARS-CoV-2 RBD</td>
          <td>MSD</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>438,046</td>
      </tr>
  </tbody>
</table>
<p>Additional benchmarks include remote homology detection, secondary structure prediction, subcellular localization, and protein-ligand binding (PDBBind, Astex diverse set, Posebusters).</p>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>Eight models were evaluated in depth across the five primary tasks: logistic regression, CNN, ESM2 (pretrained), ESM2-Finetuned, GearNet, GearNet-Finetuned, EVE, and SeqDesign. Additional models (LSTM, ResNet, DeepSF, Transformer, HHblits, Equibind, DiffDock, TankBind, Transception, MSA Transformer, ESM1v, Progen2) were evaluated on specific benchmark tasks.</p>
<h3 id="existing-splits-as-points-on-the-spc">Existing Splits as Points on the SPC</h3>
<p>SPECTRA reveals that existing benchmark splits correspond to specific points on the spectral performance curve. For instance:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Benchmark Split</th>
          <th>Cross-Split Overlap</th>
          <th>Spectral Parameter</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Remote homology</td>
          <td>TAPE family</td>
          <td>97%</td>
          <td>0.025</td>
      </tr>
      <tr>
          <td>Remote homology</td>
          <td>TAPE superfamily</td>
          <td>71%</td>
          <td>0.475</td>
      </tr>
      <tr>
          <td>Secondary structure</td>
          <td>CASP12</td>
          <td>48%</td>
          <td>0.5</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>Equibind temporal</td>
          <td>76%</td>
          <td>0.55</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>LPPDBind similarity</td>
          <td>91%</td>
          <td>0.275</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>Posebusters</td>
          <td>70%</td>
          <td>0.575</td>
      </tr>
  </tbody>
</table>
<h2 id="performance-degradation-and-foundation-model-insights">Performance Degradation and Foundation Model Insights</h2>
<h3 id="universal-performance-decline">Universal Performance Decline</h3>
<p>All evaluated models demonstrate decreased performance as cross-split overlap decreases. Logistic regression drops from AUROC &gt; 0.9 to 0.5 for rifampicin resistance. ESM2-Finetuned decreases from Spearman&rsquo;s $\rho &gt; 0.9$ to less than 0.4 for GFP fluorescence prediction.</p>
<p>No single model achieves the highest AUSPC across all tasks. CNN maintains AUSPC &gt; 0.6 across all tasks but is surpassed by ESM2-Finetuned and ESM2 on rifampicin resistance. Some models retain reasonable performance even at $\mathbf{SP} = 1$ (minimal overlap): ESM2, ESM2-Finetuned, and CNN maintain AUROC &gt; 0.7 for RIF and PZA at this extreme.</p>
<h3 id="uncovering-hidden-spectral-properties">Uncovering Hidden Spectral Properties</h3>
<p>SPECTRA can detect unconsidered spectral properties through high variance in model performance at fixed spectral parameters. For rifampicin resistance, the CNN shows high variance at $\mathbf{SP} = 0.9$, $0.95$, and $1.0$ (standard deviations of 0.09, 0.10, and 0.08 respectively).</p>
<p>The authors trace this to the rifampicin resistance determining region (RRDR), a 26-amino-acid region of the rpoB gene. They define diff-RRDR as:</p>
<p>$$
\text{diff-RRDR} = \left(\max\left(\text{position}_{\text{train}}\right) - \max\left(\text{position}_{\text{test}}\right)\right) + \left(\min\left(\text{position}_{\text{train}}\right) - \min\left(\text{position}_{\text{test}}\right)\right)
$$</p>
<p>diff-RRDR correlates with CNN performance variance (Spearman&rsquo;s $\rho = -0.51$, p-value $= 1.79 \times 10^{-5}$) but not with ESM2 performance. The authors attribute this to ESM2&rsquo;s larger context window (512 positions vs. CNN&rsquo;s 12), making it more invariant to positional shifts in resistance-determining mutations.</p>
<h3 id="foundation-model-generalizability">Foundation Model Generalizability</h3>
<p>For protein foundation models, SPECTRA reveals that AUSPC correlates with the similarity between task-specific datasets and the pretraining dataset. ESM2&rsquo;s AUSPC varies from 0.91 (RIF) to 0.26 (SARS-CoV-2). The correlation between UniRef50 overlap and AUSPC is strong (Spearman&rsquo;s $\rho = 0.9$, p-value $= 1.4 \times 10^{-27}$).</p>
<p>This finding holds across multiple foundation models (Transception, MSA Transformer, ESM1v, Progen2) evaluated on five ProteinGym datasets (Spearman&rsquo;s $\rho = 0.9$, p-value $= 0.04$). Fine-tuning improves AUSPC for tasks with low pretraining overlap (PZA, SARS-CoV-2, GFP).</p>
<h3 id="computational-cost">Computational Cost</h3>
<p>Generating SPECTRA splits ranges from 5 minutes (amyloid beta aggregation) to 9 hours (PDBBind). Generating spectral performance curves ranges from 1 hour (logistic regression) to 5 days (ESM2-Finetuned). The authors recommend releasing SPECTRA splits alongside new benchmarks to amortize this cost.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Spectral property selection is pivotal</strong>: The choice of spectral property must be biologically informed and task-specific. Standardized definitions across the community are needed.</li>
<li><strong>Computational cost</strong>: Running SPECTRA is expensive, especially for large models. The authors mitigate this with multi-core CPU parallelization and multi-GPU training.</li>
<li><strong>Not a model ranking tool</strong>: SPECTRA is designed for understanding generalizability patterns, not for ranking models. Proper ranking requires averaging AUSPCs across many tasks in a standardized benchmark.</li>
<li><strong>Spectral parameter vs. cross-split overlap</strong>: The minimal achievable cross-split overlap varies across tasks, so SPECTRA plots performance against the spectral parameter rather than overlap directly. This means the AUSPC reflects relative impact on performance per unit decrease in overlap.</li>
</ul>
<p>The authors envision SPECTRA as a foundation for next-generation molecular benchmarks that explicitly characterize generalizability across the full spectrum of distribution shift, applicable beyond molecular data to small molecule therapeutics, inverse protein folding, and patient-level clinical datasets.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All data used in this study is publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>TB RIF resistance</td>
          <td>17,474 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TB INH resistance</td>
          <td>26,574 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TB PZA resistance</td>
          <td>12,146 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>GFP fluorescence</td>
          <td>54,024 samples</td>
          <td>From Sarkisyan et al. (2016)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>SARS-CoV-2 escape</td>
          <td>438,046 samples</td>
          <td>From Greaney et al. (2021)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>TAPE (remote homology, secondary structure)</td>
          <td>Various</td>
          <td>From Rao et al. (2019)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>PEER (subcellular localization)</td>
          <td>13,949 samples</td>
          <td>From Xu et al. (2022)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>ProteinGym (amyloid, RRM)</td>
          <td>Various</td>
          <td>From Notin et al. (2022)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>PDBBind (protein-ligand binding)</td>
          <td>14,993-16,742 complexes</td>
          <td>From Wang et al. (2005)</td>
      </tr>
  </tbody>
</table>
<p>Data is also available on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN">Harvard Dataverse</a>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Spectral property comparison uses Biopython pairwise alignment (match=1, mismatch=-2, gap=-2.5) with a 0.3 similarity threshold for sequence-to-sequence datasets</li>
<li>Greedy randomized maximal independent set approximation for split generation</li>
<li>Spectral parameter incremented in 0.05 steps from 0 to 1</li>
<li>Three random seeds per spectral parameter value</li>
<li>80/20 train-test split ratio enforced via subset sum for mutational scan datasets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ESM2: 650M parameter version from Lin et al. (2023)</li>
<li>ESM2-Finetuned: First 30 layers frozen, masked language head replaced with linear prediction layer</li>
<li>GearNet and GearNet-Finetuned: Protein structures generated via ESMFold</li>
<li>CNN: Architecture from Green et al. (2022), one-hot encoded sequences</li>
<li>Logistic regression: One-hot encoded mutational barcodes</li>
<li>EVE and SeqDesign: MSAs constructed via Jackhmmer against UniRep100</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUROC</td>
          <td>TB resistance (RIF, INH, PZA)</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>GFP fluorescence, SARS-CoV-2 escape</td>
          <td>Regression tasks</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Remote homology, secondary structure, subcellular localization</td>
          <td>Per-label/class accuracy</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Protein-ligand binding</td>
          <td>Predicted vs. actual complex</td>
      </tr>
      <tr>
          <td>AUSPC</td>
          <td>All tasks</td>
          <td>Area under spectral performance curve</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Most models: 1x Tesla A10 GPU</li>
<li>ESM2-Finetuned: 4x Tesla A100 GPUs on Azure cluster</li>
<li>Hyperparameter optimization: Weights &amp; Biases random search over learning rate</li>
<li>All code in PyTorch</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mims-harvard/SPECTRA">SPECTRA Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Framework implementation and reproduction scripts</td>
      </tr>
      <tr>
          <td><a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN">Harvard Dataverse</a></td>
          <td>Dataset</td>
          <td>CC0 1.0</td>
          <td>All datasets and generated splits</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ektefaie, Y., Shen, A., Bykova, D., Marin, M. G., Zitnik, M., &amp; Farhat, M. (2024). Evaluating generalizability of artificial intelligence models for molecular datasets. <em>Nature Machine Intelligence</em>, 6(12), 1512-1524. <a href="https://doi.org/10.1038/s42256-024-00931-6">https://doi.org/10.1038/s42256-024-00931-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ektefaie2024evaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evaluating generalizability of artificial intelligence models for molecular datasets}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin, Maximillian G. and Zitnik, Marinka and Farhat, Maha}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1512--1524}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-024-00931-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Perplexity for Molecule Ranking and CLM Bias Detection</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/perplexity-molecule-ranking-bias-clms/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/perplexity-molecule-ranking-bias-clms/</guid><description>Perplexity scoring enables intrinsic molecule ranking and pretraining bias detection in chemical language models for de novo drug design.</description><content:encoded><![CDATA[<h2 id="a-method-for-intrinsic-scoring-and-bias-detection-in-chemical-language-models">A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces two contributions to the chemical language model (CLM) pipeline for <a href="/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/">de novo molecular design</a>. First, the authors propose using perplexity as a model-intrinsic score to rank generated <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a &ldquo;delta score&rdquo; that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.</p>
<h2 id="the-ranking-and-bias-problem-in-clm-based-molecule-generation">The Ranking and Bias Problem in CLM-Based Molecule Generation</h2>
<p>Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">transfer learning</a> (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce &ldquo;pretraining bias,&rdquo; where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.</p>
<p>Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.</p>
<h2 id="perplexity-scoring-and-the-delta-score-for-bias-estimation">Perplexity Scoring and the Delta Score for Bias Estimation</h2>
<p>The core innovation is the application of <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a>, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:</p>
<p>$$
\text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})}
$$</p>
<p>Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.</p>
<p>To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):</p>
<p>$$
\text{delta} = \text{rank}_{ft} - \text{rank}_{pt}
$$</p>
<p>A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.</p>
<p>The multinomial sampling probability for each character is computed via the softmax function:</p>
<p>$$
p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}}
$$</p>
<p>where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).</p>
<h2 id="experimental-setup-10-protein-targets-across-four-data-regimes">Experimental Setup: 10 Protein Targets Across Four Data Regimes</h2>
<p>The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).</p>
<p><strong>Model architecture</strong>: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.</p>
<p><strong>Pretraining</strong>: The model was pretrained on 1,683,181 molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.</p>
<p><strong>Fine-tuning</strong>: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL &gt; 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).</p>
<table>
  <thead>
      <tr>
          <th>CHEMBL ID</th>
          <th>Target</th>
          <th>Protein Classification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CHEMBL1836</td>
          <td>Prostanoid EP4 receptor</td>
          <td><a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">G protein-coupled receptor</a></td>
      </tr>
      <tr>
          <td>CHEMBL1945</td>
          <td>Melatonin receptor 1A</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL1983</td>
          <td>Serotonin 1D (5-HT1D) receptor</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL202</td>
          <td><a href="https://en.wikipedia.org/wiki/Dihydrofolate_reductase">Dihydrofolate reductase</a></td>
          <td>Oxidoreductase</td>
      </tr>
      <tr>
          <td>CHEMBL3522</td>
          <td><a href="https://en.wikipedia.org/wiki/Cytochrome_P450">Cytochrome P450</a> 17A1</td>
          <td>Cytochrome P450</td>
      </tr>
      <tr>
          <td>CHEMBL4029</td>
          <td>Interleukin-8 receptor A</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL5073</td>
          <td>CaM kinase I delta</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5137</td>
          <td>Metabotropic glutamate receptor 2</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL5408</td>
          <td>Serine/threonine-protein kinase TBK1</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5608</td>
          <td>NT-3 growth factor receptor</td>
          <td>Kinase</td>
      </tr>
  </tbody>
</table>
<p><strong>Sampling comparison</strong>: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.</p>
<p><strong>Molecular similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> was computed using Morgan fingerprints (radius 2, length 1024) and 2D <a href="https://en.wikipedia.org/wiki/Pharmacophore">pharmacophore</a> fingerprints via RDKit (2019.03.2).</p>
<h2 id="key-findings-multinomial-sampling-outperforms-beam-search">Key Findings: Multinomial Sampling Outperforms Beam Search</h2>
<p><strong>Perplexity correlates with molecular similarity.</strong> The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.</p>
<p><strong>Multinomial sampling produces better-ranked molecules than beam search.</strong> With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.</p>
<p><strong>Perplexity scoring narrows the quality distribution.</strong> The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.</p>
<p><strong>Pretraining bias is substantial.</strong> The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect &ldquo;generic&rdquo; pretraining rather than task-focused fine-tuning.</p>
<p><strong>Perplexity alone partially mitigates bias.</strong> Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.</p>
<p><strong>SMILES validity remained high.</strong> Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or <a href="/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/">SMILES augmentation</a> remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v28</td>
          <td>1,683,181 molecules</td>
          <td>Canonical SMILES, 20-90 characters, salts and duplicates removed</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>ChEMBL v28 (split)</td>
          <td>84,160 molecules</td>
          <td>Random split from pretraining set</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ChEMBL v28 (per target)</td>
          <td>5, 10, 20, or 40 molecules</td>
          <td>pChEMBL &gt; 6, 10 targets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>LSTM-based CLM with character-level SMILES prediction</li>
<li>Multinomial sampling at $T = 1$</li>
<li>Beam search at $k = 10$ and $k = 50$</li>
<li>Perplexity computed per Equation 1; delta score per Equation 2</li>
<li>Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization</li>
<li>5,820,515 parameters total</li>
<li>One-hot encoded SMILES input</li>
<li>Pretrained weights available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perplexity</td>
          <td>Model confidence in generated SMILES</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Delta score</td>
          <td>Rank difference between fine-tuned and pretrained models</td>
          <td>Positive indicates task-relevant generation</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Morgan and pharmacophore fingerprints</td>
          <td>Compared to fine-tuning set</td>
      </tr>
      <tr>
          <td>Pearson correlation</td>
          <td>Perplexity vs. Tanimoto distance</td>
          <td>Stabilizes at ~0.5</td>
      </tr>
      <tr>
          <td>SMILES validity</td>
          <td>Fraction of valid SMILES strings</td>
          <td>Consistently &gt; 90%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ETHmodlab/CLM_perplexity">CLM_perplexity</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Framework, pretrained weights, and training data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ETHmodlab/molecular_design_with_beam_search">Beam search implementation</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Referenced beam search implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Moret, M., Grisoni, F., Katzberger, P., &amp; Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. <em>Journal of Chemical Information and Modeling</em>, 62(5), 1199-1206. <a href="https://doi.org/10.1021/acs.jcim.2c00079">https://doi.org/10.1021/acs.jcim.2c00079</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ETHmodlab/CLM_perplexity">GitHub: CLM_perplexity (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{moret2022perplexity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1199--1206}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c00079}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Graph-Based GA and MCTS Generative Model for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/</guid><description>Jensen introduces a graph-based genetic algorithm and generative model with MCTS that outperforms ML methods for penalized logP optimization.</description><content:encoded><![CDATA[<h2 id="a-graph-based-approach-to-molecular-optimization">A Graph-Based Approach to Molecular Optimization</h2>
<p>This is a <strong>Method</strong> paper that introduces two graph-based approaches for exploring chemical space: a genetic algorithm (GB-GA) and a generative model combined with <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte Carlo tree search</a> (GB-GM-MCTS). The primary contribution is demonstrating that these non-ML, graph-based methods can match or exceed the performance of contemporary ML-based generative models for molecular property optimization, while being several orders of magnitude faster. The paper provides open-source implementations built on the RDKit cheminformatics package. The two approaches explore <a href="https://en.wikipedia.org/wiki/Chemical_space">chemical space</a> using direct graph manipulations rather than string-based representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</p>
<h2 id="why-compare-simple-baselines-to-ml-generative-models">Why Compare Simple Baselines to ML Generative Models?</h2>
<p>By 2018, several ML-based generative models for molecules had been published, including VAEs, RNNs, and graph convolutional policy networks. However, these models were rarely compared against traditional optimization approaches such as genetic algorithms. Jensen identifies this gap explicitly: while ML generative model performance had been impressive, the lack of comparison to simpler baselines made it difficult to assess whether the complexity of ML approaches was justified.</p>
<p>A practical barrier to such comparisons was the absence of free, open-source GA implementations for molecular optimization (the existing ACSESS algorithm required proprietary OpenEye toolkits). This paper fills that gap by providing RDKit-based implementations of both the GB-GA and GB-GM-MCTS.</p>
<h2 id="graph-based-crossovers-mutations-and-monte-carlo-tree-search">Graph-Based Crossovers, Mutations, and Monte Carlo Tree Search</h2>
<h3 id="gb-ga-crossovers-and-mutations-on-molecular-graphs">GB-GA: Crossovers and Mutations on Molecular Graphs</h3>
<p>The GB-GA operates directly on molecular graph representations (not string representations like SMILES). It combines ideas from Brown et al. (2004) and the ACSESS algorithm of Virshup et al. (2013).</p>
<p><strong>Crossovers</strong> can occur at two types of positions with equal probability:</p>
<ul>
<li>Non-ring bonds: a molecule is cut at a non-ring bond, and fragments from two parent molecules are recombined</li>
<li>Ring bonds: adjacent bonds or bonds separated by one bond are cut, and fragments are mated using single or double bonds</li>
</ul>
<p><strong>Mutations</strong> include seven operation types, each with specified probabilities:</p>
<ul>
<li>Append atom (15%): adds an atom with a single, double, or triple bond</li>
<li>Insert atom (15%): inserts an atom into an existing bond</li>
<li>Delete atom (14%): removes an atom, reconnecting neighbors</li>
<li>Change atom type (14%): swaps element identity (C, N, O, F, S, Cl, Br)</li>
<li>Change bond order (14%): toggles between single, double, and triple bonds</li>
<li>Delete ring bond (14%): opens a ring</li>
<li>Add ring bond (14%): closes a new ring</li>
</ul>
<p>Molecules with macrocycles (seven or more atoms), allene centers in rings, fewer than five heavy atoms, incorrect valences, or more non-H atoms than the target size are discarded. The target size is sampled from a normal distribution with mean 39.15 and standard deviation 3.50 non-H atoms, calibrated to match the molecules found by Yang et al. (2017).</p>
<h3 id="gb-gm-mcts-a-probabilistic-growth-model-with-tree-search">GB-GM-MCTS: A Probabilistic Growth Model with Tree Search</h3>
<p>The GB-GM grows molecules one atom at a time, with the choice of bond order and atom type determined probabilistically from a bonding analysis of a reference dataset (the first 1000 molecules from ZINC). Since 63% of atoms in the reference set are ring atoms, ring-creation or ring-insertion mutations are chosen 63% of the time.</p>
<p>The generative model is combined with a <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte Carlo tree search</a> where:</p>
<ul>
<li>Each node corresponds to an atom addition step</li>
<li>Leaf parallelization uses a maximum of 25 leaf nodes</li>
<li>The exploration factor is $1 / \sqrt{2}$</li>
<li>Rollout terminates if the molecule exceeds the target size</li>
<li>The reward function returns 1 if the predicted $J(\mathbf{m})$ value exceeds the largest value found so far, and 0 otherwise</li>
</ul>
<h3 id="the-penalized-logp-objective">The Penalized logP Objective</h3>
<p>Both methods optimize the penalized logP score $J(\mathbf{m})$:</p>
<p>$$
J(\mathbf{m}) = \log P(\mathbf{m}) - \text{SA}(\mathbf{m}) - \text{RingPenalty}(\mathbf{m})
$$</p>
<p>where $\log P(\mathbf{m})$ is the <a href="https://en.wikipedia.org/wiki/Partition_coefficient">octanol-water partition coefficient</a> predicted by RDKit, $\text{SA}(\mathbf{m})$ is a synthetic accessibility score, and $\text{RingPenalty}(\mathbf{m})$ penalizes unrealistically large rings by reducing the score by $\text{RingSize} - 6$ for each oversized ring. Each property is normalized to zero mean and unit standard deviation across the ZINC dataset.</p>
<h2 id="experimental-setup-and-comparisons-to-ml-methods">Experimental Setup and Comparisons to ML Methods</h2>
<h3 id="gb-ga-experiments">GB-GA Experiments</h3>
<p>Ten GA simulations were performed with a population size of 20 over 50 generations (1000 $J(\mathbf{m})$ evaluations per run). The initial mating pool was 20 random molecules from the first 1000 molecules in ZINC. Two mutation rates were tested: 50% and 1%.</p>
<h3 id="gb-gm-mcts-experiments">GB-GM-MCTS Experiments</h3>
<p>Ten simulations used ethane as a seed molecule with 1000 tree traversals per run. Additional experiments used 5000 traversals and an adjusted probability of generating $\text{C}=\text{C}-\text{C}$ ring patterns (increased from 62% to 80%).</p>
<h3 id="baselines">Baselines</h3>
<p>Results were compared to those compiled by Yang et al. (2017):</p>
<ul>
<li>ChemTS (RNN + MCTS)</li>
<li>RNN with and without Bayesian optimization</li>
<li><a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Continuous VAE (CVAE)</a></li>
<li><a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE (GVAE)</a></li>
<li>Graph convolutional policy network (GCPN, from You et al. 2018)</li>
</ul>
<h3 id="key-results">Key Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Average $J(\mathbf{m})$</th>
          <th>Molecules Evaluated</th>
          <th>CPU Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GB-GA (50% mutation)</td>
          <td>6.8 +/- 0.7</td>
          <td>1000</td>
          <td>30 seconds</td>
      </tr>
      <tr>
          <td>GB-GA (1% mutation)</td>
          <td>7.4 +/- 0.9</td>
          <td>1000</td>
          <td>30 seconds</td>
      </tr>
      <tr>
          <td>GB-GM-MCTS (62%)</td>
          <td>2.6 +/- 0.6</td>
          <td>1000</td>
          <td>90 seconds</td>
      </tr>
      <tr>
          <td>GB-GM-MCTS (80%)</td>
          <td>3.4 +/- 0.6</td>
          <td>1000</td>
          <td>90 seconds</td>
      </tr>
      <tr>
          <td>GB-GM-MCTS (80%)</td>
          <td>4.3 +/- 0.6</td>
          <td>5000</td>
          <td>9 minutes</td>
      </tr>
      <tr>
          <td>ChemTS</td>
          <td>4.9 +/- 0.5</td>
          <td>~5000</td>
          <td>2 hours</td>
      </tr>
      <tr>
          <td>ChemTS</td>
          <td>5.6 +/- 0.5</td>
          <td>~20000</td>
          <td>8 hours</td>
      </tr>
      <tr>
          <td>RNN + BO</td>
          <td>4.5 +/- 0.2</td>
          <td>~4000</td>
          <td>8 hours</td>
      </tr>
      <tr>
          <td>Only RNN</td>
          <td>4.8 +/- 0.2</td>
          <td>~20000</td>
          <td>8 hours</td>
      </tr>
      <tr>
          <td>CVAE + BO</td>
          <td>0.0 +/- 0.9</td>
          <td>~100</td>
          <td>8 hours</td>
      </tr>
      <tr>
          <td>GVAE + BO</td>
          <td>0.2 +/- 1.3</td>
          <td>~1000</td>
          <td>8 hours</td>
      </tr>
  </tbody>
</table>
<p>The GB-GA with 1% mutation rate achieved an average maximum $J(\mathbf{m})$ of 7.4, which is 1.8 units higher than the best ML result (ChemTS at 5.6) while using 20x fewer evaluations and completing in 30 seconds versus 8 hours. The two highest-scoring individual molecules found by GB-GA had $J(\mathbf{m})$ scores of 8.8 and 8.5, exceeding the 7.8-8.0 range found by the GCPN approach. These molecules bore little resemblance to the initial mating pool (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarities</a> of 0.27 and 0.12 to the most similar ZINC molecules), indicating that the GA traversed a large distance in chemical space in just 50 generations.</p>
<p>The GB-GM-MCTS performed below ChemTS at equal evaluations (4.3 vs. 4.9 at 5000 evaluations) but was several orders of magnitude faster (9 minutes vs. 2 hours). The MCTS approach tended to extract the dominant hydrophobic structural motif (benzene rings) from the training set, making it more dependent on training set composition than the GA.</p>
<h2 id="simple-methods-set-a-high-bar-for-molecular-optimization">Simple Methods Set a High Bar for Molecular Optimization</h2>
<p>The central finding is that a simple graph-based genetic algorithm outperforms all tested ML-based generative models on penalized logP optimization, both in terms of solution quality and computational efficiency. The GB-GA achieves higher $J(\mathbf{m})$ scores with 1000 evaluations in 30 seconds than ML methods achieve with 20,000 evaluations over 8 hours.</p>
<p>Several additional observations emerge:</p>
<ol>
<li><strong>Chemical space traversal</strong>: The GB-GA can reach high-scoring molecules that are structurally distant from the starting population, with Tanimoto similarity as low as 0.12 to the nearest ZINC molecule.</li>
<li><strong>Mutation rate matters</strong>: A 1% mutation rate outperformed a 50% rate (7.4 vs. 6.8), suggesting that preserving more parental structure during crossover is beneficial for this objective.</li>
<li><strong>Training set dependence</strong>: The GB-GM-MCTS is more sensitive to training set composition than the GA. Its preference for benzene-ring-containing molecules (the dominant ZINC motif) limits its ability to discover alternative structural solutions like the long aliphatic chains favored by the GA.</li>
<li><strong>Generalizability caveat</strong>: Jensen explicitly notes that these comparisons cover only one property (penalized logP) and that similar comparisons for other properties are needed before drawing general conclusions.</li>
</ol>
<p>The paper&rsquo;s influence has been substantial: it helped establish the expectation that new molecular generative models should be benchmarked against genetic algorithm baselines, a position subsequently reinforced by Brown et al. (2019) in <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and by <a href="/notes/chemistry/molecular-design/generation/search-based/genetic-algorithms-molecule-generation-baselines/">Tripp and Hernandez-Lobato (2023)</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Initial mating pool / reference set</td>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a> (subset)</td>
          <td>First 1000 molecules</td>
          <td>Same subset used in previous studies (Gomez-Bombarelli et al., Yang et al.)</td>
      </tr>
      <tr>
          <td>Target molecule size</td>
          <td>Derived from Yang et al. results</td>
          <td>20 molecules</td>
          <td>Mean 39.15, SD 3.50 non-H atoms</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>GB-GA</strong>: Population size 20, 50 generations, mutation rates of 1% and 50% tested. Crossovers at ring and non-ring bonds with equal probability. Seven mutation types with specified probabilities. Molecules selected from mating pool based on normalized logP scores.</li>
<li><strong>GB-GM</strong>: Atom-by-atom growth using probabilistic rules derived from ZINC bonding analysis. Ring creation probability 63% (matching ZINC), with 80% variant also tested. Seed molecule: ethane.</li>
<li><strong>MCTS</strong>: Modified from haroldsultan/MCTS Python implementation. Leaf parallelization with max 25 leaf nodes. Exploration factor $1/\sqrt{2}$. Binary reward function (1 if new best, 0 otherwise).</li>
<li><strong>Property calculation</strong>: logP, SA score, and ring penalty all computed via RDKit. Each property normalized to zero mean and unit standard deviation across ZINC.</li>
</ul>
<h3 id="models">Models</h3>
<p>No neural network models are used. The GB-GA and GB-GM are purely algorithmic approaches parameterized by bonding statistics from the ZINC dataset.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GB-GA (1%)</th>
          <th>Best ML (ChemTS)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Average max $J(\mathbf{m})$</td>
          <td>7.4 +/- 0.9</td>
          <td>5.6 +/- 0.5</td>
          <td>Over 10 runs</td>
      </tr>
      <tr>
          <td>Single best $J(\mathbf{m})$</td>
          <td>8.8</td>
          <td>~8.0 (GCPN)</td>
          <td>GB-GA vs. You et al.</td>
      </tr>
      <tr>
          <td>Evaluations per run</td>
          <td>1000</td>
          <td>~20,000</td>
          <td>20x fewer for GB-GA</td>
      </tr>
      <tr>
          <td>CPU time per run</td>
          <td>30 seconds</td>
          <td>8 hours</td>
          <td>~960x faster</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>All GB-GA and GB-GM experiments were run on a laptop. No GPU required. The GB-GA completes in 30 seconds per run and the GB-GM-MCTS in 90 seconds (1000 traversals) to 9 minutes (5000 traversals).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jensengroup/GB-GA/tree/v0.0">GB-GA (v0.0)</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Graph-based genetic algorithm, RDKit dependency only</td>
      </tr>
      <tr>
          <td><a href="https://github.com/jensengroup/GB-GM/tree/v0.0">GB-GM (v0.0)</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Graph-based generative model + MCTS, RDKit dependency only</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. <em>Chemical Science</em>, 10(12), 3567-3572. <a href="https://doi.org/10.1039/c8sc05372c">https://doi.org/10.1039/c8sc05372c</a></p>
<p><strong>Publication</strong>: Chemical Science (Royal Society of Chemistry), 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/jensengroup/GB-GA">GB-GA Code (GitHub)</a></li>
<li><a href="https://github.com/jensengroup/GB-GM">GB-GM Code (GitHub)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jensen2019graph,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jensen, Jan H.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3567--3572}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/c8sc05372c}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Frechet ChemNet Distance for Molecular Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/</guid><description>FCD uses ChemNet activations and the Wasserstein-2 distance to evaluate molecular generative models for chemical validity, biological relevance, and diversity.</description><content:encoded><![CDATA[<h2 id="a-unified-evaluation-metric-for-molecular-generation">A Unified Evaluation Metric for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces the Frechet ChemNet Distance (FCD), a single scalar metric for evaluating generative models that produce molecules for drug discovery. FCD adapts the Frechet Inception Distance (FID) from image generation to the molecular domain. By comparing distributions of learned representations from a drug-activity prediction network (ChemNet), FCD simultaneously captures whether generated molecules are chemically valid, biologically relevant, and structurally diverse.</p>
<h2 id="inconsistent-evaluation-of-molecular-generative-models">Inconsistent Evaluation of Molecular Generative Models</h2>
<p>At the time of this work (2018), deep generative models for molecules were proliferating: RNNs combined with <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoders</a>, reinforcement learning, and <a href="/posts/what-is-a-gan/">GANs</a> all produced <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings representing novel molecules. The evaluation landscape was fragmented. Different papers reported different metrics: percentage of valid SMILES, mean logP, druglikeness, synthetic accessibility (SA) scores, or internal diversity via Tanimoto distance.</p>
<p>This inconsistency created several problems. First, method comparison across publications was difficult because no common metric existed. Second, simple metrics like &ldquo;fraction of valid SMILES&rdquo; could be trivially maximized by generating short, simple molecules (e.g., &ldquo;CC&rdquo; or &ldquo;CCC&rdquo;). Third, individual property metrics (logP, druglikeness) each captured only one dimension of quality. A model could score well on logP but produce molecules that were not diverse or not biologically meaningful.</p>
<p>The authors argued that a good metric should capture three properties simultaneously: (1) chemical validity and similarity to real drug-like molecules, (2) biological relevance, and (3) diversity within the generated set.</p>
<h2 id="core-innovation-frechet-distance-over-chemnet-activations">Core Innovation: Frechet Distance over ChemNet Activations</h2>
<p>The key insight is to use a neural network trained on biological activity prediction as a feature extractor for molecules, then compare distributions of these features using the Frechet (Wasserstein-2) distance.</p>
<h3 id="chemnet-architecture">ChemNet Architecture</h3>
<p>ChemNet is a multi-task neural network trained to predict bioactivities across approximately 6,000 assays from three major drug discovery databases (ChEMBL, ZINC, PubChem). The architecture processes one-hot encoded SMILES strings through:</p>
<ol>
<li>Two 1D convolutional layers with SELU activations</li>
<li>A max-pooling layer</li>
<li>Two stacked LSTM layers</li>
<li>A fully connected output layer</li>
</ol>
<p>The penultimate layer (the second LSTM&rsquo;s hidden state after processing the full input sequence) serves as the molecular representation. Because ChemNet was trained to predict drug activities, its internal representations encode both chemical structure (from the input side) and biological function (from the output side).</p>
<h3 id="the-fcd-formula">The FCD Formula</h3>
<p>Given a set of real molecules and a set of generated molecules, FCD is computed as follows:</p>
<ol>
<li>Pass each molecule (as a SMILES string) through ChemNet and extract penultimate-layer activations.</li>
<li>Fit a multivariate Gaussian to each set by computing the mean $\mathbf{m}$ and covariance $\mathbf{C}$ for the generated set, and mean $\mathbf{m}_w$ and covariance $\mathbf{C}_w$ for the real set.</li>
<li>Compute the squared Frechet distance:</li>
</ol>
<p>$$
d^{2}\left((\mathbf{m}, \mathbf{C}), (\mathbf{m}_w, \mathbf{C}_w)\right) = |\mathbf{m} - \mathbf{m}_w|_2^{2} + \mathrm{Tr}\left(\mathbf{C} + \mathbf{C}_w - 2(\mathbf{C}\mathbf{C}_w)^{1/2}\right)
$$</p>
<p>The Gaussian assumption is justified by the maximum entropy principle: the Gaussian is the maximum-entropy distribution for given mean and covariance. A lower FCD indicates that the generated distribution is closer to the real distribution.</p>
<h3 id="why-not-just-fingerprints">Why Not Just Fingerprints?</h3>
<p>The authors also define a Frechet Fingerprint Distance (FFD) that replaces ChemNet activations with 2048-bit ECFP_4 fingerprints. FFD captures chemical structure but not biological function. The experimental comparison shows that FCD produces more distinct separations between biased and unbiased molecule sets, particularly for biologically meaningful biases.</p>
<h2 id="detecting-flaws-in-generative-models">Detecting Flaws in Generative Models</h2>
<p>The experiments evaluate whether FCD can detect specific failure modes in generative models. The authors simulate five types of biased generators by selecting molecules from real databases that exhibit particular properties, then compare FCD against individual metrics (logP, druglikeness, SA score, internal diversity) and FFD.</p>
<h3 id="simulated-bias-experiments">Simulated Bias Experiments</h3>
<p>All experiments use 5,000 molecules drawn 5 times each. The reference distribution is 200,000 randomly drawn real molecules not used for ChemNet training.</p>
<table>
  <thead>
      <tr>
          <th>Bias Type</th>
          <th>logP</th>
          <th>Druglikeness</th>
          <th>SA Score</th>
          <th>Int. Diversity</th>
          <th>FFD</th>
          <th>FCD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Low druglikeness (&lt;5th pct)</td>
          <td>-</td>
          <td>Detects</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>High logP (&gt;95th pct)</td>
          <td>Detects</td>
          <td>Detects</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Low SA score (&lt;5th pct)</td>
          <td>-</td>
          <td>Partial</td>
          <td>-</td>
          <td>Partial</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Mode collapse (cluster)</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Kinase inhibitors (PLK1)</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
  </tbody>
</table>
<p>FCD is the only metric that detects all five bias types. The biological bias test (kinase inhibitors for PLK1-PBD from PubChem AID 720504) is particularly notable: only FFD and FCD detect this bias, and FCD provides a more distinct separation. This validates the hypothesis that incorporating biological information through ChemNet activations improves evaluation beyond purely chemical descriptors.</p>
<h3 id="sample-size-requirements">Sample Size Requirements</h3>
<p>The authors tested FCD convergence with varying sample sizes (5 to 300,000 molecules). Mean FCD values for samples drawn from the real distribution:</p>
<table>
  <thead>
      <tr>
          <th>Sample Size</th>
          <th>Mean FCD</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>5</td>
          <td>76.46</td>
          <td>5.03</td>
      </tr>
      <tr>
          <td>50</td>
          <td>31.86</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>500</td>
          <td>4.41</td>
          <td>0.03</td>
      </tr>
      <tr>
          <td>5,000</td>
          <td>0.42</td>
          <td>0.01</td>
      </tr>
      <tr>
          <td>50,000</td>
          <td>0.05</td>
          <td>0.00</td>
      </tr>
      <tr>
          <td>300,000</td>
          <td>0.02</td>
          <td>0.00</td>
      </tr>
  </tbody>
</table>
<p>A sample size of 5,000 molecules is sufficient for reliable estimation, with the mean FCD approaching zero and negligible variance.</p>
<h3 id="benchmarking-published-generative-models">Benchmarking Published Generative Models</h3>
<p>The authors computed FCD for several published generative methods:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>FCD</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Random real molecules</td>
          <td>0.22</td>
          <td>Baseline (near zero as expected)</td>
      </tr>
      <tr>
          <td>Segler et al. (LSTM)</td>
          <td>1.62</td>
          <td>Trained to approximate full ChEMBL distribution</td>
      </tr>
      <tr>
          <td>DRD2-targeted methods</td>
          <td>24.14 to 47.85</td>
          <td>Olivecrona, RL, and ORGAN agents</td>
      </tr>
      <tr>
          <td>Rule-based baseline</td>
          <td>58.76</td>
          <td>Random concatenation of C, N, O atoms</td>
      </tr>
  </tbody>
</table>
<p>The ranking matches expectations. The Segler model, trained to approximate the overall molecule distribution, achieves the lowest FCD (1.62). Models optimized for a specific target (DRD2), including the Olivecrona RL agents, the RL method by Benhenda, and ORGAN, produce higher FCD values (24.14 to 47.85) against the general distribution. More training iterations push these models further from the general distribution, as they become increasingly DRD2-specific. The canonical and reduced Olivecrona agents learn similar chemical spaces, consistent with the original authors&rsquo; conclusions. The rule-based system scores worst (58.76), confirming FCD as a meaningful quality metric.</p>
<h2 id="conclusions-and-impact">Conclusions and Impact</h2>
<p>FCD provides a single metric that unifies the evaluation of chemical validity, biological relevance, and diversity for molecular generative models. Its main advantages are:</p>
<ol>
<li>It captures multiple quality dimensions in one score, simplifying method comparison.</li>
<li>It detects biases that no single existing metric can catch alone.</li>
<li>It requires only SMILES strings as input, making it applicable to any generative method (including graph-based approaches via SMILES conversion).</li>
<li>It incorporates biological information through ChemNet, distinguishing it from purely chemical metrics like FFD.</li>
</ol>
<p><strong>Limitations</strong>: The metric depends on the ChemNet model, which was trained on a specific set of bioactivity assays. Molecules outside the training distribution of ChemNet may not be well-represented. The Gaussian assumption for the activation distributions may not hold perfectly. FCD measures distance to a reference set, so it evaluates how well a generator approximates a given distribution rather than the absolute quality of individual molecules. When using FCD for targeted generation (e.g., molecules active against a specific protein), the reference set should be chosen accordingly, not the general drug-like molecule distribution.</p>
<p>FCD has since become a standard evaluation metric in the molecular generation community, adopted by benchmarking platforms like <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemNet training</td>
          <td>ChEMBL, ZINC, PubChem</td>
          <td>~6,000 assays</td>
          <td>Two-thirds for training, one-third for testing</td>
      </tr>
      <tr>
          <td>Reference distribution</td>
          <td>Combined databases</td>
          <td>200,000 molecules</td>
          <td>Excluded from ChemNet training</td>
      </tr>
      <tr>
          <td>Bias simulations</td>
          <td>Subsets of combined databases</td>
          <td>5,000 per experiment</td>
          <td>5 repetitions each</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>ChemNet: 2x 1D-conv (SELU), max-pool, 2x stacked LSTM, FC output</li>
<li>FCD: Squared Frechet distance between Gaussian-fitted ChemNet penultimate-layer activations</li>
<li>FFD: Same as FCD but using 2048-bit ECFP_4 fingerprints instead of ChemNet activations</li>
<li>Molecular property calculations: RDKit (logP, druglikeness, SA score, Morgan fingerprints with radius 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FCD</td>
          <td>Frechet distance over ChemNet activations (lower = closer to reference)</td>
      </tr>
      <tr>
          <td>FFD</td>
          <td>Frechet distance over ECFP_4 fingerprints</td>
      </tr>
      <tr>
          <td>logP</td>
          <td>Mean partition coefficient</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Geometric mean of desired molecular properties (QED)</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility score</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Tanimoto distance within generated set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not provided in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bioinf-jku/FCD">FCD Implementation</a></td>
          <td>Code</td>
          <td>LGPL-3.0</td>
          <td>Official Python implementation; requires only SMILES input</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., &amp; Klambauer, G. (2018). Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. <em>Journal of Chemical Information and Modeling</em>, 58(9), 1736-1741.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{preuer2018frechet,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fr{\&#39;e}chet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Preuer, Kristina and Renz, Philipp and Unterthiner, Thomas and Hochreiter, Sepp and Klambauer, G{\&#34;u}nter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{58}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1736--1741}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.8b00234}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Back Translation for Semi-Supervised Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/back-translation-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/back-translation-molecule-generation/</guid><description>A semi-supervised method adapting NLP back translation to molecule generation, improving property optimization and retrosynthesis with unlabeled ZINC data.</description><content:encoded><![CDATA[<h2 id="semi-supervised-data-augmentation-for-molecular-tasks">Semi-Supervised Data Augmentation for Molecular Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> prediction tasks.</p>
<h2 id="bridging-the-labeled-data-gap-in-molecular-generation">Bridging the Labeled Data Gap in Molecular Generation</h2>
<p>Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.</p>
<p>Prior approaches to using unlabeled molecular data include <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">variational autoencoders (VAEs)</a> for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.</p>
<h2 id="back-translation-as-molecular-data-augmentation">Back Translation as Molecular Data Augmentation</h2>
<p>The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then &ldquo;back translates&rdquo; unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.</p>
<p>The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen&rsquo;s inequality to obtain a lower bound:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>This lower bound is optimized via Monte Carlo sampling in three steps:</p>
<p><strong>Step 1</strong>: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:</p>
<p>$$
\begin{aligned}
\min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\
\min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g)
\end{aligned}
$$</p>
<p><strong>Step 2</strong>: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:</p>
<p>$$
\hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)}
$$</p>
<p><strong>Step 3</strong>: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:</p>
<p>$$
\min_{\theta_f^<em>} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^</em>)
$$</p>
<p>A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.</p>
<h2 id="experiments-on-property-optimization-and-retrosynthesis">Experiments on Property Optimization and Retrosynthesis</h2>
<h3 id="molecular-property-improvement">Molecular Property Improvement</h3>
<p>The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):</p>
<ul>
<li><strong>LogP</strong> (penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">partition coefficient</a>): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$</li>
<li><strong>QED</strong> (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]</li>
<li><strong>DRD2</strong> (<a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine type 2 receptor</a> activity): translate inactive ($P &lt; 0.5$) to active ($P \geq 0.5$)</li>
</ul>
<p>Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP ($\delta \geq 0.6$)</th>
          <th>LogP ($\delta \geq 0.4$)</th>
          <th>QED (%)</th>
          <th>DRD2 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.28</td>
          <td>1.03</td>
          <td>8.8</td>
          <td>3.4</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>0.79</td>
          <td>2.49</td>
          <td>9.4</td>
          <td>4.4</td>
      </tr>
      <tr>
          <td>JTNN</td>
          <td>2.33</td>
          <td>3.55</td>
          <td>59.9</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Transformer baseline</td>
          <td>2.45</td>
          <td>3.69</td>
          <td>71.9</td>
          <td>60.2</td>
      </tr>
      <tr>
          <td>+BT (1M, filtered)</td>
          <td>2.86</td>
          <td>4.41</td>
          <td>82.9</td>
          <td>67.4</td>
      </tr>
      <tr>
          <td>HierG2G baseline</td>
          <td>2.49</td>
          <td>3.98</td>
          <td>76.9</td>
          <td>85.9</td>
      </tr>
      <tr>
          <td>+BT (250K, filtered)</td>
          <td>2.75</td>
          <td>4.24</td>
          <td>79.1</td>
          <td>87.3</td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis-prediction">Retrosynthesis Prediction</h3>
<p>On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see <a href="/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a> and <a href="/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">Data Transfer for Retrosynthesis</a>. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data&rsquo;s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Top-1</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Reaction type given</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>64.2</td>
          <td>79.1</td>
          <td>85.2</td>
          <td>90.0</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>67.9</td>
          <td>82.5</td>
          <td>87.3</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>52.2</td>
          <td>68.2</td>
          <td>72.7</td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>55.9</td>
          <td>72.8</td>
          <td>77.8</td>
          <td>79.7</td>
      </tr>
      <tr>
          <td><strong>Reaction type unknown</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>54.7</td>
          <td>70.2</td>
          <td>77.0</td>
          <td>84.4</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>37.9</td>
          <td>57.3</td>
          <td>62.7</td>
          <td>68.1</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>43.5</td>
          <td>58.8</td>
          <td>64.6</td>
          <td>69.7</td>
      </tr>
  </tbody>
</table>
<p>The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Effect of unlabeled data size</strong>: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.</p>
<p><strong>Effect of labeled data size</strong>: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.</p>
<p><strong>Data filtration</strong>: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.</p>
<h2 id="consistent-gains-across-architectures-and-tasks">Consistent Gains Across Architectures and Tasks</h2>
<p>The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:</p>
<ol>
<li><strong>Architecture agnosticism</strong>: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.</li>
<li><strong>Filtration is essential at scale</strong>: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.</li>
<li><strong>Training overhead is moderate</strong>: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.</li>
<li><strong>Diversity and novelty increase</strong>: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.</li>
</ol>
<p>The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (property improvement)</td>
          <td>Jin et al. (2019, 2020) datasets</td>
          <td>34K-99K pairs</td>
          <td>LogP, QED, DRD2 tasks</td>
      </tr>
      <tr>
          <td>Training (retrosynthesis)</td>
          <td>USPTO-50K</td>
          <td>40K reactions</td>
          <td>80/10/10 split from Dai et al. (2019)</td>
      </tr>
      <tr>
          <td>Unlabeled molecules</td>
          <td>ZINC</td>
          <td>250K or 1M</td>
          <td>Randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Same as training</td>
          <td>800-1000 test samples</td>
          <td>Per-task test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Back translation with optional data filtration</li>
<li>Beam search with $k=20$ for inference</li>
<li>Random sampling for back-translation step (Equation 5)</li>
<li>Dice similarity on Morgan fingerprints for similarity constraint</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Transformer</strong>: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)</li>
<li><strong>HierG2G</strong>: Settings from Jin et al. (2020)</li>
<li><strong>GLN</strong>: Settings from Dai et al. (2019)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.6$)</td>
          <td>2.86</td>
          <td>2.49 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.4$)</td>
          <td>4.41</td>
          <td>3.98 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>QED</td>
          <td>82.9%</td>
          <td>76.9% (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>DRD2</td>
          <td>87.3%</td>
          <td>85.9% (HierG2G)</td>
          <td>HierG2G + BT(250K, filtered)</td>
      </tr>
      <tr>
          <td>Top-1 accuracy</td>
          <td>USPTO-50K (known type)</td>
          <td>67.9%</td>
          <td>64.2% (GLN)</td>
          <td>Ours + GLN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fyabc/BT4MolGen">BT4MolGen</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation in Python</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., &amp; Qin, T. (2021). Back translation for molecule generation. <em>Bioinformatics</em>, 38(5), 1244-1251. <a href="https://doi.org/10.1093/bioinformatics/btab817">https://doi.org/10.1093/bioinformatics/btab817</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fan2022back,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Back translation for molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1244--1251}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bioinformatics/btab817}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Tied Two-Way Transformers for Diverse Retrosynthesis</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/</guid><description>Tied two-way transformers with cycle consistency and multinomial latent variables improve retrosynthetic prediction validity, plausibility, and diversity.</description><content:encoded><![CDATA[<h2 id="bridging-forward-and-backward-reaction-prediction">Bridging Forward and Backward Reaction Prediction</h2>
<p>This is a <strong>Method</strong> paper that addresses three key limitations of template-free <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> models: invalid <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> outputs, chemically implausible predictions, and lack of diversity in reactant candidates. The solution combines three techniques: (1) cycle consistency checks using a paired forward reaction transformer, (2) parameter tying between the forward and backward transformers, and (3) multinomial latent variables with a learned prior to capture multiple reaction pathways.</p>
<h2 id="three-problems-in-template-free-retrosynthesis">Three Problems in Template-Free Retrosynthesis</h2>
<p>Template-free retrosynthesis models cast retrosynthesis as a <a href="/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">sequence-to-sequence</a> translation problem (product SMILES to reactant SMILES). While these models avoid the cost of hand-coded reaction templates, they suffer from:</p>
<ol>
<li><strong>Invalid SMILES</strong>: predicted reactant strings that contain grammatical errors and cannot be parsed into molecules</li>
<li><strong>Implausibility</strong>: predicted reactants that are valid molecules but cannot actually synthesize the target product</li>
<li><strong>Lack of diversity</strong>: beam search produces duplicate or near-duplicate candidates, reducing the number of useful suggestions</li>
</ol>
<p>Prior work addressed these individually (SCROP adds a syntax corrector for validity, Chen et al. use latent variables for diversity), but this paper tackles all three simultaneously.</p>
<h2 id="model-architecture">Model Architecture</h2>
<h3 id="tied-two-way-transformers">Tied Two-Way Transformers</h3>
<p>The model pairs a retrosynthesis transformer $p(y|z, x)$ (product to reactants) with a forward reaction transformer $p(\tilde{x}|z, y)$ (reactants to product). Both use the standard encoder-decoder transformer architecture with 6 layers, 8 attention heads, and 256-dimensional embeddings.</p>
<p>The key architectural innovation is aggressive parameter tying: the two transformers share the entire encoder and all decoder parameters except layer normalization. This means the two-transformer system has approximately the same parameter count as a single transformer (17.5M vs. 17.4M). The shared parameters force the model to learn bidirectional reaction patterns from both forward and backward training data simultaneously, improving grammar learning and reducing invalid outputs.</p>
<h3 id="multinomial-latent-variables">Multinomial Latent Variables</h3>
<p>A discrete latent variable $z \in \{1, \ldots, K\}$ is introduced to capture multiple reaction modes. Each latent value conditions a different decoding path, encouraging diverse reactant predictions. The decoder initializes with a latent-class-specific start token (e.g., &ldquo;&lt;CLS2&gt;&rdquo;) and then decodes autoregressively.</p>
<p>The prior $p(z|x)$ is a learned multinomial distribution parametrized by a two-layer feed-forward network with tanh activation, taking the mean-pooled encoder output as input. This learned prior outperforms the uniform prior used by Chen et al., producing a smaller trade-off between top-1 and top-10 accuracy as $K$ increases.</p>
<h3 id="training-with-hard-em">Training with Hard EM</h3>
<p>Since the latent variable $z$ is unobserved during training, the model is trained with the online <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">hard-EM algorithm</a>. The loss function is:</p>
<p>$$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \text{data}} \left[ \min_{z} \mathcal{L}_h(x, y, z; \theta) \right]$$</p>
<p>where $\mathcal{L}_h = -(\log p(z|x) + \log p(y|z,x) + \log p(\tilde{x}=x|z,y))$. The E-step selects the best $z$ for each training pair (with dropout disabled), and the M-step updates parameters given the complete data.</p>
<h3 id="inference-with-cycle-consistency-reranking">Inference with Cycle Consistency Reranking</h3>
<p>At inference, the model: (1) generates $K$ sets of beam search hypotheses from the retrosynthesis transformer (one per latent value), (2) scores each candidate with the forward reaction transformer for cycle consistency $p(\tilde{x}=x|z,y)$, and (3) reranks candidates by the full likelihood $p(z|x) \cdot p(y|z,x) \cdot p(\tilde{x}=x|z,y)$. This pushes chemically plausible predictions to higher ranks.</p>
<h2 id="results-on-uspto-50k">Results on USPTO-50K</h2>
<p>All results are averaged over 5 random seeds with beam size 10.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-5 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Top-1 Invalid</th>
          <th>Top-10 Invalid</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Liu-LSTM</td>
          <td>37.4%</td>
          <td>57.0%</td>
          <td>61.7%</td>
          <td>12.2%</td>
          <td>22.0%</td>
      </tr>
      <tr>
          <td>SCROP</td>
          <td>43.7%</td>
          <td>65.2%</td>
          <td>68.7%</td>
          <td>0.7%</td>
          <td>2.3%</td>
      </tr>
      <tr>
          <td>Lin-TF</td>
          <td>42.0%</td>
          <td>71.3%</td>
          <td>77.6%</td>
          <td>2.2%</td>
          <td>7.8%</td>
      </tr>
      <tr>
          <td>Base transformer</td>
          <td>44.3%</td>
          <td>68.4%</td>
          <td>72.7%</td>
          <td>1.7%</td>
          <td>12.1%</td>
      </tr>
      <tr>
          <td>Proposed ($K$=5)</td>
          <td>46.8%</td>
          <td>73.5%</td>
          <td>78.5%</td>
          <td>0.1%</td>
          <td>2.6%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model achieves a +3.1% top-1 accuracy improvement over the best previous template-free method and reduces top-1 invalid rate to 0.1%.</p>
<h3 id="ablation-analysis">Ablation Analysis</h3>
<p>The ablation study isolates the contribution of each component:</p>
<ul>
<li><strong>Base+CC</strong> (cycle consistency only): reranks candidates to improve top-1/3/5 accuracy and validity, but top-10 stays the same since the candidate set is unchanged. Parameter count doubles (34.8M).</li>
<li><strong>Base+PT</strong> (parameter tying only): improves accuracy and validity at all top-$k$ levels with negligible parameter increase. Parameter tying during training improves the retrosynthesis transformer itself, even without cycle consistency at inference.</li>
<li><strong>Proposed ($K$=1)</strong>: combines tying with cycle consistency reranking.</li>
<li><strong>Proposed ($K$=5)</strong>: adds latent diversity, further improving top-10 accuracy (+2.2%) and reducing top-10 invalid rate (from 10.2% to 2.6%).</li>
</ul>
<h3 id="diversity-unique-rate">Diversity: Unique Rate</h3>
<p>As $K$ increases from 1 to 5, the unique molecule rate among 10 predictions rises substantially, confirming that latent modeling produces more diverse candidates. The learned prior reduces the top-1/top-10 accuracy trade-off compared to Chen et al.&rsquo;s uniform prior.</p>
<h2 id="results-on-in-house-multi-pathway-dataset">Results on In-House Multi-Pathway Dataset</h2>
<p>The in-house dataset (162K reactions from <a href="https://en.wikipedia.org/wiki/Reaxys">Reaxys</a>) contains multiple ground-truth reactions per product, enabling direct evaluation of pathway diversity through coverage (proportion of ground-truth pathways correctly predicted in the top-10 candidates).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Unique Rate</th>
          <th>Coverage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>64.2%</td>
          <td>91.6%</td>
          <td>76.1%</td>
          <td>84.4%</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>66.0%</td>
          <td>92.8%</td>
          <td>93.2%</td>
          <td>87.3%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model covers 87.3% of ground-truth reaction pathways on average, compared to 84.4% for the baseline. The unique rate jumps from 76.1% to 93.2%, confirming that the latent variables effectively encourage diverse predictions.</p>
<h2 id="limitations">Limitations</h2>
<p>The model uses SMILES string representation, which linearizes molecules and does not exploit the inherently rich chemical graph structure. Graph-based retrosynthesis models (e.g., GraphRetro at 63.8% top-1) substantially outperform template-free string-based models. The USPTO-50K dataset provides only one ground-truth pathway per product, making diversity evaluation limited on this benchmark. The in-house dataset is not publicly available. The model also does not predict reaction conditions (solvents, catalysts, temperature) or reagents.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ejklike/tied-twoway-transformer">ejklike/tied-twoway-transformer</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: USPTO-50K dataset (public, 50K reactions from USPTO patents). In-house dataset (162K reactions from Reaxys, not publicly available).</p>
<p><strong>Hardware</strong>: 4 NVIDIA Tesla M40 GPUs. Checkpoints saved every 5000 steps, last 5 averaged.</p>
<p><strong>Training</strong>: Adam optimizer ($\beta$ = 0.9, 0.98), initial learning rate 2 with 8000 warm-up steps, dropout 0.3, gradient accumulation over 4 batches. Label smoothing set to 0.</p>
<p><strong>Inference</strong>: Beam size 10, generating 10 candidates per product.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, E., Lee, D., Kwon, Y., Park, M. S., &amp; Choi, Y.-S. (2021). Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables. <em>Journal of Chemical Information and Modeling</em>, 61, 123-133.</p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ejklike/tied-twoway-transformer">GitHub: ejklike/tied-twoway-transformer</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kim2021valid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kim, Eunji and Lee, Dongseon and Kwon, Youngchun and Park, Min Sik and Choi, Youn-Suk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{123--133}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c01074}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>UnCorrupt SMILES: Post Hoc Correction for De Novo Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</guid><description>A transformer-based SMILES corrector that fixes invalid outputs from molecular generators, recovering 60-95% of erroneous SMILES strings.</description><content:encoded><![CDATA[<h2 id="a-transformer-based-smiles-error-corrector">A Transformer-Based SMILES Error Corrector</h2>
<p>This is a <strong>Method</strong> paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (<a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:</p>
<ul>
<li><strong>RNN models</strong> (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)</li>
<li><strong>GANs</strong> (ORGANIC): 9.5% invalid</li>
<li><strong>VAEs</strong> (GENTRL): 88.9% invalid</li>
</ul>
<p>These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (<a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.</p>
<h2 id="error-taxonomy-across-generator-types">Error Taxonomy Across Generator Types</h2>
<p>The paper classifies invalid SMILES errors into six categories based on RDKit error messages:</p>
<ol>
<li><strong>Syntax errors</strong>: malformed SMILES grammar</li>
<li><strong>Unclosed rings</strong>: unmatched ring closure digits</li>
<li><strong>Parentheses errors</strong>: unbalanced open/close parentheses</li>
<li><strong>Bond already exists</strong>: duplicate bonds between the same atoms</li>
<li><strong>Aromaticity errors</strong>: atoms incorrectly marked as aromatic or kekulization failures</li>
<li><strong>Valence errors</strong>: atoms exceeding their maximum bond count</li>
</ol>
<p>The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.</p>
<h2 id="architecture-and-training">Architecture and Training</h2>
<p>The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:</p>
<ul>
<li>Embedding dimension: 256</li>
<li>Encoder/decoder layers: 3 each</li>
<li>Attention heads: 8</li>
<li>Feed-forward dimension: 512</li>
<li>Dropout: 0.1</li>
<li>Optimizer: Adam (learning rate 0.0005)</li>
<li>Training: 20 epochs, batch size 16</li>
</ul>
<p>Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the <a href="/notes/chemistry/datasets/gdb-11/">GDB</a>-8 database to atoms with full valence, and other structural modifications.</p>
<h2 id="training-with-multiple-errors-improves-correction">Training with Multiple Errors Improves Correction</h2>
<p>A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>1 error/input</th>
          <th>12 errors/input</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN (DrugEx)</td>
          <td>~60% fixed</td>
          <td>62% fixed</td>
      </tr>
      <tr>
          <td>Target-directed RNN</td>
          <td>~60% fixed</td>
          <td>68% fixed</td>
      </tr>
      <tr>
          <td>GAN (ORGANIC)</td>
          <td>~80% fixed</td>
          <td>95% fixed</td>
      </tr>
      <tr>
          <td>VAE (GENTRL)</td>
          <td>~35% fixed</td>
          <td>80% fixed</td>
      </tr>
  </tbody>
</table>
<p>Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to &ldquo;distrust&rdquo; inputs more when exposed to higher error rates.</p>
<p>The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.</p>
<h2 id="fixed-molecules-are-comparable-to-generator-outputs">Fixed Molecules Are Comparable to Generator Outputs</h2>
<p>The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:</p>
<ul>
<li><strong>Uniqueness</strong>: 97% of corrected molecules are unique</li>
<li><strong>Novelty vs. generated</strong>: 97% of corrected molecules are novel compared to the valid generator outputs</li>
<li><strong>Similarity to nearest neighbor (SNN)</strong>: 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space</li>
<li><strong>Property distributions</strong>: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set</li>
</ul>
<p>This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator&rsquo;s valid outputs while exploring complementary regions of chemical space.</p>
<h2 id="local-chemical-space-exploration-via-error-introduction">Local Chemical Space Exploration via Error Introduction</h2>
<p>Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:</p>
<ol>
<li>Take a known active molecule</li>
<li>Introduce random errors into its SMILES (repeated 1000 times)</li>
<li>Correct the errors using the trained corrector</li>
</ol>
<p>This &ldquo;local sequence exploration&rdquo; generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).</p>
<p>The authors demonstrate this on selective <a href="https://en.wikipedia.org/wiki/Aurora_kinase_B">Aurora kinase B</a> (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).</p>
<h2 id="limitations">Limitations</h2>
<p>The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">LindeSchoenmaker/SMILES-corrector</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training code, synthetic error generation, and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.</p>
<p><strong>Code</strong>: Transformer implemented in PyTorch, adapted from Ben Trevett&rsquo;s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.</p>
<p><strong>Evaluation</strong>: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark protocols.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., &amp; van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. <em>Journal of Cheminformatics</em>, 15, 22.</p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">GitHub: LindeSchoenmaker/SMILES-corrector</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schoenmaker2023uncorrupt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{UnCorrupt SMILES: a novel approach to de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Schoenmaker, Linde and B{\&#39;e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00696-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RetMol: Retrieval-Based Controllable Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/</guid><description>RetMol uses retrieval-augmented generation to steer a pre-trained molecular model toward desired properties using only a handful of exemplar molecules.</description><content:encoded><![CDATA[<h2 id="retrieval-augmented-generation-for-molecules">Retrieval-Augmented Generation for Molecules</h2>
<p>This is a <strong>Method</strong> paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).</p>
<h2 id="limitations-of-existing-controllable-generation">Limitations of Existing Controllable Generation</h2>
<p>Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:</p>
<ol>
<li><strong>Reinforcement learning (RL)-based methods</strong> require task-specific fine-tuning of the generative model for each new objective</li>
<li><strong>Supervised learning (SL)-based methods</strong> need molecules with desired properties as training data, which may be scarce</li>
<li><strong>Latent optimization-based methods</strong> require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers</li>
</ol>
<p>RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.</p>
<h2 id="the-retmol-framework">The RetMol Framework</h2>
<p>RetMol consists of four components built around a pre-trained encoder-decoder backbone (<a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>, a BART variant trained on ZINC):</p>
<h3 id="retrieval-database">Retrieval Database</h3>
<p>A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.</p>
<h3 id="molecule-retriever">Molecule Retriever</h3>
<p>A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.</p>
<h3 id="information-fusion-via-cross-attention">Information Fusion via Cross-Attention</h3>
<p>The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:</p>
<p>$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$</p>
<p>where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).</p>
<h3 id="self-supervised-training-nearest-neighbor-prediction">Self-Supervised Training: Nearest Neighbor Prediction</h3>
<p>Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:</p>
<p>$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$</p>
<p>The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.</p>
<h2 id="iterative-refinement-at-inference">Iterative Refinement at Inference</h2>
<p>During inference, RetMol uses an iterative process:</p>
<ol>
<li>Encode the input molecule and retrieved exemplars</li>
<li>Fuse embeddings via cross-attention</li>
<li>Perturb the fused embedding $M$ times with Gaussian noise</li>
<li>Greedily decode $M$ candidate molecules</li>
<li>Replace the input with the best candidate if it improves upon the current score</li>
<li>Add remaining good candidates to the retrieval database</li>
<li>Repeat until convergence or a maximum number of iterations</li>
</ol>
<p>The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p>RetMol is evaluated on four tasks of increasing difficulty:</p>
<h3 id="qed-optimization-under-similarity-constraint">QED Optimization Under Similarity Constraint</h3>
<p>Goal: generate molecules with QED $\geq$ 0.9 while maintaining <a href="https://en.wikipedia.org/wiki/Tanimoto_coefficient">Tanimoto similarity</a> $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).</p>
<h3 id="penalized-logp-optimization">Penalized LogP Optimization</h3>
<p>Goal: maximize penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">LogP</a> while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.</p>
<h3 id="gsk3beta--jnk3-dual-inhibitor-design"><a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>$\beta$ + <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> Dual Inhibitor Design</h3>
<p>Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Success %</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>47.9</td>
          <td>0.561</td>
          <td>0.621</td>
      </tr>
      <tr>
          <td>RationaleRL</td>
          <td>74.8</td>
          <td>0.568</td>
          <td>0.701</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>92.3</td>
          <td>0.824</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>MolEvol</td>
          <td>93.0</td>
          <td>0.757</td>
          <td>0.681</td>
      </tr>
      <tr>
          <td>RetMol</td>
          <td>96.9</td>
          <td>0.862</td>
          <td>0.732</td>
      </tr>
  </tbody>
</table>
<p>RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS&rsquo;s 550.</p>
<h3 id="sars-cov-2-main-protease-inhibitor-optimization"><a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 Main Protease</a> Inhibitor Optimization</h3>
<p>A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.</p>
<h2 id="key-analysis-findings">Key Analysis Findings</h2>
<ul>
<li><strong>Database size</strong>: Strong performance even with 100 molecules, already outperforming baselines on success rate</li>
<li><strong>Database quality</strong>: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)</li>
<li><strong>Training objective</strong>: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)</li>
<li><strong>Dynamic database update</strong>: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone&rsquo;s generation capabilities.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol</a></td>
          <td>Code</td>
          <td>NVIDIA Source Code License-NC</td>
          <td>Full training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol (checkpoints)</a></td>
          <td>Model</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Pre-trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.</p>
<p><strong>Training</strong>: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.</p>
<p><strong>Inference</strong>: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.</p>
<p><strong>Backbone</strong>: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., &amp; Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. <em>Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023)</em>.</p>
<p><strong>Publication</strong>: International Conference on Learning Representations (ICLR) 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/NVlabs/RetMol">GitHub: NVlabs/RetMol</a></li>
<li><a href="https://openreview.net/forum?id=vDFA1tpuLvk">OpenReview</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2023retrieval,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Retrieval-based Controllable Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=vDFA1tpuLvk}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Regression Transformer: Prediction Meets Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</guid><description>The Regression Transformer unifies property prediction and conditional generation in one multitask model by casting regression as sequence modelling.</description><content:encoded><![CDATA[<h2 id="a-multitask-model-that-unifies-regression-and-generation">A Multitask Model That Unifies Regression and Generation</h2>
<p>The Regression Transformer (RT) is a <strong>Method</strong> paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.</p>
<h2 id="closing-the-gap-between-predictors-and-generators">Closing the Gap Between Predictors and Generators</h2>
<p>Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.</p>
<p>The RT addresses three specific gaps:</p>
<ol>
<li><strong>No true multitask entanglement</strong>: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.</li>
<li><strong>No inductive bias for continuous properties</strong>: Molecular generative models lack mechanisms to condition generation on floating-point property values.</li>
<li><strong>Disconnected workflows</strong>: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.</li>
</ol>
<h2 id="core-innovation-regression-as-conditional-sequence-modelling">Core Innovation: Regression as Conditional Sequence Modelling</h2>
<p>The RT&rsquo;s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:</p>
<h3 id="numerical-tokenization">Numerical Tokenization</h3>
<p>Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence <code>[1_1, 2_0, 3_-1]</code>.</p>
<h3 id="numerical-encodings">Numerical Encodings</h3>
<p>To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:</p>
<p>$$
\text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1}
$$</p>
<p>These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.</p>
<h3 id="alternating-training-with-self-consistency">Alternating Training with Self-Consistency</h3>
<p>The RT uses an <a href="https://en.wikipedia.org/wiki/XLNet">XLNet</a> backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:</p>
<ul>
<li><strong>Mask numerical tokens</strong>: the model performs property prediction (regression)</li>
<li><strong>Mask textual tokens</strong>: the model performs conditional sequence generation</li>
</ul>
<p>The base PLM objective is:</p>
<p>$$
\mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{&lt; i}}) \right]
$$</p>
<p>This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.</p>
<p>The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:</p>
<p>$$
\mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}})
$$</p>
<p>This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT&rsquo;s dual capability as both predictor and generator.</p>
<h2 id="experiments-across-molecules-proteins-and-reactions">Experiments Across Molecules, Proteins, and Reactions</h2>
<h3 id="drug-likeness-qed">Drug Likeness (QED)</h3>
<p>Initial validation on a synthetic QED dataset (~1.4M molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE &lt; 0.06) and generate novel molecules conditioned on desired QED values (Spearman&rsquo;s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> representations proved comparable to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).</p>
<h3 id="moleculenet-regression-benchmarks">MoleculeNet Regression Benchmarks</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.</p>
<p>Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT&rsquo;s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).</p>
<h3 id="constrained-property-optimization">Constrained Property Optimization</h3>
<p>On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Improvement ($\delta$=0.4)</th>
          <th>Success</th>
          <th>Property Prediction</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.84</td>
          <td>83.6%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>2.49</td>
          <td>100%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>4.71</td>
          <td>85.7%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td><strong>RT</strong></td>
          <td><strong>3.16</strong></td>
          <td><strong>97.1%</strong></td>
          <td><strong>PCC = 0.92</strong></td>
      </tr>
  </tbody>
</table>
<p>The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.</p>
<h3 id="protein-language-modelling">Protein Language Modelling</h3>
<p>On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.</p>
<h3 id="chemical-reaction-modelling">Chemical Reaction Modelling</h3>
<p>The RT was applied to reaction yield prediction on <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig amination</a> and <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki coupling</a> datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.</li>
<li>The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.</li>
<li>A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.</li>
<li>The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ol>
<li><strong>No large-scale pre-training</strong>: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a> or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.</li>
<li><strong>Fine-grained regression precision</strong>: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).</li>
<li><strong>Single-property focus</strong>: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.</li>
<li><strong>SELFIES validity caveats</strong>: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed&rsquo;s atoms).</li>
<li><strong>XLNet backbone limitations</strong>: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/regression-transformer">Regression Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/GT4SD/gt4sd-core">GT4SD Integration</a></td>
          <td>Code + Models</td>
          <td>MIT</td>
          <td>Pre-trained model inference pipelines</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></td>
          <td>Demo</td>
          <td>-</td>
          <td>Interactive inference webapp</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug likeness</td>
          <td>ChEMBL (QED)</td>
          <td>~1.4M molecules</td>
          <td>Synthetic QED labels computed with RDKit</td>
      </tr>
      <tr>
          <td>Regression benchmark</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipo)</td>
          <td>642-4,200 compounds</td>
          <td>16x SMILES augmentation, 3 random splits</td>
      </tr>
      <tr>
          <td>Property optimization</td>
          <td>ZINC (plogP)</td>
          <td>215,381 train / 799 test</td>
          <td>Fixed split from Jin et al. (2018)</td>
      </tr>
      <tr>
          <td>Protein pre-training</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a> (Boman)</td>
          <td>2,648,205 peptides</td>
          <td>15-45 amino acid peptides</td>
      </tr>
      <tr>
          <td>Protein benchmarks</td>
          <td>TAPE (Fluorescence, Stability)</td>
          <td>21,446-53,416 samples</td>
          <td>Fixed splits</td>
      </tr>
      <tr>
          <td>Reaction pre-training</td>
          <td>USPTO</td>
          <td>2,830,616 reactions</td>
          <td>Molecular weight as numerical property</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig / Suzuki</td>
          <td>3,955 / 5,760 reactions</td>
          <td>Ten 70/30 random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)</li>
<li>Parameters: ~27 million</li>
<li>Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)</li>
<li>Decoding: Greedy for property prediction, beam search for sequence generation</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>RT Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED prediction</td>
          <td>RMSE</td>
          <td>0.037</td>
          <td>Best config (NE + SC)</td>
      </tr>
      <tr>
          <td>QED generation</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.517</td>
          <td>Between primers and generated QED</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>Comparable to XLNet</td>
          <td>Within s.d. of regression-loss XLNet</td>
      </tr>
      <tr>
          <td>plogP optimization ($\delta$=0.4)</td>
          <td>Improvement</td>
          <td>3.16</td>
          <td>Outperforms JT-VAE, GCPN</td>
      </tr>
      <tr>
          <td>Protein fluorescence</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.72</td>
          <td>Outperforms TAPE baselines</td>
      </tr>
      <tr>
          <td>BH yield prediction</td>
          <td>$R^2$</td>
          <td>0.939</td>
          <td>Near Yield-BERT (0.951)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained on single GPUs (NVIDIA A100 or V100)</li>
<li>Training time: ~4 days for pre-training, ~1 day for fine-tuning</li>
<li>Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Born, J. &amp; Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. <em>Nature Machine Intelligence</em>, 5(4), 432-444. <a href="https://doi.org/10.1038/s42256-023-00639-z">https://doi.org/10.1038/s42256-023-00639-z</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence, April 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/regression-transformer">Regression Transformer GitHub Repository</a></li>
<li><a href="https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer">GT4SD Integration</a></li>
<li><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{born2023regression,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Regression Transformer enables concurrent sequence regression and generation for molecular language modelling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Born, Jannis and Manica, Matteo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{432--444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LIMO: Latent Inceptionism for Targeted Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/</guid><description>LIMO uses gradient-based optimization through a VAE latent space and stacked property predictor to generate drug-like molecules with high binding affinity.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., &amp; Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. <em>Proceedings of the 39th International Conference on Machine Learning (ICML 2022)</em>, PMLR 162, 5777&ndash;5792.</p>
<p><strong>Publication</strong>: ICML 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Rose-STL-Lab/LIMO">GitHub: Rose-STL-Lab/LIMO</a></li>
<li><a href="https://arxiv.org/abs/2206.09010">arXiv: 2206.09010</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{eckmann2022limo,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LIMO: Latent Inceptionism for Targeted Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5777--5792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">organization</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="gradient-based-reverse-optimization-in-molecular-latent-space">Gradient-Based Reverse Optimization in Molecular Latent Space</h2>
<p>This is a <strong>Method</strong> paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.</p>
<h2 id="slow-property-optimization-in-existing-methods">Slow Property Optimization in Existing Methods</h2>
<p>Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.</p>
<h2 id="the-limo-framework">The LIMO Framework</h2>
<p>LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.</p>
<h3 id="selfies-based-vae">SELFIES-Based VAE</h3>
<p>The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:</p>
<p>$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$</p>
<p>The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).</p>
<h3 id="stacked-property-predictor">Stacked Property Predictor</h3>
<p>The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:</p>
<p>$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$</p>
<p>where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.</p>
<h3 id="reverse-optimization-inceptionism">Reverse Optimization (Inceptionism)</h3>
<p>After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:</p>
<p>$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$</p>
<p>Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the &ldquo;inceptionism&rdquo; (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.</p>
<h3 id="substructure-constrained-optimization">Substructure-Constrained Optimization</h3>
<p>For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:</p>
<p>$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$</p>
<p>where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<h3 id="benchmark-tasks-qed-and-penalized-logp">Benchmark Tasks (QED and Penalized LogP)</h3>
<p>LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN&rsquo;s 11.8).</p>
<p>The ablation study (&ldquo;LIMO on z&rdquo;) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.</p>
<h3 id="binding-affinity-maximization">Binding Affinity Maximization</h3>
<p>The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>ESR1 best $K_D$ (nM)</th>
          <th>ACAA1 best $K_D$ (nM)</th>
          <th>Time (hrs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>6.4</td>
          <td>75</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MolDQN</td>
          <td>373</td>
          <td>240</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>17</td>
          <td>163</td>
          <td>6</td>
      </tr>
      <tr>
          <td>GraphDF</td>
          <td>25</td>
          <td>370</td>
          <td>12</td>
      </tr>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>37</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p>For ESR1, LIMO&rsquo;s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).</p>
<h3 id="multi-objective-optimization">Multi-Objective Optimization</h3>
<p>Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($&gt;$ 0.4), and SA ($&lt;$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski&rsquo;s rule of 5 with zero PAINS alerts.</p>
<h2 id="limitations">Limitations</h2>
<p>The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Rose-STL-Lab/LIMO">Rose-STL-Lab/LIMO</a></td>
          <td>Code</td>
          <td>UC San Diego Custom (non-commercial)</td>
          <td>Full training, optimization, and evaluation code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.</p>
<p><strong>Hardware</strong>: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.</p>
<p><strong>Training</strong>: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.</p>
<p><strong>Targets</strong>: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).</p>
]]></content:encoded></item><item><title>BARTSmiles: BART Pre-Training for Molecular SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/</guid><description>BARTSmiles applies BART-style denoising pre-training to 1.7B SMILES from ZINC20, achieving top results on 11 molecular property and reaction tasks.</description><content:encoded><![CDATA[<h2 id="a-bart-based-method-for-molecular-self-supervised-learning">A BART-Based Method for Molecular Self-Supervised Learning</h2>
<p>BARTSmiles is a <strong>Method</strong> paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a> from the <a href="/notes/chemistry/datasets/zinc-22/">ZINC20 dataset</a>. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.</p>
<h2 id="scaling-self-supervised-molecular-representations-beyond-prior-work">Scaling Self-Supervised Molecular Representations Beyond Prior Work</h2>
<p>At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> (Chithrananda et al., 2020) and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a> (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.</p>
<p>The paper argues that three gaps needed to be addressed:</p>
<ol>
<li><strong>Scale</strong>: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.</li>
<li><strong>Architecture choice</strong>: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.</li>
<li><strong>Pre-training recipe</strong>: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.</li>
</ol>
<h2 id="core-innovation-ablation-driven-pre-training-recipe-for-smiles">Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES</h2>
<p>The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:</p>
<h3 id="tokenization">Tokenization</h3>
<p>Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).</p>
<h3 id="masking-strategy">Masking Strategy</h3>
<p>The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:</p>
<ul>
<li><strong>Mask token budget</strong>: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.</li>
<li><strong>Span masking</strong>: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.</li>
<li><strong>Token randomization</strong>: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.</li>
</ul>
<h3 id="scale">Scale</h3>
<p>Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).</p>
<h3 id="implicit-task-encoding">Implicit Task Encoding</h3>
<p>The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.</p>
<h2 id="experimental-setup-moleculenet-toxicology-and-generative-benchmarks">Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks</h2>
<h3 id="classification-tasks">Classification Tasks</h3>
<p>BARTSmiles is evaluated on 7 classification datasets from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames</a>, <a href="https://en.wikipedia.org/wiki/Micronucleus_test">Micronucleus Assay</a>). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer-XL</a>, GROVER-large, MolCLR, iMolCLR).</p>
<p>Selected classification results (AUC-ROC):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td><strong>0.997</strong></td>
          <td>0.954</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td><strong>0.825</strong></td>
          <td>0.805</td>
          <td>Attentive FP</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td><strong>0.705</strong></td>
          <td>0.699</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>0.851</td>
          <td>0.858</td>
          <td>Attentive FP</td>
      </tr>
  </tbody>
</table>
<p>The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.</p>
<h3 id="regression-tasks">Regression Tasks</h3>
<p>All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td><strong>0.095</strong></td>
          <td>0.279</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td><strong>0.114</strong></td>
          <td>0.231</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td><strong>0.292</strong></td>
          <td>0.529</td>
          <td>MoLFormer-XL</td>
      </tr>
  </tbody>
</table>
<p>BARTSmiles achieves substantial improvements on all three regression tasks.</p>
<h3 id="generative-tasks">Generative Tasks</h3>
<p><strong><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></strong> (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.</p>
<p><strong>Chemical Reaction Prediction</strong> (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a> baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.</p>
<h3 id="fine-tuning-recipe">Fine-Tuning Recipe</h3>
<p>The fine-tuning approach is designed to minimize hyperparameter tuning:</p>
<ul>
<li>Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training</li>
<li>Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)</li>
<li>Stochastic Weight Averaging (SWA) over three sets of four checkpoints</li>
<li>For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision</li>
<li>For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking</li>
</ul>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Scale matters for molecular pre-training</strong>: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.</li>
<li><strong>Domain-specific ablation is necessary</strong>: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).</li>
<li><strong>Frozen representations capture task structure</strong>: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.</li>
<li><strong>Interpretability aligns with domain knowledge</strong>: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., <a href="https://en.wikipedia.org/wiki/Nitro_compound">nitro groups</a> in mutagenic compounds, hydroxyl groups in soluble compounds).</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Scaffold split sensitivity</strong>: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.</li>
<li><strong>Pre-training data distribution</strong>: The <a href="https://en.wikipedia.org/wiki/Fr%C3%A9chet_distance">Frechet distance</a> analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.</li>
<li><strong>Fingerprints carry complementary information</strong>: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.</li>
<li><strong>Compute requirements</strong>: Pre-training requires 1,024 A100 GPUs, which limits accessibility.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pre-training, fine-tuning, and evaluation scripts with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20 (deduplicated)</td>
          <td>~1.7B molecules</td>
          <td>Canonicalized SMILES, 10K validation holdout</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (7 datasets)</td>
          <td>1,427-41,127 compounds</td>
          <td>AUC-ROC metric</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (3 datasets)</td>
          <td>642-4,200 compounds</td>
          <td>RMSE metric</td>
      </tr>
      <tr>
          <td>Toxicology</td>
          <td>Ames, MN Assay</td>
          <td>6,512 / 641 compounds</td>
          <td>Cross-validation for Ames; external test for MN</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50k</td>
          <td>Standard split</td>
          <td>Top-K accuracy</td>
      </tr>
      <tr>
          <td>Reaction prediction</td>
          <td>USPTO (MIT/LEF/STEREO)</td>
          <td>Standard splits</td>
          <td>Top-1 accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)</li>
<li>Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128</li>
<li>Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)</li>
<li>Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR</li>
<li>Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BART-Large architecture (exact parameter count not specified in paper)</li>
<li>Pre-trained checkpoint released on GitHub</li>
<li>Maximum sequence length: 128 tokens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>BARTSmiles</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td>AUC-ROC</td>
          <td>0.997</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td>AUC-ROC</td>
          <td>0.825</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.095</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>0.114</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>RMSE</td>
          <td>0.292</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>USPTO-50k Retro (Top-1)</td>
          <td>Accuracy</td>
          <td>55.6%</td>
          <td>New SOTA (sample + re-rank)</td>
      </tr>
      <tr>
          <td>USPTO-MIT Rxn (Split)</td>
          <td>Accuracy</td>
          <td>91.8%</td>
          <td>New SOTA (beam-10)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)</li>
<li>Ablation runs: 128 A100 GPUs per run</li>
<li>Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision</li>
<li>Experiment tracking: Aim</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., &amp; Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. <em>Journal of Chemical Information and Modeling</em>, 64(15), 5832-5843. <a href="https://doi.org/10.1021/acs.jcim.4c00512">https://doi.org/10.1021/acs.jcim.4c00512</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles GitHub Repository (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chilingaryan2024bartsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BARTSmiles: Generative Masked Language Models for Molecular Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5832--5843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00512}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGen: Molecular Generation with Chemical Feedback</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/</guid><description>MolGen pre-trains on SELFIES molecules and uses chemical feedback to align generated molecules with real-world chemical preferences across domains.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-method-for-molecular-generation">A SELFIES-Based Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model&rsquo;s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.</p>
<h2 id="challenges-in-language-model-based-molecule-generation">Challenges in Language Model-Based Molecule Generation</h2>
<p>Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:</p>
<ol>
<li><strong>Syntactic invalidity</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.</li>
<li><strong>Narrow domain focus</strong>: Most existing models focus exclusively on synthetic molecules and neglect <a href="https://en.wikipedia.org/wiki/Natural_product">natural products</a>, which have distinct structural complexity and scaffold diversity.</li>
<li><strong>Molecular hallucinations</strong>: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that &ldquo;comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.&rdquo;</li>
<li><strong>Limited optimization signals</strong>: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.</li>
</ol>
<h2 id="core-innovation-pre-training-with-selfies-and-chemical-feedback">Core Innovation: Pre-training with SELFIES and Chemical Feedback</h2>
<p>MolGen&rsquo;s novelty rests on three interconnected components.</p>
<h3 id="selfies-based-pre-training">SELFIES-Based Pre-training</h3>
<p>MolGen uses <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.</p>
<p>The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:</p>
<p>$$
\mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{&lt; j}) \log p_{\theta}(s \mid S, S_{&lt; j}; \theta)
$$</p>
<p>where $S_{&lt; j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.</p>
<h3 id="domain-agnostic-molecular-prefix-tuning">Domain-Agnostic Molecular Prefix Tuning</h3>
<p>The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:</p>
<p>$$
\text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right)
$$</p>
<p>This decomposes into a linear interpolation between prefix attention and standard attention:</p>
<p>$$
\text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v)
$$</p>
<p>where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.</p>
<h3 id="chemical-feedback-paradigm">Chemical Feedback Paradigm</h3>
<p>To address molecular hallucinations, MolGen aligns the model&rsquo;s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:</p>
<p>$$
p_{\text{true}}(S_i \mid S) &gt; p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) &gt; \text{Ps}(S_j)
$$</p>
<p>This is enforced via a rank loss:</p>
<p>$$
\mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j &gt; i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right)
$$</p>
<p>where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{&lt; t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}}
$$</p>
<p>Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.</p>
<h2 id="experiments-across-distribution-learning-and-property-optimization">Experiments Across Distribution Learning and Property Optimization</h2>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>Stage 1 pre-training</strong>: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)</li>
<li><strong>Stage 2 pre-training</strong>: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains</li>
<li><strong>Downstream evaluation</strong>: MOSES synthetic dataset, ZINC250K, and natural product molecules</li>
</ul>
<h3 id="molecular-distribution-learning">Molecular Distribution Learning</h3>
<p>MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, and Novelty). Baselines include AAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>, CharRNN, VAE, JT-VAE, LIMO, and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Validity</th>
          <th>Frag</th>
          <th>Scaf</th>
          <th>SNN</th>
          <th>IntDiv</th>
          <th>FCD</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemformer</td>
          <td>.9843</td>
          <td>.9889</td>
          <td>.9248</td>
          <td>.5622</td>
          <td>.8553</td>
          <td>.0061</td>
          <td>.9581</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>1.000</td>
          <td>.9999</td>
          <td>.9999</td>
          <td>.9996</td>
          <td>.8567</td>
          <td>.0015</td>
          <td>1.000</td>
      </tr>
  </tbody>
</table>
<p>On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer&rsquo;s 0.8346.</p>
<h3 id="targeted-molecule-discovery">Targeted Molecule Discovery</h3>
<p>For penalized logP maximization (top-3 scores):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1st</th>
          <th>2nd</th>
          <th>3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MARS (no length limit)</td>
          <td>44.99</td>
          <td>44.32</td>
          <td>43.81</td>
      </tr>
      <tr>
          <td>MolGen (no length limit)</td>
          <td>80.30</td>
          <td>74.70</td>
          <td>69.85</td>
      </tr>
      <tr>
          <td>MolGen (length-limited)</td>
          <td>30.51</td>
          <td>28.98</td>
          <td>28.95</td>
      </tr>
  </tbody>
</table>
<p>For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.</p>
<h3 id="molecular-docking">Molecular Docking</h3>
<p>MolGen optimizes binding affinity for two protein targets (<a href="https://en.wikipedia.org/wiki/Estrogen_receptor_alpha">ESR1</a> and ACAA1), measured by <a href="https://en.wikipedia.org/wiki/Dissociation_constant">dissociation constant</a> $K_D$ (lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESR1 1st</th>
          <th>ESR1 2nd</th>
          <th>ESR1 3rd</th>
          <th>ACAA1 1st</th>
          <th>ACAA1 2nd</th>
          <th>ACAA1 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>0.89</td>
          <td>1.4</td>
          <td>37</td>
          <td>37</td>
          <td>41</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>0.13</td>
          <td>0.35</td>
          <td>0.47</td>
          <td>3.36</td>
          <td>3.98</td>
          <td>8.50</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.</p>
<h3 id="constrained-molecular-optimization">Constrained Molecular Optimization</h3>
<p>Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>$\delta = 0.6$</th>
          <th>$\delta = 0.4$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a></td>
          <td>3.78 (3.29)</td>
          <td>11.55 (11.27)</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>12.08 (0.82)</td>
          <td>12.35 (1.21)</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<ul>
<li><strong>Chemical feedback</strong>: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.</li>
<li><strong>Prefix tuning</strong>: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.</li>
<li><strong>Label smoothing</strong>: Enhances diversity of generated molecules as measured by Internal Diversity.</li>
<li><strong>Substructure attention</strong>: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen&rsquo;s superior focus.</li>
</ul>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.</li>
<li>Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.</li>
<li>The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.</li>
<li>MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational cost</strong>: Training and fine-tuning on large datasets is computationally intensive.</li>
<li><strong>Model interpretability</strong>: The transformer architecture makes it difficult to understand explicit rationale behind decisions.</li>
<li><strong>Single-target optimization only</strong>: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.</li>
<li><strong>Task specificity</strong>: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.</li>
<li><strong>Reaction prediction</strong>: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 pre-training</td>
          <td>ZINC-15</td>
          <td>100M+ molecules</td>
          <td>MW $\leq$ 500 Da, LogP $\leq$ 5</td>
      </tr>
      <tr>
          <td>Stage 2 pre-training</td>
          <td>ZINC + <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> + NPASS</td>
          <td>2.22M molecules</td>
          <td>Synthetic and natural product domains</td>
      </tr>
      <tr>
          <td>Distribution learning (synthetic)</td>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></td>
          <td>~1.9M molecules</td>
          <td>Standard benchmark split</td>
      </tr>
      <tr>
          <td>Distribution learning (natural)</td>
          <td>NPASS</td>
          <td>30,926 compounds</td>
          <td>30,126 train / 800 test</td>
      </tr>
      <tr>
          <td>Constrained optimization</td>
          <td>ZINC250K</td>
          <td>800 molecules</td>
          <td>Lowest p-logP scores</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)</li>
<li><strong>Prefix length</strong>: 5 tunable vectors per layer</li>
<li><strong>Optimizer</strong>: LAMB (pre-training), AdamW (fine-tuning)</li>
<li><strong>Pre-training</strong>: 600M steps with linear warm-up (180,000 steps) followed by linear decay</li>
<li><strong>Rank loss weight</strong> ($\alpha$): Recommended values of 3 or 5</li>
<li><strong>Candidate generation</strong>: 30 candidates per molecule (synthetic), 8 candidates (natural products)</li>
</ul>
<h3 id="models">Models</h3>
<p>MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>MolGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a> (lower is better)</td>
          <td>Synthetic</td>
          <td>0.0015</td>
          <td>0.0061 (<a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>)</td>
          <td>Distribution learning</td>
      </tr>
      <tr>
          <td>p-logP top-1 (no limit)</td>
          <td>Synthetic</td>
          <td>80.30</td>
          <td>44.99 (MARS)</td>
          <td>Targeted discovery</td>
      </tr>
      <tr>
          <td>QED top-1</td>
          <td>Synthetic</td>
          <td>0.948</td>
          <td>0.948 (several)</td>
          <td>Tied at maximum</td>
      </tr>
      <tr>
          <td>ESR1 $K_D$ top-1</td>
          <td>Docking</td>
          <td>0.13</td>
          <td>0.72 (LIMO)</td>
          <td>Binding affinity</td>
      </tr>
      <tr>
          <td>p-logP improvement ($\delta=0.4$)</td>
          <td>Synthetic</td>
          <td>12.35 (1.21)</td>
          <td>11.55 (11.27) (RetMol)</td>
          <td>Constrained optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>6 NVIDIA V100 GPUs</li>
<li>Pre-training batch size: 256 molecules per GPU</li>
<li>Fine-tuning batch size: 6 (synthetic and natural product)</li>
<li>Training: 100 epochs for fine-tuning tasks</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zjunlp/MolGen">zjunlp/MolGen</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/zjunlp">zjunlp/MolGen-large</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained weights on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., &amp; Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. <em>Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)</em>.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zjunlp/MolGen">GitHub: zjunlp/MolGen</a></li>
<li><a href="https://huggingface.co/zjunlp">Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2024domain,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Domain-Agnostic Molecular Generation with Chemical Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=9rPyHyjfwP}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Transformer: Calibrated Reaction Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/</link><pubDate>Wed, 18 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/</guid><description>A Transformer seq2seq model for chemical reaction prediction achieving 90.4% top-1 accuracy on USPTO_MIT with calibrated uncertainty estimation.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Method</strong> paper. It adapts the Transformer architecture to chemical reaction prediction, treating it as a machine translation problem from reactant <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> to product SMILES. The key contributions are (1) demonstrating that a fully attention-based model outperforms all prior template-based, graph-based, and RNN-based methods, (2) showing the model works without separating reactants from reagents, and (3) introducing calibrated uncertainty estimation for ranking synthesis pathways.</p>
<h2 id="motivation-limitations-of-existing-reaction-prediction">Motivation: Limitations of Existing Reaction Prediction</h2>
<p>Prior approaches to reaction prediction fell into two broad groups, template-based and template-free, each with fundamental limitations:</p>
<ul>
<li><strong>Template-based methods</strong> rely on libraries of reaction rules, either handcrafted or automatically extracted from atom-mapped data. Automatic template extraction itself depends on atom mapping, which depends on templates, creating a circular dependency.</li>
<li><strong>Graph-based template-free methods</strong> (e.g., WLDN, ELECTRO) avoid explicit templates but still require atom-mapped training data and cannot handle stereochemistry.</li>
<li><strong><a href="/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/">RNN-based seq2seq models</a></strong> (also template-free) treat reactions as SMILES translation but impose a positional inductive bias: tokens far apart in the SMILES string are assumed to be less related. This is incorrect because SMILES position has no relationship to 3D spatial distance.</li>
</ul>
<h2 id="core-innovation-transformer-for-reaction-prediction">Core Innovation: Transformer for Reaction Prediction</h2>
<p>The Molecular Transformer adapts the Transformer architecture to chemical reactions by treating SMILES strings of reactants and reagents as source sequences and product SMILES as target sequences.</p>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder Transformer with 4 layers, 256-dimensional hidden states, 8 attention heads, and 12M parameters (reduced from the original 65M NMT model).</li>
<li><strong>Tokenization</strong>: Atom-wise regex tokenization of SMILES strings, applied uniformly to both reactants and reagents (no special reagent tokens).</li>
<li><strong>Data augmentation</strong>: Training data is doubled by generating <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">random (non-canonical) SMILES</a> for each reaction, which improves top-1 accuracy by roughly 1%.</li>
<li><strong>Weight averaging</strong>: Final model weights are averaged over the last 20 checkpoints, providing a further accuracy boost without the inference cost of ensembling.</li>
<li><strong>Mixed input</strong>: Unlike all prior work that separates reactants from reagents (which implicitly assumes knowledge of the product), the Molecular Transformer operates on mixed inputs where no distinction is made.</li>
</ul>
<p>The multihead attention mechanism is the key architectural advantage over RNNs. It allows the model to attend to any pair of tokens regardless of their position in the SMILES string, correctly capturing long-range chemical relationships that RNNs miss.</p>
<h2 id="uncertainty-estimation">Uncertainty Estimation</h2>
<p>A central contribution is calibrated uncertainty scoring. The product of predicted token probabilities serves as a confidence score for each prediction. This score achieves 0.89 AUC-ROC for classifying whether a prediction is correct.</p>
<p>An important finding: <strong>label smoothing hurts uncertainty calibration</strong>. While label smoothing (as used in the original Transformer) marginally improves top-1 accuracy (87.44% vs 87.28%), it destroys the model&rsquo;s ability to distinguish correct from incorrect predictions. Setting the label smoothing parameter to 0.0 preserves calibration.</p>
<p>The confidence score shows no correlation with SMILES length (Pearson $r = 0.06$), confirming it is not biased against predictions of larger molecules.</p>
<h2 id="experimental-results">Experimental Results</h2>
<h3 id="forward-synthesis-prediction">Forward Synthesis Prediction</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Setting</th>
          <th style="text-align: left">Top-1 (%)</th>
          <th style="text-align: left">Top-2 (%)</th>
          <th style="text-align: left">Top-5 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">90.4</td>
          <td style="text-align: left">93.7</td>
          <td style="text-align: left">95.3</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">88.6</td>
          <td style="text-align: left">92.4</td>
          <td style="text-align: left">94.2</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">78.1</td>
          <td style="text-align: left">84.0</td>
          <td style="text-align: left">87.1</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">76.2</td>
          <td style="text-align: left">82.4</td>
          <td style="text-align: left">85.8</td>
      </tr>
  </tbody>
</table>
<p>The mixed-input model (88.6%) outperforms all prior methods that used separated inputs (best previous: WLDN5 at 85.6%).</p>
<h3 id="comparison-with-quantum-chemistry">Comparison with Quantum Chemistry</h3>
<p>On <a href="https://en.wikipedia.org/wiki/Regioselectivity">regioselectivity</a> of <a href="https://en.wikipedia.org/wiki/Electrophilic_aromatic_substitution">electrophilic aromatic substitution</a> in heteroaromatics, the Molecular Transformer achieves 83% top-1 accuracy vs 81% for RegioSQM (a quantum-chemistry-based predictor), at a fraction of the computational cost.</p>
<h3 id="comparison-with-human-chemists">Comparison with Human Chemists</h3>
<p>On 80 reactions sampled across rarity bins, the Molecular Transformer achieves 87.5% top-1 accuracy vs 76.5% for the best human chemist and 72.5% for the best graph-based model (WLDN5).</p>
<h3 id="chemically-constrained-beam-search">Chemically Constrained Beam Search</h3>
<p>Constraining beam search to only predict atoms present in the reactants (preventing &ldquo;alchemy&rdquo;) produces no change in accuracy, confirming the model has learned conservation of atoms from data alone.</p>
<h2 id="trade-offs-and-limitations">Trade-offs and Limitations</h2>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/Stereochemistry">Stereochemistry</a></strong>: Accuracy drops significantly on USPTO_STEREO (76-78% vs 88-90% on USPTO_MIT), indicating stereochemical prediction remains challenging.</li>
<li><strong>Resolution reactions</strong>: Near-zero accuracy on resolution reactions (28.6%), where reagent information is often missing from patent data.</li>
<li><strong>Unclassified reactions</strong>: Accuracy on &ldquo;unrecognized&rdquo; reaction classes is 46.3%, likely reflecting noisy or mistranscribed data.</li>
<li><strong>No atom mapping</strong>: The model provides no explicit atom mapping between reactants and products, which limits interpretability for understanding reaction mechanisms.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Primary benchmark</strong></td>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">479K</td>
          <td style="text-align: left">Filtered by Jin et al., no stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LEF subset</strong></td>
          <td style="text-align: left">USPTO_LEF</td>
          <td style="text-align: left">350K</td>
          <td style="text-align: left">Subset of MIT with linear electron flow only</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stereo benchmark</strong></td>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">1.0M</td>
          <td style="text-align: left">Patent reactions through Sept 2016, includes stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Time-split test</strong></td>
          <td style="text-align: left">Pistachio_2017</td>
          <td style="text-align: left">15.4K</td>
          <td style="text-align: left">Non-public, reactions from 2017</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: SMILES canonicalized with RDKit. Regex tokenization from Schwaller et al. (2018). Two input modes: &ldquo;separated&rdquo; (reactants &gt; reagents) and &ldquo;mixed&rdquo; (all molecules concatenated).</p>
<h3 id="model">Model</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">4</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model dimension</strong></td>
          <td style="text-align: left">256</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention heads</strong></td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~12M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Label smoothing</strong></td>
          <td style="text-align: left">0.0</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimizer</strong></td>
          <td style="text-align: left">Adam</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Warm-up steps</strong></td>
          <td style="text-align: left">8000</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Batch size</strong></td>
          <td style="text-align: left">~4096 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Beam width</strong></td>
          <td style="text-align: left">5</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (sep)</td>
          <td style="text-align: left"><strong>90.4%</strong></td>
          <td style="text-align: left">85.6% (WLDN5)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (mixed)</td>
          <td style="text-align: left"><strong>88.6%</strong></td>
          <td style="text-align: left">80.3% (S2S RNN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>AUC-ROC</strong></td>
          <td style="text-align: left">Uncertainty calibration</td>
          <td style="text-align: left"><strong>0.89</strong></td>
          <td style="text-align: left">N/A</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Regioselectivity</td>
          <td style="text-align: left"><strong>83%</strong></td>
          <td style="text-align: left">81% (RegioSQM)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Human comparison</td>
          <td style="text-align: left"><strong>87.5%</strong></td>
          <td style="text-align: left">76.5% (best human)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: Single Nvidia P100 GPU, 48h for best single model</li>
<li>Inference: 20 min for 40K reactions on single P100</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., &amp; Lee, A. A. (2019). Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. <em>ACS Central Science</em>, 5(9), 1572-1583. <a href="https://doi.org/10.1021/acscentsci.9b00576">https://doi.org/10.1021/acscentsci.9b00576</a></p>
<p><strong>Publication</strong>: ACS Central Science 2019</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schwallerMolecularTransformerModel2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Schwaller, Philippe and Laino, Teodoro and Gaudin, Th{\&#39;e}ophile and Bolgar, Peter and Hunter, Christopher A. and Bekas, Costas and Lee, Alpha A.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2019</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1572--1583}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acscentsci.9b00576}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Umeyama's Method: Corrected SVD for Point Alignment</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/umeyama-similarity-transformation/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/umeyama-similarity-transformation/</guid><description>Umeyama (1991) fixes the SVD-based point set alignment method to always produce proper rotations, jointly solving for rotation, translation, and scale.</description><content:encoded><![CDATA[<h2 id="fixing-the-reflection-problem-in-svd-based-alignment">Fixing the Reflection Problem in SVD-Based Alignment</h2>
<p>This <strong>Method</strong> paper addresses a specific failure mode in prior SVD-based solutions to the point set registration problem. Both <a href="/notes/biology/computational-biology/arun-svd-point-fitting/">Arun et al. (1987)</a> and <a href="/notes/biology/computational-biology/horn-orthonormal-matrices/">Horn, Hilden, and Negahdaripour (1988)</a> presented SVD-based methods for finding the optimal rotation between two point patterns. (Note: this is a different paper from <a href="/notes/biology/computational-biology/horn-absolute-orientation/">Horn&rsquo;s 1987 quaternion method</a>, which does not suffer from this issue.) These SVD-based methods can produce a reflection ($\det(R) = -1$) instead of a proper rotation when the data is severely corrupted. Umeyama provides a corrected formulation that always yields a proper rotation matrix.</p>
<h2 id="the-similarity-transformation-problem">The Similarity Transformation Problem</h2>
<p>Given two point sets ${\mathbf{x}_i}$ and ${\mathbf{y}_i}$ ($i = 1, \ldots, n$) in $m$-dimensional space, find the similarity transformation parameters (rotation $R$, translation $\mathbf{t}$, and scale $c$) minimizing the mean squared error:</p>
<p>$$
e^2(R, \mathbf{t}, c) = \frac{1}{n} \sum_{i=1}^{n} \lVert \mathbf{y}_i - (cR\mathbf{x}_i + \mathbf{t}) \rVert^2
$$</p>
<p>This generalizes the <a href="/notes/biology/computational-biology/kabsch-algorithm/">Kabsch problem</a> (rotation only) and the <a href="/notes/biology/computational-biology/horn-absolute-orientation/">absolute orientation problem</a> (rotation + translation + scale) to arbitrary dimensions $m$.</p>
<h2 id="the-core-lemma-corrected-svd-rotation">The Core Lemma: Corrected SVD Rotation</h2>
<p>The key contribution is a lemma for finding the rotation $R$ minimizing $\lVert A - RB \rVert^2$. Given the SVD of $AB^T = UDV^T$ (with $d_1 \geq d_2 \geq \cdots \geq d_m \geq 0$), define the correction matrix:</p>
<p>$$
S = \begin{cases} I &amp; \text{if } \det(AB^T) \geq 0 \\ \operatorname{diag}(1, 1, \ldots, 1, -1) &amp; \text{if } \det(AB^T) &lt; 0 \end{cases}
$$</p>
<p>The minimum value is:</p>
<p>$$
\min_{R} \lVert A - RB \rVert^2 = \lVert A \rVert^2 + \lVert B \rVert^2 - 2\operatorname{tr}(DS)
$$</p>
<p>When $\operatorname{rank}(AB^T) \geq m - 1$, the optimal rotation is uniquely determined as:</p>
<p>$$
R = USV^T
$$</p>
<p>The critical insight is that when $\det(AB^T) = 0$ (i.e., $\operatorname{rank}(AB^T) = m - 1$), the matrix $S$ must instead be chosen based on $\det(U)\det(V)$:</p>
<p>$$
S = \begin{cases} I &amp; \text{if } \det(U)\det(V) = 1 \\ \operatorname{diag}(1, 1, \ldots, 1, -1) &amp; \text{if } \det(U)\det(V) = -1 \end{cases}
$$</p>
<p>This handles the degenerate case where the sign of $\det(AB^T)$ is unreliable.</p>
<h2 id="complete-similarity-transformation-solution">Complete Similarity Transformation Solution</h2>
<p>Umeyama derives the full solution using centered coordinates and the covariance matrix $\Sigma_{xy} = \frac{1}{n} \sum_i (\mathbf{y}_i - \boldsymbol{\mu}_y)(\mathbf{x}_i - \boldsymbol{\mu}_x)^T$.</p>
<p>Given the SVD $\Sigma_{xy} = UDV^T$:</p>
<p><strong>Rotation</strong>:</p>
<p>$$
R = USV^T
$$</p>
<p><strong>Scale</strong>:</p>
<p>$$
c = \frac{1}{\sigma_x^2} \operatorname{tr}(DS)
$$</p>
<p><strong>Translation</strong>:</p>
<p>$$
\mathbf{t} = \boldsymbol{\mu}_y - cR\boldsymbol{\mu}_x
$$</p>
<p><strong>Minimum error</strong>:</p>
<p>$$
\varepsilon^2 = \sigma_y^2 - \frac{\operatorname{tr}(DS)^2}{\sigma_x^2}
$$</p>
<p>where $\sigma_x^2$ and $\sigma_y^2$ are the variances of the respective point sets around their centroids.</p>
<h2 id="why-prior-methods-fail">Why Prior Methods Fail</h2>
<p>The methods of Arun et al. and Horn et al. use $R = UV^T$ directly from the SVD. This works when $\det(UV^T) = 1$ (proper rotation). When $\det(UV^T) = -1$, these methods either produce a reflection or apply an ad hoc correction (flipping the sign of the last column of $U$). Umeyama shows that the correct fix depends on $\det(\Sigma_{xy})$:</p>
<ul>
<li>If $\det(\Sigma_{xy}) \geq 0$: set $S = I$, so $R = UV^T$</li>
<li>If $\det(\Sigma_{xy}) &lt; 0$: set $S = \operatorname{diag}(1, \ldots, 1, -1)$, flipping the last singular value&rsquo;s contribution</li>
</ul>
<p>This distinction matters because corrupted data can make $\det(UV^T) = -1$ even when the true transformation is a proper rotation. Simply flipping a column of $U$ does not always yield the correct least-squares solution.</p>
<h2 id="generality">Generality</h2>
<p>The formulation works for any dimension $m$, covering both 2D and 3D registration problems. The proof uses Lagrange multipliers with explicit enforcement of both orthogonality ($R^T R = I$) and the proper rotation constraint ($\det(R) = 1$), which prior methods enforced only partially.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, 13(4), 376-380. <a href="https://doi.org/10.1109/34.88573">https://doi.org/10.1109/34.88573</a></p>
<p><strong>Publication</strong>: IEEE TPAMI, 1991</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/posts/kabsch-algorithm/">Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</a> (tutorial with implementations including the Kabsch-Umeyama scaling extension)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{umeyama1991least,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Least-squares estimation of transformation parameters between two point patterns}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Umeyama, Shinji}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{IEEE Transactions on Pattern Analysis and Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{376--380}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1991}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1109/34.88573}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SELFormer: A SELFIES-Based Molecular Language Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/selformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/selformer/</guid><description>A SELFIES-based RoBERTa model pretrained on 2M ChEMBL molecules for molecular property prediction on MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-chemical-language-model">A SELFIES-Based Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$) with a secondary <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>SELFormer applies the RoBERTa transformer architecture to <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">SELFIES</a> molecular string representations instead of the <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> and fine-tuned for molecular property prediction tasks on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.</p>
<h2 id="why-selfies-over-smiles-for-pretraining">Why SELFIES Over SMILES for Pretraining?</h2>
<p>Existing chemical language models, including <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a>, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">SELFIES</a> addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES&rsquo; growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.</p>
<h2 id="masked-language-modeling-on-guaranteed-valid-molecular-strings">Masked Language Modeling on Guaranteed-Valid Molecular Strings</h2>
<p>SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.</p>
<p>The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder&rsquo;s output embedding.</p>
<p>Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors&rsquo; hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>SELFormer-Lite</th>
          <th>SELFormer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Attention Heads</td>
          <td>12</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Hidden Layers</td>
          <td>8</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Batch Size</td>
          <td>16</td>
          <td>16</td>
      </tr>
      <tr>
          <td>Learning Rate</td>
          <td>5e-5</td>
          <td>5e-5</td>
      </tr>
      <tr>
          <td>Weight Decay</td>
          <td>0.01</td>
          <td>0.01</td>
      </tr>
      <tr>
          <td>Pretraining Epochs</td>
          <td>100</td>
          <td>100</td>
      </tr>
      <tr>
          <td>Parameters</td>
          <td>58.3M</td>
          <td>86.7M</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarking-against-smiles-transformers-and-graph-models">Benchmarking Against SMILES Transformers and Graph Models</h2>
<p>SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BACE</th>
          <th>BBBP</th>
          <th>HIV</th>
          <th>Tox21</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td>0.832</td>
          <td><strong>0.902</strong></td>
          <td>0.681</td>
          <td>0.653</td>
          <td><strong>0.745</strong></td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>0.799</td>
          <td>0.728</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolBERT</td>
          <td><strong>0.866</strong></td>
          <td>0.762</td>
          <td><strong>0.783</strong></td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.809</td>
          <td>0.710</td>
          <td>0.771</td>
          <td>0.759</td>
          <td>0.570</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td><strong>0.890</strong></td>
          <td>0.736</td>
          <td><strong>0.806</strong></td>
          <td><strong>0.787</strong></td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.856</td>
          <td>0.724</td>
          <td><strong>0.806</strong></td>
          <td>0.781</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.855</td>
          <td><strong>0.908</strong></td>
          <td>-</td>
          <td><strong>0.848</strong></td>
          <td>0.649</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE, scaffold split, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>PDBbind</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td><strong>0.682</strong></td>
          <td>2.797</td>
          <td>0.735</td>
          <td>1.488</td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>-</td>
          <td>-</td>
          <td>0.986</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>1.050</td>
          <td><strong>2.082</strong></td>
          <td><strong>0.683</strong></td>
          <td><strong>1.397</strong></td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.798</td>
          <td><strong>1.877</strong></td>
          <td>0.660</td>
          <td>-</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.803</td>
          <td>2.121</td>
          <td><strong>0.600</strong></td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.</p>
<h2 id="strong-classification-performance-with-compact-pretraining">Strong Classification Performance with Compact Pretraining</h2>
<p>SELFormer&rsquo;s strongest results come on classification tasks where molecular substructure matters:</p>
<ul>
<li><strong>SIDER</strong>: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES&rsquo; ability to capture subtle structural differences relevant to drug side effects.</li>
<li><strong>BBBP</strong>: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.</li>
<li><strong>BACE/HIV vs. ChemBERTa-2</strong>: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.</li>
<li><strong>ESOL regression</strong>: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.</li>
</ul>
<p>Limitations are also apparent:</p>
<ul>
<li><strong>HIV and Tox21</strong>: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.</li>
<li><strong>FreeSolv and Lipophilicity regression</strong>: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.</li>
<li><strong>Small pretraining corpus</strong>: At 2M molecules, SELFormer&rsquo;s corpus is orders of magnitude smaller than MolFormer&rsquo;s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES&rsquo; representational advantage.</li>
<li><strong>Single-task ablation scope</strong>: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v30</td>
          <td>2,084,725 compounds (2,084,472 after SELFIES conversion)</td>
          <td>Drug-like bioactive small molecules</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>1,513</td>
          <td><a href="https://en.wikipedia.org/wiki/Beta-secretase_1">Beta-secretase 1</a> inhibitor binding</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td><a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">Blood-brain barrier</a> permeability</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>HIV replication inhibition</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>SIDER</td>
          <td>1,427</td>
          <td>Drug side effects (27 classes)</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Tox21</td>
          <td>7,831</td>
          <td>Toxicity (12 targets)</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>Octanol/water distribution coefficient</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PDBbind</td>
          <td>11,908</td>
          <td>Binding affinity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (MLM), 15% token masking</li>
<li><strong>Tokenization</strong>: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings</li>
<li><strong>SMILES to SELFIES conversion</strong>: SELFIES API with Pandaral.lel for parallelization</li>
<li><strong>Splitting</strong>: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)</li>
<li><strong>Fine-tuning</strong>: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa (HuggingFace Transformers)</li>
<li><strong>SELFormer</strong>: 12 hidden layers, 4 attention heads, 86.7M parameters</li>
<li><strong>SELFormer-Lite</strong>: 8 hidden layers, 12 attention heads, 58.3M parameters</li>
<li><strong>Hyperparameter search</strong>: Sequential search over ~100 configurations on 100K molecule subset</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Area under receiver operating characteristic curve</td>
      </tr>
      <tr>
          <td>PRC-AUC</td>
          <td>Classification</td>
          <td>Area under precision-recall curve (reported for random splits)</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<p>Results reported on scaffold split and random split datasets.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 2x NVIDIA A5000 GPUs</li>
<li><strong>Hyperparameter optimization time</strong>: ~11 days</li>
<li><strong>Full pretraining</strong>: 100 epochs on 2.08M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HUBioDataLab/SELFormer">SELFormer GitHub</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/HUBioDataLab/SELFormer">SELFormer on HuggingFace</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretrained SELFormer weights</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v30</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source pretraining data</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Benchmark</td>
          <td>Unknown</td>
          <td>Downstream evaluation tasks</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yüksel, A., Ulusoy, E., Ünlü, A., &amp; Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. <em>Machine Learning: Science and Technology</em>, 4(2), 025035. <a href="https://doi.org/10.1088/2632-2153/acdb30">https://doi.org/10.1088/2632-2153/acdb30</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/HUBioDataLab/SELFormer">GitHub Repository (SELFormer)</a></li>
<li><a href="https://huggingface.co/HUBioDataLab/SELFormer">HuggingFace Model Hub (SELFormer)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yuksel2023selformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Y{\&#34;u}ksel, Atakan and Ulusoy, Erva and {\&#34;U}nl{\&#34;u}, Atabey and Do{\u{g}}an, Tunca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{025035}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/acdb30}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoLFormer: Large-Scale Chemical Language Representations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molformer/</guid><description>A linear-attention transformer pretrained on 1.1B SMILES from PubChem and ZINC for molecular property prediction across MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-billion-scale-chemical-language-model">A Billion-Scale Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> and <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a>. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification and regression tasks, including quantum-chemical property prediction from SMILES alone.</p>
<h2 id="bridging-the-gap-between-molecular-languages-and-graph-neural-networks">Bridging the Gap Between Molecular Languages and Graph Neural Networks</h2>
<p>Prior chemical language models like <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?</p>
<p>Two specific challenges motivated this work:</p>
<ul>
<li><strong>Scale</strong>: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.</li>
<li><strong>Efficiency</strong>: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.</li>
</ul>
<h2 id="linear-attention-with-rotary-positional-embeddings">Linear Attention with Rotary Positional Embeddings</h2>
<p>MoLFormer&rsquo;s two key architectural choices are its attention mechanism and positional encoding scheme.</p>
<p><strong>Standard attention</strong> computes:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)}
$$</p>
<p>MoLFormer replaces this with <strong>linear attention</strong> using a generalized feature map $\varphi$, combined with <strong>rotary positional embeddings</strong> $R_m$ applied before the feature map:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.</p>
<p>The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.</p>
<h2 id="broad-moleculenet-benchmarking-with-scaling-ablations">Broad MoleculeNet Benchmarking with Scaling Ablations</h2>
<p>MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP</th>
          <th>Tox21</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.847</strong></td>
          <td><strong>0.948</strong></td>
          <td>0.822</td>
          <td>0.882</td>
          <td><strong>0.690</strong></td>
      </tr>
      <tr>
          <td>N-Gram</td>
          <td>0.912</td>
          <td>0.769</td>
          <td>0.855</td>
          <td>0.830</td>
          <td>0.876</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td>0.736</td>
          <td>0.798</td>
          <td>0.932</td>
          <td>0.806</td>
          <td><strong>0.890</strong></td>
          <td>0.680</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.724</td>
          <td>0.781</td>
          <td>0.901</td>
          <td>0.806</td>
          <td>0.856</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>Hu et al.</td>
          <td>0.708</td>
          <td>0.787</td>
          <td>0.789</td>
          <td>0.802</td>
          <td>0.859</td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GeomGCL</td>
          <td>-</td>
          <td>0.850</td>
          <td>0.919</td>
          <td>-</td>
          <td>-</td>
          <td>0.648</td>
      </tr>
      <tr>
          <td>ChemBERTa</td>
          <td>0.643</td>
          <td>-</td>
          <td>0.906</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>1.5894</strong></td>
          <td><strong>0.0102</strong></td>
          <td><strong>0.2787</strong></td>
          <td><strong>0.2308</strong></td>
          <td><strong>0.5289</strong></td>
      </tr>
      <tr>
          <td>A-FP</td>
          <td>2.6355</td>
          <td>0.0282</td>
          <td>0.5030</td>
          <td>0.736</td>
          <td>0.578</td>
      </tr>
      <tr>
          <td>MPNN</td>
          <td>3.1898</td>
          <td>0.0143</td>
          <td>0.58</td>
          <td>1.150</td>
          <td>0.7190</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>4.3536</td>
          <td>0.0148</td>
          <td>0.970</td>
          <td>1.40</td>
          <td>0.655</td>
      </tr>
  </tbody>
</table>
<p>MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).</p>
<p><strong>Key ablation findings</strong>:</p>
<ul>
<li><strong>Data scale matters</strong>: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.</li>
<li><strong>Model depth matters</strong>: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.</li>
<li><strong>Fine-tuning &raquo; frozen</strong>: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.</li>
<li><strong>Rotary &gt; absolute at scale</strong>: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.</li>
</ul>
<h2 id="smiles-transformers-learn-molecular-geometry">SMILES Transformers Learn Molecular Geometry</h2>
<p>The most striking finding is that MoLFormer&rsquo;s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.</p>
<p>Using <a href="/notes/chemistry/datasets/qm9/">QM9</a> molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:</p>
<table>
  <thead>
      <tr>
          <th>Distance Category</th>
          <th>Range</th>
          <th>Linear Attention (Rotary)</th>
          <th>Full Attention (Rotary)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Short</td>
          <td>$\leq$ 2 Å</td>
          <td>0.594-0.602</td>
          <td>0.598-0.615</td>
      </tr>
      <tr>
          <td>Medium</td>
          <td>2-4 Å</td>
          <td>0.724-0.730</td>
          <td>0.716-0.727</td>
      </tr>
      <tr>
          <td>Long</td>
          <td>4-10 Å</td>
          <td>0.209-0.211</td>
          <td>0.204-0.210</td>
      </tr>
  </tbody>
</table>
<p>The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.</p>
<p>MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.</p>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Quantum-chemical energies</strong>: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.</li>
<li><strong>Sequence length cap</strong>: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.</li>
<li><strong>SMILES canonicalization</strong>: The model depends on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> canonical SMILES; sensitivity to non-canonical forms is not evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>PubChem</td>
          <td>111M molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC</td>
          <td>~1B molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining (combined)</td>
          <td>PubChem + ZINC</td>
          <td>~1.1B molecules</td>
          <td>MoLFormer-XL training set</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP, Tox21, ClinTox, HIV, BACE, SIDER</td>
          <td>1,427-41,127</td>
          <td>MoleculeNet scaffold splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>QM9, QM8, ESOL, FreeSolv, Lipophilicity</td>
          <td>642-133,885</td>
          <td>MoleculeNet random splits (QM9/QM8), scaffold (others)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li><strong>Tokenization</strong>: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens</li>
<li><strong>Sequence length</strong>: 1-202 tokens (99.4% coverage)</li>
<li><strong>Optimizer</strong>: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up</li>
<li><strong>Adaptive bucketing</strong>: Sequences grouped by length into buckets to minimize padding waste</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Transformer encoder with linear attention and rotary positional embeddings</li>
<li><strong>MoLFormer-XL</strong>: 12 layers, 12 attention heads, hidden size 768</li>
<li><strong>MoLFormer-Base</strong>: 6 layers (ablation only)</li>
<li><strong>Feature map size</strong>: 32 (generalized feature map for linear attention)</li>
<li><strong>Frozen head</strong>: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Scaffold splits per MoleculeNet</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression (ESOL, FreeSolv, Lipophilicity)</td>
          <td>Scaffold splits</td>
      </tr>
      <tr>
          <td>Avg MAE</td>
          <td>Regression (QM9, QM8)</td>
          <td>Random splits per MoleculeNet</td>
      </tr>
  </tbody>
</table>
<p>QM9 results also reported with 5-fold cross-validation for robustness.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand</li>
<li><strong>GPU reduction</strong>: Linear attention + bucketing reduced GPU requirements from ~1000 to 16</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/molformer">IBM/molformer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Pretraining, fine-tuning, and attention visualization</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">MoLFormer-XL (HuggingFace)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pretrained weights (46.8M parameters)</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>111M molecules</td>
      </tr>
      <tr>
          <td><a href="https://zinc.docking.org/">ZINC</a></td>
          <td>Dataset</td>
          <td>See ZINC terms</td>
          <td>~1B molecules</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., &amp; Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. <em>Nature Machine Intelligence</em>, 4, 1256-1264. <a href="https://doi.org/10.1038/s42256-022-00580-7">https://doi.org/10.1038/s42256-022-00580-7</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/molformer">GitHub Repository (MoLFormer)</a></li>
<li><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">HuggingFace Models</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2022molformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large-Scale Chemical Language Representations Capture Molecular Structure and Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1256--1264}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00580-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Horn et al.: Absolute Orientation Using Orthonormal Matrices</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/horn-orthonormal-matrices/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/horn-orthonormal-matrices/</guid><description>Horn, Hilden, and Negahdaripour (1988) solve absolute orientation using matrix square roots, providing an orthonormal matrix alternative to quaternions.</description><content:encoded><![CDATA[<h2 id="a-matrix-based-companion-to-the-quaternion-method">A Matrix-Based Companion to the Quaternion Method</h2>
<p>This <strong>Method</strong> paper presents a closed-form solution to the absolute orientation problem using $3 \times 3$ orthonormal matrices directly, complementing <a href="/notes/biology/computational-biology/horn-absolute-orientation/">Horn&rsquo;s earlier quaternion-based solution</a> (1987). The authors note that while quaternions are more elegant, orthonormal matrices are more widely used in photogrammetry, graphics, and robotics. The solution relies on the polar decomposition of the cross-covariance matrix via its matrix square root.</p>
<p>The paper also compares two approaches: (1) directly finding the best-fit orthonormal matrix (the main result), and (2) finding an unconstrained best-fit linear transformation and then projecting it onto the nearest orthonormal matrix. These give different results, and only the first approach has the desired symmetry property.</p>
<h2 id="the-rotation-via-polar-decomposition">The Rotation via Polar Decomposition</h2>
<p>As in the quaternion paper, the problem reduces to finding the orthonormal matrix $R$ maximizing $\operatorname{Tr}(R^T M)$, where $M = \sum_{i=1}^{n} \mathbf{r}&rsquo;_{r,i} (\mathbf{r}&rsquo;_{l,i})^T$ is the cross-covariance matrix of the centered point sets.</p>
<p>The key insight is the polar decomposition: any matrix $M$ can be written as:</p>
<p>$$
M = U S
$$</p>
<p>where $U$ is orthonormal and $S = (M^T M)^{1/2}$ is positive semidefinite. When $M$ is nonsingular:</p>
<p>$$
U = M (M^T M)^{-1/2}
$$</p>
<p>The matrix square root $(M^T M)^{1/2}$ is computed via eigendecomposition. If $M^T M$ has eigenvalues $\lambda_1, \lambda_2, \lambda_3$ and eigenvectors $\hat{\mathbf{u}}_1, \hat{\mathbf{u}}_2, \hat{\mathbf{u}}_3$:</p>
<p>$$
(M^T M)^{1/2} = \sqrt{\lambda_1} , \hat{\mathbf{u}}_1 \hat{\mathbf{u}}_1^T + \sqrt{\lambda_2} , \hat{\mathbf{u}}_2 \hat{\mathbf{u}}_2^T + \sqrt{\lambda_3} , \hat{\mathbf{u}}_3 \hat{\mathbf{u}}_3^T
$$</p>
<p>The sign of $\det(U)$ equals the sign of $\det(M)$, so $U$ is a proper rotation when $\det(M) &gt; 0$ and a reflection when $\det(M) &lt; 0$.</p>
<h2 id="handling-the-coplanar-case">Handling the Coplanar Case</h2>
<p>When one set of measurements is coplanar, $M$ is singular ($\operatorname{rank}(M) = 2$) and one eigenvalue of $M^T M$ is zero. The matrix square root still exists (positive semidefinite rather than positive definite), but $S$ is no longer invertible.</p>
<p>In this case, $U$ is determined only for two of its three columns. The third column (corresponding to the zero eigenvalue) is fixed by the orthonormality constraint, with a sign ambiguity resolved by requiring $\det(U) = +1$ (proper rotation).</p>
<h2 id="the-nearest-orthonormal-matrix-alternative-approach">The Nearest Orthonormal Matrix (Alternative Approach)</h2>
<p>The paper also derives a closed-form solution for finding the orthonormal matrix nearest to an arbitrary matrix $A$ (minimizing $\lVert A - R \rVert^2$). This uses the same polar decomposition machinery: if $A = U_A S_A$, then $U_A$ is the nearest orthonormal matrix.</p>
<p>This approach (find unconstrained best-fit transform, then project to nearest orthonormal matrix) was used by some earlier methods. Horn et al. show it gives a different result from the direct least-squares solution and lacks the symmetry property: the inverse transformation from right-to-left is generally not the exact inverse of the left-to-right solution.</p>
<h2 id="relationship-to-other-methods">Relationship to Other Methods</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Rotation representation</th>
          <th>Core computation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/biology/computational-biology/kabsch-algorithm/">Kabsch (1976)</a></td>
          <td>Orthogonal matrix</td>
          <td>Eigendecomposition of $\tilde{R}R$ ($3 \times 3$)</td>
      </tr>
      <tr>
          <td><a href="/notes/biology/computational-biology/horn-absolute-orientation/">Horn (1987)</a></td>
          <td>Unit quaternion</td>
          <td>Eigenvector of $N$ ($4 \times 4$)</td>
      </tr>
      <tr>
          <td>Horn et al. (1988)</td>
          <td>Orthonormal matrix</td>
          <td>Square root of $M^T M$ ($3 \times 3$)</td>
      </tr>
      <tr>
          <td><a href="/notes/biology/computational-biology/arun-svd-point-fitting/">Arun et al. (1987)</a></td>
          <td>Orthonormal matrix</td>
          <td>SVD of $H$ ($3 \times 3$)</td>
      </tr>
  </tbody>
</table>
<p>The polar decomposition approach (this paper) and the SVD approach (<a href="/notes/biology/computational-biology/arun-svd-point-fitting/">Arun et al.</a>) are closely related: the SVD $M = U \Lambda V^T$ gives the polar decomposition as $M = (UV^T)(V \Lambda V^T)$ where $UV^T$ is the orthonormal factor and $V \Lambda V^T$ is the positive semidefinite factor. Both methods can produce reflections under noisy data, which <a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">Umeyama (1991)</a> later addressed.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Horn, B. K. P., Hilden, H. M., &amp; Negahdaripour, S. (1988). Closed-form solution of absolute orientation using orthonormal matrices. <em>Journal of the Optical Society of America A</em>, 5(7), 1127-1135. <a href="https://doi.org/10.1364/josaa.5.001127">https://doi.org/10.1364/josaa.5.001127</a></p>
<p><strong>Publication</strong>: Journal of the Optical Society of America A, 1988</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/posts/kabsch-algorithm/">Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</a> (tutorial with differentiable implementations)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{horn1988closed,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Closed-form solution of absolute orientation using orthonormal matrices}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Horn, Berthold K. P. and Hilden, Hugh M. and Negahdaripour, Shahriar}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the Optical Society of America A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1127--1135}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1988}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Optica Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1364/josaa.5.001127}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Arun et al.: SVD-Based Least-Squares Fitting of 3D Points</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/arun-svd-point-fitting/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/arun-svd-point-fitting/</guid><description>Arun, Huang, and Blostein (1987) introduce an SVD-based algorithm for least-squares rotation and translation between two 3D point sets.</description><content:encoded><![CDATA[<h2 id="svd-for-3d-point-set-registration">SVD for 3D Point Set Registration</h2>
<p>This <strong>Method</strong> paper presents a concise algorithm for finding the least-squares rotation and translation between two 3D point sets using the singular value decomposition (SVD) of a $3 \times 3$ cross-covariance matrix. The approach is closely related to the earlier <a href="/notes/biology/computational-biology/kabsch-algorithm/">Kabsch algorithm</a> (1976), which used eigendecomposition, and was developed independently of <a href="/notes/biology/computational-biology/horn-absolute-orientation/">Horn&rsquo;s quaternion method</a> (1987). The paper also identifies a reflection degeneracy that <a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">Umeyama</a> later provided a complete fix for.</p>
<h2 id="problem-formulation">Problem Formulation</h2>
<p>Given two 3D point sets ${p_i}$ and ${p&rsquo;_i}$ ($i = 1, \ldots, N$) related by:</p>
<p>$$
p&rsquo;_i = R p_i + T + N_i
$$</p>
<p>where $R$ is a rotation matrix, $T$ is a translation vector, and $N_i$ is noise, find $\hat{R}$ and $\hat{T}$ minimizing:</p>
<p>$$
\Sigma^2 = \sum_{i=1}^{N} \lVert p&rsquo;_i - (R p_i + T) \rVert^2
$$</p>
<h2 id="decoupling-translation-and-rotation">Decoupling Translation and Rotation</h2>
<p>The translation is eliminated by centering both point sets at their centroids $p$ and $p&rsquo;$. Defining centered coordinates $q_i = p_i - p$ and $q&rsquo;_i = p&rsquo;_i - p&rsquo;$, the problem reduces to:</p>
<p>$$
\Sigma^2 = \sum_{i=1}^{N} \lVert q&rsquo;_i - R q_i \rVert^2
$$</p>
<p>Once $\hat{R}$ is found, the translation follows as $\hat{T} = p&rsquo; - \hat{R} p$.</p>
<h2 id="the-svd-algorithm">The SVD Algorithm</h2>
<p>The algorithm proceeds in five steps:</p>
<ol>
<li>Center both point sets by subtracting centroids</li>
<li>Compute the $3 \times 3$ cross-covariance matrix: $H = \sum_{i=1}^{N} q_i q&rsquo;^t_i$</li>
<li>Compute the SVD: $H = U \Lambda V^t$</li>
<li>Form the candidate rotation: $X = V U^t$</li>
<li>Check $\det(X)$: if $+1$, then $\hat{R} = X$; if $-1$, the result is a reflection</li>
</ol>
<p>The key insight is that minimizing $\Sigma^2$ is equivalent to maximizing $\operatorname{Trace}(RH)$. Using a lemma based on the Cauchy-Schwarz inequality, Arun et al. show that $X = VU^t$ maximizes this trace over all orthonormal matrices.</p>
<h2 id="the-reflection-problem">The Reflection Problem</h2>
<p>When $\det(VU^t) = -1$, the SVD produces a reflection rather than a proper rotation. Arun et al. analyze three cases:</p>
<p><strong>Noiseless, non-coplanar points</strong>: The SVD always gives a proper rotation ($\det = +1$). No issue arises.</p>
<p><strong>Coplanar points</strong> (including $N = 3$): One singular value of $H$ is zero. Both a rotation and a reflection achieve $\Sigma^2 = 0$. The fix is to flip the sign of the column of $V$ corresponding to the zero singular value:</p>
<p>$$
V&rsquo; = [v_1, v_2, -v_3], \quad X&rsquo; = V&rsquo; U^t
$$</p>
<p><strong>Noisy, non-coplanar points with $\det = -1$</strong>: The paper acknowledges this case cannot be handled by the algorithm. The reflection genuinely minimizes $\Sigma^2$ over all orthonormal matrices, meaning no rotation achieves a lower error. The authors suggest this only occurs with very large noise and recommend RANSAC-like approaches.</p>
<p>This last case is precisely what <a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">Umeyama (1991)</a> later resolved with a corrected formulation using a sign matrix $S$ conditioned on $\det(\Sigma_{xy})$.</p>
<h2 id="computational-comparison">Computational Comparison</h2>
<p>The paper includes VAX 11/780 benchmarks comparing three methods:</p>
<table>
  <thead>
      <tr>
          <th>Points</th>
          <th>SVD (ms)</th>
          <th>Quaternion (ms)</th>
          <th>Iterative (ms)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>3</td>
          <td>54.6</td>
          <td>26.6</td>
          <td>126.8</td>
      </tr>
      <tr>
          <td>11</td>
          <td>37.0</td>
          <td>41.0</td>
          <td>105.2</td>
      </tr>
      <tr>
          <td>30</td>
          <td>44.2</td>
          <td>48.3</td>
          <td>111.0</td>
      </tr>
  </tbody>
</table>
<p>The SVD and quaternion methods have comparable speed, both significantly faster than the iterative approach. SVD becomes faster than quaternion for larger point sets since its core computation operates on a $3 \times 3$ matrix regardless of $N$.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Arun, K. S., Huang, T. S., &amp; Blostein, S. D. (1987). Least-Squares Fitting of Two 3-D Point Sets. <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, PAMI-9(5), 698-700. <a href="https://doi.org/10.1109/TPAMI.1987.4767965">https://doi.org/10.1109/TPAMI.1987.4767965</a></p>
<p><strong>Publication</strong>: IEEE TPAMI, 1987</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/posts/kabsch-algorithm/">Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</a> (tutorial with differentiable implementations)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{arun1987least,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Least-Squares Fitting of Two 3-D Point Sets}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Arun, K. S. and Huang, T. S. and Blostein, S. D.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{IEEE Transactions on Pattern Analysis and Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{PAMI-9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{698--700}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1987}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1109/TPAMI.1987.4767965}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</guid><description>Uni-Parser is a modular, multi-expert PDF parsing engine for scientific documents with integrated OCSR and chemical structure recognition.</description><content:encoded><![CDATA[<h2 id="an-industrial-grade-multi-modal-document-parser">An Industrial-Grade Multi-Modal Document Parser</h2>
<p>Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.</p>
<p>The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.</p>
<h2 id="a-five-stage-pipeline-architecture">A Five-Stage Pipeline Architecture</h2>
<p>The system is organized into five sequential stages:</p>
<ol>
<li><strong>Document Pre-Processing</strong>: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.</li>
<li><strong>Group-based Layout Detection</strong>: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).</li>
<li><strong>Semantic Contents Parsing</strong>: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.</li>
<li><strong>Semantic Contents Gathering</strong>: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.</li>
<li><strong>Output Formatting and Semantic Chunking</strong>: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.</li>
</ol>
<h2 id="group-based-layout-detection">Group-Based Layout Detection</h2>
<p>A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.</p>
<p>The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.</p>
<h2 id="chemical-structure-recognition-with-molparser-15">Chemical Structure Recognition with MolParser 1.5</h2>
<p>Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:</p>
<ul>
<li>Strong reliance on rigid, hand-crafted rules that limit scalability</li>
<li>Substantially higher annotation costs (over 20x compared to end-to-end approaches)</li>
<li>Lower performance ceilings despite increasing training data</li>
</ul>
<h3 id="molecule-localization">Molecule Localization</h3>
<p>Uni-Parser-LD achieves strong molecule detection performance:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>mAP@50</th>
          <th>mAP@50-95</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (Uni-Parser Bench)</td>
          <td><strong>0.994</strong></td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.983</td>
          <td>0.919</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.974</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (BioVista Bench)</td>
          <td><strong>0.981</strong></td>
          <td><strong>0.844</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.961</td>
          <td>0.871</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.945</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td>BioMiner</td>
          <td>0.929</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.899</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h3 id="ocsr-accuracy">OCSR Accuracy</h3>
<p>MolParser 1.5 consistently outperforms prior methods across molecule types:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Full</th>
          <th>Chiral</th>
          <th>Markush</th>
          <th>All</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser 1.5</strong> (Uni-Parser Bench)</td>
          <td><strong>0.979</strong></td>
          <td><strong>0.809</strong></td>
          <td><strong>0.805</strong></td>
          <td><strong>0.886</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.953</td>
          <td>0.676</td>
          <td>0.664</td>
          <td>0.800</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.617</td>
          <td>0.274</td>
          <td>0.168</td>
          <td>0.417</td>
      </tr>
      <tr>
          <td><strong>MolParser 1.5</strong> (BioVista Bench)</td>
          <td><strong>0.795</strong></td>
          <td><strong>0.604</strong></td>
          <td><strong>0.761</strong></td>
          <td><strong>0.780</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.669</td>
          <td>0.352</td>
          <td>0.733</td>
          <td>0.703</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.774</td>
          <td>0.497</td>
          <td>0.185</td>
          <td>0.507</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.703</td>
          <td>0.481</td>
          <td>0.156</td>
          <td>0.455</td>
      </tr>
      <tr>
          <td>MolNexTR</td>
          <td>0.695</td>
          <td>0.419</td>
          <td>0.045</td>
          <td>0.401</td>
      </tr>
      <tr>
          <td>DECIMER</td>
          <td>0.545</td>
          <td>0.326</td>
          <td>0.000</td>
          <td>0.298</td>
      </tr>
  </tbody>
</table>
<p>Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.</p>
<h2 id="document-parsing-benchmarks">Document Parsing Benchmarks</h2>
<p>On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.</p>
<p>On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.</p>
<h2 id="comparison-with-ocsr-enabled-pdf-parsers">Comparison with OCSR-Enabled PDF Parsers</h2>
<p>On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Recall</th>
          <th>OCSR Success</th>
          <th>OCSR Acc</th>
          <th>Id Match</th>
          <th>Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>96.5%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>1.8s</strong></td>
      </tr>
      <tr>
          <td>MathPix</td>
          <td>100%</td>
          <td>75.9%</td>
          <td>59.6%</td>
          <td>-</td>
          <td>66.1s</td>
      </tr>
      <tr>
          <td>MinerU.Chem</td>
          <td>66.7%</td>
          <td>63.1%</td>
          <td>22.7%</td>
          <td>-</td>
          <td>~7 min</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/UniParser">HuggingFace Models</a></td>
          <td>Model/Dataset</td>
          <td>Unknown</td>
          <td>MolDet models and MolParser-7M dataset available</td>
      </tr>
      <tr>
          <td><a href="https://uni-parser.github.io">Project Page</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Project website with documentation</td>
      </tr>
  </tbody>
</table>
<p>The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<ul>
<li>Chiral molecule recognition remains a challenge for end-to-end OCSR models</li>
<li>Chemical reaction understanding in real-world literature has substantial room for improvement</li>
<li>Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements</li>
<li>Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., &amp; Ke, G. (2025). Uni-Parser Technical Report. <em>arXiv preprint arXiv:2512.15098</em>. <a href="https://arxiv.org/abs/2512.15098">https://arxiv.org/abs/2512.15098</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://uni-parser.github.io">Project Page</a></li>
<li><a href="https://huggingface.co/UniParser">HuggingFace Models</a></li>
</ul>
]]></content:encoded></item><item><title>Latent Diffusion Models for High-Res Image Synthesis</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/latent-diffusion-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/latent-diffusion-models/</guid><description>Latent Diffusion Models train diffusion in a compressed latent space, enabling high-res image synthesis with cross-attention conditioning at reduced compute.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It introduces Latent Diffusion Models (LDMs), which train denoising diffusion models in the latent space of pretrained autoencoders rather than directly in pixel space. The key insight is that separating perceptual compression from generative learning enables high-resolution image synthesis at a fraction of the computational cost of pixel-based diffusion. The paper also introduces a cross-attention conditioning mechanism for flexible multi-modal generation.</p>
<h2 id="computational-cost-of-pixel-space-diffusion">Computational Cost of Pixel-Space Diffusion</h2>
<p>Training diffusion models directly in pixel space is computationally expensive (150 to 1000 V100 GPU-days for leading models at the time) because the model must process high-dimensional RGB data at every denoising step. Much of this compute is spent modeling imperceptible high-frequency details. The authors observe that learning can be split into two stages: a perceptual compression stage that removes high-frequency detail, and a semantic compression stage where the generative model learns the conceptual composition. Prior two-stage approaches (VQGAN, DALL-E) relied on aggressive compression and autoregressive modeling in discrete latent spaces, trading off reconstruction quality for tractability.</p>
<h2 id="core-innovation-diffusion-in-latent-space">Core Innovation: Diffusion in Latent Space</h2>
<p>LDMs decompose image synthesis into two phases:</p>
<p><strong>Phase 1: Perceptual Compression.</strong> A pretrained autoencoder (encoder $\mathcal{E}$, decoder $\mathcal{D}$) maps images $x \in \mathbb{R}^{H \times W \times 3}$ to a lower-dimensional latent representation $z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}$ with spatial downsampling factor $f = H/h$. The autoencoder is trained with a perceptual loss (matching deep features from a pretrained VGG network) and a patch-based adversarial objective, with either KL or VQ regularization on the latent space.</p>
<p><strong>Phase 2: Latent Diffusion.</strong> A standard denoising diffusion model operates in this latent space. The training objective becomes:</p>
<p>$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t) \right|_2^2 \right]$$</p>
<p>where $z_t$ is the noised latent at timestep $t$, and $\epsilon_\theta$ is a time-conditional UNet.</p>
<p><strong>Cross-Attention Conditioning.</strong> To enable conditioning on text, semantic maps, or other modalities, the authors introduce cross-attention layers into the UNet. A domain-specific encoder $\tau_\theta$ maps conditioning input $y$ to an intermediate representation $\tau_\theta(y) \in \mathbb{R}^{M \times d_\tau}$, which interacts with the UNet features via:</p>
<p>$$Q = W_Q^{(i)} \cdot \varphi_i(z_t), \quad K = W_K^{(i)} \cdot \tau_\theta(y), \quad V = W_V^{(i)} \cdot \tau_\theta(y)$$</p>
<p>The conditional objective then becomes:</p>
<p>$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \right|_2^2 \right]$$</p>
<p>Both $\tau_\theta$ and $\epsilon_\theta$ are optimized jointly.</p>
<h2 id="experimental-setup-and-results">Experimental Setup and Results</h2>
<p>The authors evaluate across multiple tasks and datasets:</p>
<p><strong>Perceptual compression tradeoffs.</strong> Downsampling factors $f \in {1, 2, 4, 8, 16, 32}$ are compared on ImageNet class-conditional generation. LDM-1 (pixel-based) trains slowly; LDM-32 loses too much information. LDM-4 and LDM-8 achieve the best balance, with LDM-8 outperforming pixel-based diffusion by 38 FID points after 2M training steps on a single A100.</p>
<p><strong>Unconditional image synthesis</strong> on CelebA-HQ 256, FFHQ 256, LSUN Churches/Bedrooms 256: LDM-4 achieves FID 5.11 on CelebA-HQ (state of the art at the time), outperforming LSGM, GANs, and other likelihood-based models. On LSUN-Bedrooms, LDM-4 achieves FID 2.95, close to ADM (1.90) with half the parameters and roughly 4x less training compute (see Appendix E.3.5).</p>
<p><strong>Text-to-image synthesis</strong> on MS-COCO: A 1.45B parameter LDM-KL-8 model trained on LAION-400M achieves FID 12.63 with classifier-free guidance (a technique that amplifies the conditioning signal at the cost of diversity, by interpolating between conditional and unconditional predictions) at scale s=1.5, on par with GLIDE (FID 12.24, 6B params) and Make-A-Scene (FID 11.84, 4B params) with substantially fewer parameters.</p>
<p><strong>Class-conditional ImageNet 256:</strong> LDM-4-G achieves FID 3.60, IS 247.67, outperforming ADM-G (FID 4.59) with fewer parameters and less compute.</p>
<p><strong>Super-resolution:</strong> LDM-4 (big) achieves FID 2.4 on ImageNet 64-to-256 upscaling (validation split), outperforming SR3 in FID.</p>
<p><strong>Inpainting</strong> on Places: LDM-4 (big, w/ ft) achieves FID 1.50, setting a new state of the art on image inpainting.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<ul>
<li>LDM-4 and LDM-8 offer the best tradeoff between perceptual compression and generation quality.</li>
<li>The autoencoder only needs to be trained once and can be reused across different diffusion models and tasks.</li>
<li>Cross-attention conditioning generalizes to text, semantic layouts, and bounding boxes without architecture changes.</li>
<li>Convolutional sampling enables generation at resolutions higher than the training resolution (up to 1024x1024).</li>
<li>Sequential sampling remains slower than GANs. The autoencoder reconstruction can become a bottleneck for tasks requiring pixel-level precision.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unconditional</td>
          <td>CelebA-HQ, FFHQ, LSUN</td>
          <td>256x256</td>
          <td>Standard benchmarks</td>
      </tr>
      <tr>
          <td>Class-conditional</td>
          <td>ImageNet</td>
          <td>256x256</td>
          <td>1000 classes</td>
      </tr>
      <tr>
          <td>Text-to-image</td>
          <td>LAION-400M</td>
          <td>256x256</td>
          <td>400M image-text pairs</td>
      </tr>
      <tr>
          <td>Inpainting</td>
          <td>Places</td>
          <td>256x256, 512x512</td>
          <td>Following LaMa protocol</td>
      </tr>
      <tr>
          <td>Super-resolution</td>
          <td>ImageNet</td>
          <td>64 to 256</td>
          <td>Following SR3 pipeline</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Autoencoder regularization</strong>: KL-reg (KL penalty toward standard normal, weighted by ~$10^{-6}$) or VQ-reg (vector quantization layer on the latent space with a learned codebook)</li>
<li><strong>Diffusion</strong>: Standard DDPM denoising with reweighted objective</li>
<li><strong>Sampling</strong>: DDIM sampler with configurable steps (100 to 500 depending on task)</li>
<li><strong>Guidance</strong>: Classifier-free diffusion guidance with scale $s$ (1.5 for class-conditional and text-to-image quantitative evaluation; 10.0 for qualitative text-to-image samples)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Autoencoder</strong>: Based on VQGAN architecture with perceptual + adversarial loss</li>
<li><strong>UNet backbone</strong>: Time-conditional with cross-attention layers at multiple resolutions</li>
<li><strong>Text encoder</strong>: BERT-tokenizer with transformer $\tau_\theta$ for LAION text-to-image model</li>
<li><strong>LDM-4-G</strong>: 400M parameters, $f=4$ downsampling</li>
<li><strong>LDM-KL-8 (text)</strong>: 1.45B parameters, $f=8$ downsampling, KL-regularized</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CelebA-HQ unconditional</td>
          <td>5.11</td>
          <td>500 DDIM steps</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet class-conditional</td>
          <td>3.60</td>
          <td>LDM-4-G, cfg s=1.5</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>MS-COCO text-to-image</td>
          <td>12.63</td>
          <td>LDM-KL-8-G, 250 steps, cfg s=1.5</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>Places inpainting</td>
          <td>1.50</td>
          <td>LDM-4 big, w/ ft</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet 4x super-resolution</td>
          <td>2.4</td>
          <td>LDM-4 big, 100 steps</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Perceptual compression tradeoff experiments: single NVIDIA A100</li>
<li>Inpainting model trained on eight V100</li>
<li>Training at least 2.7x faster than pixel-based diffusion at equal parameters</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CompVis/latent-diffusion">CompVis/latent-diffusion</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., &amp; Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. <em>CVPR 2022</em>. <a href="https://arxiv.org/abs/2112.10752">https://arxiv.org/abs/2112.10752</a></p>
<p><strong>Publication</strong>: CVPR 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{rombach2022highresolution,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{High-Resolution Image Synthesis with Latent Diffusion Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\&#34;o}rn}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>     = <span style="color:#e6db74">{10684--10695}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/CompVis/latent-diffusion">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>Kabsch Algorithm: Optimal Rotation for Point Set Alignment</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/kabsch-algorithm/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/kabsch-algorithm/</guid><description>Kabsch (1976) derives a closed-form solution for the optimal rotation aligning two weighted vector sets by minimizing squared deviations.</description><content:encoded><![CDATA[<h2 id="a-closed-form-solution-for-optimal-rotation">A Closed-Form Solution for Optimal Rotation</h2>
<p>This short communication presents a <strong>Method</strong> paper: a direct, analytical solution to a constrained optimization problem. Given two sets of vectors, Kabsch derives the orthogonal matrix (rotation) that best superimposes one set onto the other by minimizing a weighted sum of squared deviations. Prior approaches either solved an unconstrained problem and factorized the result (Diamond, 1976) or used iterative methods (McLachlan, 1972). Kabsch shows that a direct, non-iterative solution exists despite the non-linear nature of the orthogonality constraint.</p>
<h2 id="the-superposition-problem">The Superposition Problem</h2>
<p>The core problem arises frequently in crystallography and structural biology: given two sets of corresponding points (e.g., atomic coordinates from a known structure and experimentally measured coordinates), find the rigid rotation that best aligns them. Translations can be removed by centering both point sets at the origin, leaving only the rotational component.</p>
<p>Formally, given vector sets $\mathbf{x}_n$ and $\mathbf{y}_n$ ($n = 1, 2, \ldots, N$) with weights $w_n$, find the orthogonal matrix $\mathsf{U}$ minimizing:</p>
<p>$$
E = \frac{1}{2} \sum_{n} w_n (\mathsf{U} \mathbf{x}_n - \mathbf{y}_n)^2
$$</p>
<p>subject to orthogonality: $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{I}$.</p>
<h2 id="derivation-via-lagrange-multipliers">Derivation via Lagrange Multipliers</h2>
<p>Kabsch introduces a symmetric matrix $\mathsf{L}$ of Lagrange multipliers to enforce orthogonality, forming the Lagrangian:</p>
<p>$$
G = E + \frac{1}{2} \sum_{i,j} l_{ij} \left( \sum_{k} u_{ki} u_{kj} - \delta_{ij} \right)
$$</p>
<p>Setting $\partial G / \partial u_{ij} = 0$ and defining two key matrices:</p>
<p>$$
r_{ij} = \sum_{n} w_n , y_{ni} , x_{nj} \qquad s_{ij} = \sum_{n} w_n , x_{ni} , x_{nj}
$$</p>
<p>where $\mathsf{R} = (r_{ij})$ is the weighted cross-covariance matrix and $\mathsf{S} = (s_{ij})$ is the weighted auto-covariance matrix, the stationarity condition becomes:</p>
<p>$$
\mathsf{U} \cdot (\mathsf{S} + \mathsf{L}) = \mathsf{R}
$$</p>
<h2 id="eigendecomposition-solution">Eigendecomposition Solution</h2>
<p>The key insight is that multiplying both sides by their transposes eliminates the unknown $\mathsf{U}$:</p>
<p>$$
(\mathsf{S} + \mathsf{L})(\mathsf{S} + \mathsf{L}) = \tilde{\mathsf{R}} \mathsf{R}
$$</p>
<p>Since $\tilde{\mathsf{R}} \mathsf{R}$ is symmetric positive definite, it has positive eigenvalues $\mu_k$ and eigenvectors $\mathbf{a}_k$. The matrix $\mathsf{S} + \mathsf{L}$ shares the same eigenvectors with eigenvalues $\sqrt{\mu_k}$.</p>
<p>From the eigenvectors $\mathbf{a}_k$, a second set of unit vectors $\mathbf{b}_k$ is defined:</p>
<p>$$
\mathbf{b}_k = \frac{1}{\sqrt{\mu_k}} \mathsf{R} , \mathbf{a}_k
$$</p>
<p>The optimal rotation matrix is then constructed directly:</p>
<p>$$
u_{ij} = \sum_{k} b_{ki} , a_{kj}
$$</p>
<h2 id="handling-degeneracies-and-generalizations">Handling Degeneracies and Generalizations</h2>
<p>Kabsch addresses two extensions:</p>
<ol>
<li>
<p><strong>Planar point sets</strong>: When all vectors lie in a plane, one eigenvalue of $\tilde{\mathsf{R}} \mathsf{R}$ is zero. The missing eigenvectors are recovered via cross products: $\mathbf{a}_3 = \mathbf{a}_1 \times \mathbf{a}_2$ and $\mathbf{b}_3 = \mathbf{b}_1 \times \mathbf{b}_2$.</p>
</li>
<li>
<p><strong>General metric constraints</strong>: The orthogonality constraint $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{I}$ can be replaced by $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{M}$ for any symmetric positive definite $\mathsf{M}$. By finding any specific solution $\mathsf{B}$ and transforming the input vectors as $\mathbf{x}&rsquo;_n = \mathsf{B} \mathbf{x}_n$, the problem reduces back to the standard orthogonal case.</p>
</li>
</ol>
<p>The method generalizes naturally to vector spaces of arbitrary dimension.</p>
<h2 id="legacy-and-impact">Legacy and Impact</h2>
<p>This two-page communication became one of the most cited papers in structural biology. The &ldquo;Kabsch algorithm&rdquo; (or &ldquo;Kabsch rotation&rdquo;) is the standard method for computing the root-mean-square deviation (RMSD) between two molecular structures after optimal superposition. It underpins structure comparison tools across crystallography, NMR spectroscopy, cryo-EM, and computational chemistry.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. <em>Acta Crystallographica Section A</em>, 32(5), 922-923. <a href="https://doi.org/10.1107/s0567739476001873">https://doi.org/10.1107/s0567739476001873</a></p>
<p><strong>Publication</strong>: Acta Crystallographica Section A, 1976</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/posts/kabsch-algorithm/">Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</a> (tutorial with differentiable implementations)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kabsch1976solution,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A solution for the best rotation to relate two sets of vectors}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kabsch, Wolfgang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{32}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{922--923}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1976}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{International Union of Crystallography}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1107/s0567739476001873}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Horn's Method: Absolute Orientation via Unit Quaternions</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/horn-absolute-orientation/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/horn-absolute-orientation/</guid><description>Horn (1987) presents a closed-form quaternion solution for absolute orientation, finding optimal rotation, translation, and scale between two point sets.</description><content:encoded><![CDATA[<h2 id="a-quaternion-approach-to-point-set-registration">A Quaternion Approach to Point Set Registration</h2>
<p>This <strong>Method</strong> paper presents a closed-form solution to the absolute orientation problem: given corresponding points measured in two different coordinate systems, find the optimal rotation, translation, and scale that maps one set onto the other. While the <a href="/notes/biology/computational-biology/kabsch-algorithm/">Kabsch algorithm</a> (1976) solved the rotation subproblem via eigendecomposition of $\tilde{\mathsf{R}}\mathsf{R}$, Horn&rsquo;s approach uses unit quaternions to represent rotation, reducing the problem to finding the eigenvector of a $4 \times 4$ symmetric matrix associated with its largest eigenvalue.</p>
<h2 id="the-absolute-orientation-problem">The Absolute Orientation Problem</h2>
<p>Given $n$ point pairs ${\mathbf{r}_{l,i}}$ and ${\mathbf{r}_{r,i}}$ measured in &ldquo;left&rdquo; and &ldquo;right&rdquo; coordinate systems, find the transformation:</p>
<p>$$
\mathbf{r}_r = s , R(\mathbf{r}_l) + \mathbf{r}_0
$$</p>
<p>where $s$ is a scale factor, $R$ is a rotation, and $\mathbf{r}_0$ is a translation, minimizing the sum of squared residual errors:</p>
<p>$$
\sum_{i=1}^{n} \lVert \mathbf{r}_{r,i} - s , R(\mathbf{r}_{l,i}) - \mathbf{r}_0 \rVert^2
$$</p>
<p>Prior methods either used iterative numerical procedures or selectively discarded constraints (e.g., Thompson&rsquo;s and Schut&rsquo;s three-point methods). Horn derives a direct solution that uses all available information from all points simultaneously.</p>
<h2 id="decoupling-translation-scale-and-rotation">Decoupling Translation, Scale, and Rotation</h2>
<p>Horn shows that the three components of the transformation can be solved sequentially.</p>
<p><strong>Translation</strong>: After centering both point sets at their centroids ($\bar{\mathbf{r}}_l$ and $\bar{\mathbf{r}}_r$), the optimal translation is:</p>
<p>$$
\mathbf{r}_0 = \bar{\mathbf{r}}_r - s , R(\bar{\mathbf{r}}_l)
$$</p>
<p><strong>Scale</strong>: Horn derives three formulations (asymmetric left, asymmetric right, and symmetric). The symmetric version, which ensures the inverse transformation yields the reciprocal scale, is:</p>
<p>$$
s = \left( \frac{\sum_{i=1}^{n} \lVert \mathbf{r}&rsquo;_{r,i} \rVert^2}{\sum_{i=1}^{n} \lVert \mathbf{r}&rsquo;_{l,i} \rVert^2} \right)^{1/2}
$$</p>
<p>the ratio of root-mean-square deviations from the respective centroids.</p>
<p><strong>Rotation</strong>: After removing translation and scale, the remaining problem is to find the rotation $R$ that maximizes:</p>
<p>$$
\sum_{i=1}^{n} \mathbf{r}&rsquo;_{r,i} \cdot R(\mathbf{r}&rsquo;_{l,i})
$$</p>
<h2 id="the-quaternion-eigenvector-solution">The Quaternion Eigenvector Solution</h2>
<p>Horn represents rotation using unit quaternions $\dot{q} = q_0 + i q_x + j q_y + k q_z$ with $\lVert \dot{q} \rVert = 1$. A rotation acts on a vector (represented as a purely imaginary quaternion $\dot{r}$) via the composite product:</p>
<p>$$
\dot{r}&rsquo; = \dot{q} , \dot{r} , \dot{q}^*
$$</p>
<p>Using the $4 \times 4$ matrix representations of quaternion products, the objective function becomes a quadratic form:</p>
<p>$$
\dot{q}^T N \dot{q}
$$</p>
<p>where $N$ is a real symmetric $4 \times 4$ matrix whose elements are combinations of the sums of products $S_{xx}, S_{xy}, \ldots, S_{zz}$ from the $3 \times 3$ cross-covariance matrix $M = \sum_i \mathbf{r}&rsquo;_{l,i} \mathbf{r}&rsquo;^T_{r,i}$:</p>
<p>$$
N = \begin{bmatrix} (S_{xx} + S_{yy} + S_{zz}) &amp; S_{yz} - S_{zy} &amp; S_{zx} - S_{xz} &amp; S_{xy} - S_{yx} \\ S_{yz} - S_{zy} &amp; (S_{xx} - S_{yy} - S_{zz}) &amp; S_{xy} + S_{yx} &amp; S_{zx} + S_{xz} \\ S_{zx} - S_{xz} &amp; S_{xy} + S_{yx} &amp; (-S_{xx} + S_{yy} - S_{zz}) &amp; S_{yz} + S_{zy} \\ S_{xy} - S_{yx} &amp; S_{zx} + S_{xz} &amp; S_{yz} + S_{zy} &amp; (-S_{xx} - S_{yy} + S_{zz}) \end{bmatrix}
$$</p>
<p>The trace of $N$ is always zero. The unit quaternion maximizing $\dot{q}^T N \dot{q}$ is the eigenvector corresponding to the most positive eigenvalue of $N$.</p>
<h2 id="the-characteristic-polynomial">The Characteristic Polynomial</h2>
<p>The eigenvalues satisfy a quartic $\lambda^4 + c_3 \lambda^3 + c_2 \lambda^2 + c_1 \lambda + c_0 = 0$ where:</p>
<ul>
<li>$c_3 = 0$ (trace of $N$ is zero, so the four roots sum to zero)</li>
<li>$c_2 = -2 \operatorname{Tr}(M^T M)$ (always negative, guaranteeing both positive and negative roots)</li>
<li>$c_1 = -8 \det(M)$</li>
<li>$c_0 = \det(N)$</li>
</ul>
<p>When points are coplanar (including the common case of exactly three points), $\det(M) = 0$, so $c_1 = 0$ and the quartic reduces to a biquadratic solvable in closed form.</p>
<h2 id="coplanar-points-and-the-three-point-case">Coplanar Points and the Three-Point Case</h2>
<p>For coplanar measurements, the quartic simplifies to $\lambda^4 + c_2 \lambda^2 + c_0 = 0$, yielding:</p>
<p>$$
\lambda_m = \left[ \frac{1}{2} \left( (c_2^2 - 4c_0)^{1/2} - c_2 \right) \right]^{1/2}
$$</p>
<p>Horn also provides a geometric interpretation for the coplanar case: first rotate one plane into the other (about their line of intersection), then solve a 2D least-squares rotation within the shared plane.</p>
<h2 id="comparison-with-the-kabsch-algorithm">Comparison with the Kabsch Algorithm</h2>
<p>Both methods solve the same underlying optimization problem but approach it differently:</p>
<table>
  <thead>
      <tr>
          <th>Aspect</th>
          <th>Kabsch (1976)</th>
          <th>Horn (1987)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Rotation representation</td>
          <td>Orthogonal matrix</td>
          <td>Unit quaternion</td>
      </tr>
      <tr>
          <td>Core computation</td>
          <td>SVD or eigendecomposition of $\tilde{R}R$ ($3 \times 3$)</td>
          <td>Eigenvector of $N$ ($4 \times 4$)</td>
      </tr>
      <tr>
          <td>Scale estimation</td>
          <td>Not addressed</td>
          <td>Three formulations (including symmetric)</td>
      </tr>
      <tr>
          <td>Constraint enforcement</td>
          <td>Lagrange multipliers</td>
          <td>Unit quaternion norm</td>
      </tr>
      <tr>
          <td>Symmetry guarantee</td>
          <td>Not addressed</td>
          <td>Proven for symmetric scale</td>
      </tr>
      <tr>
          <td>Degenerate cases</td>
          <td>Cross-product fallback</td>
          <td>Biquadratic closed form</td>
      </tr>
  </tbody>
</table>
<p>Horn emphasizes a symmetry property: the inverse transformation should yield exactly the inverse parameters. This holds automatically for the quaternion rotation but requires a specific (symmetric) choice of scale formula.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Horn, B. K. P. (1987). Closed-form solution of absolute orientation using unit quaternions. <em>Journal of the Optical Society of America A</em>, 4(4), 629-642. <a href="https://doi.org/10.1364/JOSAA.4.000629">https://doi.org/10.1364/JOSAA.4.000629</a></p>
<p><strong>Publication</strong>: Journal of the Optical Society of America A, 1987</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/posts/kabsch-algorithm/">Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</a> (tutorial with differentiable implementations of the related SVD-based method)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{horn1987closed,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Closed-form solution of absolute orientation using unit quaternions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Horn, Berthold K. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the Optical Society of America A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{629--642}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1987}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Optica Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1364/josaa.4.000629}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GraSP: Graph Recognition via Subgraph Prediction (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/grasp-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/grasp-2026/</guid><description>GraSP is a general image-to-graph framework using sequential subgraph prediction, applied to OCSR with 67.5% accuracy on QM9.</description><content:encoded><![CDATA[<h2 id="a-general-framework-for-visual-graph-recognition">A General Framework for Visual Graph Recognition</h2>
<p>GraSP (Graph Recognition via Subgraph Prediction) addresses a fundamental limitation in image-to-graph methods: existing solutions are task-specific and do not transfer between domains. Whether the task is OCSR, scene graph recognition, music notation parsing, or road network extraction, each domain has developed independent solutions despite solving the same conceptual problem of extracting a graph from an image.</p>
<p>The key insight is that graph recognition can be reformulated as sequential subgraph prediction using a binary classifier, sidestepping two core difficulties of using graphs as neural network outputs:</p>
<ol>
<li><strong>Graph isomorphism</strong>: An uncolored graph with $n$ nodes has $n!$ equivalent representations, making direct output comparison intractable</li>
<li><strong>Compositional outputs</strong>: Nodes, edges, and features are interdependent, so standard i.i.d. loss functions are insufficient</li>
</ol>
<h2 id="sequential-subgraph-prediction-as-an-mdp">Sequential Subgraph Prediction as an MDP</h2>
<p>GraSP formulates graph recognition as a Markov Decision Process. Starting from an empty graph, the method iteratively expands the current graph by adding one edge at a time (connecting either a new node or two existing nodes). At each step, a binary classifier predicts whether each candidate successor graph is a subgraph of the target graph shown in the image.</p>
<p>The critical observation is that the optimal value function $V^{\pi^*}$ satisfies:</p>
<p>$$V^{\pi^*}(\mathcal{G}_t | \mathcal{I}) = 1 \iff \mathcal{G}_t \subseteq \mathcal{G}_{\mathcal{I}}$$</p>
<p>This means the value function reduces to a subgraph membership test, which can be learned as a binary classifier rather than requiring reinforcement learning. Greedy decoding then suffices: at each step, select any successor that the classifier predicts is a valid subgraph, and terminate when the classifier indicates the current graph is complete.</p>
<p>This formulation decouples <strong>decision</strong> (what to add) from <strong>generation</strong> (in what order), making the same model applicable across different graph types without modification.</p>
<h2 id="architecture-gnn--film-conditioned-cnn">Architecture: GNN + FiLM-Conditioned CNN</h2>
<p>The architecture has three components:</p>
<ol>
<li>
<p><strong>GNN encoder</strong>: A Message Passing Neural Network processes the candidate subgraph, producing a graph embedding. Messages are constructed as concatenations of source node features, target node features, and connecting edge features.</p>
</li>
<li>
<p><strong>FiLM-conditioned CNN</strong>: A ResNet-v2 processes the image, with FiLM layers placed after every normalization layer within each block. The graph embedding conditions the image processing, producing a joint graph-image representation.</p>
</li>
<li>
<p><strong>MLP classification head</strong>: Takes the conditioned image embedding plus a binary terminal flag (indicating whether this is a termination check) and predicts subgraph membership.</p>
</li>
</ol>
<p>The model uses only 7.25M parameters. Group Normalization is used in the CNN (8 groups per layer), Layer Normalization in the GNN and MLP.</p>
<h2 id="training-via-streaming-data-generation">Training via Streaming Data Generation</h2>
<p>Training uses a streaming architecture rather than a fixed dataset:</p>
<ul>
<li>For each iteration, a target graph $\mathcal{G}_T$ is sampled and rendered as an image</li>
<li><strong>Positive samples</strong> are generated by deleting edges that do not disconnect the graph (yielding valid subgraphs)</li>
<li><strong>Negative samples</strong> are generated by expanding successor states and checking via approximate subgraph matching</li>
<li>Two FIFO buffers (one for positives, one for negatives), each holding up to 25,000 images, maintain diverse and balanced mini-batches of 1024 samples</li>
<li>Training uses the RAdam optimizer with a cosine learning rate schedule (warmup over 50M samples, cycle of 250M samples) on 4 A100 GPUs with a 24h budget</li>
</ul>
<h2 id="synthetic-benchmarks-on-colored-trees">Synthetic Benchmarks on Colored Trees</h2>
<p>GraSP is evaluated on increasingly complex synthetic tasks involving colored tree graphs:</p>
<ul>
<li><strong>Small trees (6-9 nodes)</strong>: Tasks with varying numbers of node colors (1, 3, 5) and edge colors (1, 3, 5). The model works well across all configurations, with simpler tasks (fewer colors) converging faster.</li>
<li><strong>Larger trees (10-15 nodes)</strong>: The same trends hold but convergence is slower due to increased structural complexity.</li>
<li><strong>Out-of-distribution generalization</strong>: Models trained on 6-9 node trees show zero-shot generalization to 10-node trees, indicating learned patterns are size-independent.</li>
</ul>
<h2 id="ocsr-evaluation-on-qm9">OCSR Evaluation on QM9</h2>
<p>For the real-world OCSR evaluation, GraSP is applied to <a href="/notes/chemistry/datasets/qm9/">QM9</a> molecular images (grayscale, no stereo-bonds) with a 10,000-molecule held-out test set:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OSRA</td>
          <td>45.61%</td>
      </tr>
      <tr>
          <td>GraSP</td>
          <td>67.51%</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>88.36%</td>
      </tr>
      <tr>
          <td>DECIMER</td>
          <td>92.08%</td>
      </tr>
  </tbody>
</table>
<p>GraSP does not match state-of-the-art OCSR tools, but the authors emphasize that the same model architecture and training procedure transfers directly from synthetic tree tasks to molecular graphs with no task-specific modifications. The only domain knowledge incorporated is a simple chemistry rule: not extending nodes that already have degree four.</p>
<p>The method highlights the practical advantage of decoupling decision from generation. Functional groups can be represented at different granularities (as single nodes to reduce trajectory depth, or expanded to reduce trajectory breadth) without changing the model.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/c72bcbf4/grasp">GraSP Code</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation with pre-trained models</td>
      </tr>
  </tbody>
</table>
<p>The repository includes pre-trained models and example trajectories for interactive exploration. Training requires 4 A100 GPUs with a 24h time budget. The QM9 dataset used for OCSR evaluation is publicly available. No license file is included in the repository.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<ul>
<li><strong>Finite type assumption</strong>: The current framework assumes a finite set of node and edge types, limiting applicability to open-vocabulary tasks like scene graph recognition</li>
<li><strong>Scaling to large graphs</strong>: For very large graphs, the branching factor of successor states becomes expensive. Learned filters to prune irrelevant successor states could help</li>
<li><strong>OCSR performance gap</strong>: While GraSP demonstrates transferability, it falls short of specialized OCSR tools that use domain-specific encodings (SMILES) or pixel-level supervision</li>
<li><strong>Modality extension</strong>: The framework could extend beyond images to other input modalities, such as vector embeddings of graphs</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Eberhard, A., Neumann, G., &amp; Friederich, P. (2026). Graph Recognition via Subgraph Prediction. <em>arXiv preprint arXiv:2601.15133</em>. <a href="https://arxiv.org/abs/2601.15133">https://arxiv.org/abs/2601.15133</a></p>
<p><strong>Publication</strong>: arXiv 2026</p>
]]></content:encoded></item><item><title>GraphReco: Probabilistic Structure Recognition (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</guid><description>GraphReco is a rule-based OCSR system using Markov networks for probabilistic atom/bond ambiguity resolution during graph assembly.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, H., Yu, Y., &amp; Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. <em>ChemistryOpen</em>, e202500537. <a href="https://doi.org/10.1002/open.202500537">https://doi.org/10.1002/open.202500537</a></p>
<p><strong>Publication</strong>: ChemistryOpen 2026 (Open Access)</p>
<h2 id="a-rule-based-ocsr-system-with-probabilistic-graph-assembly">A Rule-Based OCSR System with Probabilistic Graph Assembly</h2>
<p>GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.</p>
<p>The system introduces two main contributions:</p>
<ol>
<li><strong>Fragment Merging (FM) line detection</strong>: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution</li>
<li><strong>Probabilistic ambiguity resolution</strong>: A Markov network that infers the most likely existence and merging state of atom and bond candidates</li>
</ol>
<h2 id="three-stage-pipeline">Three-Stage Pipeline</h2>
<p>GraphReco follows a three-stage workflow:</p>
<ol>
<li>
<p><strong>Component Extraction</strong>: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.</p>
</li>
<li>
<p><strong>Atom and Bond Ambiguity Resolution</strong>: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.</p>
</li>
<li>
<p><strong>Graph Reconstruction</strong>: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.</p>
</li>
</ol>
<h2 id="fragment-merging-line-detection">Fragment Merging Line Detection</h2>
<p>Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:</p>
<ol>
<li>
<p><strong>Fragment extraction</strong>: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.</p>
</li>
<li>
<p><strong>Fragment grouping</strong>: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.</p>
</li>
<li>
<p><strong>Fragment merging</strong>: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.</p>
</li>
</ol>
<p>The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.</p>
<h2 id="probabilistic-ambiguity-resolution-via-markov-network">Probabilistic Ambiguity Resolution via Markov Network</h2>
<p>After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:</p>
<p>$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$</p>
<p>where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.</p>
<p>A Markov network is constructed with four types of nodes:</p>
<ul>
<li><strong>Atom nodes</strong>: Boolean existence variables for each atom candidate</li>
<li><strong>Bond nodes</strong>: Boolean existence variables for each bond candidate</li>
<li><strong>Atom merge nodes</strong>: Boolean variables for pairs of overlapping atom candidates</li>
<li><strong>Bond merge nodes</strong>: Boolean variables for pairs of nearby bond candidates</li>
</ul>
<p>Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:</p>
<p>$$P(a_1, a_2) = \begin{cases} 0.9, &amp; \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), &amp; \text{if } Q &lt; d \leq R \\ 0.1, &amp; \text{if } d &gt; R \end{cases}$$</p>
<p>where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.</p>
<h2 id="evaluation-results">Evaluation Results</h2>
<p>GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td><strong>94.2</strong></td>
          <td><strong>86.7</strong></td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>92.4</td>
          <td>70.3</td>
          <td>89.1</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>89.9</td>
          <td>63.0</td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>89.7</td>
          <td>63.9</td>
          <td>89.3</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>93.3</td>
          <td>82.8</td>
          <td><strong>91.5</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>35.4</td>
          <td>13.8</td>
          <td>25.2</td>
      </tr>
  </tbody>
</table>
<p>GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.</p>
<h3 id="robustness-on-perturbed-images">Robustness on Perturbed Images</h3>
<p>On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-perturbed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolGrapher</td>
          <td><strong>86.7</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>42.3</td>
      </tr>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td>40.6</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>30.7</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>6.4</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>5.1</td>
      </tr>
  </tbody>
</table>
<p>GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.</p>
<h2 id="ablation-study">Ablation Study</h2>
<p>Each component contributes substantially to overall performance on USPTO-10K:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full system</td>
          <td>94.2</td>
          <td>86.7</td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>Without FM line detection</td>
          <td>2.9</td>
          <td>5.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>Without atom candidates</td>
          <td>9.8</td>
          <td>0.4</td>
          <td>5.0</td>
      </tr>
      <tr>
          <td>Without bond candidates</td>
          <td>79.1</td>
          <td>75.8</td>
          <td>75.0</td>
      </tr>
      <tr>
          <td>Without Markov network</td>
          <td>88.2</td>
          <td>81.4</td>
          <td>84.2</td>
      </tr>
  </tbody>
</table>
<p>The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed</li>
<li>The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality</li>
<li>Only handles single 2D molecule structures per image</li>
<li>Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Online Demo</td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Google Cloud Run deployment (no longer available)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components for full reproduction:</strong></p>
<ul>
<li>Source code is not publicly available</li>
<li>No pre-built package or installable library</li>
<li>Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released</li>
</ul>
<p><strong>Hardware/compute requirements:</strong> Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.</p>
]]></content:encoded></item><item><title>D3PM: Discrete Denoising Diffusion Probabilistic Models</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/discrete-diffusion-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/discrete-diffusion-models/</guid><description>D3PMs extend diffusion models to discrete data with structured transition matrices, connecting diffusion to masked language models.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It extends denoising diffusion probabilistic models (DDPMs) from continuous to discrete state-spaces by introducing structured Markov transition matrices for the corruption process. The paper unifies several corruption strategies, draws a formal connection between absorbing-state diffusion and masked language models, and demonstrates competitive results on both image and text generation.</p>
<h2 id="diffusion-beyond-continuous-spaces">Diffusion Beyond Continuous Spaces</h2>
<p>Standard DDPMs operate in continuous state-spaces (e.g., pixel values treated as real numbers) and use Gaussian noise for corruption. Many important data types are inherently discrete: text (tokens from a vocabulary), quantized images (discrete pixel values), molecular structures, and segmentation maps. Prior work by Hoogeboom et al. extended binary diffusion to multinomial diffusion with uniform transition probabilities, but this limits the structure of the corruption process. D3PMs generalize this by allowing arbitrary transition matrices that encode domain-specific inductive biases.</p>
<h2 id="core-innovation-structured-transition-matrices">Core Innovation: Structured Transition Matrices</h2>
<p>D3PMs define a forward corruption process over discrete variables $\mathbf{x} \in {1, \ldots, K}^D$ using transition matrices $\mathbf{Q}_t \in \mathbb{R}^{K \times K}$:</p>
<p>$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_{t-1} \mathbf{Q}_t)$$</p>
<p>where $\mathbf{x}_{t-1}$ is a one-hot row vector. The cumulative transition after $t$ steps is $\overline{\mathbf{Q}}_t = \mathbf{Q}_1 \mathbf{Q}_2 \cdots \mathbf{Q}_t$, giving:</p>
<p>$$q(\mathbf{x}_t | \mathbf{x}_0) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_0 \overline{\mathbf{Q}}_t)$$</p>
<p>The paper explores several transition matrix designs:</p>
<p><strong>Uniform diffusion:</strong> $[\mathbf{Q}_t]_{ij} = (1 - \beta_t) \mathbf{1}_{i=j} + \beta_t / K$. Transitions with equal probability to any state. Stationary distribution is uniform.</p>
<p><strong>Absorbing state:</strong> In absorbing-state diffusion, each non-mask token transitions to the mask state with probability $\beta_t$ per step, while tokens already at the mask state remain there:</p>
<p>$[\mathbf{Q}_t]_{ij} = (1-\beta_t)\mathbf{1}_{i=j\neq m} + \beta_t \mathbf{1}_{j=m} + \mathbf{1}_{i=j=m}$. Each token transitions to a designated absorbing state $m$ (e.g., [MASK] for text, gray pixel for images) with probability $\beta_t$. This establishes a direct connection to masked language models like BERT.</p>
<p><strong>Discretized Gaussian:</strong> Transition probabilities decay as a function of the distance $|i-j|$ between states, mimicking Gaussian diffusion on ordinal data like pixel values.</p>
<p><strong>Embedding-based nearest neighbor:</strong> For text, transitions are weighted by proximity in a pretrained word embedding space, so corruption preferentially swaps words with semantically similar ones.</p>
<p><strong>Training objective.</strong> The reverse process $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ is parameterized by predicting $\tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$ and computing the posterior:</p>
<p>$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \propto \sum_{\tilde{\mathbf{x}}_0} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \tilde{\mathbf{x}}_0) , \tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$$</p>
<p>The loss combines the variational lower bound (VLB) with an auxiliary cross-entropy loss $L_\lambda$:</p>
<p>$$L = L_{\text{VLB}} + \lambda , L_{\text{CE}}$$</p>
<p>where $L_{\text{CE}}$ is a reweighted cross-entropy loss on the $\mathbf{x}_0$ prediction that stabilizes training and improves sample quality. The VLB decomposes into per-timestep KL divergences between the true and predicted reverse transitions.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p><strong>Image generation (CIFAR-10):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Loss</th>
          <th>IS</th>
          <th>FID</th>
          <th>NLL (bpd)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>D3PM uniform</td>
          <td>$L_{\text{VLB}}$</td>
          <td>5.99</td>
          <td>51.27</td>
          <td>5.08</td>
      </tr>
      <tr>
          <td>D3PM absorbing</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>6.78</td>
          <td>30.97</td>
          <td>4.40</td>
      </tr>
      <tr>
          <td>D3PM Gauss</td>
          <td>$L_{\text{VLB}}$</td>
          <td>7.75</td>
          <td>15.30</td>
          <td>3.97</td>
      </tr>
      <tr>
          <td>D3PM Gauss</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>8.54</td>
          <td>8.34</td>
          <td>3.98</td>
      </tr>
      <tr>
          <td>D3PM Gauss + logistic</td>
          <td>$L_\lambda$ ($\lambda{=}0.001$)</td>
          <td>8.56</td>
          <td>7.34</td>
          <td>3.44</td>
      </tr>
      <tr>
          <td>DDPM $L_{\text{simple}}$ (continuous)</td>
          <td>&ndash;</td>
          <td>9.46</td>
          <td>3.17</td>
          <td>3.75</td>
      </tr>
  </tbody>
</table>
<p>The best discrete D3PM variant is D3PM Gauss + logistic, which achieves FID 7.34 and NLL 3.44 bpd using the combined $L_\lambda$ loss with a truncated logistic parameterization. The truncated logistic parameterization replaces the standard softmax output with a discretized logistic distribution over pixel values, assigning probability mass to each discrete bin based on a continuous logistic CDF. This provides a smoother output distribution that better captures the ordinal structure of pixel intensities. This variant exceeds the continuous DDPM in log-likelihood (3.44 vs. 3.75 bpd) while approaching its sample quality (FID 7.34 vs. 3.17).</p>
<p><strong>Text generation (text8, character-level, 1000 steps):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>bpc</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>D3PM absorbing ($L_\lambda$)</td>
          <td>1.45</td>
      </tr>
      <tr>
          <td>D3PM NN ($L_{\text{VLB}}$)</td>
          <td>1.59</td>
      </tr>
      <tr>
          <td>D3PM uniform</td>
          <td>1.61</td>
      </tr>
      <tr>
          <td>Discrete Flow (Tran et al.)</td>
          <td>1.23</td>
      </tr>
  </tbody>
</table>
<p>Among the D3PM variants and baselines evaluated, D3PM absorbing achieves the best bpc on text8 apart from Discrete Flow (Tran et al., 2019). On LM1B (sentencepiece vocabulary of 8192 tokens), D3PM absorbing achieves a perplexity of 76.9 at 1000 steps, compared to 137.9 for D3PM uniform and 43.6 for a comparable autoregressive transformer, demonstrating that discrete diffusion scales to large vocabularies.</p>
<p><strong>Ablation findings:</strong></p>
<ul>
<li>The auxiliary cross-entropy loss $L_\lambda$ is critical: for D3PM Gauss, it improves FID from 15.30 ($L_{\text{VLB}}$) to 8.34 ($L_\lambda$, $\lambda{=}0.001$). Adding the truncated logistic parameterization further improves FID to 7.34.</li>
<li>Discretized Gaussian transitions outperform both uniform and absorbing-state transitions on CIFAR-10 across all metrics.</li>
<li>For text, the absorbing-state (mask) model outperforms uniform and nearest-neighbor models. Nearest-neighbor diffusion provides only marginal improvement over uniform, a surprising negative result.</li>
<li>The $\mathbf{x}_0$-parameterization ensures the learned reverse distribution has the correct sparsity pattern dictated by the transition matrix $\mathbf{Q}_t$.</li>
</ul>
<h2 id="findings-and-limitations">Findings and Limitations</h2>
<ul>
<li>The choice of transition matrix is an important design decision that encodes domain-specific inductive biases. Discretized Gaussian transitions work best for ordinal image data; absorbing-state transitions work best for text.</li>
<li>D3PMs formally unify diffusion models and masked language models: absorbing-state diffusion with a [MASK] token is equivalent to a reweighted BERT-style training objective.</li>
<li>The combined VLB + auxiliary loss ($L_\lambda$) achieves better density estimation (3.44 bpd) than continuous DDPMs (3.75 bpd) while producing competitive samples.</li>
<li>Sample quality (best FID 7.34 for D3PM Gauss + logistic) still lags behind continuous-space DDPMs (FID 3.17) on CIFAR-10, though the gap narrows with structured transitions and the auxiliary loss.</li>
<li>Scaling to very large numbers of categories $K$ requires special techniques (low-rank corruption or matrix exponentials) to manage the $O(K^2 T)$ memory cost of storing transition matrices.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Image generation</td>
          <td>CIFAR-10</td>
          <td>32x32, 256 categories</td>
          <td>Quantized to 256 ordinal values per channel</td>
      </tr>
      <tr>
          <td>Text generation</td>
          <td>text8</td>
          <td>Character-level</td>
          <td>27 character vocabulary, sequences of length 256</td>
      </tr>
      <tr>
          <td>Text generation</td>
          <td>LM1B</td>
          <td>Word-level</td>
          <td>Sentencepiece vocabulary of 8192 tokens, sequence length 128</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Noise schedules</strong>: Linear schedule for D3PM Gauss, cosine schedule for D3PM uniform, and a novel mutual information schedule for absorbing and nearest-neighbor models</li>
<li><strong>Reverse parameterization</strong>: $\mathbf{x}_0$-parameterization with posterior computation via Bayes&rsquo; rule</li>
<li><strong>Loss</strong>: $L_{\text{VLB}} + \lambda L_{\text{CE}}$ with $\lambda = 0.001$ for images and $\lambda = 0.01$ for text absorbing models</li>
<li><strong>Scaling</strong>: Low-rank corruption (absorbing, uniform) scales as $O(r^2 T)$; matrix exponentials for nearest-neighbor transitions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Image models</strong>: Modified U-Net architecture from Ho et al. (2020) adapted for categorical output via softmax over $K$ classes</li>
<li><strong>Text models</strong>: 12-layer <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style transformer encoder with 70M parameters (12 heads, MLP dim 3072, QKV dim 768)</li>
<li><strong>Timesteps</strong>: $T = 1000$ for both images and text, though text models can be evaluated with fewer steps (e.g., 256 or 20)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>Best D3PM</th>
          <th>Continuous DDPM</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CIFAR-10</td>
          <td>7.34 (Gauss + logistic)</td>
          <td>3.17</td>
      </tr>
      <tr>
          <td>NLL (bpd)</td>
          <td>CIFAR-10</td>
          <td>3.44 (Gauss + logistic)</td>
          <td>3.75</td>
      </tr>
      <tr>
          <td>BPC</td>
          <td>text8 (char)</td>
          <td>1.45 (absorbing, $L_\lambda$)</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Perplexity</td>
          <td>LM1B</td>
          <td>76.9 (absorbing)</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained for 1M steps with batch size 512 on TPUv2 or TPUv3</li>
<li>Text models: 12-layer transformer encoder (T5 architecture), 70M parameters</li>
<li>Image models: Modified U-Net architecture from Ho et al. (2020)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/google-research/google-research/tree/master/d3pm">google-research/d3pm</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official JAX/Flax implementation for image and text experiments</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Austin, J., Johnson, D. D., Ho, J., Tarlow, D., &amp; van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. <em>NeurIPS 2021</em>. <a href="https://arxiv.org/abs/2107.03006">https://arxiv.org/abs/2107.03006</a></p>
<p><strong>Publication</strong>: NeurIPS 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{austin2021structured,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Structured Denoising Diffusion Models in Discrete State-Spaces}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>    = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2021}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>Consistency Models: Fast One-Step Diffusion Generation</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/consistency-models/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/consistency-models/</guid><description>Consistency models enable one-step generation by learning to map any point on a diffusion ODE trajectory to its origin, achieving FID 3.55 on CIFAR-10.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It proposes consistency models, a new class of generative models designed for fast one-step (or few-step) generation. The models can be trained either by distilling pretrained diffusion models (consistency distillation) or as standalone generative models from scratch (consistency training). The paper provides theoretical analysis of both training modes and achieves FID 3.55 on CIFAR-10 for single-step non-adversarial generation (state of the art at the time of publication).</p>
<h2 id="the-slow-sampling-problem-in-diffusion">The Slow Sampling Problem in Diffusion</h2>
<p>Diffusion models produce high-quality samples but require iterating through many denoising steps (often tens to hundreds), making generation slow compared to GANs or VAEs. Previous approaches to speed up sampling include faster ODE/SDE solvers (DDIM, DPM-Solver) and progressive distillation. These either still require multiple steps or depend on a complex multi-stage distillation pipeline. The goal is a model that can generate high-quality samples in a single forward pass while optionally allowing more steps for better quality.</p>
<h2 id="core-innovation-the-self-consistency-property">Core Innovation: The Self-Consistency Property</h2>
<p>The key idea builds on the Probability Flow (PF) ODE from the score-based SDE framework. The PF ODE describes a deterministic trajectory that converts noise into data, governed by the learned score function. For the VE-SDE parameterization used by EDM (Karras et al., 2022), this takes the form:</p>
<p>$$\frac{d\mathbf{x}_t}{dt} = -t , s_\phi(\mathbf{x}_t, t)$$</p>
<p>where $s_\phi$ is a pretrained score model, a <strong>consistency function</strong> $f(\mathbf{x}_t, t)$ maps any point on an ODE trajectory to the trajectory&rsquo;s origin $\mathbf{x}_\epsilon$. The defining property is self-consistency:</p>
<p>$$f(\mathbf{x}_t, t) = f(\mathbf{x}_{t&rsquo;}, t&rsquo;) \quad \text{for all } t, t&rsquo; \in [\epsilon, T]$$</p>
<p>for any points $\mathbf{x}_t$ and $\mathbf{x}_{t&rsquo;}$ on the same PF ODE trajectory.</p>
<p><strong>Parameterization.</strong> The model enforces the boundary condition $f(\mathbf{x}_\epsilon, \epsilon) = \mathbf{x}_\epsilon$ using skip connections:</p>
<p>$$f_\theta(\mathbf{x}, t) = c_{\text{skip}}(t) , \mathbf{x} + c_{\text{out}}(t) , F_\theta(\mathbf{x}, t)$$</p>
<p>where $c_{\text{skip}}(\epsilon) = 1$ and $c_{\text{out}}(\epsilon) = 0$, ensuring the boundary condition is satisfied by construction.</p>
<p><strong>Consistency Distillation (CD).</strong> Given a pretrained diffusion model, CD trains a consistency model by enforcing self-consistency between adjacent timesteps:</p>
<p>$$\mathcal{L}_{\text{CD}}^N(\theta, \theta^-; \phi) = \mathbb{E}\left[\lambda(t_n) , d!\left(f_\theta(\mathbf{x}_{t_{n+1}}, t_{n+1}), , f_{\theta^-}(\hat{\mathbf{x}}_{t_n}^\phi, t_n)\right)\right]$$</p>
<p>where $\hat{\mathbf{x}}_{t_n}^\phi$ is obtained by running one step of the ODE solver using the pretrained score model, $\theta^-$ is an exponential moving average (EMA) of $\theta$, and $d(\cdot, \cdot)$ is a distance metric. The use of a target network $\theta^-$ (updated via EMA) parallels techniques from deep Q-learning and momentum contrastive learning.</p>
<p><strong>Consistency Training (CT).</strong> CT eliminates the need for a pretrained diffusion model. It replaces the ODE solver step with a score estimate derived from the denoising score matching identity:</p>
<p>$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = \mathbb{E}\left[\frac{\mathbf{x} - \mathbf{x}_t}{t^2} ,\middle|, \mathbf{x}_t\right]$$</p>
<p>Because this identity lets us estimate the score from noisy data alone (without a pretrained model), we can compute the ODE update directly from training samples. This allows training directly on data pairs $(\mathbf{x}, \mathbf{x} + t\mathbf{z})$ where $\mathbf{z} \sim \mathcal{N}(0, I)$.</p>
<p><strong>Theoretical guarantee.</strong> If CD achieves zero loss, the consistency model error is bounded by $O((\Delta t)^p)$ where $\Delta t$ is the maximum timestep gap and $p$ is the order of the ODE solver.</p>
<h2 id="experiments-and-benchmarks">Experiments and Benchmarks</h2>
<p><strong>Datasets:</strong> CIFAR-10 (32x32), ImageNet 64x64, LSUN Bedroom 256x256, LSUN Cat 256x256.</p>
<p><strong>Architecture:</strong> All models use the NCSN++/EDM architecture. CD distills from pretrained EDM models.</p>
<p><strong>Key results for consistency distillation (CD):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Steps</th>
          <th>FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CIFAR-10</td>
          <td>1</td>
          <td>3.55</td>
      </tr>
      <tr>
          <td>CIFAR-10</td>
          <td>2</td>
          <td>2.93</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>1</td>
          <td>6.20</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>2</td>
          <td>4.70</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>1</td>
          <td>7.80</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>2</td>
          <td>5.22</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>1</td>
          <td>11.0</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>2</td>
          <td>8.84</td>
      </tr>
  </tbody>
</table>
<p>CD outperforms progressive distillation (PD) across all datasets and sampling steps, with the exception of single-step generation on Bedroom 256x256 where CD with $\ell_2$ slightly underperforms PD with $\ell_2$.</p>
<p><strong>Key results for consistency training (CT):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Steps</th>
          <th>FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CIFAR-10</td>
          <td>1</td>
          <td>8.70</td>
      </tr>
      <tr>
          <td>CIFAR-10</td>
          <td>2</td>
          <td>5.83</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>1</td>
          <td>13.0</td>
      </tr>
      <tr>
          <td>ImageNet 64x64</td>
          <td>2</td>
          <td>11.1</td>
      </tr>
      <tr>
          <td>LSUN Bedroom 256</td>
          <td>1</td>
          <td>16.0</td>
      </tr>
      <tr>
          <td>LSUN Cat 256</td>
          <td>1</td>
          <td>20.7</td>
      </tr>
  </tbody>
</table>
<p>CT outperforms existing single-step non-adversarial models (VAEs, normalizing flows), e.g., improving over DC-VAE&rsquo;s FID of 17.90 on CIFAR-10. Samples from CT share structural similarity with EDM samples from the same initial noise, suggesting CT does not suffer from mode collapse.</p>
<p><strong>Zero-shot editing:</strong> Consistency models support colorization, super-resolution, inpainting, stroke-guided generation, interpolation, and denoising at test time without task-specific training, by modifying the multi-step sampling algorithm.</p>
<h2 id="findings-and-limitations">Findings and Limitations</h2>
<ul>
<li>Consistency distillation achieves state-of-the-art FID for one-step generation (3.55 on CIFAR-10, 6.20 on ImageNet 64x64).</li>
<li>Multi-step sampling provides a smooth quality-compute tradeoff: more steps yield better FID.</li>
<li>CT produces competitive results without any pretrained diffusion model, making consistency models a standalone generative model family.</li>
<li>The LPIPS distance metric $d(\cdot, \cdot)$ generally outperforms $\ell_1$ and $\ell_2$ for training consistency models.</li>
<li>At higher resolutions (LSUN 256x256), the gap between CD/CT and full EDM sampling widens.</li>
<li>CT currently underperforms CD, suggesting room for improvement in the standalone training paradigm.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Primary benchmark</td>
          <td>CIFAR-10</td>
          <td>32x32, 50K train</td>
          <td>FID on 50K samples</td>
      </tr>
      <tr>
          <td>Scaling benchmark</td>
          <td>ImageNet 64x64</td>
          <td>64x64, 1.28M</td>
          <td>Unconditional generation</td>
      </tr>
      <tr>
          <td>High-res benchmark</td>
          <td>LSUN Bedroom, Cat</td>
          <td>256x256</td>
          <td>Unconditional generation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>ODE solver for CD</strong>: Euler and Heun (2nd order) solvers on the empirical PF ODE</li>
<li><strong>EMA for target network</strong>: Decay rate $\mu$ scheduled as a function of training step</li>
<li><strong>Schedule functions</strong>: $N$ (number of discretization steps) and $\mu$ (EMA rate) increase over training following specific schedules (see Appendix C of the paper)</li>
<li><strong>Distance metric</strong>: LPIPS performs best; $\ell_2$ and $\ell_1$ also evaluated</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: NCSN++/EDM architecture from Karras et al. (2022)</li>
<li><strong>CD teacher</strong>: Pretrained EDM models</li>
<li><strong>Parameterization</strong>: Skip-connection formulation with $c_{\text{skip}}(t)$ and $c_{\text{out}}(t)$ from EDM</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>CD 1-step</th>
          <th>CT 1-step</th>
          <th>EDM (full)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FID</td>
          <td>CIFAR-10</td>
          <td>3.55</td>
          <td>8.70</td>
          <td>2.04</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>ImageNet 64</td>
          <td>6.20</td>
          <td>13.0</td>
          <td>2.44</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>LSUN Bedroom</td>
          <td>7.80</td>
          <td>16.0</td>
          <td>3.57</td>
      </tr>
      <tr>
          <td>FID</td>
          <td>LSUN Cat</td>
          <td>11.0</td>
          <td>20.7</td>
          <td>6.69</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training details follow EDM conventions</li>
<li>CD and CT use the same batch sizes and learning rate schedules as EDM training</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/openai/consistency_models">openai/consistency_models</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained checkpoints</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, Y., Dhariwal, P., Chen, M., &amp; Sutskever, I. (2023). Consistency Models. <em>ICML 2023</em>. <a href="https://arxiv.org/abs/2303.01469">https://arxiv.org/abs/2303.01469</a></p>
<p><strong>Publication</strong>: ICML 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{song2023consistency,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Consistency Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>    = <span style="color:#e6db74">{202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>       = <span style="color:#e6db74">{https://arxiv.org/abs/2303.01469}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/openai/consistency_models">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">Score-Based Generative Modeling with SDEs</a></li>
</ul>
]]></content:encoded></item><item><title>AdaptMol: Domain Adaptation for Molecular OCSR (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</guid><description>AdaptMol is an image-to-graph OCSR model using MMD-based domain adaptation and self-training for hand-drawn molecule recognition.</description><content:encoded><![CDATA[<h2 id="bridging-the-synthetic-to-real-gap-in-graph-based-ocsr">Bridging the Synthetic-to-Real Gap in Graph-Based OCSR</h2>
<p>Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.</p>
<p>AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:</p>
<ol>
<li><strong>Base model training</strong> on synthetic data with comprehensive augmentation and dual position representation</li>
<li><strong>MMD alignment</strong> of bond-level features between source and target domains</li>
<li><strong>Self-training</strong> with SMILES-validated pseudo-labels on unlabeled target images</li>
</ol>
<h2 id="end-to-end-graph-reconstruction-architecture">End-to-End Graph Reconstruction Architecture</h2>
<p>AdaptMol builds on MolScribe&rsquo;s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:</p>
<p><strong>Atom prediction</strong> follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:</p>
<p>$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$</p>
<p>where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.</p>
<p><strong>Dual position representation</strong> adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:</p>
<p>$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$</p>
<p>where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).</p>
<p><strong>Bond prediction</strong> extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:</p>
<p>$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$</p>
<p>A feed-forward network then predicts bond types between all atom pairs.</p>
<h2 id="bond-level-domain-adaptation-via-mmd">Bond-Level Domain Adaptation via MMD</h2>
<p>The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.</p>
<p>AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:</p>
<p>$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}&rsquo;|} \sum_{c \in \mathcal{C}&rsquo;} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$</p>
<p>where $\mathcal{C}&rsquo;$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence &gt; 0.95, entropy &lt; 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.</p>
<p>An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).</p>
<h2 id="self-training-with-smiles-validation">Self-Training with SMILES Validation</h2>
<p>After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.</p>
<p>This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw&rsquo;s 38 million synthetic hand-drawn images.</p>
<h2 id="comprehensive-data-augmentation">Comprehensive Data Augmentation</h2>
<p>Two categories of augmentation are applied during synthetic data generation:</p>
<ul>
<li><strong>Structure-rendering augmentation</strong>: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)</li>
<li><strong>Image-level augmentation</strong>: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)</li>
</ul>
<p>Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.</p>
<h2 id="results">Results</h2>
<h3 id="hand-drawn-molecule-recognition">Hand-Drawn Molecule Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>DECIMER test (Acc)</th>
          <th>ChemPix (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>AdaptMol</strong></td>
          <td><strong>82.6</strong></td>
          <td><strong>60.5</strong></td>
      </tr>
      <tr>
          <td>DECIMER v2.2</td>
          <td>71.9</td>
          <td>51.4</td>
      </tr>
      <tr>
          <td>AtomLenz</td>
          <td>30.0</td>
          <td>48.4</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>10.1</td>
          <td>26.1</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>10.7</td>
          <td>14.5</td>
      </tr>
  </tbody>
</table>
<h3 id="literature-and-synthetic-benchmarks">Literature and Synthetic Benchmarks</h3>
<p>AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>AdaptMol</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>DECIMER v2.2</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CLEF</td>
          <td><strong>92.7</strong></td>
          <td>87.5</td>
          <td>57.2</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td><strong>88.2</strong></td>
          <td>78.8</td>
          <td>73.0</td>
          <td>75.7</td>
      </tr>
      <tr>
          <td>UOB</td>
          <td><strong>89.3</strong></td>
          <td>88.2</td>
          <td>85.1</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>ACS</td>
          <td><strong>75.5</strong></td>
          <td>72.8</td>
          <td>41.0</td>
          <td>37.7</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>90.9</td>
          <td><strong>92.6</strong></td>
          <td>74.9</td>
          <td>59.6</td>
      </tr>
      <tr>
          <td>Staker</td>
          <td>84.0</td>
          <td><strong>84.4</strong></td>
          <td>0.0</td>
          <td>66.3</td>
      </tr>
  </tbody>
</table>
<p>MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.</p>
<h3 id="pipeline-ablation">Pipeline Ablation</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Hand-drawn</th>
          <th>ChemDraw</th>
          <th>JPO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base model</td>
          <td>10.4</td>
          <td>92.3</td>
          <td>82.7</td>
      </tr>
      <tr>
          <td>+ Font augmentation</td>
          <td>30.2</td>
          <td>92.5</td>
          <td>82.8</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD</td>
          <td>42.1</td>
          <td>94.0</td>
          <td>83.0</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD + Self-training</td>
          <td><strong>82.6</strong></td>
          <td><strong>95.9</strong></td>
          <td><strong>88.2</strong></td>
      </tr>
  </tbody>
</table>
<p>Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fffh1/AdaptMol">AdaptMol Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">Model + Data</a></td>
          <td>Model/Dataset</td>
          <td>MIT</td>
          <td>Pretrained checkpoint and datasets</td>
      </tr>
  </tbody>
</table>
<p>Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Sequence length constraints prevent accurate prediction of very large molecules (&gt;120 atoms), where resizing causes significant information loss</li>
<li>Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases</li>
<li>Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations</li>
<li>The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, F., He, E., &amp; Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. <em>Research Square preprint</em>. <a href="https://doi.org/10.21203/rs.3.rs-8365561/v1">https://doi.org/10.21203/rs.3.rs-8365561/v1</a></p>
<p><strong>Publication</strong>: Research Square preprint, February 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/fffh1/AdaptMol">GitHub</a></li>
<li><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">HuggingFace (model + data)</a></li>
</ul>
]]></content:encoded></item><item><title>Spherical CNNs: Rotation-Equivariant Networks on the Sphere</title><link>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/spherical-cnns/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/spherical-cnns/</guid><description>Cohen et al. introduce rotation-equivariant spherical CNNs that define cross-correlation on SO(3), computed via generalized FFT from harmonic analysis.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>method paper</strong> that introduces the theory and implementation of convolutional neural networks on the sphere. The key contribution is defining spherical cross-correlation that is SO(3)-equivariant and can be computed efficiently using generalized Fast Fourier Transforms from non-commutative harmonic analysis.</p>
<h2 id="why-planar-convolutions-fail-on-spherical-data">Why planar convolutions fail on spherical data</h2>
<p>Many problems require analyzing spherical signals: omnidirectional vision for robots and autonomous vehicles, molecular regression, and global weather modeling. A naive approach of projecting spherical data to a plane introduces space-varying distortions that break translational weight sharing. Rotating a spherical signal cannot be emulated by translating its planar projection.</p>
<p>The fundamental issue is geometric: patterns on a plane move via translations, but patterns on a sphere move via 3D rotations. A spherical CNN should detect patterns regardless of how they are rotated over the sphere. The relevant symmetry group is SO(3) (the group of all 3D rotations).</p>
<h2 id="spherical-cross-correlation-and-the-so3-output-space">Spherical cross-correlation and the SO(3) output space</h2>
<p>The paper defines spherical cross-correlation by replacing filter translations with rotations. For spherical signals $f$ on $S^2$ (the unit sphere) and filter $\psi$, the correlation is:</p>
<p>$$\lbrack\psi \star f\rbrack(R) = \langle L_R \psi, f \rangle = \int_{S^2} \sum_{k=1}^{K} \psi_k(R^{-1}x) f_k(x) , dx$$</p>
<p>where $L_R$ is the rotation operator $\lbrack L_R f\rbrack(x) = f(R^{-1}x)$.</p>
<p>A crucial subtlety: whereas the space of moves for the plane (2D translations) is isomorphic to the plane itself, the space of moves for the sphere (3D rotations) is SO(3), a different three-dimensional manifold. The output of a spherical correlation is therefore a function on SO(3), not on $S^2$. This means subsequent layers must use SO(3) correlation:</p>
<p>$$\lbrack\psi \star f\rbrack(R) = \int_{\text{SO}(3)} \sum_{k=1}^{K} \psi_k(R^{-1}Q) f_k(Q) , dQ$$</p>
<h3 id="equivariance-proof">Equivariance proof</h3>
<p>Equivariance follows from the unitarity of $L_R$ in a single line:</p>
<p>$$\lbrack\psi \star \lbrack L_Q f\rbrack\rbrack(R) = \langle L_R \psi, L_Q f \rangle = \langle L_{Q^{-1}R} \psi, f \rangle = \lbrack\psi \star f\rbrack(Q^{-1}R) = \lbrack L_Q\lbrack\psi \star f\rbrack\rbrack(R)$$</p>
<p>This holds for both $S^2$ and SO(3) correlation.</p>
<h2 id="efficient-computation-via-generalized-fft">Efficient computation via generalized FFT</h2>
<p>A naive SO(3) correlation is $O(n^6)$. The paper addresses this using the generalized Fourier transform (GFT) from non-commutative harmonic analysis.</p>
<p>The GFT projects functions onto orthogonal basis functions: spherical harmonics $Y_m^l(x)$ for $S^2$, and Wigner D-functions $D_{mn}^l(R)$ for SO(3). Both satisfy generalized Fourier theorems:</p>
<ul>
<li><strong>SO(3) convolution theorem</strong>: $\widehat{\psi \star f} = \hat{f} \cdot \hat{\psi}^\dagger$ (matrix multiplication of block Fourier coefficients)</li>
<li><strong>$S^2$ convolution theorem</strong>: $\widehat{\psi \star f}^l = \hat{f}^l \cdot \hat{\psi}^{l\dagger}$ (outer product of $S^2$ Fourier coefficient vectors)</li>
</ul>
<p>The SO(3) FFT works in two steps: (1) standard 2D FFT over the $\alpha$ and $\gamma$ Euler angles, then (2) linear contraction of the $\beta$ axis with precomputed Wigner-d function samples, implemented as a custom GPU kernel.</p>
<h2 id="experiments">Experiments</h2>
<h3 id="equivariance-error">Equivariance error</h3>
<p>Since the theory applies to continuous functions but the implementation is discretized, the authors rigorously measure equivariance error. The approximation error grows with resolution and depth but stays manageable for practical bandwidths. With ReLU activations, the error is higher but stays flat across layers, indicating the error comes from feature map rotation (exact only for bandlimited functions) rather than accumulating through the network.</p>
<h3 id="spherical-mnist">Spherical MNIST</h3>
<p>MNIST digits projected onto the sphere, tested in non-rotated (NR) and rotated (R) settings with ~165K parameters per model:</p>
<table>
  <thead>
      <tr>
          <th>Train / Test</th>
          <th>Planar CNN</th>
          <th>Spherical CNN</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NR / NR</td>
          <td>99%</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>R / R</td>
          <td>45%</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>NR / R</td>
          <td>9%</td>
          <td>85%</td>
      </tr>
  </tbody>
</table>
<p>The planar CNN collapses to chance when trained on non-rotated data and tested on rotated data. The spherical CNN maintains strong performance across all settings.</p>
<h3 id="3d-shape-recognition-shrec17">3D shape recognition (SHREC17)</h3>
<p>3D meshes projected onto an enclosing sphere via ray casting. For each point on the sphere, a ray is cast toward the origin, collecting three types of information from the intersection: ray length and cos/sin of the surface angle. The same three channels are computed for the convex hull, giving 6 channels total. The network (~1.4M parameters) placed 2nd on recall, mAP, and NDCG, and 3rd on precision and F1 in the SHREC17 competition, competing against methods with highly task-specialized architectures.</p>
<h3 id="molecular-atomization-energy-qm7">Molecular atomization energy (QM7)</h3>
<p>Molecules represented as spherical potential functions around each atom (generalizing the Coulomb matrix). A deep ResNet-style $S^2$CNN with DeepSets-style permutation-invariant aggregation over atoms achieved 8.47 RMSE, outperforming all kernel-based approaches and sorted Coulomb matrix methods.</p>
<h2 id="discussion-and-future-directions">Discussion and future directions</h2>
<p>The authors highlight several avenues for future work. For volumetric tasks like 3D model recognition, extending beyond SO(3) to the roto-translation group SE(3) could improve results. They also note that a Steerable CNN for the sphere would enable analysis of vector fields (e.g., global wind directions). Omnidirectional vision is mentioned as a compelling application as 360-degree sensors become more prevalent.</p>
<h2 id="reproducibility">Reproducibility</h2>
<p>The official PyTorch implementation is publicly available. The code does not support recent PyTorch versions due to changes in the FFT interface.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jonkhler/s2cnn">s2cnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation (deprecated for modern PyTorch)</td>
      </tr>
  </tbody>
</table>
<p>Hardware requirements from the paper: the SHREC17 model uses 8GB GPU memory at batch size 16 and takes 50 hours to train. The QM7 model uses 7GB at batch size 20 and takes 3 hours to train. Datasets used (Spherical MNIST, SHREC17, QM7) are all publicly available.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cohen, T. S., Geiger, M., Köhler, J., &amp; Welling, M. (2018). Spherical CNNs. <em>International Conference on Learning Representations</em>. <a href="https://arxiv.org/abs/1801.10130">https://arxiv.org/abs/1801.10130</a></p>
<p><strong>Publication</strong>: ICLR 2018</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=Hkbd5xZRb">OpenReview</a></li>
<li><a href="https://arxiv.org/abs/1801.10130">arXiv</a></li>
<li><a href="https://github.com/jonkhler/s2cnn">GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{cohen2018spherical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Spherical {CNNs}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cohen, Taco S. and Geiger, Mario and K{\&#34;o}hler, Jonas and Welling, Max}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SE(3)-Transformers: Equivariant Attention for 3D Data</title><link>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/se3-transformers/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/se3-transformers/</guid><description>Fuchs et al. combine self-attention with SE(3)-equivariance for 3D point clouds using invariant attention weights and equivariant value messages.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>method paper</strong> that introduces the SE(3)-Transformer, a self-attention mechanism for 3D point clouds and graphs that is equivariant under continuous 3D rotations and translations. It builds on tensor field networks (TFNs) by adding data-dependent attention weights, resolving a known expressiveness limitation of equivariant convolutions.</p>
<h2 id="why-equivariant-attention-for-point-clouds">Why equivariant attention for point clouds?</h2>
<p>Point cloud data appears in 3D object scans, molecular structures, and particle simulations. Two properties are essential: handling varying numbers of irregularly sampled points, and invariance to global changes in pose (rotations and translations).</p>
<p>Self-attention handles variable-size inputs naturally and has proven effective across many domains. Tensor field networks provide SE(3)-equivariant convolutions but suffer from a key limitation: their filter kernels are decomposed into learnable radial functions and fixed angular components (spherical harmonics). The angular dependence is completely constrained by the equivariance condition, leaving no learnable degrees of freedom in the angular direction. This has been identified in the literature as severely limiting performance.</p>
<p>The SE(3)-Transformer resolves this by introducing data-dependent attention weights that modulate the angular profile of the kernels while maintaining equivariance.</p>
<h2 id="architecture-invariant-attention-meets-equivariant-values">Architecture: invariant attention meets equivariant values</h2>
<p>The core layer combines three components:</p>
<p>$$\mathbf{f}_{\text{out},i}^{\ell} = \underbrace{\mathbf{W}_V^{\ell\ell} \mathbf{f}_{\text{in},i}^{\ell}}_{\text{self-interaction}} + \sum_{k \geq 0} \sum_{j \in \mathcal{N}_i \setminus i} \underbrace{\alpha_{ij}}_{\text{attention}} \underbrace{\mathbf{W}_V^{\ell k}(\mathbf{x}_j - \mathbf{x}_i) \mathbf{f}_{\text{in},j}^k}_{\text{value message}}$$</p>
<h3 id="invariant-attention-weights">Invariant attention weights</h3>
<p>The attention weights use dot-product attention between equivariant queries and keys:</p>
<p>$$\alpha_{ij} = \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_{ij})}{\sum_{j&rsquo; \in \mathcal{N}_i \setminus i} \exp(\mathbf{q}_i^\top \mathbf{k}_{ij&rsquo;})}$$</p>
<p>Both $\mathbf{q}_i$ and $\mathbf{k}_{ij}$ are constructed using TFN-type linear embeddings, making them SE(3)-equivariant. Their inner product is invariant because SO(3) representations are orthogonal: $\mathbf{q}^\top \mathbf{S}_g^\top \mathbf{S}_g \mathbf{k} = \mathbf{q}^\top \mathbf{k}$.</p>
<h3 id="equivariant-value-messages">Equivariant value messages</h3>
<p>The value messages use the same TFN kernel structure as tensor field networks: weight kernels $\mathbf{W}_V^{\ell k}(\mathbf{x})$ decomposed into learnable radial functions and Clebsch-Gordan/spherical harmonic angular components. Features are typed by irreducible representation degree $\ell$ (the independent matrix blocks into which SO(3) group actions decompose): type-0 vectors are rotation-invariant scalars, type-1 vectors transform as 3D vectors, and so on.</p>
<h3 id="angular-modulation">Angular modulation</h3>
<p>The attention weights $\alpha_{ij}$ multiply the value messages, creating data-dependent kernels $\alpha_{ij} \mathbf{W}_V^{\ell k}(\mathbf{x})$. This effectively modulates the angular profile of the fixed spherical harmonic components, adding learnable angular degrees of freedom while preserving equivariance. The authors describe this as one of the first examples of a nonlinear equivariant layer.</p>
<h3 id="attentive-self-interaction">Attentive self-interaction</h3>
<p>The paper also introduces attentive self-interaction as an alternative to the standard linear self-interaction (analogous to 1x1 convolutions). Instead of fixed learned weights across all points, the weights are generated by an MLP operating on invariant inner products of the input features:</p>
<p>$$w_{i,c&rsquo;c}^{\ell\ell} = \text{MLP}\left(\bigoplus_{c,c&rsquo;} \mathbf{f}_{\text{in},i,c&rsquo;}^{\ell\top} \mathbf{f}_{\text{in},i,c}^{\ell}\right)$$</p>
<h2 id="experiments">Experiments</h2>
<h3 id="n-body-particle-simulation">N-body particle simulation</h3>
<p>Five charged particles carrying positive or negative charges, exerting repulsive or attractive forces on each other. The task is predicting positions and velocities 500 timesteps ahead. The SE(3)-Transformer achieves 0.0076 MSE on position (vs. 0.0139 for Set Transformer and 0.0151 for TFN), with equivariance error on the order of $10^{-7}$, confirming exact equivariance up to numerical precision.</p>
<h3 id="scanobjectnn-real-world-3d-object-classification">ScanObjectNN (real-world 3D object classification)</h3>
<p>2902 real-world scanned objects across 15 categories. This task is only SO(2)-invariant (gravity axis matters), so the authors provide the z-component as an additional scalar input. With only 128 input points, the SE(3)-Transformer+z achieves 85.0% accuracy, competitive with methods using 1024 points and task-specific architectures. The model learns to ignore the symmetry-breaking z-input when trained on rotation-augmented data.</p>
<h3 id="qm9-molecular-property-regression"><a href="/notes/chemistry/datasets/qm9/">QM9</a> molecular property regression</h3>
<p>134k molecules with up to 29 atoms, predicting 6 quantum chemical properties. The SE(3)-Transformer achieves competitive results against other equivariant models (TFN, Cormorant), with improvements over TFN on all six targets. Across all three experiments, the SE(3)-Transformer outperforms both a non-equivariant attention baseline (Set Transformer) and equivariant models without attention (TFN).</p>
<h3 id="practical-contributions">Practical contributions</h3>
<p>The paper includes a PyTorch spherical harmonics implementation that is 10x faster than Scipy on CPU and 100-1000x faster on GPU. For a ScanObjectNN model, this yields roughly 22x speedup of the forward pass compared to the lie-learn library, directly addressing a major bottleneck of TFN-based architectures.</p>
<h2 id="conclusions-and-limitations">Conclusions and limitations</h2>
<p>Adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability across all three experiments. For large neighbourhoods, attention proved essential for model convergence. The equivariance constraints also improved performance compared to conventional (non-equivariant) attention in all experiments.</p>
<p>The authors note that the SE(3)-Transformer is inherently suited for classification and regression on molecular data and discuss applications in drug research, including early-stage suitability classification of molecules for inhibiting viral reproductive cycles.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/FabianFuchsML/se3-transformer-public">se3-transformer-public</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch + DGL implementation</td>
      </tr>
  </tbody>
</table>
<p>The repository includes code for N-body simulations and QM9 experiments. Hyperparameters and architecture details are provided in the paper&rsquo;s appendix (4 equivariant layers, representation degrees, channels per degree, learning rates, batch sizes). Hardware requirements are not explicitly stated in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fuchs, F. B., Worrall, D. E., Fischer, V., &amp; Welling, M. (2020). SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. <em>Advances in Neural Information Processing Systems</em>, 33. <a href="https://arxiv.org/abs/2006.10503">https://arxiv.org/abs/2006.10503</a></p>
<p><strong>Publication</strong>: NeurIPS 2020</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2006.10503">arXiv</a></li>
<li><a href="https://github.com/FabianFuchsML/se3-transformer-public">GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fuchs2020se3,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{SE(3)-Transformers}: 3D Roto-Translation Equivariant Attention Networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fuchs, Fabian B. and Worrall, Daniel E. and Fischer, Volker and Welling, Max}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OCSU: Optical Chemical Structure Understanding (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</guid><description>OCSU task for translating molecular images into multi-level descriptions. Introduces Vis-CheBI20 dataset and DoubleCheck/Mol-VL for molecular understanding.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., &amp; Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. <em>arXiv preprint arXiv:2501.15415</em>. <a href="https://doi.org/10.48550/arXiv.2501.15415">https://doi.org/10.48550/arXiv.2501.15415</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/PharMolix/OCSU">Code and Dataset (GitHub)</a></li>
</ul>
<h2 id="multi-level-chemical-understanding-method-and-resource">Multi-Level Chemical Understanding (Method and Resource)</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution.</p>
<ul>
<li><strong>Methodological</strong>: It proposes two novel architectures, <strong>DoubleCheck</strong> (an enhanced recognition model) and <strong>Mol-VL</strong> (an end-to-end vision-language model), to solve the newly formulated OCSU task.</li>
<li><strong>Resource</strong>: It constructs and releases <strong>Vis-CheBI20</strong>, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.</li>
</ul>
<h2 id="the-motivation-for-ocsu-beyond-basic-graph-recognition">The Motivation for OCSU Beyond Basic Graph Recognition</h2>
<p>Existing methods for processing molecular images focus narrowly on <strong>Optical Chemical Structure Recognition (OCSR)</strong>, which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.</p>
<ul>
<li><strong>Gap</strong>: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.</li>
<li><strong>Goal</strong>: To enable <strong>Optical Chemical Structure Understanding (OCSU)</strong>, bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.</li>
</ul>
<h2 id="key-innovations-doublecheck-mol-vl-and-the-vis-chebi20-dataset">Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset</h2>
<p>The paper introduces the <strong>OCSU task</strong>, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:</p>
<ol>
<li><strong>DoubleCheck (OCSR-based)</strong>: An enhancement to standard OCSR models (like MolScribe) that performs a &ldquo;second look&rdquo; at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.</li>
<li><strong>Mol-VL (OCSR-free)</strong>: An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.</li>
<li><strong>Vis-CheBI20 Dataset</strong>: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.</li>
</ol>
<h2 id="methodology-and-experimental-evaluation">Methodology and Experimental Evaluation</h2>
<p>The authors evaluated both paradigms on <strong>Vis-CheBI20</strong> and existing benchmarks (USPTO, ACS) across four subtasks:</p>
<ol>
<li><strong>Functional Group Caption</strong>: Retrieval/F1 score evaluation.</li>
<li><strong>Molecule Description</strong>: Natural language generation metrics (BLEU, ROUGE, METEOR).</li>
<li><strong>IUPAC Naming</strong>: Text generation metrics (BLEU, ROUGE).</li>
<li><strong>SMILES Naming (OCSR)</strong>: Exact matching accuracy ($Acc_s$).</li>
</ol>
<p><strong>Baselines</strong>:</p>
<ul>
<li><strong>Task-Specific</strong>: MolScribe, MolVec, OSRA.</li>
<li><strong>LLM/VLM</strong>: Qwen2-VL, BioT5+, Mol-Instructions.</li>
<li><strong>Ablation</strong>: DoubleCheck vs. MolScribe backbone to test the &ldquo;feature enhancement&rdquo; mechanism.</li>
</ul>
<h2 id="results-and-conclusions-paradigm-trade-offs">Results and Conclusions: Paradigm Trade-Offs</h2>
<ul>
<li><strong>DoubleCheck Superiority</strong>: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved <strong>92.85%</strong> $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a <strong>+3.12%</strong> gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.</li>
<li><strong>Paradigm Trade-offs</strong>:
<ul>
<li><strong>Mol-VL (OCSR-free)</strong> excelled at semantic tasks like <strong>Functional Group Captioning</strong>, achieving <strong>97.32%</strong> F1 (vs. 93.63% for DoubleCheck &amp; RDKit and 89.60% for MolScribe &amp; RDKit). It benefits from end-to-end learning of structural context.</li>
<li><strong>DoubleCheck (OCSR-based)</strong> performed better on <strong>IUPAC naming recall</strong> and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.</li>
</ul>
</li>
<li><strong>Conclusion</strong>: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Vis-CheBI20 Dataset</strong></p>
<ul>
<li><strong>Source</strong>: Derived from ChEBI-20 and PubChem.</li>
<li><strong>Size</strong>: 29,700 molecular diagrams, 117,700 image-text pairs.</li>
<li><strong>Generation</strong>: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.</li>
<li><strong>Splits</strong> (vary by task, see table below):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Train Size</th>
          <th style="text-align: left">Test Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Functional Group</td>
          <td style="text-align: left">26,144</td>
          <td style="text-align: left">3,269</td>
      </tr>
      <tr>
          <td style="text-align: left">Description</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
      <tr>
          <td style="text-align: left">IUPAC Naming</td>
          <td style="text-align: left">26,200</td>
          <td style="text-align: left">2,680</td>
      </tr>
      <tr>
          <td style="text-align: left">SMILES Naming</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>DoubleCheck (Attentive Feature Enhancement)</strong></p>
<ol>
<li><strong>Ambiguity Detection</strong>: Uses atom prediction confidence to identify &ldquo;ambiguous atoms&rdquo;.</li>
<li><strong>Masking</strong>: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.</li>
<li><strong>Local Encoding</strong>: A Swin-B encoder ($\Phi_l$) encodes the masked image region.</li>
<li><strong>Fusion</strong>: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l
\end{aligned}
$$</p>
<ol start="5">
<li><strong>Two-Stage Training</strong>:
<ul>
<li>Stage 1: Train atom/bond predictors (30 epochs).</li>
<li>Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).</li>
</ul>
</li>
</ol>
<p><strong>Mol-VL (Multi-Task VLM)</strong></p>
<ul>
<li><strong>Prompting</strong>: System prompt: &ldquo;You are working as an excellent assistant in chemistry&hellip;&rdquo;</li>
<li><strong>Tokens</strong>: Uses <code>&lt;image&gt;</code> and <code>&lt;/image&gt;</code> special tokens.</li>
<li><strong>Auxiliary Task</strong>: Functional group recognition (identifying highlighted groups) added to training to improve context learning.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>DoubleCheck</strong>:
<ul>
<li><strong>Backbone</strong>: MolScribe architecture.</li>
<li><strong>Encoders</strong>: Swin-B for both global and local atom encoding.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>:
<ul>
<li><strong>Base Model</strong>: Qwen2-VL (2B and 7B versions).</li>
<li><strong>Vision Encoder</strong>: ViT with naive dynamic resolution and M-RoPE.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Metrics</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).</li>
<li><strong>Functional Groups</strong>: F1 Score (Information Retrieval task).</li>
<li><strong>Text Generation</strong>: BLEU-2/4, METEOR, ROUGE-L.</li>
</ul>
<p><strong>Selected Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left"><strong>92.85%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>MolScribe</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left">92.57%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mol-VL-7B</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left"><strong>97.32%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck &amp; RDKit</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left">93.63%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>DoubleCheck</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>4 days</strong>.
<ul>
<li>Max LR: 4e-4.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>10 days</strong>.
<ul>
<li>Max LR: 1e-5, 50 epochs.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/PharMolix/OCSU">PharMolix/OCSU (GitHub)</a></td>
          <td style="text-align: left">Code, Model, Dataset</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.</li>
<li>Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.</li>
<li>Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.</li>
<li>The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{fanOCSUOpticalChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSU}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GTR-CoT: Graph Traversal Chain-of-Thought for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</guid><description>GTR-VL uses graph traversal chain-of-thought and two-stage training to improve optical chemical structure recognition on printed and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., &amp; He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. <a href="https://doi.org/10.48550/arXiv.2506.07553">https://doi.org/10.48550/arXiv.2506.07553</a></p>
<p><strong>Publication</strong>: arXiv preprint (2025)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2506.07553">Paper on arXiv</a></li>
</ul>
<h2 id="contribution-vision-language-modeling-for-ocsr">Contribution: Vision-Language Modeling for OCSR</h2>
<p>This is a <strong>method paper</strong> that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.</p>
<h2 id="motivation-the-abbreviation-bottleneck">Motivation: The Abbreviation Bottleneck</h2>
<p>The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws &ldquo;Ph&rdquo; for phenyl or &ldquo;Et&rdquo; for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.</p>
<p>This creates a fundamental mismatch. The model sees &ldquo;Ph&rdquo; in the image but is told the &ldquo;correct&rdquo; answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.</p>
<p>Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.</p>
<h2 id="novelty-graph-traversal-as-visual-chain-of-thought">Novelty: Graph Traversal as Visual Chain-of-Thought</h2>
<p>The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:</p>
<ol>
<li>
<p><strong>Graph Traversal as Visual Chain of Thought</strong>: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.</p>
<p>Formally, the model output sequence for image $I_m$ is generated as:</p>
<p>$$ R_m = \text{concat}(CoT_m, S_m) $$</p>
<p>where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.</p>
</li>
<li>
<p><strong>&ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; Principle</strong>: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what&rsquo;s actually visible in the image.</p>
<p>They treat abbreviations like &ldquo;Ph&rdquo; as single &ldquo;superatoms&rdquo; and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.</p>
</li>
<li>
<p><strong>Large-Scale Dataset (GTR-1.3M)</strong>: To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.</p>
</li>
<li>
<p><strong>GRPO for Hand-Drawn OCSR</strong>: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:</p>
<p>$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$</p>
<p>where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.</p>
</li>
<li>
<p><strong>Two-Stage Training</strong>: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.</p>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.</p>
</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The evaluation focused on demonstrating that GTR-VL&rsquo;s design principles solve real problems that plague existing OCSR systems:</p>
<ol>
<li>
<p><strong>Comprehensive Baseline Comparison</strong>: GTR-VL was tested against three categories of models:</p>
<ul>
<li><strong>Specialist OCSR systems</strong>: MolScribe and MolNexTR</li>
<li><strong>Chemistry-focused VLMs</strong>: ChemVLM, ChemDFM-X, OCSU</li>
<li><strong>General-purpose VLMs</strong>: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: The new benchmark includes two subsets of patent images:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 standard patent images similar to existing benchmarks</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<p>This design directly tests whether models can handle the abbreviation problem that breaks existing systems.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Systematic experiments isolated the contribution of key design choices:</p>
<ul>
<li><strong>Chain-of-Thought vs. Direct</strong>: Comparing graph traversal CoT against direct SMILES prediction</li>
<li><strong>Traversal Strategy</strong>: Graph traversal vs. the traditional &ldquo;atoms-then-bonds&rdquo; approach</li>
<li><strong>Dataset Quality</strong>: Training on corrected vs. uncorrected data</li>
</ul>
</li>
<li>
<p><strong>Retraining Experiments</strong>: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.</p>
</li>
<li>
<p><strong>Hand-Drawn OCSR Evaluation</strong>: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.</p>
</li>
<li>
<p><strong>Qualitative Analysis</strong>: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.</p>
</li>
</ol>
<h2 id="results--conclusions-resolving-the-abbreviation-bottleneck">Results &amp; Conclusions: Resolving the Abbreviation Bottleneck</h2>
<ul>
<li>
<p><strong>Performance Gains on Abbreviations</strong>: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.</p>
</li>
<li>
<p><strong>Data Correction is Critical</strong>: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.</p>
</li>
<li>
<p><strong>Chain-of-Thought Helps</strong>: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.</p>
</li>
<li>
<p><strong>Graph Traversal Beats Traditional Parsing</strong>: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.</p>
</li>
<li>
<p><strong>General VLMs Still Struggle</strong>: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.</p>
</li>
<li>
<p><strong>Hand-Drawn Recognition via GRPO</strong>: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.</p>
</li>
<li>
<p><strong>Evaluation Methodology Matters</strong>: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many &ldquo;failures&rdquo; in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.</p>
</li>
</ul>
<p>The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Base Model</strong>: GTR-VL fine-tunes <strong>Qwen2.5-VL</strong>.</p>
<p><strong>Input/Output Mechanism</strong>:</p>
<ul>
<li><strong>Input</strong>: The model takes an image $I_m$ and a text prompt</li>
<li><strong>Output</strong>: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string</li>
<li><strong>Traversal Strategy</strong>: Uses <strong>depth-first traversal</strong> to alternately predict atoms and bonds</li>
</ul>
<p><strong>Prompt Structure</strong>: The model is prompted to &ldquo;list the types of atomic elements&hellip; the coordinates&hellip; and the chemical bonds&hellip; then&hellip; output a canonical SMILES&rdquo;. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.</p>
<h3 id="data">Data</h3>
<p><strong>Training Dataset (GTR-1.3M)</strong>:</p>
<ul>
<li><strong>Synthetic Component</strong>: 1 million molecular SMILES from PubChem, converted to images using Indigo</li>
<li><strong>Real Component</strong>: 351,000 samples from USPTO patents (filtered from an original 680,000)
<ul>
<li>Processed using an OCR pipeline to detect abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Et&rdquo;)</li>
<li>Ground truth expanded structures replaced with superatoms to match visible abbreviations in images</li>
<li>This &ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; correction ensures training supervision matches visual input</li>
</ul>
</li>
</ul>
<p><strong>Evaluation Dataset (MolRec-Bench)</strong>:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 molecular images from USPTO patents</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Graph Traversal Algorithm</strong>:</p>
<ul>
<li>Depth-first traversal strategy</li>
<li>Alternating atom-bond prediction sequence</li>
<li>Each step uses previously predicted atoms and bonds as context</li>
</ul>
<p><strong>Two-Stage Training</strong>:</p>
<ul>
<li><strong>Stage 1 (SFT)</strong>: Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)</li>
<li><strong>Stage 2 (GRPO)</strong>: Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)</li>
</ul>
<p><strong>Training Procedure</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate (SFT)</strong>: Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay</li>
<li><strong>Learning Rate (GRPO)</strong>: Peak learning rate of $1 \times 10^{-5}$ with cosine decay</li>
<li><strong>Warm-up</strong>: Linear warm-up for the first 10% of iterations</li>
<li><strong>Batch Size (SFT)</strong>: 2 per GPU with gradient accumulation over 16 steps, yielding <strong>effective batch size of 1024</strong></li>
<li><strong>Batch Size (GRPO)</strong>: 4 per GPU with gradient accumulation of 1, yielding <strong>effective batch size of 128</strong></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong> (three complementary measures to handle abbreviation issues):</p>
<ul>
<li><strong>Gen-SMILES</strong>: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)</li>
<li><strong>Gra-SMILES</strong>: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)</li>
<li><strong>Graph</strong>: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)</li>
</ul>
<p><strong>Baselines Compared</strong>:</p>
<ul>
<li>Specialist OCSR systems: MolScribe, MolNexTR</li>
<li>Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU</li>
<li>General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute</strong>: Training performed on <strong>32 NVIDIA A100 GPUs</strong></p>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Status</strong>: Closed. As of the paper&rsquo;s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.</p>
]]></content:encoded></item><item><title>ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</guid><description>A 14B-parameter chemical reasoning LLM enhanced with atomized functional group knowledge and mix-sourced distillation strategy.</description><content:encoded><![CDATA[<h2 id="method-and-resource-contributions">Method and Resource Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper with significant <strong>Resource</strong> contributions.</p>
<ul>
<li><strong>Methodological Basis</strong>: The paper introduces a training pipeline (&ldquo;mix-sourced distillation&rdquo;) and domain-specific reinforcement learning to improve reasoning capabilities in chemical LLMs. It validates the approach through ablation studies across training stages.</li>
<li><strong>Resource Contribution</strong>: The authors constructed <strong>ChemFG</strong>, a 101 billion-token corpus annotated with &ldquo;atomized&rdquo; knowledge regarding functional groups and reaction centers.</li>
</ul>
<h2 id="bridging-the-chemical-reasoning-gap">Bridging the Chemical Reasoning Gap</h2>
<p>Current chemical LLMs struggle to reason logically for two main reasons:</p>
<ol>
<li><strong>Shallow Domain Understanding</strong>: Models generally learn molecule-level properties directly, bypassing the intermediate &ldquo;atomized&rdquo; characteristics (e.g., <a href="https://en.wikipedia.org/wiki/Functional_group">functional groups</a>) that ultimately dictate chemical behavior.</li>
<li><strong>Specialized Reasoning Logic</strong>: Chemical logic differs fundamentally from math or code. Distilling reasoning from general teacher models like DeepSeek-R1 frequently fails because the teachers lack the domain intuition required to generate valid chemical rationales.</li>
</ol>
<h2 id="atomized-knowledge-and-mixed-source-distillation">Atomized Knowledge and Mixed-Source Distillation</h2>
<p>The authors introduce three structural innovations to solve the reasoning gap:</p>
<ol>
<li><strong>Atomized Knowledge Enhancement (ChemFG)</strong>: A toolkit was built leveraging SMARTS notations to identify functional group changes during reactions. A critique of this approach is that it relies heavily on 2D cheminformatics abstractions, potentially missing deeper 3D stereochemical interactions.</li>
<li><strong>Mix-Sourced Distillation</strong>: General models (DeepSeek-R1/o3-mini) are fed &ldquo;pseudo-reasoning&rdquo; prompts that include ground truth answers and functional group data. While this forces the teacher to generate high-quality rationales for the student to learn, it introduces a layer of hindsight bias into the generated reasoning chains. During inference, the student model lacks both the pre-calculated functional group metadata and the ground truth, forcing it to bridge an artificially steep generalization gap.</li>
<li><strong>Chemical Reinforcement Learning</strong>: The intermediate model undergoes domain-specific reinforcement learning. The RL details are described in the paper&rsquo;s Appendix D, with the authors citing the open-source DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) framework. The optimization relies on rule-based rewards (format adherence and canonicalized <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> accuracy) across a variety of chemical tasks.</li>
</ol>
<h2 id="benchmark-evaluation-and-ablation-studies">Benchmark Evaluation and Ablation Studies</h2>
<p>The model was evaluated on comprehensive chemical benchmarks: <strong>SciKnowEval</strong> (19 tasks) and <strong><a href="/notes/chemistry/llm-applications/chemeval-multilevel-chemical-evaluation/">ChemEval</a></strong> (36 tasks).</p>
<ul>
<li><strong>Baselines</strong>: Compared against similarly sized open models (Qwen2.5-14B-Instruct, Qwen3-14B), domain models (<a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a>, MolInst), and frontier models (GPT-4o, DeepSeek-R1).</li>
<li><strong>Ablation</strong>: Evaluated across training stages (Base → ChemDFM-I → ChemDFM-R) to measure the specific impact of the instruction tuning versus the reasoning stages.</li>
<li><strong>Qualitative Analysis</strong>: The paper includes case studies demonstrating the model&rsquo;s step-by-step chemical reasoning and its potential for human-AI collaboration (Sections 4.2 and 4.3).</li>
</ul>
<h2 id="performance-outcomes-and-numerical-limitations">Performance Outcomes and Numerical Limitations</h2>
<ul>
<li><strong>Performance vs. Baselines</strong>: ChemDFM-R outperforms similarly sized open models and domain models on molecule-centric and reaction-centric tasks, and surpasses the much larger DeepSeek-R1 on ChemEval (0.78 vs. 0.58 overall). It shows competitive results relative to o4-mini, though o4-mini leads on SciKnowEval (0.74 vs. 0.70).</li>
<li><strong>Reasoning Interactivity</strong>: The model generates readable rationales that allow users to catch structural errors or identify reaction mechanisms accurately. Section 4.3 of the paper demonstrates human-AI collaboration scenarios.</li>
<li><strong>Quantitative Limitations</strong>: The model struggles with tasks involving numerical prediction and calculation (e.g., yield extraction, molecular property calculation). The paper notes that all molecule-centric and reaction-centric tasks where ChemDFM-R falls short of Qwen2.5-14B-Instruct involve numerical reasoning.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is constructed in three phases:</p>
<p><strong>1. Domain Pre-training (ChemFG)</strong>:</p>
<ul>
<li><strong>Size</strong>: 101 billion tokens</li>
<li><strong>Composition</strong>:
<ul>
<li>12M literature documents (79B tokens)</li>
<li>30M molecules from PubChem/PubChemQC</li>
<li>7M reactions from USPTO-FULL</li>
</ul>
</li>
<li><strong>Augmentation</strong>: SMILES augmentation (10x) using R-SMILES</li>
<li><strong>Atomized Features</strong>: Annotated with a custom &ldquo;Functional Group Identification Toolkit&rdquo; that identifies 241 functional group types and tracks changes in reaction centers. <em>Note: Data and toolkit are partially reproduced; while the toolkit (<a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a>) was open-sourced on GitHub, the 101 billion-token ChemFG dataset itself has not been publicly released.</em></li>
</ul>
<p><strong>2. Instruction Tuning</strong>:</p>
<ul>
<li><strong>Sources</strong>: Molecule-centric (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>), Reaction-centric (USPTO), and Knowledge-centric (Exams, Literature QA) tasks</li>
<li><strong>Mixing</strong>: Mixed with general instruction data in a 1:2 ratio</li>
</ul>
<p><strong>3. Distillation Dataset</strong>:</p>
<ul>
<li><strong>Sources</strong>:
<ul>
<li>~70% ChemDFM-R instruction data</li>
<li>~22% constructed pseudo-reasoning (functional group descriptions)</li>
<li>~8% teacher rationales (from DeepSeek-R1/o3-mini)</li>
</ul>
</li>
<li><strong>Mixing</strong>: Mixed with general data (including AM-Deepseek-R1-Distill-1.4M) in a 1:2 ratio</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Identification</strong>:</p>
<ul>
<li>Extends the <code>thermo</code> library&rsquo;s SMARTS list</li>
<li>For reactions, identifies &ldquo;reacting functional groups&rdquo; by finding reactants containing atoms involved in bond changes (reaction centers) that do not appear in the product</li>
</ul>
<p><strong>Mix-Sourced Distillation</strong>:</p>
<ul>
<li>Teacher models (DeepSeek-R1, o3-mini) are prompted with Question + Ground Truth + Functional Group Info to generate high-quality &ldquo;Thoughts&rdquo;</li>
<li>These rationales are distilled into the student model using a supervised fine-tuning loss across target tokens $y_t$:
$$ \mathcal{L}_{\text{SFT}} = - \sum_{t=1}^T \log P_\theta(y_t \mid x, y_{&lt;t}) $$</li>
</ul>
<p><strong>Reinforcement Learning</strong>:</p>
<ul>
<li><strong>Algorithm</strong>: The paper cites DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) as the RL framework; full details are in Appendix D of the paper. <em>Note: While the underlying DAPO framework is open-source, the specific chemistry-oriented RL pipeline and environment used for ChemDFM-R has not been publicly released.</em></li>
<li><strong>Hyperparameters</strong> (from paper appendix): Learning rate <code>5e-7</code>, rollout batch size <code>512</code>, training batch size <code>128</code></li>
<li><strong>Rewards</strong>: The reward system applies rule-based constraints focusing on physical form and chemical validity. The total reward $R(y, y^*)$ for a generated response $y$ given target $y^*$ combines a format adherence reward ($R_{\text{format}}$) and an accuracy reward ($R_{\text{acc}}$) evaluated on canonicalized SMILES:
$$ R(y, y^*) = R_{\text{format}}(y) + R_{\text{acc}}(\text{canonicalize}(y), \text{canonicalize}(y^*)) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base Model</strong>: Qwen2.5-14B</li>
<li><strong>ChemDFM-I</strong>: Result of instruction tuning the domain-pretrained model for 2 epochs</li>
<li><strong>ChemDFM-R</strong>: Result of applying mix-sourced distillation (1 epoch) followed by RL on ChemDFM-I. <em>Note: Model weights are publicly available on <a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">Hugging Face</a>.</em></li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware and training time details are described in the paper&rsquo;s appendices, which are not available in the extracted text. The details below are reported from the paper but could not be independently cross-verified against the main text:</p>
<ul>
<li><strong>Compute</strong>: NVIDIA A800 Tensor Core GPUs</li>
<li><strong>Training Time</strong>: 30,840 GPU hours total (Domain Pretraining: 24,728 hours; Instruction Tuning: 3,785 hours; Distillation: 2,059 hours; Reinforcement Learning: 268 hours)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>SciKnowEval</strong>: 19 tasks (text-centric, molecule-centric, reaction-centric)</li>
<li><strong>ChemEval</strong>: 36 tasks, categorized similarly</li>
</ul>
<p><strong>Key Metrics</strong>: Accuracy, F1 Score, BLEU score (with PRS normalization for ChemEval)</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SciKnowEval (all)</th>
          <th>ChemEval* (all)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Qwen2.5-14B-Instruct</td>
          <td>0.61</td>
          <td>0.57</td>
          <td>General-domain baseline</td>
      </tr>
      <tr>
          <td>ChemDFM-I</td>
          <td>0.69</td>
          <td>0.72</td>
          <td>After domain pretraining + instruction tuning</td>
      </tr>
      <tr>
          <td>ChemDFM-R</td>
          <td><strong>0.70</strong></td>
          <td><strong>0.78</strong></td>
          <td>After distillation + RL</td>
      </tr>
      <tr>
          <td>DeepSeek-R1</td>
          <td>0.62</td>
          <td>0.58</td>
          <td>General-domain reasoning model</td>
      </tr>
      <tr>
          <td>o4-mini</td>
          <td><strong>0.74</strong></td>
          <td>0.69</td>
          <td>Frontier reasoning model</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">ChemDFM-R-14B</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>Final reasoning model weights on Hugging Face</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Functional group identification toolkit (241 groups)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components</strong>: The 101B-token ChemFG pretraining dataset is not publicly released. The chemistry-oriented RL pipeline and training code are not open-sourced. The instruction tuning and distillation datasets are not available.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Wan, Z., Chen, L., Lin, X., Yu, S., Zhang, S., Ma, D., Zhu, Z., Zhang, D., Wang, H., Dai, Z., Wen, L., Chen, X., &amp; Yu, K. (2025). ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge. <em>arXiv preprint arXiv:2507.21990</em>. <a href="https://doi.org/10.48550/arXiv.2507.21990">https://doi.org/10.48550/arXiv.2507.21990</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{zhao2025chemdfmr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2507.21990}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2507.21990}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GP-MoLFormer: Molecular Generation via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/</guid><description>A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient property optimization.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-focus">Contribution and Taxonomic Focus</h2>
<p>This is primarily a <strong>Methodological</strong> paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b).</p>
<p>It also contains a secondary <strong>Theoretical</strong> contribution by establishing an empirical <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">scaling law</a> that relates inference compute (generation size) to the novelty of the generated molecules.</p>
<h2 id="motivation-data-scale-and-prompt-based-optimization">Motivation: Data Scale and Prompt-Based Optimization</h2>
<p>While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on <em>molecular</em> generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient &ldquo;prompt-based&rdquo; optimization techniques.</p>
<h2 id="core-innovations-architecture-and-pair-tuning">Core Innovations: Architecture and Pair-Tuning</h2>
<ol>
<li><strong>Architecture</strong>: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.</li>
<li><strong>Pair-Tuning</strong>: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn &ldquo;soft prompts&rdquo; for optimization without updating the base model weights.</li>
<li><strong>Scaling Analysis</strong>: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.</li>
</ol>
<h2 id="experimental-methodology-and-downstream-tasks">Experimental Methodology and Downstream Tasks</h2>
<p>The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:</p>
<ol>
<li><strong>De Novo Generation</strong>: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/">LIMO</a>, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.</li>
<li><strong>Scaffold-Constrained Decoration</strong>: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.</li>
<li><strong>Property-Guided Optimization</strong>: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">logP</a>, and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> binding activity, comparing the results to graph-based and reinforcement learning benchmarks.</li>
</ol>
<p>Additionally, they performed a <strong>Scaling Study</strong>:</p>
<ul>
<li>Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.</li>
<li>Generating up to 10 billion molecules to fit empirical scaling laws for novelty.</li>
</ul>
<h2 id="key-findings-and-scaling-laws">Key Findings and Scaling Laws</h2>
<ul>
<li><strong>Scale Driven Performance</strong>: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer&rsquo;s advantage in generation metrics over LLM-baselines like <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.</li>
<li><strong>Pair-Tuning Efficacy</strong>: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.</li>
<li><strong>Memorization vs. Novelty</strong>: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.</li>
<li><strong>Inference Scaling Law</strong>: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Sources</strong>: A combination of <strong>PubChem</strong> (111M SMILES) and <strong>ZINC</strong> (1B SMILES) databases. Downloading and pre-training instructions are located in the repository&rsquo;s <code>data/README.md</code>.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>All SMILES were canonicalized using RDKit (no isomeric information).</li>
<li><strong>GP-MoLFormer (Base)</strong>: Trained on the full 1.1B dataset (includes duplicates).</li>
<li><strong>GP-MoLFormer-UNIQ</strong>: Trained on a de-duplicated subset of 650M SMILES.</li>
</ul>
</li>
<li><strong>Tokenization</strong>: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of <strong>2,362 tokens</strong>.</li>
<li><strong>Filtering</strong>: Sequences restricted to a maximum length of <strong>202 tokens</strong>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pair-Tuning (Algorithm 1)</strong>:</p>
<ul>
<li><strong>Objective</strong>: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b &gt; a$. The base model parameters $\theta$ remain frozen.</li>
<li><strong>Prompt Structure</strong>: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence:
$$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{&lt;i}) $$</li>
<li><strong>Hyperparameters</strong>: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.</li>
<li><strong>Inference</strong>: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Availability</strong>: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on <a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">Hugging Face</a>. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.</li>
<li><strong>Architecture</strong>: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).</li>
<li><strong>Attention Mechanism</strong>: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping:
$$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$</li>
<li><strong>Inference Speed</strong>: ~3ms per forward pass on a single A100 GPU.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Generation Quality Metrics</strong>: Validity, Uniqueness, Novelty (<a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> suite), <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance (FCD)</a>, Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).</li>
<li><strong>MoLFormer-Based Metrics</strong>: The authors introduce Fréchet <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.</li>
<li><strong>Optimization Metrics</strong>: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.</li>
<li><strong>Scaling Metrics</strong>: Empirical fit for novelty decay: $y = ae^{-bx}$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.</li>
<li><strong>Training Time</strong>:
<ul>
<li>GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).</li>
<li>GP-MoLFormer-UNIQ (650M data): ~80 hours total.</li>
</ul>
</li>
<li><strong>Hyperparameters</strong>: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).</li>
<li><strong>Optimization</strong>: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/gp-molformer/">GP-MoLFormer (GitHub)</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation; IBM will not maintain going forward</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">GP-MoLFormer-Uniq (Hugging Face)</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained on 650M de-duplicated SMILES</td>
      </tr>
  </tbody>
</table>
<p>The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository&rsquo;s <code>data/README.md</code>.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., &amp; Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. <em>Digital Discovery</em>, 4(10), 2684&ndash;2696. <a href="https://doi.org/10.1039/D5DD00122F">https://doi.org/10.1039/D5DD00122F</a></p>
<p><strong>Publication</strong>: Digital Discovery, vol. 4, no. 10, pp. 2684&ndash;2696 (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2025gpmolformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GP-MoLFormer: a foundation model for molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2684--2696}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D5DD00122F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-2: Scaling Molecular Transformers to 77M</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-2/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-2/</guid><description>Optimizing transformer pretraining for molecules using MLM vs MTR objectives, scaling to 77M compounds from PubChem for improved property prediction.</description><content:encoded><![CDATA[<h2 id="classifying-chemberta-2s-methodological-contributions">Classifying ChemBERTa-2&rsquo;s Methodological Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper with a secondary <strong>Resource</strong> contribution.</p>
<p>It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine &ldquo;how well&rdquo; these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.</p>
<p><strong>Key methodological indicators</strong>:</p>
<ul>
<li><strong>Baseline comparison</strong>: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa-1</a>) with prominent benchmark tables</li>
<li><strong>Ablation studies</strong>: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size</li>
<li><strong>Scaling analysis</strong>: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance</li>
</ul>
<h2 id="motivations-for-scaling-molecular-transformers">Motivations for Scaling Molecular Transformers</h2>
<p>The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a &ldquo;chemical foundation model&rdquo;.</p>
<p><strong>Key motivations</strong>:</p>
<ul>
<li><strong>Label scarcity</strong>: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant</li>
<li><strong>Scaling hypothesis</strong>: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP</li>
<li><strong>Efficiency</strong>: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> computed properties as labels) approaches</li>
</ul>
<h2 id="novelty-in-multi-task-regression-objectives">Novelty in Multi-Task Regression Objectives</h2>
<p><strong>Scale</strong>: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>).</p>
<p><strong>Pipeline optimization</strong>: A direct, controlled comparison of <strong>Masked Language Modeling (MLM)</strong> vs. <strong>Multi-Task Regression (MTR)</strong> pretraining objectives on identical datasets.</p>
<p><strong>Proxy selection</strong>: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.</p>
<h2 id="experimental-pretraining-setup-on-77m-compounds">Experimental Pretraining Setup on 77M Compounds</h2>
<h3 id="pretraining-setup">Pretraining Setup</h3>
<p><strong>Datasets</strong>: Subsets of <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> containing 5M, 10M, and 77M unique SMILES.</p>
<p><strong>Tasks</strong>:</p>
<ul>
<li><strong>MLM</strong>: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens:
$$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$
where $\mathcal{M}$ represents the set of masked token indices.</li>
<li><strong>MTR</strong>: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective:
$$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$
Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.</li>
</ul>
<p><strong>Hyperparameter search</strong>: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.</p>
<h3 id="downstream-validation">Downstream Validation</h3>
<p><strong>Finetuning</strong>: Evaluated on 8 tasks from <strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).</p>
<p><strong>Analysis</strong>: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.</p>
<h2 id="key-performance-outcomes-and-scaling-realities">Key Performance Outcomes and Scaling Realities</h2>
<p><strong>Highly competitive performance</strong>: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.</p>
<p><strong>MTR superiority</strong>: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.</p>
<p><strong>Scaling laws versus downstream utility</strong>: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.</p>
<p><strong>Transfer learning</strong>: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The pretraining corpus is derived from <strong>PubChem</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>PubChem</td>
          <td>77M SMILES</td>
          <td>Canonicalized and globally shuffled. Subsets of 5M and 10M used. <strong>Note: Exact splits and datasets are not published.</strong></td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>PubChem</td>
          <td>100k SMILES</td>
          <td>A fixed set held out from the 77M corpus. <strong>Note: Exact 100k subset is not published.</strong></td>
      </tr>
      <tr>
          <td><strong>MTR Labels</strong></td>
          <td>RDKit</td>
          <td>200 props</td>
          <td>200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. <strong>Note: Calculated labels are not published and must be re-computed.</strong></td>
      </tr>
      <tr>
          <td><strong>Finetuning</strong></td>
          <td>MoleculeNet</td>
          <td>1.5k - 8k</td>
          <td>Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pretraining Objectives:</strong></p>
<ol>
<li><strong>Masked Language Modeling (MLM)</strong>: Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.</li>
<li><strong>Multi-Task Regression (MTR)</strong>: Predicting 200 RDKit properties. Labels are mean-normalized.</li>
</ol>
<p><strong>Tokenizer:</strong></p>
<ul>
<li>Dictionary of common SMILES characters</li>
<li>Maximum vocabulary size: <strong>591 tokens</strong></li>
</ul>
<p><strong>Optimization:</strong></p>
<ul>
<li><strong>Patience</strong>: Early stopping set to one pass through the dataset to ensure full coverage</li>
<li><strong>Hyperparameter search</strong>: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. <strong>Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.</strong></li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Based on <strong>RoBERTa</strong> (HuggingFace implementation)</li>
<li><strong>Parameter scale</strong>: Models ranged between <strong>5M and 46M parameters</strong></li>
<li><strong>Selection</strong>: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset</li>
<li><strong>Checkpoints</strong>: Pre-trained weights are hosted by DeepChem on <a href="https://huggingface.co/DeepChem">Hugging Face</a>. Direct links include <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MTR">DeepChem/ChemBERTa-77M-MTR</a> and <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MLM">DeepChem/ChemBERTa-77M-MLM</a> (Note: Model cards are currently empty).</li>
<li><strong>Code Reference</strong>: While the <a href="https://github.com/deepchem/deepchem">DeepChem</a> repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2&rsquo;s exact pipeline are not separated from the generalized deepchem library tooling.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Benchmarks were performed on <strong>MoleculeNet</strong> using DeepChem.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Tasks</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RMSE</strong> ($\downarrow$)</td>
          <td>Delaney, Lipo, BACE (Reg), Clearance</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).</td>
      </tr>
      <tr>
          <td><strong>ROC-AUC</strong> ($\uparrow$)</td>
          <td>BBBP, ClinTox, HIV, Tox21, BACE (Cls)</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: AWS EC2 instances with <strong>Nvidia T4 GPUs</strong></li>
<li><strong>Strategy</strong>: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.</li>
<li><strong>Note</strong>: For MTR, they wrote a custom data loader wrapper around HuggingFace&rsquo;s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. <em>arXiv preprint arXiv:2209.01712</em>. <a href="https://doi.org/10.48550/arXiv.2209.01712">https://doi.org/10.48550/arXiv.2209.01712</a></p>
<p><strong>Publication</strong>: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa-1 Paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{ahmadChemBERTa2ChemicalFoundation2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemformer: A Pre-trained Transformer for Comp Chem</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/chemformer/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/chemformer/</guid><description>BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical sequence tasks.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (&ldquo;Combined&rdquo; masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution by making the pre-trained models and code available.</p>
<h2 id="motivation-computational-bottlenecks-in-cheminformatics">Motivation: Computational Bottlenecks in Cheminformatics</h2>
<p>Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.</p>
<h2 id="core-innovation-bart-architecture-and-combined-pre-training">Core Innovation: BART Architecture and Combined Pre-training</h2>
<p>The primary insight lies in the adaptation of the <strong>BART architecture</strong> for chemistry and the introduction of a <strong>&ldquo;Combined&rdquo; self-supervised pre-training task</strong>.</p>
<ul>
<li><strong>Architecture</strong>: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.</li>
<li><strong>Combined Pre-training</strong>: The authors introduce a task that applies both <strong>Span Masking</strong> (randomly replacing tokens with <code>&lt;mask&gt;</code>) and <strong><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> Augmentation</strong> (permuting atom order, see <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">Randomized SMILES</a>) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input:
$$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{&lt;t}, \tilde{x}) $$</li>
<li><strong>Tunable Augmentation</strong>: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.</li>
</ul>
<h2 id="experimental-setup-and-pre-training-tasks">Experimental Setup and Pre-training Tasks</h2>
<p>The authors pre-trained Chemformer on <strong>100 million molecules</strong> from ZINC-15 and fine-tuned it on three distinct task types:</p>
<ol>
<li><strong>Seq2Seq Reaction Prediction</strong>:
<ul>
<li><em>Direct Synthesis</em>: USPTO-MIT dataset (Mixed and Separated).</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: USPTO-50K dataset (see also <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a>).</li>
</ul>
</li>
<li><strong>Molecular Optimization</strong>: Generating molecules with improved properties (<a href="https://en.wikipedia.org/wiki/Distribution_coefficient">LogD</a>, solubility, clearance) starting from ChEMBL matched molecular pairs.</li>
<li><strong>Discriminative Tasks</strong>:
<ul>
<li><em><a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a></em>: Predicting properties (ESOL, FreeSolv, Lipophilicity) from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
<li><em>Bioactivity</em>: Predicting pXC50 values for 133 genes using ExCAPE data.</li>
</ul>
</li>
</ol>
<p>Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.</p>
<h2 id="results-trade-offs-and-conclusions">Results, Trade-offs, and Conclusions</h2>
<ul>
<li><strong>Performance</strong>: Chemformer achieved <strong>competitive top-1 accuracy</strong> on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).</li>
<li><strong>Convergence Speed</strong>: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.</li>
<li><strong>Pre-training Tasks</strong>: The &ldquo;Combined&rdquo; task generally performed best for reaction prediction and bioactivity, while &ldquo;Masking&rdquo; was superior for molecular optimization.</li>
<li><strong>Augmentation Trade-off</strong>: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.</li>
<li><strong>Discriminative Evaluation Caveats</strong>: Chemformer underperformed specialized baselines (like D-MPNN or <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT&rsquo;s approximately 85M, and Chemformer&rsquo;s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a>.</li>
<li><strong>Pre-training Data Scope</strong>: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><em>Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on <a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Box</a>. Active development of Chemformer models has moved to the <a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels</a> repository.</em></p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/Chemformer">Chemformer (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Archived; original PyTorch implementation</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Active successor repository</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Pre-trained weights (Box)</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Base and Large model checkpoints</td>
      </tr>
  </tbody>
</table>
<p>The following datasets were used for pre-training and benchmarking.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Pre-training</strong></td>
          <td style="text-align: left">ZINC-15</td>
          <td style="text-align: left">100M</td>
          <td style="text-align: left">Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Direct Synthesis</strong></td>
          <td style="text-align: left">USPTO-MIT</td>
          <td style="text-align: left">~470k</td>
          <td style="text-align: left">Evaluated on &ldquo;Mixed&rdquo; and &ldquo;Separated&rdquo; variants.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Retrosynthesis</strong></td>
          <td style="text-align: left">USPTO-50K</td>
          <td style="text-align: left">~50k</td>
          <td style="text-align: left">Standard benchmark for retrosynthesis.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimization</strong></td>
          <td style="text-align: left">ChEMBL MMPs</td>
          <td style="text-align: left">~160k Train</td>
          <td style="text-align: left">Matched Molecular Pairs for LogD, solubility, and clearance optimization.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Properties</strong></td>
          <td style="text-align: left">MoleculeNet</td>
          <td style="text-align: left">Small</td>
          <td style="text-align: left">ESOL (1128), FreeSolv (642), Lipophilicity (4200).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Bioactivity</strong></td>
          <td style="text-align: left">ExCAPE</td>
          <td style="text-align: left">~312k</td>
          <td style="text-align: left">133 gene targets; &gt;1200 compounds per gene.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.</li>
<li><strong>Augmentation</strong>: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pre-training Tasks</strong>:
<ol>
<li><em>Masking</em>: Span masking (BART style).</li>
<li><em>Augmentation</em>: Input is a randomized SMILES; target is canonical SMILES.</li>
<li><em>Combined</em>: Input is augmented <em>then</em> masked; target is canonical SMILES.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).</li>
<li>Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.</li>
</ul>
</li>
<li><strong>Inference</strong>: <a href="https://en.wikipedia.org/wiki/Beam_search">Beam search</a> with width 10 for Seq2Seq tasks. Used <code>molbart/inference_score.py</code> and <code>molbart/retrosynthesis/round_trip_inference.py</code> for standard and round-trip validation.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Chemformer (Base)</th>
          <th style="text-align: left">Chemformer-Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">6</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model Dimension</strong></td>
          <td style="text-align: left">512</td>
          <td style="text-align: left">1024</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Feed-forward Dim</strong></td>
          <td style="text-align: left">2048</td>
          <td style="text-align: left">4096</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention Heads</strong></td>
          <td style="text-align: left">8</td>
          <td style="text-align: left">16</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~45M</td>
          <td style="text-align: left">~230M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Pre-training Task</strong></td>
          <td style="text-align: left">All 3 variants</td>
          <td style="text-align: left">Combined only</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Direct Synthesis (Sep)</td>
          <td style="text-align: left"><strong>92.8%</strong> (Large)</td>
          <td style="text-align: left">91.1% (Aug Transformer)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Retrosynthesis</td>
          <td style="text-align: left"><strong>54.3%</strong> (Large)</td>
          <td style="text-align: left">53.7% (GraphRetro) / 52.5% (GLN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Desirable %</strong></td>
          <td style="text-align: left">Mol Optimization</td>
          <td style="text-align: left"><strong>75.0%</strong> (Base-Mask)</td>
          <td style="text-align: left">70.2% (Transformer-R)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>RMSE</strong></td>
          <td style="text-align: left">Lipophilicity</td>
          <td style="text-align: left">0.598 (Combined)</td>
          <td style="text-align: left">0.555 (D-MPNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 NVIDIA V100 GPUs (batch size 128 per GPU).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.</li>
<li>Fine-tuning: ~20-40 epochs for reaction prediction (&lt;12 hours).</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Irwin, R., Dimitriadis, S., He, J., &amp; Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. <em>Machine Learning: Science and Technology</em>, 3(1), 015022. <a href="https://doi.org/10.1088/2632-2153/ac3ffb">https://doi.org/10.1088/2632-2153/ac3ffb</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{irwinChemformerPretrainedTransformer2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemformer: A Pre-Trained Transformer for Computational Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{015022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2632-2153}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/ac3ffb}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa: Molecular Property Prediction via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</guid><description>A systematic evaluation of RoBERTa transformers pretrained on 77M PubChem SMILES for molecular property prediction tasks.</description><content:encoded><![CDATA[<h2 id="taxonomy-and-paper-contributions">Taxonomy and Paper Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$), with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine &ldquo;how well does this work?&rdquo; in the chemical domain. It ablates dataset size, tokenization, and input representation.</p>
<p>It is also a resource paper as it introduces &ldquo;PubChem-77M,&rdquo; a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.</p>
<h2 id="overcoming-data-scarcity-in-property-prediction">Overcoming Data Scarcity in Property Prediction</h2>
<p>The primary motivation is <strong>data scarcity</strong> in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.</p>
<p>Massive quantities of <strong>unlabeled chemical structure data</strong> exist in the form of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.</p>
<h2 id="pretraining-scaling-laws-and-novelty">Pretraining Scaling Laws and Novelty</h2>
<p>Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:</p>
<ul>
<li><strong>Scaling Analysis</strong>: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.</li>
<li><strong>Tokenizer Comparison</strong>: It compares standard NLP <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte-Pair Encoding (BPE)</a> against a chemically-aware &ldquo;SmilesTokenizer&rdquo;.</li>
<li><strong>Representation Comparison</strong>: It evaluates if the robust <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representation offers advantages over standard SMILES in a Transformer context.</li>
</ul>
<h2 id="experimental-setup-pretraining-and-finetuning">Experimental Setup: Pretraining and Finetuning</h2>
<p>The authors trained <strong>ChemBERTa</strong> (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.</p>
<ul>
<li><strong>Pretraining</strong>: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.</li>
<li><strong>Baselines</strong>: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.</li>
<li><strong>Downstream Tasks</strong>: Finetuning was performed individually on small <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks: BBBP (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.</li>
<li><strong>Ablations</strong>:
<ul>
<li><strong>Tokenization</strong>: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.</li>
<li><strong>Input</strong>: SMILES vs. SELFIES strings on the Tox21 task.</li>
</ul>
</li>
</ul>
<h2 id="results-vs-graph-neural-network-baselines">Results vs. Graph Neural Network Baselines</h2>
<p>The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>BBBP PRC</th>
          <th>ClinTox ROC</th>
          <th>ClinTox PRC</th>
          <th>HIV ROC</th>
          <th>HIV PRC</th>
          <th>Tox21 ROC</th>
          <th>Tox21 PRC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemBERTa 10M</td>
          <td>0.643</td>
          <td>0.620</td>
          <td>0.733</td>
          <td>0.975</td>
          <td>0.622</td>
          <td>0.119</td>
          <td>0.728</td>
          <td>0.207</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.708</td>
          <td>0.697</td>
          <td>0.906</td>
          <td>0.993</td>
          <td>0.752</td>
          <td>0.152</td>
          <td>0.688</td>
          <td>0.429</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>0.681</td>
          <td>0.692</td>
          <td>0.693</td>
          <td>0.968</td>
          <td>0.780</td>
          <td>0.383</td>
          <td>0.724</td>
          <td>0.335</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>0.702</td>
          <td>0.724</td>
          <td>0.833</td>
          <td>0.986</td>
          <td>0.763</td>
          <td>0.364</td>
          <td>0.708</td>
          <td>0.345</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Scaling Improvements &amp; Training Dynamics</strong>: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.</li>
<li><strong>Performance Limits vs. GNNs</strong>: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.</li>
<li><strong>Ablation Limitations (Tokenization &amp; SELFIES)</strong>: The authors&rsquo; ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding &ldquo;semantically-aware tokenization&rdquo; or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.</li>
<li><strong>Interpretability</strong>: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.</p>
<ul>
<li><strong>Pretraining Data</strong>: <strong>PubChem-77M</strong>.
<ul>
<li>Source: 77 million unique SMILES from PubChem.</li>
<li>Preprocessing: Canonicalized and globally shuffled.</li>
<li>Subsets used: 100K, 250K, 1M, and 10M subsets.</li>
<li><em>Availability Note</em>: The authors provided a direct link to the <a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">canonicalized 10M compound subset</a> used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.</li>
</ul>
</li>
<li><strong>Evaluation Data</strong>: <strong>MoleculeNet</strong>.
<ul>
<li>Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).</li>
<li>Splitting: 80/10/10 train/valid/test split using a <strong>scaffold splitter</strong> to ensure chemical diversity between splits.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.</p>
<ul>
<li><strong>Objective</strong>: Masked Language Modeling (MLM) with <strong>15% token masking</strong>.</li>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>BPE</strong>: Byte-Pair Encoder (vocab size 52K).</li>
<li><strong>SmilesTokenizer</strong>: Regex-based custom tokenizer available in DeepChem (documented <a href="https://deepchem.readthedocs.io/en/latest/tokenizers.html#smilestokenizer">here</a>).</li>
</ul>
</li>
<li><strong>Sequence Length</strong>: Maximum sequence length of <strong>512 tokens</strong>.</li>
<li><strong>Finetuning</strong>: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: <strong>RoBERTa</strong> (via HuggingFace).
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 12 (72 distinct mechanisms total).</li>
<li><em>Implementation Note</em>: The original training notebooks and scripts are maintained in the authors&rsquo; <a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry repository</a>, alongside the primary downstream tasks integrated into DeepChem. A <a href="https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb">full Tox21 transfer learning tutorial</a> has been incorporated into the DeepChem repository.</li>
</ul>
</li>
<li><strong>Baselines</strong> (via Chemprop library):
<ul>
<li><strong>D-MPNN</strong>: Directed Message Passing Neural Network with default hyperparameters.</li>
<li><strong>RF/SVM</strong>: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (<a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ROC-AUC</strong></td>
          <td>Area Under Receiver Operating Characteristic Curve</td>
      </tr>
      <tr>
          <td><strong>PRC-AUC</strong></td>
          <td>Area Under Precision-Recall Curve (vital for imbalanced data)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Single <strong>NVIDIA V100 GPU</strong>.</li>
<li><strong>Training Time</strong>: Approximately <strong>48 hours</strong> for the 10M compound subset.</li>
<li><strong>Carbon Footprint</strong>: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training notebooks and finetuning scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration of ChemBERTa and SmilesTokenizer</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">ChemBERTa-zinc-base-v1</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained RoBERTa on 100K ZINC SMILES</td>
      </tr>
      <tr>
          <td><a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">PubChem-10M subset</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Canonicalized 10M compound subset used for largest experiments</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. <em>arXiv preprint arXiv:2010.09885</em>. <a href="https://doi.org/10.48550/arXiv.2010.09885">https://doi.org/10.48550/arXiv.2010.09885</a></p>
<p><strong>Publication</strong>: arXiv 2020 (Preprint)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">HuggingFace Model Hub (ChemBERTa-zinc-base-v1)</a> - <em>Additional pre-trained variations on PubChem &amp; ZINC datasets are available on the author&rsquo;s <a href="https://huggingface.co/seyonec">seyonec</a> HF profile.</em></li>
<li><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry GitHub Repository</a> - <em>Notebooks and scripts used for MLM pretraining and finetuning evaluations.</em></li>
</ul>
<h3 id="bibtex">BibTeX</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{chithranandaChemBERTaLargeScaleSelfSupervised2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Score-Based Generative Modeling with SDEs (Song 2021)</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/score-based-generative-modeling-sde/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/score-based-generative-modeling-sde/</guid><description>Unified SDE framework for score-based generative models, introducing Predictor-Corrector samplers and setting CIFAR-10 records with FID 2.20 and 2.99 bits/dim.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a unified framework that generalizes previous discrete score-based models (SMLD and DDPM) into continuous-time Stochastic Differential Equations (SDEs). The paper introduces algorithms for sampling (Predictor-Corrector) and likelihood computation (Probability Flow ODE), validated by setting new records on CIFAR-10 (FID 2.20, IS 9.89 at the time of publication). It also contains elements of <strong>Systematization</strong> by showing how existing methods are special cases of this broader framework.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Prior successful generative models, specifically Score Matching with Langevin Dynamics (SMLD) and Denoising Diffusion Probabilistic Models (DDPM), operate by sequentially corrupting data with slowly increasing noise and learning to reverse the process. Both methods treat the noise scales as a finite set of discrete steps. The authors aim to generalize this to a continuum of noise scales by modeling the diffusion process as a Stochastic Differential Equation (SDE). This continuous formulation enables:</p>
<ul>
<li><strong>Flexible sampling:</strong> Use of general-purpose SDE solvers.</li>
<li><strong>Exact likelihood computation:</strong> Via connection to Neural ODEs.</li>
<li><strong>Controllable generation:</strong> Solving inverse problems (inpainting, colorization) without retraining.</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>SDE framework</strong> for score-based generative modeling:</p>
<ul>
<li><strong>Continuous Generalization:</strong> Proving that SMLD and DDPM noise perturbations correspond to discretizations of Variance Exploding (VE) SDEs and Variance Preserving (VP) SDEs, respectively.</li>
<li><strong>Reverse-Time SDE:</strong> Leveraging Anderson&rsquo;s result (Anderson, 1982: a result on time-reversal of diffusion processes showing that the reverse is also a diffusion, with the forward drift reversed and a correction term involving the score of the marginal density) that the reverse of a diffusion process is also a diffusion process, governed by the score (gradient of log density).</li>
<li><strong>Predictor-Corrector (PC) Samplers:</strong> A hybrid sampling strategy where a numerical SDE solver (Predictor) estimates the next step, and a score-based MCMC approach (Corrector) corrects the marginal distribution.</li>
<li><strong>Probability Flow ODE:</strong> Deriving a deterministic ODE that shares the same marginal densities as the SDE, enabling near-exact likelihood computation (accuracy is limited by both numerical ODE solver discretization and variance of the unbiased Hutchinson trace estimator) and latent space manipulation.</li>
<li><strong>Sub-VP SDE:</strong> A new SDE class proposed to improve likelihoods by bounding variance tighter than the VP SDE.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the framework on standard image benchmarks:</p>
<ul>
<li><strong>Datasets:</strong> CIFAR-10 (32x32), CelebA (64x64), LSUN (Bedroom, Church), and CelebA-HQ (256x256 and 1024x1024).</li>
<li><strong>Ablation Studies:</strong> Comparing samplers (Ancestral vs. Reverse Diffusion vs. Probability Flow vs. PC) and SDE types (VE, VP, sub-VP).</li>
<li><strong>Architecture Search:</strong> Exploring improvements like FIR up/downsampling, rescaling skip connections, and increasing depth (leading to NCSN++ and DDPM++ architectures).</li>
<li><strong>Likelihood Evaluation:</strong> Computing Negative Log-Likelihood (NLL) in bits/dim using the Probability Flow ODE.</li>
<li><strong>Inverse Problems:</strong> Testing class-conditional generation, inpainting, and colorization using the conditional reverse-time SDE.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Record Performance:</strong> The <strong>NCSN++ cont. (deep, VE)</strong> model achieved an Inception Score of 9.89 and FID of 2.20 on CIFAR-10 (as of ICLR 2021).</li>
<li><strong>High-Fidelity Generation:</strong> First score-based model to generate 1024x1024 images (CelebA-HQ).</li>
<li><strong>Competitive Likelihoods:</strong> The <strong>DDPM++ cont. (deep, sub-VP)</strong> model achieved 2.99 bits/dim on uniformly dequantized CIFAR-10, a record at the time.</li>
<li><strong>Sampling Efficiency:</strong> PC samplers consistently outperformed predictor-only methods (like standard ancestral sampling) for the same computational cost.</li>
<li><strong>Controllable Generation:</strong> Successful application to inpainting and colorization using a single unconditional model.</li>
<li><strong>Limitations:</strong> Sampling remains slower than GANs on the same datasets. The breadth of available samplers introduces many hyperparameters (SDE type, predictor, corrector, signal-to-noise ratio, number of steps) that require tuning.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>CIFAR-10</strong>: Used for main benchmarking (FID, Inception Score, NLL).</li>
<li><strong>CelebA-HQ</strong>: Used for high-resolution experiments at 256x256 and 1024x1024.</li>
<li><strong>LSUN</strong>: Bedroom and Church Outdoor categories (256x256) used for sampler comparison and controllable generation (inpainting, colorization).</li>
<li><strong>Preprocessing</strong>: CIFAR-10 images are 32x32; CelebA pre-processed to 64x64 following Song &amp; Ermon (2020). Data is typically scaled to $[0, 1]$ or standardized depending on the specific SDE config.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Forward SDEs</strong>:</p>
<p>Here $dw$ denotes a Wiener process increment (a small, independent Gaussian noise burst at each timestep).</p>
<ul>
<li><strong>VE SDE (Variance Exploding)</strong>: $dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} dw$. Corresponds to SMLD. Used with $\sigma_{\min}=0.01$ and $\sigma_{\max}$ chosen via heuristics.</li>
<li><strong>VP SDE (Variance Preserving)</strong>: $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)} dw$. Corresponds to DDPM.</li>
<li><strong>Sub-VP SDE</strong>: $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s)ds})} dw$. Bounded variance, good for likelihoods.</li>
</ul>
<p><strong>Reverse-Time SDE Solver (Predictor)</strong>:</p>
<ul>
<li>Discretized via <strong>Reverse Diffusion Sampling</strong>, which matches the forward discretization.</li>
<li><strong>Euler-Maruyama</strong> solver used for continuously-trained models.</li>
</ul>
<p><strong>Corrector Algorithm</strong>:</p>
<ul>
<li><strong>Langevin MCMC</strong>: Applies annealed Langevin dynamics: adds noise and takes a score-guided gradient step to correct the marginal distribution at each timestep.</li>
<li><strong>PC Sampling</strong>: Alternates between one step of the Predictor and one step of the Corrector.</li>
<li><strong>Signal-to-Noise Ratio ($r$)</strong>: A hyperparameter for the corrector step size. Tuned values: $r \approx 0.16$ for VE SDEs on CIFAR-10.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NCSN++</strong>: Optimized architecture for VE SDEs. Key features:
<ul>
<li>4 residual blocks per resolution.</li>
<li>BigGAN-type residual blocks.</li>
<li>Rescaling skip connections by $1/\sqrt{2}$.</li>
<li>FIR (Finite Impulse Response) up/downsampling.</li>
<li>&ldquo;Residual&rdquo; progressive architecture for input, no progressive growing for output.</li>
</ul>
</li>
<li><strong>DDPM++</strong>: Optimized architecture for VP/sub-VP SDEs. Similar to NCSN++ but without FIR upsampling and no progressive growing.</li>
<li><strong>Deep Variants</strong>: &ldquo;cont. (deep)&rdquo; models double the depth (from 4 to 8 blocks per resolution) for the best reported results.</li>
<li><strong>Conditioning</strong>: Time $t$ is conditioned via random Fourier feature embeddings (scale 16) for continuous models.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>FID (Fréchet Inception Distance)</strong>: Computed on 50k samples.</li>
<li><strong>Inception Score</strong>: Reported for CIFAR-10.</li>
<li><strong>NLL (Negative Log-Likelihood)</strong>: Reported in bits/dim on uniformly dequantized data using the Probability Flow ODE.</li>
</ul>
<p><strong>Denoising</strong>: A single denoising step using Tweedie&rsquo;s formula is applied at the end of sampling to remove residual noise, which significantly improves FID.</p>
<h3 id="hardware">Hardware</h3>
<p><strong>Training</strong>:</p>
<ul>
<li>Batch size: 128 for CIFAR-10, 64 for LSUN, 8 for high-res CelebA-HQ.</li>
<li>Iterations: Discrete-objective models trained for 1.3M iterations during architecture exploration. Continuous-objective models (cont.) trained for 0.95M iterations. High-res CelebA-HQ (1024x1024) trained for approximately 2.4M iterations.</li>
<li><strong>EMA</strong>: Exponential Moving Average rate of 0.999 used for VE models, 0.9999 for VP models.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/yang-song/score_sde">yang-song/score_sde</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official JAX and PyTorch implementation with pretrained checkpoints</td>
      </tr>
  </tbody>
</table>
<p>All datasets used (CIFAR-10, CelebA-HQ, LSUN) are publicly available. Pretrained model checkpoints for CIFAR-10, CelebA-HQ, and FFHQ are provided in the repository. Specific hardware requirements (GPU type, training time) are not detailed in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., &amp; Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. <em>ICLR 2021</em>. <a href="https://arxiv.org/abs/2011.13456">https://arxiv.org/abs/2011.13456</a></p>
<p><strong>Publication</strong>: ICLR 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{song2021scorebased,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>     = <span style="color:#e6db74">{Score-Based Generative Modeling through Stochastic Differential Equations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>    = <span style="color:#e6db74">{Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>      = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>       = <span style="color:#e6db74">{https://openreview.net/forum?id=PxTIG12RRHS}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/yang-song/score_sde">GitHub Repository</a></li>
<li><a href="/notes/machine-learning/generative-models/score-matching-denoising-autoencoders/">Score Matching and Denoising Autoencoders</a></li>
</ul>
]]></content:encoded></item><item><title>Rectified Flow: Learning to Generate and Transfer Data</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/rectified-flow/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/rectified-flow/</guid><description>A unified ODE-based framework for generative modeling and domain transfer that learns straight paths for fast 1-step generation.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, with a significant <strong>Theory</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes &ldquo;Rectified Flow,&rdquo; a novel generative framework that learns ordinary differential equations (ODEs) to transport distributions via straight paths. It introduces the &ldquo;Reflow&rdquo; algorithm to iteratively straighten these paths.</li>
<li><strong>Theory</strong>: It provides rigorous proofs connecting the method to Optimal Transport, showing that the rectification process yields a coupling with non-increasing convex transport costs and that recursive reflow reduces the curvature of trajectories.</li>
</ul>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The work addresses two main challenges in unsupervised learning: generative modeling (generating data from noise) and domain transfer (mapping between two observed distributions).</p>
<ul>
<li><strong>Inefficiency of ODE/SDE Models</strong>: Continuous-time models (like Score-based Generative Models and DDPMs) require simulating diffusions over many steps, resulting in high computational costs during inference.</li>
<li><strong>Complexity of GANs</strong>: GANs provide fast (one-step) generation alongside challenges with training instability and mode collapse.</li>
<li><strong>Disconnection</strong>: Generative modeling and domain transfer are often treated as separate tasks requiring different techniques.</li>
</ul>
<p>The authors aim to unify these tasks into a single &ldquo;transport mapping&rdquo; problem while bridging the gap between high-quality continuous models and fast one-step models.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Rectified Flow</strong> framework and the <strong>Reflow</strong> procedure.</p>
<ul>
<li><strong>Straight-Line ODEs</strong>: Rectified Flow learns an ODE drift $v$ to follow the straight line connecting data pairs $(X_0, X_1)$, providing an alternative to diffusion models that rely on stochastic paths or specific forward processes. This is achieved via a simple least-squares optimization problem.</li>
<li><strong>Reflow (Iterative Straightening)</strong>: The authors introduce a recursive training procedure where a new flow is trained on the data pairs $(Z_0, Z_1)$ generated by the previous flow. Theoretical analysis shows this reduces the &ldquo;transport cost&rdquo; and straightens the trajectories, allowing for accurate 1-step simulation (effectively converting the ODE into a one-step model).</li>
<li><strong>Unified Framework</strong>: The method uses the exact same algorithm for generation ($\pi_0$ is Gaussian) and domain transfer ($\pi_0$ is a source dataset), removing the need for adversarial losses or cycle-consistency constraints.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the method across image generation, translation, and domain adaptation tasks.</p>
<ul>
<li><strong>Unconditioned Image Generation</strong>:
<ul>
<li><strong>Dataset</strong>: CIFAR-10 ($32\times32$).</li>
<li><strong>Baselines</strong>: Compared against GANs (StyleGAN2, TDPM), Diffusion/SDE Models (VP SDE, sub-VP SDE, VE SDE), ODE methods (VP ODE, sub-VP ODE, VE ODE), and distilled methods (DDIM Distillation).</li>
<li><strong>High-Res</strong>: Validated on LSUN Bedroom/Church, CelebA-HQ, and AFHQ ($256\times256$).</li>
</ul>
</li>
<li><strong>Image-to-Image Translation</strong>:
<ul>
<li><strong>Datasets</strong>: AFHQ (Cat $\leftrightarrow$ Dog/Wild), MetFace $\leftrightarrow$ CelebA-HQ.</li>
<li><strong>Setup</strong>: Transferring styles while preserving semantic identity (using a classifier-based feature mapping metric).</li>
</ul>
</li>
<li><strong>Domain Adaptation</strong>:
<ul>
<li><strong>Datasets</strong>: DomainNet, Office-Home.</li>
<li><strong>Metric</strong>: Classification accuracy on the transferred testing data.</li>
</ul>
</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Superior 1-Step Generation</strong>: On CIFAR-10 with a single Euler step (as of ICLR 2023), the distilled 2-Rectified Flow achieved an FID of <strong>4.85</strong>, beating the best one-step U-Net model TDPM (FID 8.91, a truncated diffusion model using a GAN). The distilled 3-Rectified Flow reached a Recall of <strong>0.51</strong>, beating the GAN baseline StyleGAN2+ADA (Recall 0.49).</li>
<li><strong>Straightening Effect</strong>: The &ldquo;Reflow&rdquo; procedure was empirically shown to reduce the &ldquo;straightness&rdquo; error and transport costs, validating the theoretical claims. &ldquo;Straightness&rdquo; is measured as $S(Z) = \mathbb{E}[\int_0^1 |\dot{Z}_t - (Z_1 - Z_0)|^2, dt]$ (zero means perfectly straight); &ldquo;transport cost&rdquo; is $\mathbb{E}[c(Z_1 - Z_0)]$ for a convex cost $c$, and Reflow reduces this for all convex costs.</li>
<li><strong>High-Quality Transfer</strong>: The model successfully performed image translation (e.g., Cat to Wild Animal) without paired data or cycle-consistency losses.</li>
<li><strong>Strong Full-Simulation Results</strong>: With RK45 adaptive ODE solving, 1-Rectified Flow achieves FID 2.58 and Recall 0.57 on CIFAR-10 (Table 1a), the best among ODE methods and comparable to fully simulated SDEs (VP SDE: FID 2.55).</li>
<li><strong>Fast Simulation</strong>: The method allows for extremely coarse time discretization (e.g., $N=1$) without significant quality loss after reflow, effectively solving the slow inference speed of standard ODE models.</li>
<li><strong>Domain Adaptation</strong>: On Office-Home, Rectified Flow achieves 69.2% accuracy, outperforming Deep CORAL (68.7%) and other baselines. On DomainNet, it achieves 41.4%, comparable to Deep CORAL (41.5%) and MLDG (41.2%).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper utilizes several standard computer vision benchmarks.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size/Resolution</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generation</td>
          <td><strong>CIFAR-10</strong></td>
          <td>32x32</td>
          <td>Standard split</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td><strong>LSUN</strong> (Bedroom, Church)</td>
          <td>256x256</td>
          <td>High-res evaluation</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td><strong>CelebA-HQ</strong></td>
          <td>256x256</td>
          <td>High-res evaluation</td>
      </tr>
      <tr>
          <td>Gen/Transfer</td>
          <td><strong>AFHQ</strong> (Cat, Dog, Wild)</td>
          <td>512x512</td>
          <td>256x256 for generation, 512x512 for transfer</td>
      </tr>
      <tr>
          <td>Transfer</td>
          <td><strong>MetFace</strong></td>
          <td>1024x1024</td>
          <td>Resized to 512x512 for experiments</td>
      </tr>
      <tr>
          <td>Adaptation</td>
          <td><strong>DomainNet</strong></td>
          <td>Mixed</td>
          <td>345 categories, 6 domains</td>
      </tr>
      <tr>
          <td>Adaptation</td>
          <td><strong>Office-Home</strong></td>
          <td>Mixed</td>
          <td>65 categories, 4 domains</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>
<p><strong>Objective Function</strong>:
The drift $v(Z_t, t)$ is trained by minimizing a least-squares regression objective:
$$\min_{v} \int_{0}^{1} \mathbb{E}[|(X_1 - X_0) - v(X_t, t)|^2] dt$$
where $X_t = tX_1 + (1-t)X_0$ is the linear interpolation.</p>
</li>
<li>
<p><strong>Reflow Procedure</strong>:
Iteratively updates the flow. Let $Z^k$ be the $k$-th rectified flow.</p>
<ol>
<li>Generate 4 million data pairs $(Z_0^k, Z_1^k)$ by simulating the current flow.</li>
<li>Fine-tune the $i$-rectified flow model for 300,000 steps on these pairs to obtain the $(i+1)$-rectified flow.</li>
</ol>
</li>
<li>
<p><strong>Distillation</strong>:
For 1-step distillation ($k=1$), the L2 loss is replaced with LPIPS perceptual similarity, which empirically yields better image quality. For multi-step distillation, training samples $t$ from ${0, 1/k, \ldots, (k-1)/k}$ rather than the full $[0, 1]$ interval.</p>
</li>
<li>
<p><strong>ODE Solver</strong>:</p>
<ul>
<li>Training: Analytical linear interpolation.</li>
<li>Inference: Euler method (constant step size $1/N$) or RK45 (adaptive).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>
<p><strong>Architecture</strong>:</p>
<ul>
<li>Uses the <strong>DDPM++ U-Net</strong> architecture (from Song et al., 2020) across experiments. Implementation is modified from the open-source code of Song et al.</li>
</ul>
</li>
<li>
<p><strong>Optimization</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: Adam (CIFAR-10) or AdamW (Transfer/Adaptation).</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>LR: $2 \times 10^{-4}$ (CIFAR), Grid search for transfer.</li>
<li>EMA: 0.999999 (CIFAR), 0.9999 (Transfer).</li>
<li>Batch Size: 4 (Transfer), 16 (Domain Adaptation).</li>
<li>Dropout: 0.15 (CIFAR), 0.1 (Transfer).</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (CIFAR-10, N=1)</th>
          <th>Baseline (Best 1-step)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>FID</strong></td>
          <td><strong>4.85</strong> (2-Rectified + Distill)</td>
          <td>8.91 (TDPM)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td><strong>Recall</strong></td>
          <td><strong>0.51</strong> (3-Rectified + Distill)</td>
          <td>0.49 (StyleGAN2+ADA)</td>
          <td>Higher is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU models or training times. The DDPM++ U-Net architecture used in the experiments typically requires multi-GPU setups for training on high-resolution datasets.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gnobitab/RectifiedFlow">RectifiedFlow (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation with CIFAR-10 and high-res training code, plus pre-trained checkpoints</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Gong, C., &amp; Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. <em>International Conference on Learning Representations (ICLR)</em>. <a href="https://openreview.net/forum?id=XVjTT1nw5z">https://openreview.net/forum?id=XVjTT1nw5z</a></p>
<p><strong>Publication</strong>: ICLR 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liuFlowStraightFast2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Flow {{Straight}} and {{Fast}}: {{Learning}} to {{Generate}} and {{Transfer Data}} with {{Rectified Flow}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Liu, Xingchao and Gong, Chengyue and Liu, Qiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://openreview.net/forum?id=XVjTT1nw5z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/gnobitab/RectifiedFlow">Official Code Repository</a></li>
<li><a href="https://openreview.net/forum?id=XVjTT1nw5z">OpenReview Page</a></li>
</ul>
]]></content:encoded></item><item><title>Neural ODEs: Continuous-Depth Deep Learning Models</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/neural-odes/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/neural-odes/</guid><description>Introduces ODE-Nets, a continuous-depth neural network model parameterized by ODEs, enabling constant memory backpropagation and adaptive computation.</description><content:encoded><![CDATA[<blockquote>
<p><strong>Key Prerequisites</strong>: Before diving in, note that for the ODE solver to guarantee a unique solution, the neural network $f(h(t), t, \theta)$ parameterizing the dynamics must be <a href="https://en.wikipedia.org/wiki/Lipschitz_continuity">Lipschitz continuous</a>. This ensures the <a href="https://en.wikipedia.org/wiki/Picard%E2%80%93Lindel%C3%B6f_theorem">Picard-Lindelöf theorem</a> holds, preventing trajectories from crossing and guaranteeing a well-defined backward pass.</p></blockquote>
<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong secondary <strong>Theory</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel family of deep neural network models where the derivative of the hidden state is parameterized by a neural network. It provides specific algorithms (Algorithm 1) for training these models scalably.</li>
<li><strong>Theory</strong>: It derives the adjoint sensitivity method for backpropagating through black-box ODE solvers and proves the &ldquo;Instantaneous Change of Variables&rdquo; theorem (Theorem 1) for continuous normalizing flows.</li>
</ul>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The authors aim to address limitations in discrete deep learning architectures:</p>
<ul>
<li><strong>Discrete vs. Continuous</strong>: Existing models like Residual Networks build transformations by composing discrete steps, which can be seen as an Euler discretization of a continuous transformation. The authors investigate the limit as step sizes go to zero.</li>
<li><strong>Memory Efficiency</strong>: Backpropagating through deep discrete networks requires storing intermediate activations, leading to linear memory cost in terms of depth, which is a major bottleneck.</li>
<li><strong>Irregular Data</strong>: Recurrent Neural Networks (RNNs) struggle with data arriving at arbitrary times, typically requiring discretization into fixed bins.</li>
<li><strong>Normalizing Flow Costs</strong>: Standard normalizing flows have a bottleneck in computing the determinant of the Jacobian, which is computationally expensive ($O(D^3)$).</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core contribution is the <strong>Neural ODE</strong> formulation:
$$\frac{dh(t)}{dt} = f(h(t), t, \theta)$$
where the output is computed using a black-box differential equation solver.</p>
<p>Key technical innovations include:</p>
<ol>
<li><strong>Adjoint Sensitivity Method for Backprop</strong>: The authors treat the solver as a black box and compute gradients by solving a second, augmented ODE backwards in time. This allows for <strong>constant memory cost</strong> regardless of depth.</li>
<li><strong>Adaptive Computation</strong>: The model uses modern ODE solvers that adapt evaluation steps based on error tolerance, allowing the model to trade precision for speed explicitly.</li>
<li><strong>Continuous Normalizing Flows (CNF)</strong>: By moving to continuous time, the change of variables formula simplifies from a log-determinant (cubic cost) to a trace operation (linear cost), enabling scalable generative modeling.</li>
<li><strong>Latent ODEs</strong>: A generative time-series model that represents time-series as latent trajectories determined by a local initial state and global shared dynamics, handling irregular sampling naturally.</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the method across three distinct domains:</p>
<ol>
<li><strong>Supervised Learning (MNIST)</strong>:
<ul>
<li>Compared <strong>ODE-Net</strong> against a standard <strong>ResNet</strong> and a Runge-Kutta network (<strong>RK-Net</strong>).</li>
<li>Measured test error, parameter count, and memory usage.</li>
<li>Analyzed the trade-off between numerical precision (tolerance) and speed (NFE).</li>
</ul>
</li>
<li><strong>Continuous Normalizing Flows (Generative)</strong>:
<ul>
<li>Compared CNF against standard Normalizing Flows (NF) on density matching and maximum likelihood estimation tasks using toy 2D datasets (Two Circles, Two Moons, and other target distributions).</li>
<li>Evaluated training loss (KL divergence) and maximum likelihood estimation.</li>
</ul>
</li>
<li><strong>Time-Series Modeling (Latent ODE)</strong>:
<ul>
<li>Tested on a dataset of bi-directional spirals with irregular timestamps and Gaussian noise.</li>
<li>Compared Latent ODEs against an RNN baseline on predictive RMSE. A second RNN variant with time-difference concatenation was also trained.</li>
</ul>
</li>
</ol>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Efficiency</strong>: ODE-Nets achieved roughly equivalent accuracy to ResNets on MNIST (0.42% vs 0.41% error) but with <strong>constant memory cost</strong> ($O(1)$) compared to ResNet&rsquo;s linear cost ($O(L)$).</li>
<li><strong>Adaptive Depth</strong>: The number of function evaluations (NFE) in ODE-Nets increases with training epoch, suggesting the model adapts its complexity as it learns. The backward pass NFE is roughly half the forward pass NFE, indicating that the adjoint method is also more computationally efficient than direct backpropagation through the integrator.</li>
<li><strong>Generative Performance</strong>: Continuous Normalizing Flows (CNF) achieved lower KL divergence loss than standard Normalizing Flows (NF), trained with only 10,000 iterations (Adam) compared to 500,000 iterations (RMSprop) for NF. Note that the two models used different optimizers, so the comparison is not fully controlled. CNF can also expand capacity by increasing width ($M$) without architectural constraints.</li>
<li><strong>Irregular Time-Series</strong>: Latent ODEs significantly outperformed RNNs across all observation counts on irregular spiral data. The advantage is most pronounced with sparse observations (0.1642 vs 0.3937 RMSE at 30 obs), and the model learns interpretable latent trajectories that switch direction smoothly.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>MNIST</strong>: Standard handwritten digit dataset used for supervised learning benchmarks.</li>
<li><strong>Toy 2D Densities</strong>: &ldquo;Two Circles&rdquo; and &ldquo;Two Moons&rdquo; distributions used for visualizing normalizing flows.</li>
<li><strong>Bi-directional Spirals</strong>: A generated dataset of 1,000 2D spirals (half clockwise, half counter-clockwise). Each spiral is sampled at 100 equally-spaced timesteps with added Gaussian noise. For training, each spiral is then subsampled without replacement to $n \in {30, 50, 100}$ irregularly-spaced observations, simulating realistic missing data.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Adjoint Sensitivity Method (Backpropagation)</strong></p>
<p>To optimize the parameters of the ODE-Net, the authors use the adjoint sensitivity method to compute gradients. Standard backpropagation would require storing the activations at every step of the ODE solver, incurring a high memory cost that scales linearly with the number of steps.</p>
<p>Instead, this method treats the ODE solver as a &ldquo;black box&rdquo; and computes gradients by solving a second, <strong>augmented ODE</strong> backwards in time from the final state $t_1$ to the initial state $t_0$.</p>
<p>The augmented state contains three components that are solved simultaneously:</p>
<ol>
<li><strong>The State</strong>: The original hidden state $z(t)$, which is reconstructed backwards.</li>
<li><strong>The Adjoint</strong>: The sensitivity of the loss with respect to the state, $a(t) = \partial L / \partial z(t)$.</li>
<li><strong>The Gradient</strong>: The accumulating gradients with respect to parameters, $\partial L / \partial \theta$.</li>
</ol>
<p>The dynamics of this augmented system are defined as:
$$\frac{d}{dt}\begin{bmatrix} z(t) \ a(t) \ \partial L/\partial \theta \end{bmatrix} = \begin{bmatrix} f(z(t), t, \theta) \ -a(t)^T \frac{\partial f}{\partial z} \ -a(t)^T \frac{\partial f}{\partial \theta} \end{bmatrix}$$</p>
<p>Using this approach, the vector-Jacobian products (e.g., $a(t)^T \frac{\partial f}{\partial z}$) are evaluated efficiently using automatic differentiation.</p>
<blockquote>
<p><strong>Why:</strong> Reconstructing $z(t)$ backwards avoids storing the forward pass, enabling <strong>constant memory cost</strong> ($O(1)$) regardless of depth.</p>
<p><strong>Origin:</strong> Adapted from Pontryagin&rsquo;s maximum principle (1962) for optimal control.</p></blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> torch
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> torch.nn <span style="color:#66d9ef">as</span> nn
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> torchdiffeq <span style="color:#f92672">import</span> odeint_adjoint
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ODEFunc</span>(nn<span style="color:#f92672">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__init__</span>(self, dim):
</span></span><span style="display:flex;"><span>        super(ODEFunc, self)<span style="color:#f92672">.</span><span style="color:#a6e22e">__init__</span>()
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>net <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Sequential(
</span></span><span style="display:flex;"><span>            nn<span style="color:#f92672">.</span>Linear(dim, <span style="color:#ae81ff">50</span>),
</span></span><span style="display:flex;"><span>            nn<span style="color:#f92672">.</span>Tanh(),
</span></span><span style="display:flex;"><span>            nn<span style="color:#f92672">.</span>Linear(<span style="color:#ae81ff">50</span>, dim),
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, t, y):
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Defines dy/dt = f(y, t)</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>net(y)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Usage with adjoint method for O(1) memory backprop</span>
</span></span><span style="display:flex;"><span>func <span style="color:#f92672">=</span> ODEFunc(dim<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>y0 <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>tensor([[<span style="color:#ae81ff">1.</span>, <span style="color:#ae81ff">0.</span>]]) <span style="color:#75715e"># Initial state</span>
</span></span><span style="display:flex;"><span>t <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>linspace(<span style="color:#ae81ff">0.</span>, <span style="color:#ae81ff">1.</span>, <span style="color:#ae81ff">10</span>) <span style="color:#75715e"># Time points to solve for</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># &#39;odeint_adjoint&#39; automatically handles the augmented state backward pass</span>
</span></span><span style="display:flex;"><span>out <span style="color:#f92672">=</span> odeint_adjoint(func, y0, t, method<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;dopri5&#39;</span>)
</span></span></code></pre></div><p><strong>2. Instantaneous Change of Variables (CNF)</strong></p>
<p>For generative modeling, the authors introduce <strong>Continuous Normalizing Flows (CNF)</strong>. In discrete normalizing flows, the probability density of a transformed variable is calculated using the change of variables theorem, which requires computing the log-determinant of the Jacobian: $\log p(z_1) = \log p(z_0) - \log |\det \frac{\partial z_1}{\partial z_0}|$. This operation is computationally expensive ($O(D^3)$) and often restricts model architectures to ensure the Jacobian is easy to compute (e.g., triangular).</p>
<p>Moving to continuous time simplifies this requirement. The paper proves that if the transformation is defined by an ODE, the change in log-probability follows a differential equation determined by the <strong>trace</strong> of the Jacobian:
$$\frac{\partial \log p(z(t))}{\partial t} = -\text{tr}\left( \frac{\partial f}{\partial z(t)} \right)$$</p>
<p>The total change in log-density is obtained by integrating this value over time.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_trace</span>(y, f):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes trace of Jacobian df/dy.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    For high dimensions, use Hutchinson&#39;s trace estimator (approximate).
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    tr <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(y<span style="color:#f92672">.</span>size(<span style="color:#ae81ff">1</span>)):
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Gradients of f&#39;s i-th component w.r.t y&#39;s i-th component</span>
</span></span><span style="display:flex;"><span>        tr <span style="color:#f92672">+=</span> torch<span style="color:#f92672">.</span>autograd<span style="color:#f92672">.</span>grad(f[:, i]<span style="color:#f92672">.</span>sum(), y, create_graph<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)[<span style="color:#ae81ff">0</span>][:, i]
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> tr
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># In the ODE function:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># d(log_p)/dt = -trace(df/dy)</span>
</span></span></code></pre></div><blockquote>
<p><strong>Why:</strong> The trace operator has <strong>linear cost</strong> ($O(D)$), whereas the determinant has cubic cost ($O(D^3)$). This allows for unrestricted, &ldquo;wide&rdquo; architectures that are automatically bijective.</p>
<p><strong>Origin:</strong> This is the &ldquo;Instantaneous Change of Variables&rdquo; theorem (Theorem 1), derived in Appendix A of the paper.</p></blockquote>
<h3 id="models">Models</h3>
<p><strong>ODE-Net (MNIST Classification)</strong>:</p>
<ul>
<li><strong>Input</strong>: Downsamples input twice.</li>
<li><strong>Core</strong>: 6 standard residual blocks replaced by a single <strong>ODESolve</strong> module.</li>
<li><strong>Output</strong>: Global average pooling + Fully connected layer.</li>
<li><strong>Solver</strong>: Implicit Adams method.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ODEBlock</span>(nn<span style="color:#f92672">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__init__</span>(self, odefunc):
</span></span><span style="display:flex;"><span>        super(ODEBlock, self)<span style="color:#f92672">.</span><span style="color:#a6e22e">__init__</span>()
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>odefunc <span style="color:#f92672">=</span> odefunc
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>integration_time <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>tensor([<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>])<span style="color:#f92672">.</span>float()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, x):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>integration_time <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>integration_time<span style="color:#f92672">.</span>type_as(x)
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Returns [x(t0), x(t1)]; we only want final state x(t1)</span>
</span></span><span style="display:flex;"><span>        out <span style="color:#f92672">=</span> odeint_adjoint(self<span style="color:#f92672">.</span>odefunc, x, self<span style="color:#f92672">.</span>integration_time)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> out[<span style="color:#ae81ff">1</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ResNet-like architecture with ODE block</span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Sequential(
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>Conv2d(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">64</span>, <span style="color:#ae81ff">3</span>, <span style="color:#ae81ff">1</span>),
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>ReLU(inplace<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>),
</span></span><span style="display:flex;"><span>    ODEBlock(ODEFunc(<span style="color:#ae81ff">64</span>)), <span style="color:#75715e"># Continuous-depth layer replacement</span>
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>BatchNorm2d(<span style="color:#ae81ff">64</span>),
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>AdaptiveAvgPool2d((<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>)),
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>Flatten(),
</span></span><span style="display:flex;"><span>    nn<span style="color:#f92672">.</span>Linear(<span style="color:#ae81ff">64</span>, <span style="color:#ae81ff">10</span>)
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p><strong>Latent ODE (Time-Series)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: RNN with 25 hidden units processing data backwards to produce $q(z_0|x)$. It runs backwards so the final RNN state summarizes the entire sequence at $t_0$, parameterizing the initial latent state $z_0$ for the forward-running ODE.</li>
<li><strong>Latent Space</strong>: 4-dimensional latent state $z_0$.</li>
<li><strong>Dynamics ($f$)</strong>: Neural network with one hidden layer of 20 units.</li>
<li><strong>Decoder</strong>: Neural network with one hidden layer of 20 units computing $p(x_{t_i}|z_{t_i})$.</li>
<li><strong>Likelihood</strong>: Gaussian log-likelihood for the spiral reconstruction task. The paper also describes an optional Poisson process likelihood $\lambda(z(t))$ for event-time data (e.g., medical records), but this is not used in the spiral experiment.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Experiment</th>
          <th>Metric</th>
          <th>Baseline (ResNet/RNN)</th>
          <th>ODE Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MNIST</td>
          <td>Test Error</td>
          <td>0.41%</td>
          <td>0.42%</td>
      </tr>
      <tr>
          <td>MNIST</td>
          <td>Parameters</td>
          <td>0.60 M</td>
          <td>0.22 M</td>
      </tr>
      <tr>
          <td>MNIST</td>
          <td>Memory</td>
          <td>$O(L)$</td>
          <td>$O(1)$</td>
      </tr>
      <tr>
          <td>Spirals (30 obs)</td>
          <td>RMSE</td>
          <td>0.3937</td>
          <td><strong>0.1642</strong></td>
      </tr>
      <tr>
          <td>Spirals (50 obs)</td>
          <td>RMSE</td>
          <td>0.3202</td>
          <td><strong>0.1502</strong></td>
      </tr>
      <tr>
          <td>Spirals (100 obs)</td>
          <td>RMSE</td>
          <td>0.1813</td>
          <td><strong>0.1346</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Implementation</strong>: Hidden state dynamics evaluated on GPU using <strong>TensorFlow</strong>.</li>
<li><strong>Solvers</strong>: Fortran ODE solvers (LSODE, VODE) from <code>scipy.integrate</code> were used for the actual integration.</li>
<li><strong>Note</strong>: While the original paper used TensorFlow/Scipy, the authors later released <code>torchdiffeq</code> (PyTorch), which has become the standard implementation for this architecture. The code samples above reflect this modern standard.</li>
<li><strong>Interface</strong>: Python&rsquo;s <code>autograd</code> framework bridged the TensorFlow dynamics and Scipy solvers.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The paper identifies several practical limitations of Neural ODEs:</p>
<ul>
<li><strong>Minibatching</strong>: Batching requires concatenating states of each batch element into a combined ODE of dimension $D \times K$. Controlling error on all batch elements together can require more evaluations than solving each system individually, though in practice this overhead was not substantial.</li>
<li><strong>Tolerance tuning</strong>: Users must choose error tolerances for both the forward and reverse passes. The paper used 1.5e-8 for sequence modeling, 1e-3 for classification, and 1e-5 for density estimation.</li>
<li><strong>Backward trajectory reconstruction</strong>: Running the dynamics backwards to reconstruct the forward state trajectory can introduce extra numerical error if the reconstructed trajectory diverges from the original. Checkpointing (storing intermediate states) can address this, though the authors did not find it necessary in practice.</li>
<li><strong>Uniqueness requirements</strong>: The neural network $f$ must be Lipschitz continuous (e.g., using tanh or ReLU activations with finite weights) to guarantee a unique solution via Picard&rsquo;s existence theorem.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with GPU-based ODE solvers</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, R. T. Q., Rubanova, Y., Bettencourt, J., &amp; Duvenaud, D. (2018). Neural ordinary differential equations. <em>Proceedings of the 32nd International Conference on Neural Information Processing Systems</em>, 6572-6583.</p>
<p><strong>Publication</strong>: NeurIPS 2018</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{chen2018neural,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Neural ordinary differential equations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 32nd International Conference on Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6572--6583}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/rtqichen/torchdiffeq">Official PyTorch Implementation</a></li>
</ul>
]]></content:encoded></item><item><title>Flow Matching for Generative Modeling: Scalable CNFs</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/flow-matching-for-generative-modeling/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/flow-matching-for-generative-modeling/</guid><description>A simulation-free framework for training Continuous Normalizing Flows using Conditional Flow Matching and Optimal Transport paths.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, as it introduces &ldquo;Flow Matching&rdquo; (FM), a novel simulation-free paradigm for training Continuous Normalizing Flows (CNFs) at scale. It is supported by a strong <strong>Theory</strong> basis, providing formal theorems that allow the intractable marginal vector field regression to be solved via a tractable conditional objective. It also touches on <strong>Systematization</strong> by showing that existing diffusion paths are specific instances of the proposed Gaussian probability path framework.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The paper aims to overcome the scaling limitations of Continuous Normalizing Flows (CNFs).</p>
<ul>
<li><strong>Problem</strong>: Standard Maximum Likelihood training for CNFs requires expensive numerical ODE simulations during training, which scales poorly. Existing simulation-free methods often involve intractable integrals or result in biased gradients.</li>
<li><strong>Gap</strong>: Diffusion models scale well, yet they are restricted to specific, curved probability paths (e.g., VP, VE) that can result in slow sampling and long training times.</li>
<li><strong>Goal</strong>: To develop an efficient, simulation-free training method for CNFs that supports arbitrary probability paths, specifically allowing for straighter, more efficient trajectories like those from Optimal Transport.</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is <strong>Flow Matching (FM)</strong> and specifically the <strong>Conditional Flow Matching (CFM)</strong> objective.</p>
<ul>
<li><strong>Direct Vector Field Regression</strong>: The model regresses a target vector field $u_t$ that generates a desired probability path $p_t$.</li>
<li><strong>Conditional Flow Matching (CFM)</strong>: The authors prove that regressing the vector field of <em>conditional</em> paths (e.g., $p_t(x|x_1)$ given a single data point) yields the same gradients as regressing the intractable marginal vector field. This bypasses the need to know the marginal score or vector field.</li>
<li><strong>Optimal Transport Paths</strong>: The framework enables the use of <strong>Optimal Transport (OT)</strong> displacement interpolation for probability paths. OT paths are straight lines with constant speed, leading to faster training and easier sampling.</li>
</ul>
<p><strong>Concurrent work note</strong>: Rectified Flow (Liu et al., 2023) and Stochastic Interpolants (Albergo &amp; Vanden-Eijnden, 2023) were published concurrently at ICLR 2023 with structurally similar contributions under different names. All three independently propose simulation-free training of continuous flows via direct vector field regression; the differences lie in the specific interpolation schemes, theoretical framing, and experimental focus.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<ul>
<li><strong>Domains</strong>: 2D Checkerboard data, CIFAR-10, and ImageNet at resolutions $32 \times 32$, $64 \times 64$, and $128 \times 128$.</li>
<li><strong>Task</strong>: Unconditional generative modeling (density estimation and sample quality) and conditional super-resolution ($64 \times 64 \to 256 \times 256$).</li>
<li><strong>Baselines</strong>: Compared against Diffusion-based methods on the same architecture (U-Net): DDPM, Score Matching (SM), and ScoreFlow.</li>
<li><strong>Ablations</strong>: Specifically compared <strong>FM with Diffusion paths</strong> vs. <strong>FM with Optimal Transport (OT) paths</strong> to isolate the benefit of the training objective vs. the path choice.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Outperforms diffusion baselines</strong>: FM-OT consistently outperforms all diffusion-based methods (DDPM, Score Matching, ScoreFlow) in both Likelihood (NLL) and Sample Quality (FID) across CIFAR-10 and ImageNet, using the same U-Net architecture and training budget. Selected rows from Table 1 (NLL in bits per dimension, BPD; lower is better for all three metrics; &ldquo;FM w/ OT&rdquo; and &ldquo;FM w/ Diffusion&rdquo; refer to FM trained with OT paths and Diffusion paths respectively):</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Method</th>
          <th>NLL (BPD) ↓</th>
          <th>FID ↓</th>
          <th>NFE ↓</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CIFAR-10</td>
          <td>DDPM</td>
          <td>3.12</td>
          <td>7.48</td>
          <td>274</td>
      </tr>
      <tr>
          <td>CIFAR-10</td>
          <td>FM w/ OT</td>
          <td><strong>2.99</strong></td>
          <td><strong>6.35</strong></td>
          <td><strong>142</strong></td>
      </tr>
      <tr>
          <td>ImageNet 64×64</td>
          <td>ScoreFlow</td>
          <td>3.36</td>
          <td>24.95</td>
          <td>601</td>
      </tr>
      <tr>
          <td>ImageNet 64×64</td>
          <td>FM w/ OT</td>
          <td><strong>3.31</strong></td>
          <td><strong>14.45</strong></td>
          <td><strong>138</strong></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Training stability</strong>: FM with diffusion paths (FM w/ Diffusion) is itself a more stable alternative to diffusion training than DDPM and Score Matching, as shown by training curves in the paper (Figure 5), even before switching to OT paths. The OT path then provides further gains.</li>
<li><strong>Sampling speed</strong>: The straight trajectories of OT paths allow accurate sampling with significantly fewer function evaluations (NFE) compared to diffusion paths.</li>
<li><strong>Generality</strong>: Diffusion is a specific instance of Gaussian probability paths within FM. OT paths are a better-optimized alternative available within the same framework.</li>
<li><strong>Downstream adoption</strong>: Flow matching has been adopted beyond image generation. <a href="/notes/biology/computational-biology/dynamicflow/">DynamicFlow</a> uses it as the generative backbone for simultaneously generating ligand molecules and transforming protein pockets, extending flow matching to structure-based drug design.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Datasets</strong>: CIFAR-10, ImageNet ($32 \times 32$, $64 \times 64$, $128 \times 128$).</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>Images are center-cropped and resized.</li>
<li>For $32 \times 32$ and $64 \times 64$, the preprocessing follows Chrabaszcz et al. (2017).</li>
<li>Data is transformed via $\varphi(y) = 2^7(y+1)$ mapping $[-1, 1]$ pixel values to $[0, 256]$ for BPD computation.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Conditional Flow Matching (CFM) Objective</strong></p>
<p>The practical training objective used is the CFM loss, which bypasses intractable marginalization:</p>
<p>$$\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t, q(x_1), p(x_0)} | v_t(\psi_t(x_0)) - u_t(\psi_t(x_0) | x_1) |^2$$</p>
<p>Where $t \sim \mathcal{U}[0,1]$, $x_1 \sim q(x_1)$ (data), and $x_0 \sim p(x_0)$ (noise).</p>
<p><strong>2. Optimal Transport (OT) Probability Path</strong></p>
<p>The authors recommend the OT path for efficiency.</p>
<ul>
<li><strong>Mean/Std Schedule</strong>: $\mu_t(x) = t x_1$ and $\sigma_t(x) = 1 - (1 - \sigma_{min})t$.</li>
<li><strong>Conditional Flow Map</strong>: $\psi_t(x) = (1 - (1 - \sigma_{min})t)x + t x_1$.</li>
<li><strong>Target Vector Field</strong>: The closed-form regression target for OT is:
$$u_t(x|x_1) = \frac{x_1 - (1 - \sigma_{min})x}{1 - (1 - \sigma_{min})t}$$</li>
</ul>
<p><strong>3. Sampling</strong></p>
<p>Sampling is performed by solving the ODE $\frac{d}{dt}\phi_t(x) = v_t(\phi_t(x))$ from $t=0$ to $t=1$ using the learned vector field $v_t$.</p>
<ul>
<li><strong>Solver</strong>: <code>dopri5</code> (adaptive) is used for robust evaluation. Fixed-step solvers (Euler, Midpoint) are used for low-NFE efficiency tests.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: U-Net architecture from Dhariwal &amp; Nichol (2021) is used for all image experiments.</li>
<li><strong>Toy Data</strong>: 5-layer MLP with 512 neurons.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$, weight decay=0.0).</li>
<li>Learning Rate: Polynomial decay or constant (see Table 3 in paper).</li>
<li>$\sigma_{min}$: Set to a small value (e.g., $1e-5$).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><strong>NLL (BPD)</strong>: Computed using the continuous change of variables formula, estimated via the Hutchinson trace estimator to bypass $O(d^3)$ divergence computation.</li>
<li><strong>FID</strong>: Frechet Inception Distance for sample quality.</li>
<li><strong>NFE</strong>: Number of Function Evaluations required by the solver.</li>
</ul>
</li>
<li><strong>Likelihood Computation</strong>: Requires solving an augmented ODE to track the log-density change:
$$\frac{d}{dt} \begin{bmatrix} \phi_t(x) \ f(t) \end{bmatrix} = \begin{bmatrix} v_t(\phi_t(x)) \ -\text{div}(v_t(\phi_t(x))) \end{bmatrix}$$</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>CIFAR-10</strong>: 2 GPUs.</li>
<li><strong>ImageNet-32</strong>: 4 GPUs.</li>
<li><strong>ImageNet-64</strong>: 16 GPUs.</li>
<li><strong>ImageNet-128</strong>: 32 GPUs.</li>
<li><strong>Precision</strong>: Full 32-bit for CIFAR/IM-32; 16-bit mixed precision for IM-64/128.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/facebookresearch/flow_matching">flow_matching (PyTorch library)</a></td>
          <td>Code</td>
          <td>CC BY-NC 4.0</td>
          <td>Later official library from Meta; not the original experiment code</td>
      </tr>
  </tbody>
</table>
<p>The paper does not release the original training code or model weights used in the experiments. The <code>facebookresearch/flow_matching</code> library was released later as a general-purpose PyTorch implementation of flow matching algorithms. Standard benchmark datasets (CIFAR-10, ImageNet) are publicly available.</p>
<hr>
<h2 id="theoretical-notes-why-cfm-works">Theoretical Notes: Why CFM Works</h2>
<p>The paper relies on three key theorems to make training tractable.</p>
<p><strong>Theorem 1 (Marginal Generation)</strong>:</p>
<p>Marginalizing conditional vector fields $u_t(x|x_1)$ yields the correct marginal vector field $u_t(x)$ that generates the marginal probability path $p_t(x)$.</p>
<p>$$u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1)q(x_1)}{p_t(x)} dx_1$$</p>
<blockquote>
<p><strong>Understanding the Proof:</strong></p>
<p>To understand why this theorem holds, we have to look at the <strong>Continuity Equation</strong>, which is the fundamental partial differential equation (PDE) that links a probability density path $p_t$ to a vector field $u_t$.</p>
<p>A vector field $u_t$ is said to &ldquo;generate&rdquo; a probability path $p_t$ if and only if they satisfy the continuity equation:</p>
<p>$$\frac{\partial p_t(x)}{\partial t} + \nabla \cdot (p_t(x) u_t(x)) = 0$$</p>
<p>The proof of Theorem 1 relies on substituting the definitions of the marginal path and vector field into this equation to see if they balance out.</p>
<p><strong>Step-by-Step Proof:</strong></p>
<ol>
<li>
<p><strong>Start with the time derivative of the marginal path</strong>: We begin by differentiating the marginal probability path $p_t(x)$ with respect to time. By definition, the marginal path is the integral of the conditional paths over the data distribution:
$$\frac{\partial p_t(x)}{\partial t} = \frac{\partial}{\partial t} \int p_t(x|x_1) q(x_1) dx_1$$</p>
</li>
<li>
<p><strong>Swap derivative and integral</strong>: Assuming standard regularity conditions (Leibniz Rule), we can move the time derivative inside the integral:
$$\frac{\partial p_t(x)}{\partial t} = \int \frac{\partial p_t(x|x_1)}{\partial t} q(x_1) dx_1$$</p>
</li>
<li>
<p><strong>Apply the Conditional Continuity Equation</strong>: This is the critical step. We know that the conditional vector field $u_t(x|x_1)$ generates the conditional path $p_t(x|x_1)$. Therefore, for every single sample $x_1$, the pair satisfies the continuity equation:
$$\frac{\partial p_t(x|x_1)}{\partial t} = -\nabla \cdot (p_t(x|x_1) u_t(x|x_1))$$</p>
<p>Substituting this into our integral gives:
$$\frac{\partial p_t(x)}{\partial t} = -\int \nabla \cdot (p_t(x|x_1) u_t(x|x_1)) q(x_1) dx_1$$</p>
</li>
<li>
<p><strong>Pull the Divergence out</strong>: Since the divergence operator ($\nabla \cdot$) acts on $x$ and the integral is over $x_1$, we can pull the divergence operator outside the integral (by linearity):
$$\frac{\partial p_t(x)}{\partial t} = -\nabla \cdot \left( \int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1 \right)$$</p>
</li>
<li>
<p><strong>Match with the Marginal Vector Field Definition</strong>: Now, look at the term inside the parentheses. The paper defines the marginal vector field $u_t(x)$ specifically to make this term simpler. Rearranging the definition of $u_t(x)$ provided in the theorem:
$$p_t(x) u_t(x) = \int p_t(x|x_1) u_t(x|x_1) q(x_1) dx_1$$</p>
<p>Substitute $p_t(x) u_t(x)$ back into our equation from Step 4:
$$\frac{\partial p_t(x)}{\partial t} = -\nabla \cdot (p_t(x) u_t(x))$$</p>
</li>
</ol>
<p><strong>Conclusion</strong>: We have just shown that $\frac{\partial p_t(x)}{\partial t} + \nabla \cdot (p_t(x) u_t(x)) = 0$. This is exactly the continuity equation. Because the marginal path and the aggregated marginal vector field satisfy this equation, the vector field is proven to generate the path.</p></blockquote>
<p><strong>Theorem 2 (Gradient Equivalence)</strong>:</p>
<p>The intractable Flow Matching objective $\mathcal{L}_{FM}$ (which requires $u_t(x)$) has the <strong>same gradients</strong> as the tractable Conditional Flow Matching objective $\mathcal{L}_{CFM}$.</p>
<p>$$\nabla_\theta \mathcal{L}_{FM}(\theta) = \nabla_\theta \mathcal{L}_{CFM}(\theta)$$</p>
<p>This allows the model to learn the marginal vector field by only seeing conditional sample paths.</p>
<blockquote>
<p><strong>Understanding the Proof:</strong></p>
<p>The reason Theorem 2 holds is that the &ldquo;Conditional Flow Matching&rdquo; (CFM) objective is essentially an unbiased estimator of the &ldquo;Flow Matching&rdquo; (FM) objective (up to a constant). When we average over all the conditional data points $x_1$, the &ldquo;cross-term&rdquo; in the loss function aligns perfectly with the marginal vector field.</p>
<p><strong>1. Expand the Loss Functions</strong></p>
<p>First, let&rsquo;s look at the squared error in both objectives. Recall that $v_t$ is our neural network (parameterized by $\theta$), $u_t$ is the intractable marginal target, and $u_t(x|x_1)$ is the tractable conditional target.</p>
<p>Expanding the squared norms:</p>
<ul>
<li>
<p><strong>FM Objective</strong>:
$$\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, p_t(x)} \left[ |v_t(x)|^2 - 2v_t(x) \cdot u_t(x) + |u_t(x)|^2 \right]$$</p>
</li>
<li>
<p><strong>CFM Objective</strong>:
$$\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} \left[ |v_t(x)|^2 - 2v_t(x) \cdot u_t(x|x_1) + |u_t(x|x_1)|^2 \right]$$</p>
</li>
</ul>
<p><strong>Key Insight</strong>: When we take the gradient $\nabla_\theta$, the last term in both equations disappears because the targets ($u_t$) are independent of the network weights $\theta$. We only need to show that the expectations of the first two terms match.</p>
<p><strong>2. Matching the First Term ($|v_t(x)|^2$)</strong></p>
<p>This part is straightforward. The expectation of $|v_t(x)|^2$ is the same in both cases because of how the marginal density $p_t(x)$ is defined.</p>
<ul>
<li><strong>FM</strong>: averages over $p_t(x)$.</li>
<li><strong>CFM</strong>: averages over $p_t(x|x_1)q(x_1)$.</li>
</ul>
<p>Since $p_t(x) = \int p_t(x|x_1) q(x_1) dx_1$ (by definition), averaging over the joint distribution is mathematically identical to averaging over the marginal $p_t(x)$.</p>
<p><strong>3. Matching the Cross Term (The &ldquo;Trick&rdquo;)</strong></p>
<p>This is the critical part of the proof. We need to show that the interaction between the network and the marginal field equals the interaction between the network and the conditional field.</p>
<p><strong>The Goal</strong>: Show $\mathbb{E}_{t, p_t(x)} [v_t(x) \cdot u_t(x)] = \mathbb{E}_{t, q(x_1), p_t(x|x_1)} [v_t(x) \cdot u_t(x|x_1)]$.</p>
<p><strong>The Proof</strong>:</p>
<ol>
<li>
<p>Start with the <strong>FM cross-term</strong> (marginal):
$$\mathbb{E}_{t, p_t(x)} [v_t(x) \cdot u_t(x)]$$</p>
</li>
<li>
<p>Substitute the definition of the marginal vector field $u_t(x)$ derived in <strong>Theorem 1</strong>:
$$u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1) q(x_1)}{p_t(x)} dx_1$$</p>
</li>
<li>
<p>Plug this into the integral. The $p_t(x)$ terms cancel:
$$\mathbb{E}_{t, p_t(x)} [v_t(x) \cdot u_t(x)] = \int_t \int_x p_t(x) v_t(x) \cdot \left[ \int_{x_1} u_t(x|x_1) \frac{p_t(x|x_1) q(x_1)}{p_t(x)} dx_1 \right] dx$$</p>
</li>
<li>
<p>This simplifies to:
$$= \int_t \int_x \int_{x_1} v_t(x) \cdot u_t(x|x_1) p_t(x|x_1) q(x_1) dx_1 dx dt$$</p>
</li>
<li>
<p>This is exactly the definition of the expectation in the <strong>CFM objective</strong>:
$$= \mathbb{E}_{t, q(x_1), p_t(x|x_1)} [v_t(x) \cdot u_t(x|x_1)]$$</p>
</li>
</ol>
<p><strong>Conclusion</strong>: Because the expectations of all terms involving $\theta$ are identical, the gradients must be identical.</p>
<p>Intuitively, this works like <strong>Denoising Score Matching</strong> or <strong>Stochastic Gradient Descent</strong>: even though each individual conditional vector field $u_t(x|x_1)$ points to a specific data point $x_1$ (which may differ from the true marginal direction), the <em>average</em> of all these pulls equals the true marginal vector field $u_t(x)$.</p></blockquote>
<p><strong>Theorem 3 (Gaussian Conditional VFs)</strong>:</p>
<p>For any Gaussian probability path $p_t(x|x_1) = \mathcal{N}(x | \mu_t(x_1), \sigma_t(x_1)^2 I)$, the unique vector field generating it is available in closed form:</p>
<p>$$u_t(x|x_1) = \frac{\sigma&rsquo;_t(x_1)}{\sigma_t(x_1)}(x - \mu_t(x_1)) + \mu&rsquo;_t(x_1)$$</p>
<p>This theorem allows explicitly defining targets for both Diffusion (curved) and Optimal Transport (straight) paths.</p>
<blockquote>
<p><strong>Understanding the Proof:</strong></p>
<p>The derivation of Theorem 3 comes from the direct relationship between a flow map $\psi_t$ and its generating vector field. Because we chose a specific, simple path (Gaussian), we can invert the flow map to find the vector field in closed form.</p>
<p><strong>1. Define the Flow Map $\psi_t$</strong></p>
<p>We start by defining the conditional probability path as a Gaussian:</p>
<p>$$p_t(x|x_1) = \mathcal{N}(x | \mu_t(x_1), \sigma_t(x_1)^2 I)$$</p>
<p>The simplest way to &ldquo;push&rdquo; a standard normal distribution (noise) $p_0 = \mathcal{N}(0, I)$ to this Gaussian is using an affine transformation (scaling and shifting). We define the flow map $\psi_t$ as:</p>
<p>$$\psi_t(x_0) = \sigma_t(x_1) x_0 + \mu_t(x_1)$$</p>
<p>This map takes a noise sample $x_0$ and transforms it into a sample $x$ at time $t$.</p>
<p><strong>2. The Definition of a Generating Vector Field</strong></p>
<p>By definition, a vector field $u_t$ generates a flow $\psi_t$ if the vector field describes the instantaneous velocity of the flow at any point. Mathematically:</p>
<p>$$u_t(\psi_t(x_0)) = \frac{d}{dt}\psi_t(x_0)$$</p>
<p>Let $x = \psi_t(x_0)$ be the position of the particle at time $t$. We want to find $u_t(x)$.</p>
<p><strong>3. Invert the Flow Map</strong></p>
<p>To find $u_t(x)$, we must express the equation in terms of $x$ rather than $x_0$. Since our flow map is a simple affine transformation (multiply and add), it is easily invertible (assuming $\sigma_t(x_1) \neq 0$):</p>
<p>$$x_0 = \frac{x - \mu_t(x_1)}{\sigma_t(x_1)}$$</p>
<p>We will call this inverse map $\psi_t^{-1}(x)$.</p>
<p><strong>4. Differentiate the Flow Map</strong></p>
<p>Now we calculate the left side of our definition equation (velocity): $\frac{d}{dt}\psi_t(x_0)$.</p>
<p>Taking the time derivative of $\psi_t(x_0) = \sigma_t(x_1) x_0 + \mu_t(x_1)$:</p>
<p>$$\frac{d}{dt}\psi_t(x_0) = \sigma&rsquo;_t(x_1) x_0 + \mu&rsquo;_t(x_1)$$</p>
<p>(Note: $\sigma&rsquo;_t$ and $\mu&rsquo;_t$ denote time derivatives).</p>
<p><strong>5. Substitute and Solve</strong></p>
<p>Now we combine everything. We know $u_t(\psi_t(x_0)) = \frac{d}{dt}\psi_t(x_0)$.</p>
<p>Substitute the result from Step 4 into this equation:</p>
<p>$$u_t(\psi_t(x_0)) = \sigma&rsquo;_t(x_1) x_0 + \mu&rsquo;_t(x_1)$$</p>
<p>This expresses the vector field in terms of the initial point $x_0$. We must express it in terms of the current point $x$. So, we plug in the inverse formula for $x_0$ derived in Step 3:</p>
<p>$$u_t(x|x_1) = \sigma&rsquo;_t(x_1) \frac{x - \mu_t(x_1)}{\sigma_t(x_1)} + \mu&rsquo;_t(x_1)$$</p>
<p>Rearranging terms gives the final closed form:</p>
<p>$$u_t(x|x_1) = \frac{\sigma&rsquo;_t(x_1)}{\sigma_t(x_1)}(x - \mu_t(x_1)) + \mu&rsquo;_t(x_1)$$</p>
<p><strong>Why is this useful?</strong></p>
<p>This formula means that as long as you can define a mean schedule $\mu_t(x_1)$ and a standard deviation schedule $\sigma_t(x_1)$ (which is easy to do for both Diffusion and Optimal Transport), you immediately get the exact vector field target $u_t(x|x_1)$ needed to train your neural network, bypassing complex ODE solving or score matching approximations.</p></blockquote>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., &amp; Le, M. (2023). Flow Matching for Generative Modeling. <em>International Conference on Learning Representations (ICLR)</em>.</p>
<p><strong>Publication</strong>: ICLR 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lipmanFlowMatchingGenerative2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Flow Matching for Generative Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Lipman, Yaron and Chen, Ricky T. Q. and Ben-Hamu, Heli and Nickel, Maximilian and Le, Matt}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2210.02747">ArXiv</a></li>
</ul>
]]></content:encoded></item><item><title>Building Normalizing Flows with Stochastic Interpolants</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/stochastic-interpolants/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/stochastic-interpolants/</guid><description>A continuous-time normalizing flow using stochastic interpolants and quadratic loss to bypass costly ODE backpropagation.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Method</strong> paper, with significant <strong>Theory</strong> contributions.</p>
<p>The authors propose a specific algorithm (&ldquo;InterFlow&rdquo;) for constructing generative models based on continuous-time normalizing flows. The work is characterized by the derivation of a new training objective (a simple quadratic loss) that bypasses the computational bottlenecks of previous methods. It includes prominent baseline comparisons against continuous flow methods (FFJORD, OT-Flow) and diffusion models. The theoretical component establishes the validity of the interpolant density satisfying the continuity equation (a conservation law governing how probability mass flows) and bounds the Wasserstein-2 distance (a measure of transport cost between distributions, penalizing squared displacement) of the transport.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The primary motivation is to overcome the computational inefficiency of training Continuous Normalizing Flows (CNFs) using Maximum Likelihood Estimation (MLE). Standard CNF training requires backpropagating through numerical ODE solvers, which is costly and limits scalability.</p>
<p>Additionally, while score-based diffusion models (SDEs) have achieved high sample quality, they theoretically require infinite time integration and rely on specific noise schedules. The authors aim to establish a method that works strictly with Probability Flow ODEs on finite time intervals, retaining the flexibility to connect arbitrary densities without the complexity of SDEs or the cost of standard ODE adjoint methods.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Stochastic Interpolant</strong> framework:</p>
<ul>
<li><strong>Explicit Interpolant Construction</strong>: The method defines a time-dependent interpolant $x_t = I_t(x_0, x_1)$ (e.g., trigonometric interpolation) that connects samples from the base density $\rho_0$ and target $\rho_1$.</li>
<li><strong>Simulation-Free Training</strong>: The velocity field $v_t(x)$ of the probability flow is learned by minimizing a simple quadratic objective: $G(\hat{v}) = \mathbb{E}[|\hat{v}_t(x_t)|^2 - 2\partial_t x_t \cdot \hat{v}_t(x_t)]$. Because $\partial_t I_t$ is known analytically from the interpolant definition, the expectation can be estimated by sampling $(x_0, x_1, t)$ directly. This avoids ODE integration during training (ODE integration is still required at inference).</li>
<li><strong>Decoupling Path and Optimization</strong>: The choice of path (interpolant) is separated from the optimization of the velocity field. MLE methods couple the path and objective.</li>
<li><strong>Connection to Score-Based Models</strong>: The authors show that for Gaussian base densities and trigonometric interpolants, the learned velocity field is explicitly related to the score function $\nabla \log \rho_t$, providing a theoretical bridge between CNFs and diffusion models.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors performed validation across synthetic, tabular, and image domains:</p>
<ul>
<li><strong>2D Density Estimation</strong>: Benchmarked on &ldquo;Checkerboard&rdquo;, &ldquo;8 Gaussians&rdquo;, and anisotropic curved densities to visualize mode coverage and transport smoothness.</li>
<li><strong>High-Dimensional Tabular Data</strong>: Evaluated on standard benchmarks (POWER, GAS, HEPMASS, MINIBOONE, BSDS300) comparing Negative Log Likelihood (NLL) against FFJORD, OT-Flow, and others.</li>
<li><strong>Image Generation</strong>: Trained models on CIFAR-10 ($32 \times 32$), ImageNet ($32 \times 32$), and Oxford Flowers ($128 \times 128$) to test scalability.</li>
<li><strong>Ablations</strong>: Investigated optimizing the interpolant path itself (e.g., learning Fourier coefficients for the path) to approach optimal transport and minimize path length.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Performance</strong>: The method matches or supersedes conventional ODE flows (like FFJORD) in terms of NLL while being significantly cheaper to train.</li>
<li><strong>Efficiency</strong>: The training cost per epoch is constant (simulation-free), whereas MLE-based ODE methods see growing costs as the dynamics become more complex.</li>
<li><strong>Scalability</strong>: The method successfully scales to $128 \times 128$ resolution on a single GPU, a resolution that prior ab-initio ODE flows had not demonstrated.</li>
<li><strong>Flexibility</strong>: The framework can connect <em>any</em> two arbitrary densities (e.g., connecting two different complex 2D distributions) without needing one to be Gaussian.</li>
<li><strong>Optimal Transport</strong>: For a fixed interpolant, minimizing $G(\hat{v})$ over the velocity field recovers the velocity for that specific path. Additionally optimizing over the interpolant family yields a solution to the Benamou-Brenier optimal transport problem.</li>
<li><strong>Limitations</strong>: The authors acknowledge that image FID scores trail dedicated diffusion models, noting that InterFlow was not optimized with standard training tricks such as exponential moving averages, truncation, or learning rate warm-ups. The framework&rsquo;s sample quality could likely improve with these additions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Tabular Datasets</strong>: POWER (6D), GAS (8D), HEPMASS (21D), MINIBOONE (43D), BSDS300 (63D).
<ul>
<li>Training points range from ~30k (MINIBOONE) to ~1.6M (POWER).</li>
</ul>
</li>
<li><strong>Image Datasets</strong>:
<ul>
<li>CIFAR-10 ($32 \times 32$, 50k training points).</li>
<li>ImageNet ($32 \times 32$, ~1.28M training points).</li>
<li>Oxford Flowers ($128 \times 128$, ~315k training points).</li>
</ul>
</li>
<li><strong>Time Sampling</strong>: Time $t$ is sampled from a Beta distribution during training (reweighting) to focus learning near the target.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Interpolant</strong>: The primary interpolant used is trigonometric: $I_t(x_0, x_1) = \cos(\frac{\pi t}{2})x_0 + \sin(\frac{\pi t}{2})x_1$.
<ul>
<li>Alternative linear interpolant: $I_t = a_t x_0 + b_t x_1$.</li>
</ul>
</li>
<li><strong>Loss Function</strong>:
$$G(\hat{v}) = \mathbb{E}_{t, x_0, x_1}[|\hat{v}_t(x_t)|^2 - 2\partial_t I_t(x_0, x_1) \cdot \hat{v}_t(x_t)]$$
<ul>
<li>The expectation is amenable to empirical estimation using batches of $x_0, x_1, t$.</li>
</ul>
</li>
<li><strong>Sampling</strong>: Numerical integration using Dormand-Prince (Runge-Kutta 4/5).</li>
<li><strong>Optimization</strong>: SGD/Adam variants used for optimization.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Tabular Architectures</strong>:
<ul>
<li>Feed-forward networks with 4-5 hidden layers.</li>
<li>Hidden widths: 512 (POWER, GAS, HEPMASS, MINIBOONE) or 1024 (BSDS300).</li>
<li>Activation: ReLU (general) or ELU (BSDS300).</li>
</ul>
</li>
<li><strong>Image Architectures</strong>:
<ul>
<li>U-Net based on the DDPM implementation.</li>
<li>Dimensions: 256 hidden dimension.</li>
<li>Sinusoidal time embeddings used.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Negative Log Likelihood (NLL) in nats (tabular) or bits per dim (images), Frechet Inception Distance (FID) for images.</li>
<li><strong>Baselines</strong>: FFJORD, Glow, Real NVP, OT-Flow, ScoreFlow, DDPM.</li>
</ul>
<p><strong>Tabular NLL</strong> (nats, lower is better; Table 2 Left):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>POWER</th>
          <th>GAS</th>
          <th>HEPMASS</th>
          <th>MINIBOONE</th>
          <th>BSDS300</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MADE</td>
          <td>3.08</td>
          <td>-3.56</td>
          <td>20.98</td>
          <td>15.59</td>
          <td>-148.85</td>
      </tr>
      <tr>
          <td>Real NVP</td>
          <td>-0.17</td>
          <td>-8.33</td>
          <td>18.71</td>
          <td>13.55</td>
          <td>-153.28</td>
      </tr>
      <tr>
          <td>Glow</td>
          <td>-0.17</td>
          <td>-8.15</td>
          <td>18.92</td>
          <td>11.35</td>
          <td>-155.07</td>
      </tr>
      <tr>
          <td>CPF</td>
          <td>-0.52</td>
          <td>-10.36</td>
          <td>16.93</td>
          <td>10.58</td>
          <td>-154.99</td>
      </tr>
      <tr>
          <td>NSP</td>
          <td>-0.64</td>
          <td>-13.09</td>
          <td>14.75</td>
          <td>9.67</td>
          <td>-157.54</td>
      </tr>
      <tr>
          <td>FFJORD</td>
          <td>-0.46</td>
          <td>-8.59</td>
          <td>14.92</td>
          <td>10.43</td>
          <td>-157.40</td>
      </tr>
      <tr>
          <td>OT-Flow</td>
          <td>-0.30</td>
          <td>-9.20</td>
          <td>17.32</td>
          <td>10.55</td>
          <td>-154.20</td>
      </tr>
      <tr>
          <td><strong>Ours</strong></td>
          <td><strong>-0.57</strong></td>
          <td><strong>-12.35</strong></td>
          <td><strong>14.85</strong></td>
          <td><strong>10.42</strong></td>
          <td><strong>-156.22</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation NLL and FID</strong> (Table 2 Right; NLL in bits per dim, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>CIFAR-10 NLL</th>
          <th>CIFAR-10 FID</th>
          <th>ImageNet-32 NLL</th>
          <th>ImageNet-32 FID</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FFJORD</td>
          <td>3.40</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Glow</td>
          <td>3.35</td>
          <td>-</td>
          <td>4.09</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DDPM</td>
          <td>≤3.75</td>
          <td>3.17</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DDPM++ (Song et al., 2021)</td>
          <td>≤3.37</td>
          <td>2.90</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ScoreSDE (Song et al., 2021)</td>
          <td>2.99</td>
          <td>2.92</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>VDM</td>
          <td>≤2.65</td>
          <td>7.41</td>
          <td>≤3.72</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Soft Truncation</td>
          <td>2.88</td>
          <td>3.45</td>
          <td>3.85</td>
          <td>8.42</td>
      </tr>
      <tr>
          <td>ScoreFlow</td>
          <td>2.81</td>
          <td>5.40</td>
          <td>3.76</td>
          <td>10.18</td>
      </tr>
      <tr>
          <td><strong>Ours</strong></td>
          <td><strong>2.99</strong></td>
          <td><strong>10.27</strong></td>
          <td><strong>3.48</strong></td>
          <td><strong>8.49</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: DDPM++ is from Song et al. (2021), the same work as ScoreSDE (it is the architecture optimized for VP/sub-VP SDEs). InterFlow matches ScoreSDE on CIFAR-10 NLL (2.99 bits per dim) while being simulation-free. FID is weaker than dedicated image models (10.27 vs 2.92 for ScoreSDE), reflecting the paper&rsquo;s primary focus on tractable likelihood rather than sample quality.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: All models were trained on a single NVIDIA A100 GPU.</li>
<li><strong>Training Time</strong>:
<ul>
<li>Tabular: $10^5$ steps.</li>
<li>Images: $1.5 \times 10^5$ to $6 \times 10^5$ steps.</li>
<li>Speedup: Demonstrated ~400x speedup compared to FFJORD on MiniBooNE dataset.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>lucidrains/denoising-diffusion-pytorch (link defunct)</td>
          <td>Code</td>
          <td>MIT</td>
          <td>Base U-Net architecture used for image experiments; original GitHub account no longer available</td>
      </tr>
  </tbody>
</table>
<p>No official code release accompanies this paper. All tabular datasets (POWER, GAS, HEPMASS, MINIBOONE, BSDS300) are publicly available from prior work. CIFAR-10 and ImageNet are standard public benchmarks. Oxford Flowers 102 is also publicly available. Hyperparameters and architectures are fully specified in Tables 3 and 4 of the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Albergo, M. S., &amp; Vanden-Eijnden, E. (2023). Building Normalizing Flows with Stochastic Interpolants. <em>The Eleventh International Conference on Learning Representations</em>.</p>
<p><strong>Publication</strong>: ICLR 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{albergoBuildingNormalizingFlows2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Building {{Normalizing Flows}} with {{Stochastic Interpolants}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{The {{Eleventh International Conference}} on {{Learning Representations}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Albergo, Michael Samuel and {Vanden-Eijnden}, Eric}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://openreview.net/forum?id=li7qeBbCR1t}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=li7qeBbCR1t">OpenReview</a></li>
<li><a href="https://arxiv.org/abs/2209.15571">arXiv</a></li>
</ul>
]]></content:encoded></item><item><title>Translating InChI to IUPAC Names with Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/</guid><description>Sequence-to-sequence Transformer translating InChI identifiers to IUPAC names with 91% accuracy on organic compounds.</description><content:encoded><![CDATA[<h2 id="primary-contribution-a-transformer-based-method">Primary Contribution: A Transformer-Based Method</h2>
<p>This is primarily a <strong>Method</strong> paper. It adapts a specific architecture (Transformer) to a specific task (InChI-to-IUPAC translation) and evaluates its performance against both machine learning and commercial baselines. It also has a secondary <strong>Resource</strong> contribution, as the trained model and scripts are released as open-source software.</p>
<h2 id="motivation-the-bottleneck-in-algorithmic-iupac-nomenclature">Motivation: The Bottleneck in Algorithmic IUPAC Nomenclature</h2>
<p>Generating correct IUPAC names is difficult due to the comprehensive but complex rules defined by the International Union of Pure and Applied Chemistry. Commercial software generates names from structures but remains closed-source with opaque methodologies and frequent inter-package disagreements. Open identifiers like InChI and SMILES lack direct human readability. This creates a need for an open, automated method to generate informative IUPAC names from standard identifiers like InChI, which are ubiquitous in online chemical databases.</p>
<h2 id="novelty-treating-chemical-translation-as-a-character-level-sequence">Novelty: Treating Chemical Translation as a Character-Level Sequence</h2>
<p>The key novelty is treating chemical nomenclature translation as a character-level sequence-to-sequence problem using a Transformer architecture, specifically using <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> as the source language.</p>
<ul>
<li>Standard Neural Machine Translation (NMT) uses sub-word tokenization. This model processes InChI and predicts IUPAC names character-by-character.</li>
<li>It demonstrates that character-level tokenization outperforms byte-pair encoding or unigram models for this specific chemical task.</li>
<li>It uses InChI&rsquo;s standardization to avoid the canonicalization issues inherent in SMILES-based approaches.</li>
<li>The attention mechanism allows the decoder to align specific parts of the generated IUPAC name with corresponding structural features in the source InChI string, operating via the standard scaled dot-product attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$</li>
</ul>
<h2 id="methodology--experimental-validation">Methodology &amp; Experimental Validation</h2>
<ul>
<li><strong>Training:</strong> The model was trained on 10 million InChI/IUPAC pairs sampled from PubChem using a character-level objective. The model is supervised using categorical cross-entropy loss across the vocabulary of characters:
$$ \mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) $$</li>
<li><strong>Ablation Studies:</strong> The authors experimentally validated architecture choices, finding that LSTM models and sub-word tokenization (BPE) performed worse than the Transformer with character tokenization. They also optimized dropout rates.</li>
<li><strong>Performance Benchmarking:</strong> The model was evaluated on a held-out test set of 200,000 samples. Performance was quantified primarily by Whole-Name Accuracy and Normalized Edit Distance (based on the Damerau-Levenshtein distance, scaled by the maximum string length).</li>
<li><strong>Commercial Comparison:</strong> The authors compared their model against four major commercial packages (ACD/I-Labs, ChemAxon, Mestrelab, and PubChem&rsquo;s Lexichem). However, this evaluation used a highly limited test set of only 100 molecules, restricting the statistical confidence of the external baseline.</li>
<li><strong>Error Analysis:</strong> They analyzed performance across different chemical classes (organics, charged species, macrocycles, inorganics) and visualized attention coefficients to interpret model focus.</li>
</ul>
<h2 id="key-results-and-the-inorganic-challenge">Key Results and the Inorganic Challenge</h2>
<ul>
<li><strong>High Accuracy on Organics:</strong> The model achieved 91% whole-name accuracy on the test set, performing particularly well on organic compounds.</li>
<li><strong>Comparable to Commercial Tools:</strong> On the limited 100-molecule benchmark, the edit distance between the model&rsquo;s predictions and commercial packages (15-23%) was similar to the variation found <em>between</em> the commercial packages themselves (16-21%).</li>
<li><strong>Limitations on Inorganics:</strong> The model performed poorly on inorganic (14% accuracy) and organometallic compounds (20% accuracy). This is attributed to inherent data limitations in the standard InChI format (which deliberately disconnects metal atoms from their ligands) and low training data coverage for those classes.</li>
<li><strong>Character-Level Superiority:</strong> Character-level tokenization was found to be essential; byte-pair encoding reduced accuracy significantly.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was derived from <a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem&rsquo;s public FTP server</a> (<code>CID-SMILES.gz</code> and <code>CID-IUPAC.gz</code>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Raw</strong></td>
          <td>PubChem</td>
          <td>100M pairs</td>
          <td>Filtered for length (InChI &lt; 200 chars, IUPAC &lt; 150 chars). 132k unparseable SMILES dropped.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Subsampled</td>
          <td>10M pairs</td>
          <td>Random sample from the filtered set.</td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>Held-out</td>
          <td>10,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>Held-out</td>
          <td>200,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Tokenization</strong></td>
          <td>Vocab</td>
          <td>InChI: 66 chars<br>IUPAC: 70 chars</td>
          <td>Character-level tokenization. Spaces treated as tokens.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: OpenNMT-py 2.0.0 (using PyTorch). Training scripts and vocabularies are available as supplementary files to the original publication. Pre-trained model weights are hosted on <a href="https://doi.org/10.5281/zenodo.5081159">Zenodo</a>.</li>
<li><strong>Architecture Type</strong>: Transformer Encoder-Decoder.</li>
<li><strong>Optimization</strong>: ADAM optimizer ($\beta_1=0.9, \beta_2=0.998$).</li>
<li><strong>Learning Rate</strong>: Linear warmup over 8000 steps to 0.0005, then decayed by inverse square root of iteration.</li>
<li><strong>Regularization</strong>:
<ul>
<li>Dropout: 0.1 (applied to dense and attentional layers).</li>
<li>Label Smoothing: Magnitude 0.1.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Teacher forcing used for both training and validation.</li>
<li><strong>Gradient Accumulation</strong>: Gradients accumulated over 4 batches before updating parameters.</li>
<li><strong>Inference</strong>: Beam search with width 10 and length penalty 1.0.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Structure</strong>: 6 layers in encoder, 6 layers in decoder.</li>
<li><strong>Attention</strong>: 8 heads per attention sub-layer.</li>
<li><strong>Dimensions</strong>:
<ul>
<li>Feed-forward hidden state size: 2048.</li>
<li>Embedding vector length: 512.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Glorot&rsquo;s method.</li>
<li><strong>Position</strong>: Positional encoding added to word vectors.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics reported include <strong>Whole-Name Accuracy</strong> (percentage of exact matches) and <strong>Normalized Edit Distance</strong> (Damerau-Levenshtein, scale 0-1).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy (All)</td>
          <td>91%</td>
          <td>N/A</td>
          <td>Test set of 200k samples.</td>
      </tr>
      <tr>
          <td>Accuracy (Inorganic)</td>
          <td>14%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Organometallic)</td>
          <td>20%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Charged)</td>
          <td>79%</td>
          <td>N/A</td>
          <td>Test set subset.</td>
      </tr>
      <tr>
          <td>Accuracy (Rajan)</td>
          <td>72%</td>
          <td>N/A</td>
          <td>Comparative ML model (STOUT).</td>
      </tr>
      <tr>
          <td>Edit Dist (Organic)</td>
          <td>$0.02 \pm 0.03$</td>
          <td>N/A</td>
          <td>Very high similarity for organics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Inorganic)</td>
          <td>$0.32 \pm 0.20$</td>
          <td>N/A</td>
          <td>Poor performance on inorganics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Organometallic)</td>
          <td>$0.37 \pm 0.24$</td>
          <td>N/A</td>
          <td>Poor performance on organometallics.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Tesla K80.</li>
<li><strong>Training Time</strong>: 7 days.</li>
<li><strong>Throughput</strong>: ~6000 tokens/sec (InChI) and ~3800 tokens/sec (IUPAC).</li>
<li><strong>Batch Size</strong>: 4096 tokens (approx. 30 compounds).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5081159">InChI to IUPAC model</a></td>
          <td>Model</td>
          <td>CC BY 4.0</td>
          <td>Pre-trained Transformer weights (551 MB), requires OpenNMT-py 2.0.0</td>
      </tr>
      <tr>
          <td><a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem FTP</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source data: CID-SMILES.gz and CID-IUPAC.gz</td>
      </tr>
      <tr>
          <td>Training scripts &amp; vocabularies</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Included as supplementary files with the publication</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Handsel, J., Matthews, B., Knight, N. J., &amp; Coles, S. J. (2021). Translating the InChI: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier. <em>Journal of Cheminformatics</em>, 13(1), 79. <a href="https://doi.org/10.1186/s13321-021-00535-x">https://doi.org/10.1186/s13321-021-00535-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{handselTranslatingInChIAdapting2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Translating the {{InChI}}: Adapting Neural Machine Translation to Predict {{IUPAC}} Names from a Chemical Identifier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Translating the {{InChI}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Handsel, Jennifer and Matthews, Brian and Knight, Nicola J. and Coles, Simon J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{79}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00535-x}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine&#39;s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91\%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention,GPU,InChI,IUPAC,seq2seq,Transformer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Struct2IUPAC: Translating SMILES to IUPAC via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/</guid><description>A Transformer-based model for translating between SMILES strings and IUPAC names, trained on 47M PubChem examples, achieving 98.9% accuracy with verification.</description><content:encoded><![CDATA[<h2 id="struct2iupac-as-a-methodological-shift">Struct2IUPAC as a Methodological Shift</h2>
<p>This is primarily a <strong>Method</strong> paper with significant elements of <strong>Position</strong>.</p>
<ul>
<li><strong>Method</strong>: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).</li>
<li><strong>Position</strong>: The authors explicitly argue for a paradigm shift, suggesting that &ldquo;heavy&rdquo; neural architectures should replace complex, costly rule-based legacy systems even for &ldquo;exact&rdquo; algorithmic tasks.</li>
</ul>
<h2 id="the-cost-of-rule-based-chemical-naming">The Cost of Rule-Based Chemical Naming</h2>
<ul>
<li><strong>Complexity of Naming</strong>: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.</li>
<li><strong>Lack of Open Source Tools</strong>: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse &ldquo;Structure-to-Name&rdquo; conversion at the time of writing.</li>
<li><strong>Cost of Development</strong>: Developing rule-based converters &ldquo;from scratch&rdquo; is prohibitively expensive and time-consuming compared to training a neural model on existing data.</li>
</ul>
<h2 id="struct2iupac-core-innovation">Struct2IUPAC Core Innovation</h2>
<ul>
<li><strong>Struct2IUPAC</strong>: The first effective open-source neural model for <a href="/notes/chemistry/molecular-representations/name-translation/stout-v2/">converting SMILES to IUPAC names</a>, treating chemical translation as a Neural Machine Translation (NMT) problem.</li>
<li><strong>Verification Loop</strong>: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.</li>
<li><strong>Custom Tokenization</strong>: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.</li>
</ul>
<h2 id="experimental-setup-and-stress-testing">Experimental Setup and Stress Testing</h2>
<ul>
<li><strong>Accuracy Benchmarking</strong>: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).</li>
<li><strong>Comparison to Rules</strong>: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.</li>
<li><strong>Stress Testing</strong>:
<ul>
<li><strong>Sequence Length</strong>: Evaluated performance across varying token lengths, identifying a &ldquo;sweet spot&rdquo; (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.</li>
<li><strong>Stereochemistry</strong>: Tested on &ldquo;stereo-dense&rdquo; compounds. The authors define a &ldquo;stereo-density&rdquo; index ($I$) as the ratio of stereocenters ($S$) to total tokens ($N$):
$$I = \frac{S}{N}$$
They observed a performance drop for these dense molecules, though the model still handled many stereocenters robustly.</li>
<li><strong>Tautomers</strong>: Verified the model&rsquo;s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).</li>
</ul>
</li>
<li><strong>Latency Analysis</strong>: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.</li>
</ul>
<h2 id="benchmarks-and-outcomes">Benchmarks and Outcomes</h2>
<ul>
<li><strong>High Accuracy</strong>: The Struct2IUPAC model achieved <strong>98.9% accuracy</strong> (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved <strong>99.1%</strong>, comparable to OPSIN&rsquo;s 99.4%.</li>
<li><strong>Distribution Modeling vs. Intuition</strong>: The authors claim the model infers &ldquo;chemical logic,&rdquo; because it correctly generates multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection). However, this more likely reflects the Transformer successfully modeling the high-frequency conditional probability distribution of synonymous names present in the PubChem training data, rather than learning intrinsic chemical rules.</li>
<li><strong>Production Readiness</strong>: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.</li>
<li><strong>Paradigm Shift</strong>: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized the PubChem database.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total</strong></td>
          <td>PubChem</td>
          <td>~95M</td>
          <td>Filtered for RDKit compatibility</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Split A</td>
          <td>47,312,235</td>
          <td>Random 50% split</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Split B</td>
          <td>47,413,850</td>
          <td>Random 50% split</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Cleaning</strong>: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.</li>
<li><strong>Availability</strong>: A subset of 100,000 test molecules is available on GitHub (<code>data/test_100000.csv</code>) and Zenodo. The full train/test splits are not explicitly provided.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SMILES</strong>: Character-based tokenization.</li>
<li><strong>IUPAC</strong>: Custom rule-based tokenizer splitting suffixes (<code>-one</code>, <code>-al</code>), prefixes (<code>-oxy</code>, <code>-di</code>), and special symbols (<code>(</code>, <code>)</code>, <code>R(S)</code>).</li>
</ul>
</li>
<li><strong>Verification Step</strong>:
<ol>
<li>Generate $N$ names using Beam Search ($N=5$).</li>
<li>Reverse translate the candidate name using OPSIN.</li>
<li>Check if the OPSIN structure matches the original input SMILES.</li>
<li>Display the first verified match; otherwise, report failure.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Standard Transformer with 6 encoder layers and 6 decoder layers.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Attention Heads: 8</li>
<li>Attention Dimension ($d_{\text{model}}$): 512</li>
<li>Feed-Forward Dimension ($d_{\text{ff}}$): 2048</li>
</ul>
</li>
<li><strong>Training Objective</strong>: The models were trained using standard autoregressive cross-entropy loss over the target token sequence $y$ given the input string $x$:
$$\mathcal{L} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x)$$</li>
<li><strong>Training</strong>: Two separate models were trained: <code>Struct2IUPAC</code> (SMILES $\to$ IUPAC) and <code>IUPAC2Struct</code> (IUPAC $\to$ SMILES).</li>
<li><strong>Availability</strong>: Code for model architecture is provided in the GitHub repository. Pre-trained weights for the IUPAC2Struct model are available, but the Struct2IUPAC model weights are not publicly released, meaning researchers would need to retrain that model on their own PubChem data to reproduce those results.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a random subset of 100,000 molecules from the test set.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Beam Size</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>1</td>
          <td>96.1%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>5</td>
          <td>98.9%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>1</td>
          <td>96.6%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>5</td>
          <td>99.1%</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Robustness</strong>: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Infrastructure</strong>: 4 $\times$ Tesla V100 GPUs and 36 CPUs.</li>
<li><strong>Training Time</strong>: Approximately 10 days under full load.</li>
<li><strong>Inference Speed</strong>: &lt;0.5s per molecule on GPU; scale is linear with output token length.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sergsb/IUPAC2Struct">IUPAC2Struct (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Transformer code and pre-trained IUPAC2Struct model</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4280814">Test data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>100k test molecules, OPSIN failure cases, model failure cases</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/smiles2iupac">Struct2IUPAC web demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online interface for SMILES to IUPAC conversion</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krasnov, L., Khokhlov, I., Fedorov, M. V., &amp; Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. <em>Scientific Reports</em>, 11(1), 14798. <a href="https://doi.org/10.1038/s41598-021-94082-y">https://doi.org/10.1038/s41598-021-94082-y</a></p>
<p><strong>Publication</strong>: Scientific Reports 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krasnovTransformerbasedArtificialNeural2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14798}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-021-94082-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/sergsb/IUPAC2Struct">GitHub Repository</a></li>
<li><a href="https://app.syntelly.com/smiles2iupac">Web Demo</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT: SMILES to IUPAC Names via Neural Machine Translation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout/</guid><description>A deep-learning neural machine translation approach to translate between SMILES strings and IUPAC names using the STOUT model.</description><content:encoded><![CDATA[<h2 id="contribution-translating-chemistry-as-a-language">Contribution: Translating Chemistry as a Language</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong secondary contribution as a <strong>Resource</strong> paper.</p>
<ul>
<li><strong>Method</strong>: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.</li>
<li><strong>Resource</strong>: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.</li>
</ul>
<h2 id="motivation-democratizing-iupac-nomenclature">Motivation: Democratizing IUPAC Nomenclature</h2>
<p>The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon&rsquo;s <code>molconvert</code>), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.</p>
<h2 id="core-innovation-sequence-to-sequence-naming">Core Innovation: Sequence-to-Sequence Naming</h2>
<ul>
<li><strong>Language Translation Approach</strong>: The authors treat chemical representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>/<a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.</li>
<li><strong>Use of SELFIES</strong>: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.</li>
<li><strong>Hardware Acceleration</strong>: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.</li>
</ul>
<h2 id="methodology--translation-validation">Methodology &amp; Translation Validation</h2>
<ul>
<li><strong>Data Scale</strong>: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.</li>
<li><strong>Hardware Benchmarking</strong>: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.</li>
<li><strong>Bidirectional Translation</strong>: The system was tested on two distinct tasks:
<ol>
<li><strong>Forward</strong>: SELFIES → IUPAC names</li>
<li><strong>Reverse</strong>: IUPAC names → SELFIES</li>
</ol>
</li>
<li><strong>Validation</strong>: Performance was evaluated on a held-out test set of 2.2 million molecules.</li>
</ul>
<h2 id="translation-accuracy--hardware-scaling">Translation Accuracy &amp; Hardware Scaling</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index &gt; 0.9 for both translation directions.</li>
<li><strong>Generalization</strong>: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.</li>
<li><strong>Impact of Data Size</strong>: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.</li>
<li><strong>Hardware Necessity</strong>: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Current repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Public Domain</td>
          <td style="text-align: left">Source of 111M molecules; 30M/60M training subsets not directly provided</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.</p>
<p><strong>Preprocessing &amp; Filtering</strong>:</p>
<ul>
<li>Explicit hydrogens removed; converted to canonical SMILES.</li>
<li><strong>Filtering Rules</strong>: MW &lt; 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.</li>
<li><strong>Ground Truth Generation</strong>: ChemAxon&rsquo;s <code>molconvert</code> (Marvin Suite 20.15) was used to generate target IUPAC names for training.</li>
<li><strong>Representation</strong>: All SMILES were converted to SELFIES for training.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">PubChem Filtered</td>
          <td style="text-align: left">30M &amp; 60M</td>
          <td style="text-align: left">Two distinct training sets created.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">PubChem Held-out</td>
          <td style="text-align: left">2.2M</td>
          <td style="text-align: left">Molecules not present in training sets; uniform token frequency.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SELFIES</strong>: Split iteratively by brackets <code>[</code> and <code>]</code>.</li>
<li><strong>IUPAC</strong>: Split via punctuation (<code>(</code>, <code>)</code>, <code>{</code>, <code>}</code>, <code>[</code>, <code>]</code>, <code>-</code>, <code>.</code>, <code>,</code>) and a discrete set of sub-word chemical morphemes (e.g., <code>methyl</code>, <code>benzene</code>, <code>fluoro</code>).</li>
<li><strong>Padding</strong>: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. &ldquo;Start&rdquo; and &ldquo;End&rdquo; sequence markers append each chain.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer instantiated with a learning rate of $0.0005$.</li>
<li><strong>Objective Function</strong>: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$:
$$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.</li>
<li><strong>Components</strong>:
<ul>
<li><strong>Encoder/Decoder</strong>: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).</li>
<li><strong>Attention</strong>: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively:
$$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$</li>
<li><strong>Embedding</strong>: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.</li>
</ul>
</li>
<li><strong>Implementation</strong>: Python 3 backend using TensorFlow 2.3.0. <em>Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.</em></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Details</th>
          <th style="text-align: left">Result (60M Model)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU Score</strong></td>
          <td style="text-align: left">NLTK sentence BLEU (unigram to 4-gram)</td>
          <td style="text-align: left">0.94 (IUPAC $\to$ SELFIES)</td>
          <td style="text-align: left">Exact text overlap. Serves as a strictly syntactic proxy.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto Similarity</strong></td>
          <td style="text-align: left">PubChem fingerprints via CDK</td>
          <td style="text-align: left">0.98 (Valid IUPAC names)</td>
          <td style="text-align: left">Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Comparison of hardware efficiency for training large chemical language models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hardware</th>
          <th style="text-align: left">Batch Size</th>
          <th style="text-align: left">Time per Epoch (15M subset)</th>
          <th style="text-align: left">Speedup Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>GPU (Tesla V100)</strong></td>
          <td style="text-align: left">256</td>
          <td style="text-align: left">~27 hours</td>
          <td style="text-align: left">1x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-8</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~2 hours</td>
          <td style="text-align: left">13x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-32</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~0.5 hours</td>
          <td style="text-align: left">54x</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. <em>Journal of Cheminformatics</em>, 13(1), 34. <a href="https://doi.org/10.1186/s13321-021-00512-4">https://doi.org/10.1186/s13321-021-00512-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTSMILESIUPAC2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{STOUT: SMILES to IUPAC Names Using Neural Machine Translation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{STOUT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00512-4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-09-22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/stout-v2/">STOUT V2.0 Note</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/">Struct2IUPAC Note</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/">HandSEL Note (InChI to IUPAC)</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT V2.0: Transformer-Based SMILES to IUPAC Translation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout-v2/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout-v2/</guid><description>A Transformer-based model for translating SMILES to IUPAC names, trained on ~1 billion molecules, achieving ~0.99 BLEU score on benchmarks.</description><content:encoded><![CDATA[<h2 id="paper-contribution--methodological-scope">Paper Contribution &amp; Methodological Scope</h2>
<p><strong>Method (Primary) / Resource (Secondary)</strong></p>
<p>This paper presents a <strong>Methodological</strong> contribution by developing and validating a Transformer-based neural machine translation model (STOUT V2) for bidirectional chemical nomenclature (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> $\leftrightarrow$ IUPAC). It systematically compares this new architecture against previous RNN-based baselines (<a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT V1</a>) and performs ablation studies on tokenization strategies.</p>
<p>It also serves as a significant <strong>Resource</strong> contribution by generating a massive training dataset of nearly 1 billion SMILES-IUPAC pairs (curated via commercial Lexichem software) and releasing the resulting models and code as open-source tools for chemical naming.</p>
<h2 id="the-need-for-robust-open-source-iupac-nomenclature-rules">The Need for Robust Open-Source IUPAC Nomenclature Rules</h2>
<p>Assigning systematic IUPAC names to chemical structures requires adherence to complex rules, challenging human consistency. Deterministic, rule-based software options like OpenEye Lexichem and ChemAxon are reliable commercial solutions. Existing open-source tools like OPSIN focus on parsing names to structures.</p>
<p>The previous version of STOUT (V1), based on RNNs/GRUs, achieved ~90% BLEU accuracy, with known limitations in capturing long-distance dependencies required for stereochemistry handling. This work uses the sequence-learning capabilities of Transformers combined with large-scale datasets to create a competitive open-source IUPAC naming tool.</p>
<h2 id="architectural-shift-and-billion-scale-training">Architectural Shift and Billion-Scale Training</h2>
<p>The primary advancements over previous iterations address both architecture and dataset scale:</p>
<ol>
<li><strong>Architecture Shift</strong>: Moving from an RNN-based Seq2Seq model to a <strong>Transformer-based architecture</strong> (4 layers, 8 heads), which captures intricate chemical patterns better than GRUs.</li>
<li><strong>Billion-Scale Training</strong>: Training on a dataset of nearly <strong>1 billion molecules</strong> (combining PubChem and ZINC15), significantly larger than the 60 million used for STOUT V1.</li>
<li><strong>Tokenization Strategy</strong>: Determining that <strong>character-wise tokenization</strong> for IUPAC names is superior to word-wise tokenization in terms of both accuracy and training efficiency (15% faster).</li>
</ol>
<h2 id="experimental-validation-and-scaling-limits">Experimental Validation and Scaling Limits</h2>
<p>The authors conducted three primary experiments to validate bidirectional translation (SMILES $\rightarrow$ IUPAC and IUPAC $\rightarrow$ SMILES):</p>
<ul>
<li><strong>Experiment 1 (Optimization)</strong>: Assessed the impact of dataset size (1M vs 10M vs 50M) and tokenization strategy on SMILES-to-IUPAC performance.</li>
<li><strong>Experiment 2 (Scaling)</strong>: Trained models on 110 million PubChem molecules for <strong>both</strong> forward and reverse translation tasks to test performance on longer sequences.</li>
<li><strong>Experiment 3 (Generalization)</strong>: Trained on the full ~1 billion dataset (PubChem + ZINC15) for both translation directions.</li>
<li><strong>External Validation</strong>: Benchmarked against an external dataset from ChEBI (1,485 molecules) and ChEMBL34 to test generalization to unseen data.</li>
</ul>
<p><strong>Evaluation Metrics</strong>:</p>
<ul>
<li><strong>Textual Accuracy</strong>: BLEU scores (1-4) and Exact String Match.</li>
<li><strong>Chemical Validity</strong>: Retranslation of generated names back to SMILES using OPSIN, followed by Tanimoto similarity checks (PubChem fingerprints) against the original input.</li>
</ul>
<h2 id="translation-accuracy-and-structural-validity">Translation Accuracy and Structural Validity</h2>
<ul>
<li><strong>Superior Performance</strong>: STOUT V2 achieved an average BLEU score of <strong>0.99</strong> (vs 0.94 for V1). While exact string matches varied by experiment (83-89%), the model notably achieved a perfect BLEU score (1.0) on <strong>97.49%</strong> of a specific test set where STOUT V1 only reached 66.65%.</li>
<li><strong>Structural Validity (&ldquo;Near Misses&rdquo;)</strong>: When the generated name differed from the ground truth string, the re-generated structure often remained chemically valid. The model maintained an average Tanimoto similarity $T(A,B)$ of <strong>0.68</strong> for these divergent names between bit-vector fingerprints $A$ and $B$, roughly defined as:
$$ T(A,B) = \frac{\sum (A \cap B)}{\sum (A \cup B)} $$
<em>Critique</em>: Note that an average Tanimoto coefficient of 0.68 typically suggests moderate structural similarity/drift, not an almost-identical &ldquo;near miss&rdquo; (which would be $&gt;0.85$). This implies the model constructs chemically related but structurally distinct outputs when it fails exact string matching.</li>
<li><strong>Tokenization</strong>: Character-level splitting for IUPAC names outperformed word-level splitting and was more computationally efficient.</li>
<li><strong>Data Imbalance &amp; Generalization</strong>: The model&rsquo;s drop in performance for sequences &gt;600 characters highlights a systemic issue in open chemical databases: long, highly complex SMILES strings are significantly underrepresented. Even billion-scale training datasets are still bound by the chemical diversity of their source material.</li>
<li><strong>Limitations</strong>:
<ul>
<li><strong>Preferred Names (PINs)</strong>: The model mimics Lexichem&rsquo;s naming conventions, generating valid IUPAC names distinct from strict <em>Preferred IUPAC Names</em> (PINs).</li>
<li><strong>Sequence Length</strong>: Performance degrades for very long SMILES (&gt;600 characters) due to scarcity in the training data.</li>
<li><strong>Algorithmic Distillation Bottleneck</strong>: Because the 1 billion training pairs were generated entirely by OpenEye&rsquo;s Lexichem, STOUT V2 acts as a knowledge distillation of that specific commercial algorithm. The model learns Lexichem’s heuristic mapping, specific dialects, and potential systematic errors, rather than deriving true nomenclature rules from first principles.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was derived from PubChem and ZINC15. Ground truth IUPAC names were generated using OpenEye Lexichem TK 2.8.1 to ensure consistency.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Exp 1)</strong></td>
          <td>PubChem Subset</td>
          <td>1M, 10M, 50M</td>
          <td>Selected via MaxMin algorithm for diversity</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 2)</strong></td>
          <td>PubChem</td>
          <td>110M</td>
          <td>Filtered for SMILES length &lt; 600</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 3)</strong></td>
          <td>PubChem + ZINC15</td>
          <td>~1 Billion</td>
          <td>999,637,326 molecules total</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChEBI</td>
          <td>1,485</td>
          <td>External validation set, non-overlapping with training</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Canonicalized, isomeric, and kekulized using RDKit (v2023.03.1).</li>
<li><strong>Formatting</strong>: Converted to TFRecord format in 100 MB chunks for TPU efficiency.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting. Atoms (e.g., &ldquo;Cl&rdquo;, &ldquo;Au&rdquo;), bonds, brackets, and digits are separate tokens.</li>
<li><strong>IUPAC Tokenization</strong>: <strong>Character-wise split</strong> was selected as the optimal strategy (treating every character as a token).</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler based on model dimensions.</li>
<li><strong>Loss Function</strong>: Trained to minimize the Sparse Categorical Cross-Entropy $L$, masking padding tokens. For a correctly predicted target class $t$ alongside probabilities $p_i$, the masked loss is represented mathematically as:
$$ L = - \sum_{i=1}^{m} m_i y_{i} \log(p_{i}) $$
where $m_i$ masks padded positions.</li>
<li><strong>Code Availability</strong>: The <a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">main STOUT V2 repository</a> contains the inference package. The training pipeline/instructions (originally linked to a separate repo that is currently a 404) can still be found within the <a href="https://doi.org/10.5281/zenodo.6559438">Zenodo archive release</a>.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the standard Transformer architecture from &ldquo;Attention is All You Need&rdquo; (Vaswani et al.).</p>
<ul>
<li><strong>Architecture</strong>: 4 Transformer layers (encoder/decoder stack).</li>
<li><strong>Attention</strong>: Multi-head attention with <strong>8 heads</strong>.</li>
<li><strong>Dimensions</strong>: Embedding size ($d_{model}$) = 512; Feed-forward dimension ($d_{ff}$) = 2048.</li>
<li><strong>Regularization</strong>: Dropout rate of 0.1.</li>
<li><strong>Context Window</strong>: Max input length (SMILES) = 600; Max output length (IUPAC) = 700-1000.</li>
<li><strong>Weights</strong>: Model weights for forward and reverse architectures are <a href="https://doi.org/10.5281/zenodo.13318286">available via Zenodo (v3)</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation focused on both string similarity and chemical structural integrity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BLEU Score</strong></td>
          <td>N-gram overlap</td>
          <td>Compared predicted IUPAC string to Ground Truth.</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Accuracy</td>
          <td>Binary 1/0 check for identical strings.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Structural Similarity</td>
          <td>Predicted Name $\rightarrow$ OPSIN $\rightarrow$ SMILES $\rightarrow$ Fingerprint comparison to input.</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT V2 GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Inference package (PyPI: STOUT-pypi)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13318286">Model Weights (Zenodo v3)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Forward and reverse translation weights</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6559438">Code Snapshot (Zenodo)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training pipeline archive</td>
      </tr>
      <tr>
          <td><a href="https://stout.decimer.ai">Web Application</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Demo with Ketcher, bulk submission, DECIMER integration</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was conducted entirely on Google Cloud Platform (GCP) TPUs.</p>
<ul>
<li><strong>STOUT V1</strong>: Trained on TPU v3-8.</li>
<li><strong>STOUT V2</strong>: Trained on <strong>TPU v4-128 pod slices</strong> (128 nodes).</li>
<li><strong>Large Scale (Exp 3)</strong>: Trained on <strong>TPU v4-256 pod slice</strong> (256 nodes).</li>
<li><strong>Training Time</strong>: Average of <strong>15 hours and 2 minutes per epoch</strong> for the 1 billion dataset.</li>
<li><strong>Framework</strong>: TensorFlow 2.15.0-pjrt with Keras.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2024). STOUT V2.0: SMILES to IUPAC name conversion using transformer models. <em>Journal of Cheminformatics</em>, 16(146). <a href="https://doi.org/10.1186/s13321-024-00941-x">https://doi.org/10.1186/s13321-024-00941-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTV20SMILES2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{STOUT V2}}.0: {{SMILES}} to {{IUPAC}} Name Conversion Using Transformer Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{STOUT V2}}.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{146}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00941-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://stout.decimer.ai">Web Application</a> (Includes Ketcher drawing, bulk submission, and DECIMER integration)</li>
<li><a href="https://decimer.ai">DECIMER Project</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT V1 Note</a></li>
<li><a href="https://zenodo.org/records/6559438">Zenodo Archive (Code Snapshot)</a></li>
</ul>
]]></content:encoded></item><item><title>OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</guid><description>A diffusion-based data augmentation pipeline (OCSAug) using DDPM and RePaint to improve optical chemical structure recognition on hand-drawn images.</description><content:encoded><![CDATA[<h2 id="document-taxonomy-ocsaug-as-a-novel-method">Document Taxonomy: OCSAug as a Novel Method</h2>
<p>This is a <strong>Method</strong> paper according to the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">taxonomy</a>. It proposes a novel data augmentation pipeline (<strong>OCSAug</strong>) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.</p>
<h2 id="expanding-hand-drawn-training-data-for-ocsr">Expanding Hand-Drawn Training Data for OCSR</h2>
<p>A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train effective models, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.</p>
<h2 id="ocsaug-pipeline-masked-repaint-via-generative-ai">OCSAug Pipeline: Masked RePaint via Generative AI</h2>
<p>The core novelty is <strong>OCSAug</strong>, a three-phase pipeline that uses generative AI to synthesize training data:</p>
<ol>
<li><strong>DDPM + RePaint</strong>: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.</li>
<li><strong>Structural Masking</strong>: It introduces <strong>vertical and horizontal stripe pattern masks</strong>. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular &ldquo;hand-drawn&rdquo; styles while preserving the underlying chemical topology.</li>
<li><strong>Label Transfer</strong>: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.</li>
</ol>
<h2 id="benchmarking-diffusion-augmentations-on-decimer">Benchmarking Diffusion Augmentations on DECIMER</h2>
<p>The authors evaluated OCSAug using the <strong>DECIMER dataset</strong>, specifically a &ldquo;drug-likeness&rdquo; subset filtered by Lipinski&rsquo;s and Veber&rsquo;s rules.</p>
<ul>
<li><strong>Baselines</strong>: The method was compared against <strong>RDKit</strong> (digital generation) and <strong>Randepict</strong> (rule-based augmentation).</li>
<li><strong>Models</strong>: Four recent OCSR models were fine-tuned: <strong>MolScribe</strong>, <strong>DECIMER 1.0 (I2S)</strong>, <strong>MolNexTR</strong>, and <strong>MPOCSR</strong>.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Tanimoto Similarity</strong>: To measure prediction accuracy against ground truth.</li>
<li><strong>Fréchet Inception Distance (FID)</strong>: To measure the distributional similarity between generated and real hand-drawn images.</li>
<li><strong>RMSE</strong>: To quantify pixel-level structural preservation across different mask thicknesses.</li>
</ul>
</li>
</ul>
<h2 id="improved-generalization-capabilities-and-fid-scores">Improved Generalization Capabilities and FID Scores</h2>
<ul>
<li><strong>Performance Boost</strong>: OCSAug improved recognition accuracy (Tanimoto similarity) by <strong>1.918 to 3.820 times</strong> compared to non-fine-tuned baselines (Improvement Ratio), outperforming traditional augmentation techniques such as RDKit and Randepict (1.570-3.523x).</li>
<li><strong>Data Quality</strong>: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.</li>
<li><strong>Generalization</strong>: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.</li>
<li><strong>Resolution Mixing</strong>: Training MolScribe and MolNexTR with a mix of $128 \times 128$, $256 \times 256$, and $512 \times 512$ resolution images improved Tanimoto similarity (e.g., MolScribe from 0.585 to 0.640), though this strategy did not help I2S or MPOCSR.</li>
<li><strong>Real-World Evaluation</strong>: On a newly collected dataset of 463 hand-drawn images from 6 volunteers (88 drug compounds), the MPOCSR model fine-tuned with OCSAug achieved 0.367 exact-match accuracy (Tanimoto = 1.0), compared to 0.365 for non-augmented fine-tuning and 0.037 for no fine-tuning. The area under the accuracy curve showed a more notable improvement in reducing misrecognition.</li>
<li><strong>Limitations</strong>: The generation process is slow (3 weeks for 10k images on a single GPU). The fixed stripe masks may struggle with highly complex, non-drug-like geometries: when evaluated on the full DECIMER dataset (without drug-likeness filtering), OCSAug did not yield uniform improvements across all models.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jjjabcd/OCSAug">OCSAug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using guided-diffusion and RePaint</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6456306">DECIMER Hand-Drawn Dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY 4.0</td>
          <td>5,088 hand-drawn molecular structure images from 24 individuals</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: DECIMER dataset (hand-drawn images).</li>
<li><strong>Filtering</strong>: A &ldquo;drug-likeness&rdquo; filter was applied (Lipinski&rsquo;s rule of 5 + Veber&rsquo;s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).</li>
<li><strong>Final Size</strong>: 3,194 samples, split into:
<ul>
<li><strong>Training</strong>: 2,604 samples.</li>
<li><strong>Validation</strong>: 290 samples.</li>
<li><strong>Test</strong>: 300 samples.</li>
</ul>
</li>
<li><strong>Resolution</strong>: All images resized to $256 \times 256$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DDPM implemented using <code>guided-diffusion</code>.</li>
<li><strong>RePaint Settings</strong>:
<ul>
<li>Total time steps: 250.</li>
<li>Jump length: 10.</li>
<li>Resampling counts: 10.</li>
</ul>
</li>
<li><strong>Masking Strategy</strong>:
<ul>
<li><strong>Vertical Stripes</strong>: Obscure atom symbols to vary handwriting style.</li>
<li><strong>Horizontal Stripes</strong>: Obscure bonds to vary length/thickness/alignment.</li>
<li><strong>Optimal Thickness</strong>: A stripe thickness of <strong>4 pixels</strong> was found to be optimal for balancing diversity and structural preservation.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.</p>
<ul>
<li><strong>MolScribe</strong>: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.</li>
<li><strong>I2S (DECIMER 1.0)</strong>: Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.</li>
<li><strong>MolNexTR</strong>: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.</li>
<li><strong>MPOCSR</strong>: MPViT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>
<p><strong>Metric</strong>: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated iteratively or defined as:</p>
<p>$$
\text{IR} = \frac{\text{TS}_{\text{finetuned}}}{\text{TS}_{\text{non-finetuned}}}
$$</p>
</li>
<li>
<p><strong>Validation</strong>: Cross-validation on the split DECIMER dataset.</p>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: NVIDIA GeForce RTX 4090.</li>
<li><strong>Training Time</strong>: DDPM training took ~6 days.</li>
<li><strong>Generation Time</strong>: Generating 2,600 augmented images took ~70 hours.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, J. H., &amp; Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. <em>The Journal of Supercomputing</em>, 81, 926.</p>
<p><strong>Publication</strong>: The Journal of Supercomputing 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/jjjabcd/OCSAug">Official Repository</a></li>
<li><a href="https://zenodo.org/records/6456306">DECIMER Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimOCSAugDiffusionbasedOptical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSAug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kim, Jin Hyuk and Choi, Jonghwan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{The Journal of Supercomputing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{81}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{926}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11227-025-07406-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Multimodal Search in Chemical Documents and Reactions</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/shah-multimodal-search-2025/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/shah-multimodal-search-2025/</guid><description>A multimodal search engine that integrates text passages, molecular diagrams, and reaction data to enable passage-level retrieval in chemical literature.</description><content:encoded><![CDATA[<h2 id="contribution-multimodal-synthesis-retrieval">Contribution: Multimodal Synthesis Retrieval</h2>
<p>This paper represents a $\Psi_{\text{Method}}$ projection that proposes a novel architectural pipeline for indexing and searching chemical literature. The framework unifies text, molecular diagrams, and structured reaction records. It also contains a secondary $\Psi_{\text{Resource}}$ projection, providing a functional demonstration tool and curating a specific benchmark dataset for Suzuki coupling reactions.</p>
<h2 id="the-gap-in-passage-level-chemical-retrieval">The Gap in Passage-Level Chemical Retrieval</h2>
<p>Scientific literature documents chemical reactions through a combination of text and visual diagrams. Textual descriptions detail parameters like yield and operational temperature, whereas diagrams graphically model these structural transformations. Existing tools such as SciFinder or <a href="https://en.wikipedia.org/wiki/Reaxys">Reaxys</a> perform document-level or individual compound retrieval. They fail to explicitly link molecular figures to localized textual descriptions. This structure prevents researchers from directly extracting a corresponding reaction diagram alongside the exact textual protocol. Researchers require passage-level retrieval of synthesis protocols to efficiently access complete reaction conditions.</p>
<h2 id="core-innovation-unified-multimodal-indexing">Core Innovation: Unified Multimodal Indexing</h2>
<p>The core methodological innovation is a multimodal passage-level indexing and linking pipeline.</p>
<ul>
<li><strong>Unified Indexing:</strong> The framework processes text and diagrams in parallel and directly links them into a single index structure. This architecture supports search queries utilizing raw text, discrete <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, or multimodal combinations.</li>
<li><strong>Compound-Passage Linking:</strong> The mechanism applies conflict-resolution logic linking chemical diagrams to specific text citations using two parallel heuristics:
<ol>
<li><strong>Token-based Alignment:</strong> Matching parsed diagram labels against documented text strings (e.g., &ldquo;compound 5&rdquo;) using normalized <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>.</li>
<li><strong>Fingerprint-based Alignment:</strong> Matching chemical structures against generated SMILES strings via structural <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto Similarity</a>.</li>
</ol>
</li>
<li><strong>ReactionMiner Integration:</strong> The pipeline parses and incorporates formatted reaction records (reactants, products, catalysts, quantitative yields) directly derived from segmented text passages.</li>
</ul>
<h2 id="methodology--expert-evaluation">Methodology &amp; Expert Evaluation</h2>
<p>The authors evaluated the system utilizing a chemical case study targeting specific synthesis domains alongside qualitative expert assessment.</p>
<ul>
<li><strong>Dataset:</strong> Evaluators processed a corpus of 7 research manuscripts and 6 supplementary data documents detailing <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki coupling</a> reactions.</li>
<li><strong>Volume:</strong> The resulting index processed 1,282 extracted passages (indexing 538), extracted 383 unique SMILES, and logged 219 parsed reactions.</li>
<li><strong>Qualitative Evaluation:</strong> Practicing structural chemists developed real-world queries (such as cross-referencing the conceptual &ldquo;Burke group&rdquo; alongside an explicit structural SMARTS pattern) to gauge retrieval capability.</li>
</ul>
<h2 id="key-findings--system-limitations">Key Findings &amp; System Limitations</h2>
<ul>
<li><strong>Diagram-to-Text Linking:</strong> The pipeline accurately paired visual molecular diagrams with structurally derived text details, permitting testers to navigate directly from a molecule query card to the exact origin passage within the source PDF.</li>
<li><strong>Contextual Insight Extraction:</strong> Specialized chemists found the parsed reaction representations (yield metrics, isolated catalysts) functionally pragmatic as high-level extractive summaries.</li>
<li><strong>Extrapolative Retrieval:</strong> The architecture permitted the effective retrieval of targeted chemical derivatives (such as benzo[b]thiophen-2-ylboronic acid) via structurally related input queries (dibenzothiophene).</li>
</ul>
<p>The system evaluation highlights several architectural restrictions:</p>
<ul>
<li><strong>Domain-Restricted Validation:</strong> The initial validation is entirely qualitative and bounded to the specific subclass of Suzuki coupling reactions. The evaluation omits standardized quantitative retrieval baselines (e.g., MAP, NDCG) and lacks systematic ablation data for the fusion scoring mechanism.</li>
<li><strong>Algorithmic Transparency:</strong> The multimodal query routing mechanism does not clearly indicate the dominant retrieval feature. This hides whether keyword text or structural similarity actually drove the final result placement. This ambiguity limits operator control.</li>
<li><strong>Optical Processing Brittleness:</strong> The embedded vision inference and primitive parsing pipelines display inherent fragility, producing intermittent failures when associating text passages with correctly parsed molecular diagrams.</li>
<li><strong>Metadata Logging Incompleteness:</strong> Practicing chemists requested additional structured metadata targets (such as specific molar equivalents and parameterized mol% values) to successfully bridge the extracted data stream directly into digital electronic lab notebooks.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.cs.rit.edu/~dprl/reactionminer-demo-landing/">ReactionMiner Demo</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Online demo landing page; source code repository not publicly linked</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source:</strong> The corpus features 7 primary research papers and 6 auxiliary supplementary information documents focusing on Suzuki coupling reactions, sourced from practicing chemists at UIUC. This evaluation dataset is strictly internal and not publicly available.</li>
<li><strong>Preprocessing:</strong>
<ul>
<li>Engineers convert source PDFs to full-page raster images.</li>
<li>The system extracts localized graphical layout and raw text via <strong>PyTesseract</strong>.</li>
<li>The pipeline segments valid passage chunks emphasizing reaction-related sentences utilizing product-indicative lexicons and topic modeling.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Diagram Extraction:</strong> A <strong>YOLOv8</strong> model identifies and segments molecular regions within structured PDF pages.</li>
<li><strong>Diagram Parsing:</strong> The architecture relies on <strong>ChemScraper</strong> to infer structural semantics from raw diagrams:
<ul>
<li><em>Born-digital PDFs:</em> <strong>SymbolScraper</strong> extracts vector lines and polygons directly from bounding box definitions.</li>
<li><em>Raster images:</em> The system employs the <strong>Line Segment Detector (LSD)</strong> and watershed bounding algorithms to isolate native geometric primitives.</li>
</ul>
</li>
<li><strong>Text Entity Extraction:</strong> The framework deploys <strong>ChemDataExtractor 2.0</strong> to extract explicit molecular aliases. A translation layer maps these entities to string representations via <strong>OPSIN</strong>.</li>
<li><strong>Linking Logic (Fusion Score):</strong>
<ul>
<li><strong>Text Link:</strong> The algorithm calculates a normalized Levenshtein ratio connecting visual diagram labels against proximal text mentions based on calculated edit distance.</li>
<li><strong>Structure Link:</strong> The algorithm computes the discrete Tanimoto Similarity between generated 2048-bit Morgan fingerprints extracted from localized visual diagram features and baseline text SMILES queries:
$$ T(A, B) = \frac{A \cdot B}{|A|^{2} + |B|^{2} - A \cdot B} $$
where $A$ and $B$ represent the boolean bit vectors of the respective fingerprint pairs.</li>
<li><strong>Conflict Resolution Protocol:</strong> The system fuses structural geometry bounds and discrete textual tokenization metrics, prioritizing the ranking sequence that yields a higher terminal similarity score. During final retrieval, the candidate subset is systematically re-ranked leveraging the hybrid calculation of the <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a> explicit metric and the localized count of exact SMILES pattern hits.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Reaction Extraction Parameters:</strong> The engineers configure a <strong>LLaMA-3.1-8b</strong> model fine-tuned entirely via <strong>LoRA</strong> targeting custom tokens representing reaction entities (compounds, reagents, thermal inputs) directly pulled from text sub-chunks. Exact prompt constraints, the fine-tuning dataset, and specific LoRA hyperparameters are omitted from the source text.</li>
<li><strong>Diagram Processing Bounds:</strong> The codebase incorporates a segmentation-aware multi-task neural network topology built into ChemScraper to execute low-level raster image parsing tasks.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Search Engine Base:</strong> The authors implemented their indexing framework scaling atop <strong>PyTerrier</strong>.</li>
<li><strong>Text Feature Ranking:</strong> The metric utilizes standalone <strong>BM25</strong> bounds mapping keyword-similarity.</li>
<li><strong>Structure Feature Operations:</strong> The topology operates <strong>RDKit</strong> bindings powering substructure coordinate mapping logic and exact molecular similarity searches.</li>
<li><strong>Multimodal Fusion Processing:</strong>
<ul>
<li>The algorithm filters out terminal candidates mapping initial structural properties (SMILES queries) against the document-wide lexical properties (BM25 scores).</li>
<li>The final fusion routing assigns the strongest positive weight to retrieved passages that accumulate dense local clusters of structurally exact verified SMILES patterns.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Infrastructure:</strong> The hardware and parameter requirements to host the multi-stage vision extractors (YOLOv8, ChemScraper) alongside a local 8B LLM are entirely unspecified in the paper.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Shah, A. K., et al. (2025). Multimodal Search in Chemical Documents and Reactions. In <em>Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR &lsquo;25)</em>. ACM. <a href="https://doi.org/10.48550/arXiv.2502.16865">https://doi.org/10.48550/arXiv.2502.16865</a></p>
<p><strong>Publication</strong>: SIGIR &lsquo;25 (Demo Track), 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{shahMultimodalSearchChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Multimodal {{Search}} in {{Chemical Documents}} and {{Reactions}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Shah, Ayush Kumar and Dey, Abhisek and Luo, Leo and Amador, Bryan and Philippy, Patrick and Zhong, Ming and Ouyang, Siru and Friday, David Mark and Bianchi, David and Jackson, Nick and Zanibbi, Richard and Han, Jiawei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = feb,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.cs.rit.edu/~dprl/reactionminer-demo-landing/">Online Demo</a> (Note: While the landing page advertises the system as open-source, the exact repository URL and installation prerequisites are omitted from the official manuscript.)</li>
</ul>
]]></content:encoded></item><item><title>MOFFlow: Flow Matching for MOF Structure Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mofflow/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/mofflow/</guid><description>A Riemannian flow matching framework for generating Metal-Organic Framework structures by treating building blocks as rigid bodies.</description><content:encoded><![CDATA[<h2 id="methodological-contribution-mofflow-architecture">Methodological Contribution: MOFFlow Architecture</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$).</p>
<p>It introduces <strong>MOFFlow</strong>, a generative architecture and training framework designed specifically for the structure prediction of Metal-Organic Frameworks (MOFs). The paper focuses on the algorithmic innovation of decomposing the problem into rigid-body assembly on a Riemannian manifold, validates this through comparison against existing baselines, and performs ablation studies to justify architectural choices. While it leverages the theory of flow matching, its primary contribution is the application-specific architecture and the handling of modular constraints.</p>
<h2 id="motivation-scaling-limits-of-atom-level-generation">Motivation: Scaling Limits of Atom-Level Generation</h2>
<p>The primary motivation is to overcome the scalability and accuracy limitations of existing methods for MOF structure prediction.</p>
<ul>
<li><strong>Computational Cost of DFT:</strong> Conventional approaches rely on <em>ab initio</em> calculations (DFT) combined with random search, which are computationally prohibitive for large, complex systems like MOFs.</li>
<li><strong>Failure of General CSP:</strong> Existing deep generative models for general Crystal Structure Prediction (CSP) operate on an atom-by-atom basis. They fail to scale to MOFs, which often contain hundreds or thousands of atoms per unit cell, and do not exploit the inherent modular nature (building blocks) of MOFs.</li>
<li><strong>Tunability:</strong> MOFs have applications in carbon capture and drug delivery due to their tunable porosity, making automated design tools valuable.</li>
</ul>
<h2 id="core-innovation-rigid-body-flow-matching-on-se3">Core Innovation: Rigid-Body Flow Matching on SE(3)</h2>
<p>MOFFlow introduces a <strong>hierarchical, rigid-body flow matching framework</strong> tailored for MOFs.</p>
<ul>
<li><strong>Rigid Body Decomposition:</strong> MOFFlow treats metal nodes and organic linkers as rigid bodies, reducing the search space from $3N$ (atoms) to $6M$ (roto-translation of $M$ blocks) compared to atom-based methods.</li>
<li><strong>Riemannian Flow Matching on $SE(3)$:</strong> It is the first end-to-end model to jointly generate block-level rotations ($SO(3)$), translations ($\mathbb{R}^3$), and lattice parameters using <a href="/notes/machine-learning/generative-models/flow-matching-for-generative-modeling/">Riemannian flow matching</a>.</li>
<li><strong>MOFAttention:</strong> A custom attention module designed to encode the geometric relationships between building blocks, lattice parameters, and rotational constraints.</li>
<li><strong>Constraint Handling:</strong> It incorporates domain knowledge by operating on a mean-free system for translation invariance and using canonicalized coordinates for rotation invariance.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors evaluated MOFFlow on structure prediction accuracy, physical property preservation, and scalability.</p>
<ul>
<li><strong>Dataset:</strong> The <strong>Boyd et al. (2019)</strong> dataset consisting of 324,426 hypothetical MOF structures, decomposed into building blocks using the <strong>MOFid</strong> algorithm. Filtered to structures with &lt;200 blocks, yielding 308,829 structures (247,066 train / 30,883 val / 30,880 test). Structures contain up to approximately 2,400 atoms per unit cell.</li>
<li><strong>Baselines:</strong>
<ul>
<li><em>Optimization-based:</em> Random Search (RS) and Evolutionary Algorithm (EA) using CrySPY and CHGNet.</li>
<li><em>Deep Learning:</em> DiffCSP (deep generative model for general crystals).</li>
<li><em>Self-Assembly:</em> A heuristic algorithm used in MOFDiff (adapted for comparison).</li>
</ul>
</li>
<li><strong>Metrics:</strong>
<ul>
<li><strong>Match Rate (MR):</strong> Percentage of generated structures matching ground truth within tolerance.</li>
<li><strong>RMSE:</strong> Root mean squared displacement normalized by average free length per atom.</li>
<li><strong>Structural Properties:</strong> Volumetric/Gravimetric Surface Area (VSA/GSA), Pore Limiting Diameter (PLD), Void Fraction, etc., calculated via Zeo++.</li>
<li><strong>Scalability:</strong> Performance vs. number of atoms and building blocks.</li>
</ul>
</li>
</ul>
<h2 id="results-and-generative-performance">Results and Generative Performance</h2>
<p>MOFFlow outperformed all baselines in accuracy and efficiency, particularly for large structures.</p>
<ul>
<li><strong>Accuracy:</strong> With a single sample, MOFFlow achieved a <strong>31.69% match rate</strong> (stol=0.5) and <strong>87.46%</strong> (stol=1.0) on the full test set (30,880 structures). With 5 samples, these rose to <strong>44.75%</strong> (stol=0.5) and <strong>100.0%</strong> (stol=1.0). RS and EA (tested on 100 and 15 samples respectively due to computational cost, generating 20 candidates each) achieved 0.00% MR at both tolerance levels. DiffCSP reached 0.09% (stol=0.5) and 23.12% (stol=1.0) with 1 sample.</li>
<li><strong>Speed:</strong> Inference took <strong>1.94 seconds</strong> per structure, compared to 5.37s for DiffCSP, 332s for RS, and 1,959s for EA.</li>
<li><strong>Scalability:</strong> MOFFlow preserved high match rates across all system sizes, while DiffCSP&rsquo;s match rate dropped sharply beyond 200 atoms.</li>
<li><strong>Property Preservation:</strong> The distributions of physical properties (e.g., surface area, void fraction) for MOFFlow-generated structures closely matched the ground truth. DiffCSP frequently reduced volumetric surface area and void fraction to zero.</li>
<li><strong>Self-Assembly Comparison:</strong> In a controlled comparison where the self-assembly (SA) algorithm received MOFFlow&rsquo;s predicted translations and lattice, MOFFlow (MR=31.69%, RMSE=0.2820) outperformed SA (MR=30.04%, RMSE=0.3084), confirming the value of the learned rotational vector fields. In an extended scalability comparison, SA scaled better for structures with many building blocks, but MOFFlow achieved higher overall match rate (31.69% vs. 27.14%).</li>
<li><strong>Batch Implementation:</strong> A refactored Batch version achieves improved results: <strong>32.73% MR</strong> (stol=0.5), RMSE of 0.2743, inference in <strong>0.19s</strong> per structure (10x faster), and training in roughly 1/3 the GPU hours.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The paper identifies three key limitations:</p>
<ol>
<li><strong>Hypothetical-only evaluation:</strong> All experiments use the Boyd et al. hypothetical database. Evaluation on more challenging real-world datasets remains needed.</li>
<li><strong>Rigid-body assumption:</strong> The model assumes that local building block structures are known, which may be impractical for rare building blocks whose structural information is missing from existing libraries or is inaccurate.</li>
<li><strong>Periodic invariance:</strong> The model is not invariant to periodic transformations of the input. Explicitly modeling periodic invariance could further improve performance.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source:</strong> MOF dataset by Boyd et al. (2019).</li>
<li><strong>Preprocessing:</strong> Structures were decomposed using the metal-oxo decomposition algorithm from <strong>MOFid</strong>.</li>
<li><strong>Filtering:</strong> Structures with fewer than 200 building blocks were used, yielding 308,829 structures.</li>
<li><strong>Splits:</strong> Train/Validation/Test ratio of 8:1:1 (247,066 / 30,883 / 30,880).</li>
<li><strong>Availability:</strong> Pre-processed dataset is available on <a href="https://zenodo.org/records/15187230">Zenodo</a>.</li>
<li><strong>Representations:</strong>
<ul>
<li><em>Atom-level:</em> Tuple $(X, a, l)$ (coordinates, types, lattice).</li>
<li><em>Block-level:</em> Tuple $(\mathcal{B}, q, \tau, l)$ (blocks, rotations, translations, lattice).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework:</strong> Riemannian Flow Matching.</li>
<li><strong>Objective:</strong> Conditional Flow Matching (CFM) loss regressing to clean data $q_1, \tau_1, l_1$.
$$
\begin{aligned}
\mathcal{L}(\theta) = \mathbb{E}_{t, \mathcal{S}^{(1)}} \left[ \frac{1}{(1-t)^2} \left( \lambda_1 |\log_{q_t}(\hat{q}_1) - \log_{q_t}(q_1)|^2 + \dots \right) \right]
\end{aligned}
$$</li>
<li><strong>Priors:</strong>
<ul>
<li>Rotations ($q$): Uniform on $SO(3)$.</li>
<li>Translations ($\tau$): Standard normal on $\mathbb{R}^3$.</li>
<li>Lattice ($l$): Log-normal for lengths, Uniform(60, 120) for angles (Niggli reduced).</li>
</ul>
</li>
<li><strong>Inference:</strong> ODE solver with <strong>50 integration steps</strong>.</li>
<li><strong>Local Coordinates:</strong> Defined using PCA axes, corrected for symmetry to ensure consistency.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture:</strong> Hierarchical structure with two key modules.
<ul>
<li><strong>Atom-level Update Layers:</strong> 4-layer EGNN-like structure to encode building block features $h_m$ from atomic graphs (cutoff 5Å).</li>
<li><strong>Block-level Update Layers:</strong> 6 layers that iteratively update $q, \tau, l$ using the <strong>MOFAttention</strong> module.</li>
</ul>
</li>
<li><strong>MOFAttention:</strong> Modified Invariant Point Attention (IPA) that incorporates lattice parameters as offsets to the attention matrix.</li>
<li><strong>Hyperparameters:</strong>
<ul>
<li>Node dimension: 256 (block-level), 64 (atom-level).</li>
<li>Attention heads: 24.</li>
<li>Loss coefficients: $\lambda_1=1.0$ (rot), $\lambda_2=2.0$ (trans), $\lambda_3=0.1$ (lattice).</li>
</ul>
</li>
<li><strong>Checkpoints:</strong> Pre-trained weights and models are openly provided on <a href="https://zenodo.org/records/15187230">Zenodo</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics:</strong>
<ul>
<li><strong>Match Rate:</strong> Using <code>StructureMatcher</code> from <code>pymatgen</code>. Tolerances: <code>stol=0.5/1.0</code>, <code>ltol=0.3</code>, <code>angle_tol=10.0</code>.</li>
<li><strong>RMSE:</strong> Normalized by average free length per atom.</li>
</ul>
</li>
<li><strong>Tools:</strong> <strong>Zeo++</strong> for structural property calculations (Surface Area, Pore Diameter, etc.).</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">MOFFlow</th>
          <th style="text-align: left">DiffCSP</th>
          <th style="text-align: left">RS (20 cands)</th>
          <th style="text-align: left">EA (20 cands)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">MR (stol=0.5, k=1)</td>
          <td style="text-align: left"><strong>31.69%</strong></td>
          <td style="text-align: left">0.09%</td>
          <td style="text-align: left">0.00%</td>
          <td style="text-align: left">0.00%</td>
      </tr>
      <tr>
          <td style="text-align: left">MR (stol=1.0, k=1)</td>
          <td style="text-align: left"><strong>87.46%</strong></td>
          <td style="text-align: left">23.12%</td>
          <td style="text-align: left">0.00%</td>
          <td style="text-align: left">0.00%</td>
      </tr>
      <tr>
          <td style="text-align: left">MR (stol=0.5, k=5)</td>
          <td style="text-align: left"><strong>44.75%</strong></td>
          <td style="text-align: left">0.34%</td>
          <td style="text-align: left">-</td>
          <td style="text-align: left">-</td>
      </tr>
      <tr>
          <td style="text-align: left">MR (stol=1.0, k=5)</td>
          <td style="text-align: left"><strong>100.0%</strong></td>
          <td style="text-align: left">38.94%</td>
          <td style="text-align: left">-</td>
          <td style="text-align: left">-</td>
      </tr>
      <tr>
          <td style="text-align: left">RMSE (stol=0.5, k=1)</td>
          <td style="text-align: left"><strong>0.2820</strong></td>
          <td style="text-align: left">0.3961</td>
          <td style="text-align: left">-</td>
          <td style="text-align: left">-</td>
      </tr>
      <tr>
          <td style="text-align: left">Avg. time per structure</td>
          <td style="text-align: left"><strong>1.94s</strong></td>
          <td style="text-align: left">5.37s</td>
          <td style="text-align: left">332s</td>
          <td style="text-align: left">1,959s</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Hardware:</strong> 8 $\times$ NVIDIA RTX 3090 (24GB VRAM).</li>
<li><strong>Training Time:</strong>
<ul>
<li><em>TimestepBatch version (main paper):</em> ~5 days 15 hours.</li>
<li><em>Batch version:</em> ~1 day 17 hours (332.74 GPU hours). The authors also release this refactored implementation, which achieves comparable performance with faster convergence.</li>
</ul>
</li>
<li><strong>Batch Size:</strong> 160 (capped by $N^2$ where $N$ is the number of atoms, for memory management).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/nayoung10/MOFFlow">MOFFlow (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Official implementation built on DiffDock, EGNN, MOFDiff, and protein-frame-flow</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://zenodo.org/records/15187230">Pre-processed dataset and checkpoints (Zenodo)</a></td>
          <td style="text-align: left">Dataset / Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Includes pre-processed MOF structures and trained model weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, N., Kim, S., Kim, M., Park, J., &amp; Ahn, S. (2025). MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks. <em>International Conference on Learning Representations (ICLR)</em>.</p>
<p><strong>Publication</strong>: ICLR 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kimMOFFlowFlowMatching2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kim, Nayoung and Kim, Seongsu and Kim, Minsu and Park, Jinkyoo and Ahn, Sungsoo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Thirteenth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=dNT3abOsLo}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=dNT3abOsLo">OpenReview Discussion</a></li>
<li><a href="https://github.com/nayoung10/MOFFlow">Official Code Repository</a></li>
</ul>
]]></content:encoded></item><item><title>MERMaid: Multimodal Chemical Reaction Mining from PDFs</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</guid><description>Vision-language pipeline extracting chemical reaction data from PDF figures and tables into structured knowledge graphs with 87% accuracy.</description><content:encoded><![CDATA[<h2 id="methodological-and-resource-contributions">Methodological and Resource Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$) that introduces a novel pipeline (MERMaid) for extracting structured chemical data from unstructured PDF documents. It proposes a specific architecture combining fine-tuned vision models (VisualHeist) with vision-language models (DataRaider) and a retrieval-augmented generation system (KGWizard) to solve the problem of multimodal data ingestion.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper ($\Psi_{\text{Resource}}$) as it releases the source code, prompts, and a new benchmark dataset (<strong>MERMaid-100</strong>) consisting of annotated reaction data across three chemical domains.</p>
<h2 id="the-inaccessibility-of-diagrammatic-reaction-data">The Inaccessibility of Diagrammatic Reaction Data</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A significant volume of chemical knowledge currently resides in &ldquo;print-optimized&rdquo; PDF formats, specifically within graphical elements like figures, schemes, and tables, which resist standard text mining.</li>
<li><strong>Limitations of Prior Work</strong>: Existing tools (e.g., ChemDataExtractor, <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/">OpenChemIE</a>) focus primarily on text, struggle with multimodal parsing, or lack the &ldquo;contextual awareness&rdquo; needed to interpret implicit information (e.g., &ldquo;standard conditions&rdquo; with modifications in optimization tables).</li>
<li><strong>Need for Structured Data</strong>: To enable <a href="/notes/chemistry/llm-applications/autonomous-chemical-research-coscientist/">self-driving laboratories</a> and data-driven discovery, this unstructured literature must be converted into machine-actionable formats like <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a>.</li>
</ul>
<h2 id="the-mermaid-pipeline-vision-models-and-llm-rag">The MERMaid Pipeline: Vision Models and LLM RAG</h2>
<ul>
<li><strong>VisualHeist (Fine-tuned Segmentation)</strong>: A custom fine-tuned model based on Microsoft&rsquo;s Florence-2 that accurately segments figures, captions, and footnotes, even in messy supplementary materials.</li>
<li><strong>DataRaider (Context-Aware Extraction)</strong>: A VLM-powered module (using GPT-4o) with a <strong>two-step prompt framework</strong> that performs &ldquo;self-directed context completion.&rdquo; It can infer missing reaction parameters from context and resolve footnote labels (e.g., linking &ldquo;condition a&rdquo; in a table to its footnote description).</li>
<li><strong>KGWizard (Schema-Adaptive Graph Construction)</strong>: A text-to-graph engine that uses LLMs as higher-order functions to synthesize parsers dynamically. It employs <strong>Retrieval-Augmented Generation (RAG)</strong> to check for existing nodes during creation, implicitly resolving coreferences (e.g., unifying &ldquo;MeCN&rdquo; and &ldquo;Acetonitrile&rdquo;).</li>
<li><strong>Topic-Agnostic Design</strong>: MERMaid features a flexible design that works across three distinct domains: <a href="https://en.wikipedia.org/wiki/Electrosynthesis">organic electrosynthesis</a>, <a href="https://en.wikipedia.org/wiki/Photocatalysis">photocatalysis</a>, and organic synthesis.</li>
</ul>
<h2 id="benchmarking-segmentation-and-extraction-accuracy">Benchmarking Segmentation and Extraction Accuracy</h2>
<ul>
<li><strong>Segmentation Benchmarking</strong>: The authors compared VisualHeist against OpenChemIE (LayoutParser) and PDFigCapX using a dataset of 121 PDFs from 5 publishers.</li>
<li><strong>End-to-End Extraction</strong>: Evaluated the full pipeline on <strong>MERMaid-100</strong>, a curated dataset of 100 articles across three domains (organic electrosynthesis, photocatalysis, organic synthesis).
<ul>
<li>Validating extraction of specific parameters (e.g., catalysts, solvents, yields) using &ldquo;hard-match&rdquo; accuracy.</li>
</ul>
</li>
<li><strong>Knowledge Graph Construction</strong>: Automatically generated knowledge graphs for the three domains and assessed the structural integrity and <a href="https://en.wikipedia.org/wiki/Coreference">coreference resolution</a> accuracy.</li>
</ul>
<h2 id="end-to-end-extraction-performance">End-to-End Extraction Performance</h2>
<ul>
<li><strong>Segmentation Results</strong>: VisualHeist achieved &gt;93% F1 score across all document types (including pre-2000 papers and supplementary materials), outperforming OpenChemIE by 15-75% and PDFigCapX by 28-75% across all metrics.</li>
<li><strong>Extraction Accuracy</strong>: DataRaider achieved &gt;92% accuracy for VLM-based parameter extraction and near-unity accuracy for domain-specific reaction parameters (e.g., anode, cathode, photocatalyst).</li>
<li><strong>Graph Building</strong>: KGWizard achieved 96% accuracy in node creation and coreference resolution.</li>
<li><strong>Overall Performance</strong>: The pipeline demonstrated an 87% end-to-end overall accuracy.</li>
<li><strong>Limitations</strong>: The architecture relies heavily on closed-weight models (GPT-4o) for reasoning and graph construction, which risks future reproducibility if API snapshots are deprecated. Additionally, the system remains vulnerable to cumulative error propagation from upstream OCR/OCSR tools like <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">RxnScribe</a>.</li>
<li><strong>Availability</strong>: The authors provide a modular, extensible framework that can be adapted to other scientific domains.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training Data (VisualHeist)</strong>:
<ul>
<li>Dataset of <strong>3,435 figures</strong> and <strong>1,716 tables</strong> annotated from 3,518 PDF pages.</li>
<li>Includes main text, supplementary materials, and unformatted archive papers.</li>
</ul>
</li>
<li><strong>Evaluation Data (MERMaid-100)</strong>:
<ul>
<li><strong>100 PDF articles</strong> curated from three domains: organic electrosynthesis, photocatalysis, and organic synthesis.</li>
<li>Includes 104 image-caption/table-heading pairs relevant to reaction optimization.</li>
<li>Available for download at Zenodo (DOI: 10.5281/zenodo.14917752).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Two-Step Prompt Framework (DataRaider)</strong>:
<ul>
<li><em>Step 1</em>: Generic base prompt + domain keys to extract &ldquo;reaction dictionaries&rdquo; and &ldquo;footnote dictionaries&rdquo;. Uses &ldquo;fill-in-the-blank&rdquo; inference for missing details.</li>
<li><em>Step 2</em>: Safety check prompt where the VLM updates the reaction dictionary using the footnote dictionary to resolve entry-specific modifications.</li>
</ul>
</li>
<li><strong>LLM-Synthesized Parsers (KGWizard)</strong>:
<ul>
<li>Uses LLM as a function $g_{A,B}: A \times B \rightarrow (X \rightarrow Y)$ to generate Python code (parsers) dynamically based on input schema instructions.</li>
</ul>
</li>
<li><strong>RAG for Coreference</strong>:
<ul>
<li>During graph construction, the system queries the existing database for matching values (e.g., &ldquo;MeCN&rdquo;) before creating new nodes to prevent duplication.</li>
</ul>
</li>
<li><strong>Batching</strong>:
<ul>
<li>Articles processed in dynamic batch sizes (starting at 1, increasing to 30) to balance speed and redundancy checks.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>VisualHeist</strong>: Fine-tuned <strong>Florence-2-large</strong> (Microsoft vision foundation model).
<ul>
<li><em>Hyperparameters</em>: 12 epochs, learning rate $5 \times 10^{-6}$, batch size 4.</li>
</ul>
</li>
<li><strong>DataRaider &amp; KGWizard</strong>: <strong>GPT-4o</strong> (version <code>gpt-4o-2024-08-06</code>). Note: Requires an active OpenAI API key. The pipeline&rsquo;s long-term reproducibility is currently tied to the continued availability of this specific closed-source endpoint.</li>
<li><strong>RxnScribe</strong>: Used for <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">Optical Chemical Structure Recognition (OCSR)</a> to convert reactant/product images to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><em>Segmentation</em>: Precision, Recall, F1, Accuracy.</li>
<li><em>Caption Extraction</em>: Evaluated via <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a>, mapping predicted token sets $A$ and true token sets $B$ to a threshold condition: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|} \ge 0.70$$</li>
<li><em>Data Extraction</em>: Evaluated via Hard-Match accuracy, requiring exact correspondence between predicted sets ($\hat{Y}$) and ground-truth parameters ($Y$) for specific roles (e.g., anode vs. cathode): $$\text{HMA} = \frac{1}{|N|} \sum_{i=1}^{N} \mathbb{1}[y_i = \hat{y}_i]$$</li>
</ul>
</li>
<li><strong>Baselines</strong>: OpenChemIE (LayoutParser + EasyOCR) and PDFigCapX.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training (VisualHeist)</strong>: 2x NVLINK Nvidia RTX A6000 GPUs (48GB VRAM) + Intel Xeon w7-2495X CPU (48 cores).</li>
<li><strong>DataRaider Evaluation</strong>: 13th Gen Intel Core i7-1360P CPU (12 cores).</li>
<li><strong>Inference Costs</strong>:
<ul>
<li>DataRaider: ~$0.051 per image.</li>
<li>KGWizard: ~$0.40 per JSON.</li>
</ul>
</li>
<li><strong>Timing</strong>:
<ul>
<li>VisualHeist inference: ~4.5 seconds/image.</li>
<li>DataRaider inference: ~41.3 seconds/image.</li>
<li>KGWizard processing: ~110.6 seconds/file.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leong, S. X., Pablo-García, S., Wong, B., &amp; Aspuru-Guzik, A. (2025). MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models. <em>Matter</em>, 8(12), 102331. <a href="https://doi.org/10.1016/j.matt.2025.102331">https://doi.org/10.1016/j.matt.2025.102331</a></p>
<p><strong>Publication</strong>: Matter, 2025</p>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/MERMaid">GitHub Repository</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (VisualHeist, DataRaider, KGWizard)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.14917752">Zenodo Data/Prompts</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>MERMaid-100 benchmark, prompts, and raw VLM responses</td>
      </tr>
  </tbody>
</table>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leong2025mermaid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leong, Shi Xuan and Pablo-Garc{\&#39;i}a, Sergio and Wong, Brandon and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Matter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{102331}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.matt.2025.102331}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InvMSAFold: Generative Inverse Folding with Potts Models</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/invmsafold/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/invmsafold/</guid><description>InvMSAFold generates diverse protein sequences from structure by predicting Potts model parameters, enabling orders-of-magnitude faster sampling.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It introduces a novel architecture, <strong>InvMSAFold</strong>, which hybridizes deep learning encoders with statistical physics-based decoders (Potts models). The rhetorical structure focuses on architectural innovation (low-rank parameter generation), ablation of speed/diversity against baselines (ESM-IF1), and algorithmic efficiency.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Standard inverse folding models (like ESM-IF1 or ProteinMPNN) solve a &ldquo;one-to-one&rdquo; mapping: given a structure, predict the <em>single</em> native sequence. However, in nature, folding is &ldquo;many-to-one&rdquo;: many homologous sequences fold into the same structure.</p>
<p>The authors identify two key gaps:</p>
<ol>
<li><strong>Lack of Diversity</strong>: Standard autoregressive models maximize probability for the ground truth sequence, often failing to capture the broad evolutionary landscape of viable homologs.</li>
<li><strong>Slow Inference</strong>: Autoregressive sampling requires a full neural network pass for <em>every amino acid</em>, making high-throughput screening (e.g., millions of candidates) computationally prohibitive.</li>
</ol>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is shifting the learning objective from predicting <em>sequences</em> to predicting <em>probability distributions</em>.</p>
<p>InvMSAFold outputs the parameters (couplings $\mathbf{J}$ and fields $\mathbf{h}$) of a <strong>Potts Model</strong> (a pairwise Markov Random Field).</p>
<ul>
<li><strong>Low-Rank Decomposition</strong>: To handle the massive parameter space of pairwise couplings ($L \times L \times q \times q$), the model predicts a low-rank approximation $\mathbf{V}$ ($L \times K \times q$), reducing complexity from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$.</li>
<li><strong>One-Shot Generation</strong>: The deep network runs only <em>once</em> to generate the Potts parameters. Sampling sequences from this Potts model is then performed on CPU via MCMC (for the PW variant) or direct autoregressive sampling (for the AR variant), which is orders of magnitude faster than running a Transformer decoder for every step.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model on three CATH-based test sets (Inter-cluster, Intra-cluster, MSA) to test generalization at varying levels of homology.</p>
<ul>
<li><strong>Speed Benchmarking</strong>: Compared wall-clock sampling time vs. ESM-IF1 on CPU/GPU.</li>
<li><strong>Covariance Reconstruction</strong>: Checked if generated sequences recover the evolutionary correlations found in natural MSAs (Pearson correlation of covariance matrices).</li>
<li><strong>Structural Fidelity</strong>: Generated sequences with high Hamming distance from native, folded them with AlphaFold 2 (no templates), and measured RMSD to the target structure.</li>
<li><strong>Property Profiling</strong>: Analyzed the distribution of predicted solubility (Protein-Sol) and thermostability (Thermoprot) to show that sequence diversity translates into a wider range of biochemical properties.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Massive Speedup</strong>: InvMSAFold is orders of magnitude faster than ESM-IF1 (CPU vs. GPU; the comparison is not hardware-matched). Because the &ldquo;heavy lifting&rdquo; (generating Potts parameters) happens once, sampling millions of sequences becomes trivial on CPUs.</li>
<li><strong>Better Diversity</strong>: The model captures evolutionary covariances significantly better than ESM-IF1 and ProteinMPNN (which shares similar covariance recovery to ESM-IF1). A PCA-based KL-divergence analysis (lower is better; 0 means a perfect match to the natural MSA distribution) shows InvMSAFold-AR scores of $0.49$ (Inter-cluster) and $0.67$ (Intra-cluster), compared to $15.8$ and $11.9$ for ESM-IF1, demonstrating that the generated sequences occupy a distribution much closer to natural MSAs.</li>
<li><strong>Robust Folding</strong>: Sequences generated far from the native sequence (high Hamming distance) still fold into the correct structure (low RMSD), whereas ESM-IF1 struggles to produce diverse valid sequences.</li>
<li><strong>Property Expansion</strong>: The method generates a wider spread of predicted biochemical properties (solubility/thermostability), which could be useful for virtual screening in protein design.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Source</strong>: CATH database (40% non-redundant dataset).</p>
<p><strong>Splits</strong>:</p>
<ul>
<li><strong>Training</strong>: ~22k domains.</li>
<li><strong>Inter-cluster Test</strong>: 10% of sequence clusters held out (unseen clusters, many with superfamilies absent from training).</li>
<li><strong>Intra-cluster Test</strong>: Unseen domains from seen clusters.</li>
<li><strong>Augmentation</strong>: MSAs generated using <strong>MMseqs2</strong> against the Uniprot50 database. Training uses random subsamples of these MSAs ($|M_X| = 64$ for PW, $|M_X| = 32$ for AR) to teach the model evolutionary variance.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Architecture</strong>:</p>
<ul>
<li><strong>Encoder</strong>: Pre-trained <strong>ESM-IF1</strong> encoder (GVP-GNN architecture). The encoder is used to pre-compute structure embeddings, with independent Gaussian noise (std = 5% of the embedding std) added during training.</li>
<li><strong>Decoder</strong>: 6-layer Transformer (8 heads) that outputs a latent tensor.</li>
<li><strong>Projection</strong>: Linear layers project latent tensor to fields $\mathbf{h}$ ($L \times q$) and low-rank tensor $\mathbf{V}$ ($L \times K \times q$).</li>
</ul>
<p><strong>Coupling Construction</strong>:
The full coupling tensor $\mathcal{J}$ is approximated via:
$$\mathcal{J}_{i,a,j,b} = \frac{1}{\sqrt{K}} \sum_{k=1}^{K} \mathcal{V}_{i,k,a} \mathcal{V}_{j,k,b}$$
Rank $K=48$ was used.</p>
<p><strong>Loss Functions</strong>:
Two variants were trained:</p>
<ol>
<li><strong>InvMSAFold-PW</strong>: Trained via <strong>Pseudo-Likelihood (PL)</strong>. Computation is optimized to $\mathcal{O}(L)$ time using the low-rank property.</li>
<li><strong>InvMSAFold-AR</strong>: Trained via <strong>Autoregressive Likelihood</strong>. Couplings are masked ($J_{ij} = 0$ if $i &lt; j$) to allow exact likelihood computation and direct sampling without MCMC.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>InvMSAFold-PW</strong>: Requires MCMC sampling (Metropolis-Hastings) at inference.</li>
<li><strong>InvMSAFold-AR</strong>: Allows direct, fast autoregressive sampling.</li>
<li><strong>Hyperparameters</strong>: AdamW optimizer, lr=$10^{-4}$ (PW) / $3.4 \times 10^{-4}$ (AR), 94 epochs. L2 regularization: $\lambda_h = \lambda_J = 10^{-4}$ (PW); $\lambda_J = 3.2 \times 10^{-6}$, $\lambda_h = 5.0 \times 10^{-5}$ (AR).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>RMSD</strong>: Structure fidelity (AlphaFold2 prediction vs. native structure).</li>
<li><strong>Covariance Pearson Correlation</strong>: Measures recovery of evolutionary pairwise statistics.</li>
<li><strong>KL Divergence</strong>: Between PCA-projected densities of natural and synthetic sequences (Gaussian KDE, kernel size 1.0).</li>
<li><strong>Sampling Speed</strong>: Wall-clock time vs. sequence length/batch size.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Not specified in the paper. The GitHub repository reports testing on an NVIDIA RTX 3090, with training taking 10-24 hours depending on model variant.</li>
<li><strong>Inference</strong>:
<ul>
<li><strong>ESM-IF1</strong>: NVIDIA GeForce RTX 4060 Laptop (8GB).</li>
<li><strong>InvMSAFold</strong>: Single core of Intel i9-13905H CPU.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/luchinoprince/Potts_Inverse_Folding">Potts_Inverse_Folding</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and inference code (PyTorch)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Silva, L. A., Meynard-Piganeau, B., Lucibello, C., &amp; Feinauer, C. (2025). Fast Uncovering of Protein Sequence Diversity from Structure. <em>International Conference on Learning Representations (ICLR)</em>. <a href="https://arxiv.org/abs/2406.11975">https://arxiv.org/abs/2406.11975</a></p>
<p><strong>Publication</strong>: ICLR 2025 (Spotlight)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{silvaFastUncoveringProtein2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Fast Uncovering of Protein Sequence Diversity from Structure}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Silva, Luca Alessandro and {Meynard-Piganeau}, Barthelemy and Lucibello, Carlo and Feinauer, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Statistical Mechanics: Theory and Experiment}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{084003}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/1742-5468/adf0e7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://openreview.net/forum?id=1iuaxjssVp}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=1iuaxjssVp">OpenReview Page</a></li>
<li><a href="https://github.com/luchinoprince/Potts_Inverse_Folding">GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>InstructMol: Multi-Modal Molecular LLM for Drug Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/instructmol/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/instructmol/</guid><description>A multi-modal LLM aligning 2D molecular graphs with text via two-stage instruction tuning for drug discovery tasks.</description><content:encoded><![CDATA[<h2 id="instructmol-framework-overview">InstructMol Framework Overview</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This work proposes <strong>InstructMol</strong>, a novel multi-modal architecture and training paradigm. It focuses on engineering a system that aligns a pre-trained molecular graph encoder with a general-purpose Large Language Model (LLM). The paper&rsquo;s primary contribution is the <strong>Two-Stage Instruction Tuning</strong> strategy (Alignment Pre-training + Task-Specific Tuning) designed to bridge the modality gap between 2D molecular graphs and natural language.</p>
<h2 id="bridging-specialist-and-generalist-models">Bridging Specialist and Generalist Models</h2>
<p>Current AI approaches in drug discovery typically fall into two categories. Specialist models deliver high accuracy on specific tasks (such as property prediction) but require extensive labeled datasets and lack conversational adaptability. Conversely, generalist LLMs offer strong reasoning and dialogue capabilities but struggle to natively interpret complex structural data, often relying on brittle 1D text representations of molecules like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</p>
<p>There is a practical need for a unified &ldquo;Molecular Assistant&rdquo; capable of visually interpreting molecular graphs, reasoning about structure in natural language, and adapting across tasks like synthesis planning and property analysis without training from scratch.</p>
<h2 id="two-stage-modality-alignment">Two-Stage Modality Alignment</h2>
<p>The core novelty lies in the architecture and the <strong>two-stage training pipeline</strong> designed to align differing modalities efficiently:</p>
<ol>
<li><strong>MoleculeSTM Integration</strong>: InstructMol initializes its graph encoder with <strong>MoleculeSTM</strong>, which is already pre-aligned with text via contrastive learning, facilitating easier downstream alignment.</li>
<li><strong>Two-Stage Alignment Strategy</strong>:
<ul>
<li><strong>Stage 1 (Alignment Pre-training)</strong>: Freezes both the LLM and Graph Encoder; trains <em>only</em> a linear projector using a massive dataset of molecule-description pairs to map graph features into the LLM&rsquo;s token space.</li>
<li><strong>Stage 2 (Task-Specific Instruction Tuning)</strong>: Freezes the Graph Encoder; fine-tunes the Projector and the LLM (using <strong>LoRA</strong>) on specific downstream tasks. This allows the model to adapt its reasoning capabilities while preserving the structural understanding gained in Stage 1.</li>
</ul>
</li>
</ol>
<h2 id="task-evaluation-in-drug-discovery">Task Evaluation in Drug Discovery</h2>
<p>The authors evaluated InstructMol across three distinct categories of drug discovery tasks, comparing it against generalist LLMs (Vicuna, LLaMA, <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a>) and specialist models (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, MolT5):</p>
<ol>
<li><strong>Property Prediction</strong>:
<ul>
<li><em>Regression</em>: Predicting quantum mechanical properties (HOMO, LUMO, Gap) using the <a href="/notes/chemistry/datasets/qm9/">QM9</a> dataset.</li>
<li><em>Classification</em>: Predicting biological activity (BACE, BBBP, HIV) using <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
</ul>
</li>
<li><strong>Molecule Description Generation</strong>: Generating natural language descriptions of molecules using the ChEBI-20 dataset.</li>
<li><strong>Chemical Reaction Analysis</strong>:
<ul>
<li><em>Forward Reaction Prediction</em>: Predicting products from reactants.</li>
<li><em>Reagent Prediction</em>: Identifying necessary reagents.</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: Suggesting reactants for a given product.</li>
</ul>
</li>
</ol>
<p><strong>Ablation Studies</strong> tested the impact of the projector type (Linear vs. MLP), LLM scale (7B vs 13B), and the necessity of the two-stage training approach.</p>
<h2 id="core-findings-and-limitations">Core Findings and Limitations</h2>
<ul>
<li><strong>Improvement Over Baseline Generalists</strong>: InstructMol significantly outperformed generalist LLMs (like LLaMA and Galactica) on all tasks, demonstrating the value of incorporating explicit graph modalities.</li>
<li><strong>Reducing the Gap with Specialists</strong>: While InstructMol brings versatile reasoning capabilities, it still trails highly optimized specialist models (such as Uni-Mol and MolT5) on tasks like molecule description generation. This remaining gap likely stems from its reliance on a relatively small alignment pre-training dataset (~264K PubChem pairs) and the information bottleneck of using a simple linear projector, compared to the millions of structures used to train expert foundational models.</li>
<li><strong>Importance of Alignment</strong>: Ablation studies confirmed that skipping Stage 1 (Alignment Pre-training) degraded performance, proving that a dedicated phase for projecting graph features into text space is crucial.</li>
<li><strong>Limitation</strong>: The model struggles with highly imbalanced datasets (e.g., HIV) and complex reaction mixtures where mapping multiple graph tokens to text becomes ambiguous.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline utilizes distinct datasets for the two stages. <strong>Note:</strong> As of the latest repository update, the finely-processed instruction-tuning datasets (e.g., the filtered ~264K PubChem pairs and instruction-formatted subset pairs) are listed as &ldquo;coming soon&rdquo;, requiring manual recreation for full reproduction.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Stage 1</strong> (Alignment)</td>
          <td style="text-align: left"><strong><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></strong></td>
          <td style="text-align: left">~264K pairs</td>
          <td style="text-align: left">Molecule-text pairs. Filtered from 330K for invalid descriptions and overlaps with ChEBI-20 test set.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Prop. Reg.)</td>
          <td style="text-align: left"><strong>QM9</strong></td>
          <td style="text-align: left">362K samples</td>
          <td style="text-align: left">Quantum mechanics properties (HOMO, LUMO, Gap).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Prop. Class.)</td>
          <td style="text-align: left"><strong>MoleculeNet</strong></td>
          <td style="text-align: left">35K samples</td>
          <td style="text-align: left">BACE, BBBP, HIV datasets. Converted to instruction format (Yes/No answer).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Generation)</td>
          <td style="text-align: left"><strong>ChEBI-20</strong></td>
          <td style="text-align: left">26.5K samples</td>
          <td style="text-align: left">Molecule description generation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Reactions)</td>
          <td style="text-align: left"><strong>USPTO</strong></td>
          <td style="text-align: left">~380K samples</td>
          <td style="text-align: left">Combined datasets for Forward (125K), Retrosynthesis (130K), and Reagent (125K) prediction.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Two-Stage Training</strong>:
<ol>
<li><strong>Alignment Pre-training</strong>: Updates only the Projector. The objective maximizes the probability of generating the target description token sequence $\mathbf{X}_A$ given the molecule input $\mathbf{X}_M$ and instruction $\mathbf{X}_I$:
$$p(\mathbf{X}_A | \mathbf{X}_M, \mathbf{X}_I) = \prod_{i=1}^L p_\theta(x_i | \mathbf{X}_G \parallel \mathbf{X}_S, \mathbf{X}_I, \mathbf{X}_{A,&lt;i})$$</li>
<li><strong>Instruction Tuning</strong>: Updates Projector + LLM (via LoRA) using standard autoregressive language modeling on task-specific instructions. The objective minimizes the negative log-likelihood of generating the target response $R$ of length $L$:
$$\mathcal{L}(\theta) = -\sum_{i=1}^L \log p(R_i | I, M, R_{&lt;i}; \theta)$$
where $I$ represents the instruction and $M$ is the multi-modal molecular input.</li>
</ol>
</li>
<li><strong>LoRA (Low-Rank Adaptation)</strong>: Applied to the LLM in Stage 2. Rank $r=64$, Scaling $\alpha=16$.</li>
<li><strong>Optimization</strong>: AdamW optimizer. Learning rate starts at 2e-3 (Stage 1) and 8e-5 (Stage 2) with cosine decay. Warm-up ratio 0.03.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Note:</strong> The official repository currently lists the final fine-tuned <strong>InstructMol weights</strong> as &ldquo;coming soon.&rdquo; Consequently, one must fine-tune the components using the provided scripts. Base model weights (Vicuna-7B and MoleculeSTM) are publicly available via Hugging Face.</p>
<ul>
<li><strong>Graph Encoder ($f_g$)</strong>:
<ul>
<li>Architecture: Graph Isomorphism Network (GIN) with 5 layers.</li>
<li>Hidden Dimension: 300.</li>
<li>Initialization: <strong>MoleculeSTM</strong> checkpoint (pre-trained via contrastive learning).</li>
<li>Status: <strong>Frozen</strong> during Stage 2.</li>
</ul>
</li>
<li><strong>LLM</strong>:
<ul>
<li>Base: <strong>Vicuna-v1.3-7B</strong>.</li>
<li>Status: Frozen in Stage 1; LoRA fine-tuned in Stage 2.</li>
</ul>
</li>
<li><strong>Projector</strong>:
<ul>
<li>Architecture: Linear Layer.</li>
<li>Function: Maps node-level graph representation $Z_G \in \mathbb{R}^{N \times d}$ to the LLM&rsquo;s word embedding space dimensions.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric Libraries</strong>: RDKit for validity/fingerprints, standard NLP libraries for BLEU/ROUGE.</li>
<li><strong>Reaction Metrics</strong>: Fingerprint <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto Similarity</a> (FTS), Exact Match, Levenshtein distance, and validity (via RDKit).</li>
<li><strong>Description Metrics</strong>: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, METEOR.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA RTX A6000 (48GB VRAM).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Stage 1: 5 epochs.</li>
<li>Stage 2: 20-50 epochs (Description Generation), 10 epochs (Properties/Reactions).</li>
</ul>
</li>
<li><strong>Batch Size</strong>: 128 for both stages.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/IDEA-XL/InstructMol">InstructMol (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache 2.0 (code), CC BY-NC 4.0 (data)</td>
          <td style="text-align: left">Training/evaluation scripts provided; fine-tuned weights listed as &ldquo;coming soon&rdquo;</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/lmsys/vicuna-7b-v1.3">Vicuna-7B v1.3</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Non-commercial (LLaMA license)</td>
          <td style="text-align: left">Base LLM; must be downloaded separately</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/chao1224/MoleculeSTM">MoleculeSTM</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Pre-trained graph encoder checkpoint</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cao, H., Liu, Z., Lu, X., Yao, Y., &amp; Li, Y. (2025). InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery. <em>Proceedings of the 31st International Conference on Computational Linguistics</em>, 354-379.</p>
<p><strong>Publication</strong>: COLING 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{caoInstructMolMultiModalIntegration2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{InstructMol}}: {{Multi-Modal Integration}} for {{Building}} a {{Versatile}} and {{Reliable Molecular Assistant}} in {{Drug Discovery}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{InstructMol}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 31st {{International Conference}} on {{Computational Linguistics}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Cao, He and Liu, Zijing and Lu, Xingyu and Yao, Yuan and Li, Yu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and {Al-Khalifa}, Hend and Eugenio, Barbara Di and Schockaert, Steven}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{354--379}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://aclanthology.org/2025.coling-main.25/}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Abu Dhabi, UAE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialists, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IDEA-XL/InstructMol">Official Repository</a></li>
</ul>
]]></content:encoded></item><item><title>DynamicFlow: Integrating Protein Dynamics into Drug Design</title><link>https://hunterheidenreich.com/notes/biology/computational-biology/dynamicflow/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/computational-biology/dynamicflow/</guid><description>Flow matching model that co-generates ligands and flexible protein pockets, addressing rigid-receptor limitations in structure-based drug design.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is primarily a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) with a strong <strong>Resource</strong> ($\Psi_{\text{Resource}}$) component.</p>
<ul>
<li><strong>Method</strong>: It proposes <strong>DynamicFlow</strong>, a novel multiscale architecture combining atom-level SE(3)-equivariant GNNs (SE(3) is the special Euclidean group in 3D: the set of all 3D rotations and translations, and equivariance means predictions transform consistently under those symmetries) and residue-level Transformers within a <a href="/notes/machine-learning/generative-models/flow-matching-for-generative-modeling/">flow matching</a> framework to model the joint distribution of ligand generation and protein conformational change.</li>
<li><strong>Resource</strong>: It curates a significant dataset derived from MISATO, pairing AlphaFold2-predicted apo structures with multiple MD-simulated holo states, specifically filtered for flow matching tasks.</li>
</ul>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Traditional Structure-Based Drug Design (SBDD) methods typically assume the protein target is rigid, which limits their applicability because proteins are dynamic and undergo conformational changes (induced fit) upon ligand binding.</p>
<ul>
<li><strong>Biological Reality</strong>: Proteins exist as ensembles of states; binding often involves transitions from &ldquo;apo&rdquo; (unbound) to &ldquo;holo&rdquo; (bound) <a href="/posts/geom-conformer-generation-dataset/">conformational changes</a>, sometimes revealing cryptic pockets.</li>
<li><strong>Computational Bottleneck</strong>: <a href="/notes/chemistry/molecular-simulation/">Molecular Dynamics (MD)</a> simulates these changes but incurs high computational costs due to energy barriers.</li>
<li><strong>Gap</strong>: <a href="/notes/machine-learning/generative-models/">Existing generative models</a> for SBDD mostly condition on a fixed pocket structure, ignoring the co-adaptation of the protein and ligand.</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>simultaneous modeling of ligand generation and protein conformational dynamics</strong> using a unified flow matching framework.</p>
<ul>
<li><strong>DynamicFlow Architecture</strong>: A multiscale model that treats the protein as both full-atom (for interaction) and residue-level frames (for large-scale dynamics), utilizing separate flow matching objectives for backbone frames, side-chain torsions, and ligand atoms.</li>
<li><strong>Stochastic Flow (SDE)</strong>: Introduction of a <a href="/notes/machine-learning/generative-models/score-based-generative-modeling-sde/">stochastic variant</a> (DynamicFlow-SDE) that improves robustness and diversity compared to the deterministic ODE flow.</li>
<li><strong>Coupled Generation</strong>: The model learns to transport the <em>apo</em> pocket distribution to the <em>holo</em> pocket distribution while simultaneously denoising the ligand, advancing beyond rigid pocket docking methods.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the method on a curated dataset of 5,692 protein-ligand complexes.</p>
<ul>
<li><strong>Baselines</strong>: Compared against rigid-pocket SBDD methods: Pocket2Mol, TargetDiff, and IPDiff (adapted as TargetDiff* and IPDiff* for fair comparison of atom numbers). Also compared against conformation sampling baselines (Str2Str).</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Ligand Quality</strong>: Vina Score (binding affinity), QED (drug-likeness), SA (synthesizability), Lipinski&rsquo;s rule of 5.</li>
<li><strong>Pocket Quality</strong>: RMSD between generated and ground-truth holo pockets, Cover Ratio (percentage of holo states successfully retrieved), and Pocket Volume distributions.</li>
<li><strong>Interaction</strong>: Protein-Ligand Interaction Profiler (PLIP) to measure specific non-covalent interactions.</li>
</ul>
</li>
<li><strong>Ablations</strong>: Tested the impact of the interaction loss, residue-level Transformer, and SDE vs. ODE formulations.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Improved Affinity</strong>: DynamicFlow-SDE achieved the best (lowest) Vina scores ($-7.65$) compared to baselines like TargetDiff ($-5.09$) and Pocket2Mol ($-5.50$). Note that Vina scores are a computational proxy and do not directly predict experimental binding affinity. Moreover, Vina score optimization is gameable: molecules can achieve strong computed binding energies while remaining synthetically inaccessible. QED and SA scores, which assess drug-likeness and synthesizability respectively, were reported but were not primary optimization targets in the paper, which limits the strength of this affinity claim.</li>
<li><strong>Realistic Dynamics</strong>: The model successfully generated holo-like pocket conformations with volume distributions and interaction profiles closer to ground-truth MD simulations than the initial apo structures.</li>
<li><strong>Enhancing Rigid Methods</strong>: Holo pockets generated by DynamicFlow served as better inputs for rigid-SBDD baselines (e.g., TargetDiff improved from $-5.09$ to $-9.00$ and IPDiff improved from $-7.55$ to $-11.04$ when using &ldquo;Our Pocket&rdquo;), suggesting the method can act as a &ldquo;pocket refiner&rdquo;.</li>
<li><strong>ODE vs. SDE Trade-off</strong>: The deterministic ODE variant achieves better pocket RMSD, while the stochastic SDE variant achieves better Cover Ratio (diversity of holo states captured) and binding affinity. Neither dominates uniformly.</li>
<li><strong>Conformation Sampling Baseline</strong>: Str2Str, a dedicated conformation sampling baseline, performed worse than simply perturbing the apo structure with noise. One interpretation is that this highlights the difficulty of the apo-to-holo prediction task; another is that Str2Str was not designed specifically for apo-to-holo prediction, making it a limited test of its capabilities.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset is derived from <strong>MISATO</strong>, which contains MD trajectories for PDBbind complexes.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td>Curated MISATO</td>
          <td>5,692 complexes</td>
          <td>Filtered for valid MD (<a href="/posts/kabsch-algorithm/">RMSD</a> $&lt; 3\text{\AA}$), clustered to remove redundancy. Contains 46,235 holo-ligand conformations total.</td>
      </tr>
      <tr>
          <td><strong>Apo Structures</strong></td>
          <td>AlphaFold2</td>
          <td>N/A</td>
          <td>Apo structures were obtained by mapping PDB IDs to UniProt and retrieving AlphaFold2 predictions, then aligning to MISATO structures.</td>
      </tr>
      <tr>
          <td><strong>Splits</strong></td>
          <td>Standard</td>
          <td>50 test complexes</td>
          <td>50 complexes with no overlap with the training set selected for testing. Note: 50 is a small held-out set; results should be interpreted cautiously.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Clustering</strong>: Holo-ligand conformations clustered with RMSD threshold $1.0\text{\AA}$; top 10 clusters kept per complex.</li>
<li><strong>Pocket Definition</strong>: Residues within $7\text{\AA}$ of the ligand.</li>
<li><strong>Alignment</strong>: AlphaFold predicted structures (apo) aligned to MISATO holo structures using sequence alignment (Smith-Waterman) to identify pocket residues.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Flow Matching Framework</strong>:</p>
<ul>
<li><strong>Continuous Variables</strong> (Pocket translation/rotation/torsions, Ligand positions): Modeled using <strong>Conditional Flow Matching (CFM)</strong>.
<ul>
<li><em>Prior</em>: Apo state for pocket; Normal distribution for ligand positions.</li>
<li><em>Target</em>: Holo state from MD; Ground truth ligand.</li>
<li><em>Interpolant</em>: Linear interpolation for Euclidean variables; Geodesic for rotations ($SO(3)$, the rotation-only subgroup of SE(3) containing all 3D rotations but not translations); Wrapped linear interpolation for torsions (Torus).</li>
</ul>
</li>
<li><strong>Discrete Variables</strong> (Ligand atom/bond types): Modeled using <strong>Discrete Flow Matching</strong> based on Continuous-Time Markov Chains (CTMC).
<ul>
<li><em>Rate Matrix</em>: Interpolates between mask token and data distribution.</li>
</ul>
</li>
<li><strong>Loss Function</strong>: Weighted sum of 7 losses:
<ol>
<li>Translation CFM (Eq 5)</li>
<li>Rotation CFM (Eq 7)</li>
<li>Torsion CFM (Eq 11)</li>
<li>Ligand Position CFM</li>
<li>Ligand Atom Type CTMC (Eq 14)</li>
<li>Ligand Bond Type CTMC</li>
<li><strong>Interaction Loss</strong> (Eq 18): Explicitly penalizes deviations in pairwise distances between protein and ligand atoms for pairs $\leq 3.5\text{\AA}$.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: <strong>DynamicFlow</strong> is a multiscale model with 15.9M parameters.</p>
<ol>
<li><strong>Atom-Level SE(3)-Equivariant GNN</strong>:
<ul>
<li><em>Input</em>: Complex graph (k-NN) and Ligand graph (fully connected).</li>
<li><em>Layers</em>: 6 EGNN blocks modified to maintain node and edge hidden states.</li>
<li><em>Function</em>: Updates ligand positions and predicts ligand atom/bond types.</li>
</ul>
</li>
<li><strong>Residue-Level Transformer</strong>:
<ul>
<li><em>Input</em>: Aggregated atom features from the GNN + Residue frames/torsions.</li>
<li><em>Layers</em>: 4 Transformer blocks with <strong>Invariant Point Attention (IPA)</strong>.</li>
<li><em>Function</em>: Updates protein residue frames (translation/rotation) and predicts side-chain torsions.</li>
</ul>
</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Vina Score</strong>: <code>vina_minimize</code> mode used for binding affinity.</li>
<li><strong>RMSD</strong>: Minimum RMSD between generated pocket and ground-truth holo conformations.</li>
<li><strong>Cover Ratio</strong>: % of ground-truth holo conformations covered by at least one generated sample (threshold $1.42\text{\AA}$).</li>
<li><strong>POVME 3</strong>: For pocket volume calculation.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Inference Benchmark</strong>: 1x Tesla V100-SXM2-32GB.</li>
<li><strong>Speed</strong>: Generates 10 ligands in ~35-36 seconds (100 NFE), significantly faster than diffusion baselines like Pocket2Mol (980s) or TargetDiff (156s).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhou, X., Xiao, Y., Lin, H., He, X., Guan, J., Wang, Y., Liu, Q., Zhou, F., Wang, L., &amp; Ma, J. (2025). Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows. <em>International Conference on Learning Representations (ICLR)</em>. <a href="https://arxiv.org/abs/2503.03989">https://arxiv.org/abs/2503.03989</a></p>
<p><strong>Publication</strong>: ICLR 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhouIntegratingProteinDynamics2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhou, Xiangxin and Xiao, Yi and Lin, Haowei and He, Xinheng and Guan, Jiaqi and Wang, Yang and Liu, Qiang and Zhou, Feng and Wang, Liang and Ma, Jianzhu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://arxiv.org/abs/2503.03989}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2503.03989">arXiv Page</a></li>
<li>Code: no public repository available at time of writing</li>
</ul>
]]></content:encoded></item><item><title>ChemDFM-X: Multimodal Foundation Model for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-x/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-x/</guid><description>Multimodal chemical model integrating 5 modalities (2D graphs, 3D conformations, images, MS2/IR spectra) trained on 7.6M instructions.</description><content:encoded><![CDATA[<h2 id="chemdfm-x-contribution-and-architecture">ChemDFM-X Contribution and Architecture</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> contribution.</p>
<p><strong>Method</strong>: The paper proposes a novel &ldquo;Cross-modal Dialogue Foundation Model&rdquo; architecture that aligns five distinct chemical modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) to a single LLM decoder using separate encoders and projection modules. It establishes strong baseline performance across multiple modalities compared against current generalist models.</p>
<p><strong>Resource</strong>: The paper addresses the scarcity of multimodal chemical data by constructing a <strong>7.6M instruction-tuning dataset</strong>. This dataset is largely synthesized from seed SMILES strings using approximate calculations (MMFF94, CFM-ID, Chemprop-IR) and specialist model predictions.</p>
<h2 id="bridging-experimental-data-and-llms">Bridging Experimental Data and LLMs</h2>
<p>Existing chemical AI models generally fall into two distinct categories. Task-specific specialist models achieve high accuracy on singular objectives, such as property prediction or molecular generation, but require strict formatting and lack conversational flexibility. Conversely, early chemical large language models provide natural language interaction but are restricted to text and SMILES strings. ChemDFM-X addresses this gap by enabling large multimodal models to process the experimental characterization data (<a href="https://en.wikipedia.org/wiki/Tandem_mass_spectrometry">MS2 spectra</a> and <a href="https://en.wikipedia.org/wiki/Infrared_spectroscopy">IR spectra</a>) and visual data routinely used in practical chemistry workflows.</p>
<h2 id="synthetic-data-scaling-for-modality-alignment">Synthetic Data Scaling for Modality Alignment</h2>
<p>The core novelty lies in the <strong>&ldquo;Any-to-Text&rdquo; alignment strategy via synthetic data scaling</strong>:</p>
<ol>
<li>
<p><strong>Comprehensive Modality Support</strong>: ChemDFM-X incorporates experimental characterization data (MS2 and IR spectra) alongside 2D graphs, 3D conformations, and images. The data representations are formally defined mathematically rather than as raw pixels:</p>
<ul>
<li><strong>Molecular Graph</strong>: An undirected graph $G = (\textbf{V}, \textbf{E})$ with atom set $\textbf{V}$ and bond set $\textbf{E}$.</li>
<li><strong>Molecular Conformation</strong>: An undirected graph $G = (\textbf{V}&rsquo;, \textbf{E})$ storing spatial coordinates: $\textbf{v}_i = (x_i, y_i, z_i, a_i)$.</li>
<li><strong>MS2 Spectrum</strong>: Treated as a point sequence of discrete mass-to-charge ratios and intensities, tokenized via a discrete codebook: $\textbf{M} = ((r_1, I_1), (r_2, I_2), \dots, (r_n, I_n))$.</li>
<li><strong>IR Spectrum</strong>: Treated as a dense sequence of continuous wave lengths and absorption intensities, directly reshaped for feature extraction: $\textbf{R} = ((w_1, t_1), (w_2, t_2), \dots, (w_l, t_l))$.</li>
</ul>
<p>The authors trained new Sequence Transformer encoders from scratch for the MS2 and IR modalities since suitable pre-trained models did not exist.</p>
</li>
<li>
<p><strong>Synthetic Data Generation Pipeline</strong>: The authors generated a 7.6M sample dataset by starting with 1.3M seed SMILES and using &ldquo;approximate calculations&rdquo; to generate missing modalities:</p>
<ul>
<li>3D conformations via <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> force field optimization</li>
<li>MS2 spectra via CFM-ID 4.0 (Competitive Fragmentation Modeling)</li>
<li>IR spectra via Chemprop-IR (Message Passing Neural Network)</li>
</ul>
</li>
<li>
<p><strong>Cross-Modal Synergy</strong>: The model demonstrates that training on reaction images improves recognition performance by leveraging semantic chemical knowledge (reaction rules) to correct visual recognition errors, an emergent capability from multimodal training.</p>
</li>
</ol>
<h2 id="multimodal-benchmarking-with-chemllmbench">Multimodal Benchmarking with ChemLLMBench</h2>
<p>The model was evaluated using a customized version of <strong><a href="/notes/chemistry/llm-applications/chemllmbench-eight-chemistry-tasks/">ChemLLMBench</a></strong> and <strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> across three modality categories:</p>
<ol>
<li>
<p><strong>Structural Modalities</strong> (2D Graphs &amp; 3D Conformations):</p>
<ul>
<li>Molecule recognition and captioning</li>
<li>Property prediction (MoleculeNet: BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li>Compared against specialist models (Mole-BERT, Uni-Mol, MolXPT, MolCA) and generalist models (3D-MoLM, ChemDFM, <a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a>)</li>
</ul>
</li>
<li>
<p><strong>Visual Modalities</strong> (Images):</p>
<ul>
<li>Single molecule image recognition</li>
<li>Reaction image recognition</li>
<li>Compared against GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, and specialist models <a href="/notes/chemistry/optical-structure-recognition/vision-language/molnextr/">MolNextr</a> and <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a></li>
</ul>
</li>
<li>
<p><strong>Characterization Modalities</strong> (MS2 &amp; IR Spectra):</p>
<ul>
<li>Spectral analysis tasks (identifying molecules from spectra)</li>
<li>Contextualized spectral interpretation (combining spectra with reaction context)</li>
<li>Novel evaluation requiring integration of spectroscopic data with reaction knowledge</li>
</ul>
</li>
</ol>
<h2 id="cross-modal-synergy-and-generalist-performance">Cross-Modal Synergy and Generalist Performance</h2>
<p><strong>Key Findings</strong>:</p>
<ol>
<li>
<p><strong>Leading Generalist Performance</strong>: ChemDFM-X establishes a new benchmark among existing generalist models (such as 3D-MOLM and ChemLLM), achieving performance metrics that match dedicated specialist models across several multimodal tasks.</p>
</li>
<li>
<p><strong>Failure of General LMMs</strong>: General vision models (GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, InternLM-XComposer2, DocOwl) failed significantly on chemical image recognition tasks (0% accuracy for most models on molecule and reaction recognition, Table 9), demonstrating that chemical domain knowledge cannot be assumed from general pre-training.</p>
</li>
<li>
<p><strong>Cross-Modal Error Correction</strong>: In reaction image recognition, ChemDFM-X achieved higher accuracy (53.0%) than on single molecules (46.0%) (Table 9). The authors conclude the model uses its internal knowledge of chemical reaction rules to correct recognition errors in the visual modality, an emergent capability from multimodal training.</p>
</li>
<li>
<p><strong>Reliance on Reaction Context for Spectra</strong>: In zero-shot scenarios, ChemDFM-X essentially fails at pure spectral recognition (achieving 0% and 1% top-1 accuracy on MS2 and IR spectra alone, Table 11). However, when SMILES-based reaction context is included, performance rises to 45% (MS2) and 64% (IR) on the reaction prediction task, and 29% (MS2) and 60% (IR) on <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> (Table 11). This indicates the model uses spectral data as a soft prior to constrain textual deductions. Furthermore, the paper compares ChemDFM-X’s spectral identification performance exclusively against text-only LLMs that cannot process spectra, omitting comparisons against established specialist tools.</p>
</li>
<li>
<p><strong>Surrogate Distillation Trade-offs</strong>: Because the spectral training data relies entirely on outputs from CFM-ID 4.0 and Chemprop-IR, ChemDFM-X effectively distills these surrogate models. Any inherent predictive biases or inaccuracies from these underlying tools are permanently embedded in the new ChemDFM-X encoders.</p>
</li>
</ol>
<p><strong>Main Conclusion</strong>: The &ldquo;separate encoders + unified decoder&rdquo; architecture with synthetic data generation enables effective multimodal chemical understanding, bridging the gap between specialist and generalist AI systems for chemistry.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors constructed a <strong>7.6M sample instruction-tuning dataset</strong> derived from <strong>1.3M seed SMILES</strong> (sourced from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> and USPTO). <strong>Note</strong>: The final 7.6M multimodal tuning dataset itself isn&rsquo;t publicly available.</p>
<p><strong>Generation Pipeline</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Generation Method</th>
          <th>Tool/Model</th>
          <th>Sample Count</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graphs</strong></td>
          <td>Direct extraction from SMILES</td>
          <td>RDKit</td>
          <td>1.1M</td>
      </tr>
      <tr>
          <td><strong>3D Conformations</strong></td>
          <td>Force field optimization</td>
          <td>RDKit + MMFF94</td>
          <td>1.3M (pseudo-optimal)</td>
      </tr>
      <tr>
          <td><strong>Molecule Images</strong></td>
          <td>Rendering with augmentation</td>
          <td>RDKit, Indigo, <a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/">ChemPix</a></td>
          <td>~1M (including handwritten style)</td>
      </tr>
      <tr>
          <td><strong>Reaction Images</strong></td>
          <td>Rendering from reaction SMILES</td>
          <td>RDKit</td>
          <td>300K</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectra</strong></td>
          <td>Computational prediction</td>
          <td>CFM-ID 4.0</td>
          <td>~700K</td>
      </tr>
      <tr>
          <td><strong>IR Spectra</strong></td>
          <td>Computational prediction</td>
          <td>Chemprop-IR</td>
          <td>~1M</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Augmentation</strong>:</p>
<ul>
<li>Molecule images augmented with &ldquo;handwritten&rdquo; style using the ChemPix pipeline</li>
<li>Multiple rendering styles (RDKit default, Indigo clean)</li>
<li>Spectra generated at multiple energy levels (10eV, 20eV, 40eV for MS2)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Architecture</strong>: &ldquo;Separate Encoders + Unified Decoder&rdquo;</p>
<p><strong>Code Availability</strong>: The authors have only released inference logic. The cross-modal projection training and synthetic data-generation scripts are closed.</p>
<p><strong>Modality Alignment</strong>:</p>
<ul>
<li>Each modality has a dedicated encoder (frozen pre-trained models where available)</li>
<li>For graph, conformation, MS2, and IR modalities: <strong>2-layer MLP projector</strong> (Linear, GELU, Linear) maps encoder features to LLM input space</li>
<li>For images: <strong>H-Reducer</strong> module compresses image tokens by factor of $n=8$ to handle high-resolution chemical images, then projects to LLM input space</li>
<li>All projected features are concatenated and fed to the unified LLM decoder</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Base LLM</strong>:</p>
<ul>
<li><strong>ChemDFM (13B)</strong>: LLaMA-based model pre-trained on chemical text and SMILES</li>
</ul>
<p><strong>Modality Encoders</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Encoder</th>
          <th>Pre-training Data</th>
          <th>Parameter Count</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graph</strong></td>
          <td>Mole-BERT</td>
          <td>2M molecules</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>3D Conformation</strong></td>
          <td>Uni-Mol</td>
          <td>209M conformations</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>Image</strong></td>
          <td>CLIP (ViT)</td>
          <td>General domain</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
      <tr>
          <td><strong>IR Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Design Rationale</strong>: MS2 and IR encoders trained from scratch as Sequence Transformers treating spectral peaks as token sequences, since no suitable pre-trained models exist for chemical spectra.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Accuracy (Acc)</strong> for recognition tasks</li>
<li><strong>BLEU-2/4</strong> and <strong>METEOR</strong> for captioning tasks</li>
<li><strong>AUC-ROC</strong> for property prediction (classification)</li>
</ul>
<p><strong>Code Availability</strong>: The adapted code for evaluating on ChemLLMBench and their custom spectral recognition tasks is closed-source.</p>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>ChemLLMBench</strong>: Adapted for multimodal inputs across molecule captioning, property prediction, and reaction understanding</li>
<li><strong>MoleculeNet</strong>: Standard molecular property prediction tasks (BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li><strong>USPTO</strong>: Reaction prediction and retrosynthesis tasks</li>
<li><strong>Custom Spectral Tasks</strong>: Novel evaluations requiring spectral interpretation</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Note</strong>: The type and quantity of GPUs used, along with the total training wall-time, were not published.</p>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Total Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 3</li>
<li><strong>Optimizer</strong>: AdamW</li>
</ul>
<p><strong>Modality-Specific Learning Rates (Peak)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Learning Rate</th>
          <th>Feature Dimension</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph</td>
          <td>1e-5</td>
          <td>300</td>
      </tr>
      <tr>
          <td>Conformation</td>
          <td>2e-4</td>
          <td>512</td>
      </tr>
      <tr>
          <td>Image</td>
          <td>2e-3</td>
          <td>1024</td>
      </tr>
      <tr>
          <td>MS2 / IR</td>
          <td>2e-4</td>
          <td>768</td>
      </tr>
  </tbody>
</table>
<p><strong>Note</strong>: Different learning rates reflect the varying degrees of domain adaptation required. Images (general CLIP) need more adaptation than graphs (chemical Mole-BERT).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemDFM-X">ChemDFM-X (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Inference code only; training and data generation scripts are closed</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-X-v1.0-13B">ChemDFM-X-v1.0-13B (HuggingFace)</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>13B parameter multimodal model weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Li, J., Chen, L., Wen, L., Wang, P., Zhu, Z., Zhang, D., Wan, Z., Li, Y., Dai, Z., Chen, X., &amp; Yu, K. (2024). ChemDFM-X: Towards Large Multimodal Model for Chemistry. <em>Science China Information Sciences</em>, 67(12), 220109. <a href="https://doi.org/10.1007/s11432-024-4243-0">https://doi.org/10.1007/s11432-024-4243-0</a></p>
<p><strong>Publication</strong>: Science China Information Sciences, December 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2409.13194">arXiv Version</a></li>
<li><a href="https://github.com/OpenDFM/ChemDFM-X">Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhaoChemDFMXLargeMultimodal2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemDFM-X}}: {{Towards Large Multimodal Model}} for {{Chemistry}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhao, Zihan and Chen, Bo and Li, Jingpiao and Chen, Lu and Wen, Liyang and Wang, Pengyu and Zhu, Zichen and Zhang, Danyang and Wan, Ziping and Li, Yansi and Dai, Zhongyang and Chen, Xin and Yu, Kai}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Science China Information Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{220109}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11432-024-4243-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2409.13194}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolSight: OCSR with RL and Multi-Granularity Learning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</guid><description>A three-stage OCSR framework using SMILES pretraining, auxiliary bond/coordinate tasks, and reinforcement learning to master stereochemistry recognition.</description><content:encoded><![CDATA[<h2 id="contribution-a-framework-for-optical-chemical-structure-recognition">Contribution: A Framework for Optical Chemical Structure Recognition</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.</p>
<p>It also has a <strong>Resource</strong> component, as the authors construct and release <em>Stereo-200k</em>, a dataset specifically designed to train models on challenging stereoisomeric molecules.</p>
<h2 id="motivation-resolving-stereochemical-cues">Motivation: Resolving Stereochemical Cues</h2>
<p>Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.</p>
<h2 id="core-innovations-grpo-and-multi-granularity-learning">Core Innovations: GRPO and Multi-Granularity Learning</h2>
<p>MolSight introduces three key technical innovations:</p>
<ol>
<li><strong>Reinforcement Learning for OCSR</strong>: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.</li>
<li><strong>Multi-Granularity Learning</strong>: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.</li>
<li><strong>SMILES-M Notation</strong>: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.</li>
</ol>
<h2 id="experimental-methodology">Experimental Methodology</h2>
<p>The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec, Imago) and deep learning methods (MolScribe, MolGrapher, DECIMER).</li>
<li><strong>Benchmarks</strong>: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).</li>
<li><strong>Ablation Studies</strong>: Tested the impact of the bond head, coordinate head, and RL stages separately.</li>
<li><strong>Transfer Learning</strong>: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>SOTA Performance</strong>: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.</li>
<li><strong>RL Effectiveness</strong>: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.</li>
<li><strong>Robustness</strong>: On perturbed USPTO images (random rotations and shearing), MolSight achieved 92.3% exact match accuracy (vs. the original 92.0%), while rule-based methods like OSRA dropped from 83.5% to 6.7%. On the low-resolution Staker dataset, MolSight reached 82.1% exact match.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline uses three distinct data sources:</p>
<ol>
<li><strong>Pre-training</strong>: <em>MolParser-7M</em>. Contains diverse images but requires the <strong>SMILES-M</strong> extension to handle Markush structures.</li>
<li><strong>Fine-tuning</strong>: <em>PubChem-1M</em> and <em>USPTO-680K</em>. Used for multi-granularity learning with bond and coordinate labels.</li>
<li><strong>RL Post-training</strong>: <em>Stereo-200k</em>. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (&rsquo;@&rsquo;) and cis-trans isomerism (&rsquo;/&rsquo;, &lsquo;\&rsquo;). It uses 5 different RDKit drawing styles to ensure robustness.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Reinforcement Learning</strong>: Uses <strong>GRPO (Group Relative Policy Optimization)</strong>.
<ul>
<li><strong>Reward Function</strong>: A linear combination of Tanimoto similarity and a graded stereochemistry reward.
$$ R = w_t \cdot r_{\text{tanimoto}} + w_s \cdot r_{\text{stereo}} $$
where $w_t=0.4$ and $w_s=0.6$. The stereochemistry reward $r_{\text{stereo}}$ is 1.0 for an InChIKey exact match, 0.3 if the atom count matches, and 0.1 otherwise.</li>
<li><strong>Sampling</strong>: Samples 4 completions per image with temperature 1.0 during RL training.</li>
</ul>
</li>
<li><strong>Auxiliary Tasks</strong>:
<ul>
<li><strong>Bond Classification</strong>: Concatenates hidden states of two atom queries to predict bond type via MLP.</li>
<li><strong>Atom Localization</strong>: Treated as a classification task (SimCC) but optimized using <strong>Maximum Likelihood Estimation (MLE)</strong> to account for uncertainty.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer. Input images are preprocessed to $512 \times 512$ resolution.
<ul>
<li><strong>Encoder</strong>: <strong>EfficientViT-L1</strong> (~53M params), chosen for linear attention efficiency.</li>
<li><strong>Decoder</strong>: 6-layer Transformer with <strong>RoPE</strong>, <strong>SwiGLU</strong>, and <strong>RMSNorm</strong>. Randomly initialized (no LLM weights) due to vocabulary mismatch.</li>
<li><strong>Coordinate Head</strong>: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.</li>
</ul>
</li>
<li><strong>Parameter Tuning</strong>:
<ul>
<li>Stage 3 (RL) uses <strong>LoRA</strong> (Rank=8, Alpha=16) to optimize the decoder.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Exact Match</strong>: Exact recognition accuracy for the full molecular structure.</li>
<li><strong>Tanimoto Coefficient</strong>: Fingerprint similarity for chemical semantics.</li>
<li><strong>OKS (Object Keypoint Similarity)</strong>: Used specifically for evaluating atom localization accuracy.</li>
</ul>
</li>
<li><strong>Perturbation</strong>: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training and inference performed on a single node.</li>
<li><strong>Processors</strong>: Intel Xeon Silver 4210R CPU.</li>
<li><strong>Accelerators</strong>: 4x <strong>NVIDIA GeForce RTX 3090/4090</strong> GPUs.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Stage 1: Batch size 512, LR $4 \times 10^{-4}$.</li>
<li>Stage 2: Batch size 256, Bond head LR $4 \times 10^{-4}$, Coord head LR $4 \times 10^{-5}$.</li>
<li>Stage 3 (RL): Batch size 64, Base LR $1 \times 10^{-4}$.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/hustvl/MolSight">MolSight (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch implementation with training and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, W., Wang, X., Feng, B., &amp; Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. In <em>Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026)</em>. <a href="https://doi.org/10.48550/arXiv.2511.17300">https://doi.org/10.48550/arXiv.2511.17300</a></p>
<p><strong>Publication</strong>: AAAI 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/hustvl/MolSight">Official Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2025molsight,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2511.17300}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2511.17300}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolScribe: Robust Image-to-Graph Molecular Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/</guid><description>Image-to-graph generation model for OCSR that predicts atoms, bonds, and coordinates jointly to better handle stereochemistry and abbreviations.</description><content:encoded><![CDATA[<h2 id="contribution-generative-image-to-graph-modelling">Contribution: Generative Image-to-Graph Modelling</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) with a secondary contribution to Resources ($\Psi_{\text{Resource}}$).</p>
<p>It proposes a novel architecture (image-to-graph generation) to solve the Optical Chemical Structure Recognition (OCSR) task, validating it through extensive ablation studies and comparisons against strong baselines like MolVec and DECIMER. It also contributes a new benchmark dataset of annotated images from ACS journals.</p>
<h2 id="motivation-limitations-in-existing-ocsr-pipelines">Motivation: Limitations in Existing OCSR Pipelines</h2>
<p>Translating molecular images into machine-readable graphs (OCSR) is challenging due to the high variance in drawing styles, stereochemistry conventions, and abbreviated structures found in literature.</p>
<p>Existing solutions face structural bottlenecks:</p>
<ul>
<li><strong>Rule-based systems</strong> (e.g., OSRA) rely on rigid heuristics that fail on diverse styles.</li>
<li><strong>Image-to-SMILES neural models</strong> treat the problem as captioning. They struggle with geometric reasoning (which is strictly required for chirality) and struggle to incorporate chemical constraints or verify correctness because they omit explicit atom locations.</li>
</ul>
<h2 id="core-innovation-joint-graph-and-coordinate-prediction">Core Innovation: Joint Graph and Coordinate Prediction</h2>
<p>MolScribe introduces an <strong>Image-to-Graph</strong> generation paradigm that combines the flexibility of neural networks with the precision of symbolic constraints. It frames the task probabilistically as:</p>
<p>$$
P(G | I) = P(A | I) P(B | A, I)
$$</p>
<p>Where the model predicts a sequence of atoms $A$ given an image $I$, followed by the bonds $B$ given both the atoms and the image.</p>
<ol>
<li><strong>Explicit Graph Prediction</strong>: It predicts a sequence of atoms (with 2D coordinates) and then predicts bonds between them.</li>
<li><strong>Symbolic Constraints</strong>: It uses the predicted graph structure and coordinates to strictly determine chirality and cis/trans isomerism.</li>
<li><strong>Abbreviation Expansion</strong>: It employs a greedy algorithm to parse and expand &ldquo;superatoms&rdquo; (e.g., &ldquo;CO2Et&rdquo;) into their full atomic structure.</li>
<li><strong>Dynamic Augmentation</strong>: It introduces a data augmentation strategy that randomly substitutes functional groups with abbreviations and adds R-groups during training to improve generalization.</li>
</ol>
<h2 id="methodology-autoregressive-atoms-and-pairwise-bonds">Methodology: Autoregressive Atoms and Pairwise Bonds</h2>
<p>The authors evaluate MolScribe on synthetic and real-world datasets, focusing on <strong>Exact Match Accuracy</strong> of the canonical SMILES string. The model generates atom sequences autoregressively:</p>
<p>$$
P(A | I) = \prod_{i=1}^n P(a_i | A_{&lt;i}, I)
$$</p>
<p>To handle continuous spatial locations, atom coordinates map to discrete bins (e.g., $\hat{x}_i = \lfloor \frac{x_i}{W} \times n_{\text{bins}} \rfloor$), and decode alongside element labels. Bonds act on a pairwise classifier over the hidden states of every atom pair:</p>
<p>$$
P(B | A, I) = \prod_{i=1}^n \prod_{j=1}^n P(b_{i,j} | A, I)
$$</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (MolVec, OSRA) and neural (Img2Mol, DECIMER, SwinOCSR) systems.</li>
<li><strong>Benchmarks</strong>:
<ul>
<li><strong>Synthetic</strong>: Indigo (in-domain) and ChemDraw (out-of-domain).</li>
<li><strong>Realistic</strong>: Five public benchmarks (CLEF, JPO, UOB, USPTO, Staker).</li>
<li><strong>New Dataset</strong>: 331 images from ACS Publications (journal articles).</li>
</ul>
</li>
<li><strong>Ablations</strong>: Tested performance without data augmentation, with continuous vs. discrete coordinates, and without non-atom tokens.</li>
<li><strong>Human Eval</strong>: Measured the time reduction for chemists using MolScribe to digitize molecules vs. drawing from scratch.</li>
</ul>
<h2 id="results-robust-exact-match-accuracy">Results: Robust Exact Match Accuracy</h2>
<ul>
<li><strong>Strong Performance</strong>: MolScribe achieved <strong>76-93% accuracy</strong> across public benchmarks, outperforming baselines on most datasets. On the ACS dataset of journal article images, MolScribe achieved 71.9% compared to the next best 55.3% (OSRA). On the large Staker patent dataset, MolScribe achieved 86.9%, surpassing MSE-DUDL (77.0%) while using far less training data (1.68M vs. 68M examples).</li>
<li><strong>Chirality Verification</strong>: Explicit geometric reasoning allowed MolScribe to predict chiral molecules significantly better than image-to-SMILES baselines. When chirality is ignored, the performance gap narrows (e.g., on Indigo, baseline accuracy rises from 94.1% to 96.3%), isolating MolScribe&rsquo;s primary advantage to geometric reasoning for stereochemistry.</li>
<li><strong>Hand-Drawn Generalization</strong>: The model achieved <strong>11.2% exact match accuracy</strong> on the DECIMER-HDM dataset, despite lacking hand-drawn images in the training set, with many errors bounded to a few atomic mismatches.</li>
<li><strong>Robustness</strong>: The model maintained high performance on perturbed images (rotation/shear), whereas rule-based systems degraded severely.</li>
<li><strong>Usability</strong>: The atom-level alignment allows for confidence visualization, and human evaluation showed it reduced digitization time from <strong>137s to 20s</strong> per molecule.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a mix of synthetic and patent data with extensive dynamic augmentation:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td><strong>PubChem (Synthetic)</strong></td>
          <td>1M</td>
          <td>Molecules randomly sampled from PubChem and rendered via Indigo toolkit; includes atom coords.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td><strong>USPTO (Patents)</strong></td>
          <td>680K</td>
          <td>Patent data lacks exact atom coordinates; relative coordinates normalized from MOLfiles to image dimensions (often introduces coordinate shifts).</td>
      </tr>
  </tbody>
</table>
<p><strong>Molecule Augmentation</strong>:</p>
<ul>
<li><strong>Functional Groups</strong>: Randomly substituted using 53 common substitution rules (e.g., replacing substructures with &ldquo;Et&rdquo; or &ldquo;Ph&rdquo;).</li>
<li><strong>R-Groups</strong>: Randomly added using vocabulary: <code>[R, R1...R12, Ra, Rb, Rc, Rd, X, Y, Z, A, Ar]</code>.</li>
<li><strong>Styles</strong>: Random variation of aromaticity (circle vs. bonds) and explicit hydrogens.</li>
</ul>
<p><strong>Image Augmentation</strong>:</p>
<ul>
<li><strong>Rendering</strong>: Randomized font (Arial, Times, Courier, Helvetica), line width, and label modes during synthetic generation.</li>
<li><strong>Perturbations</strong>: Applied rotation ($\pm 90^{\circ}$), cropping ($1%$), padding ($40%$), downscaling, blurring, and Salt-and-Pepper/Gaussian noise.</li>
</ul>
<p><strong>Preprocessing</strong>: Input images are resized to $384 \times 384$.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Atom Prediction (Pix2Seq-style)</strong>:
<ul>
<li>The model generates a sequence of tokens: $S^A = [l_1, \hat{x}_1, \hat{y}_1, \dots, l_n, \hat{x}_n, \hat{y}_n]$.</li>
<li><strong>Discretization</strong>: Coordinates are binned into integer tokens ($n_{bins} = 64$).</li>
<li><strong>Tokenizer</strong>: Atom-wise tokenizer splits SMILES into atoms; non-atom tokens (parentheses, digits) are kept to help structure learning.</li>
</ul>
</li>
<li><strong>Bond Prediction</strong>:
<ul>
<li>Format: Pairwise classification for every pair of predicted atoms.</li>
<li>Symmetry: For symmetric bonds (single/double), the probability is averaged as:
$$
\hat{P}(b_{i,j} = t) = \frac{1}{2} \big( P(b_{i,j} = t) + P(b_{j,i} = t) \big)
$$
For wedges, directional logic strictly applies instead.</li>
</ul>
</li>
<li><strong>Abbreviation Expansion (Algorithm 1)</strong>:
<ul>
<li>A greedy algorithm connects atoms within an expanded abbreviation (e.g., &ldquo;COOH&rdquo;) until valences are full, avoiding the need for a fixed dictionary.</li>
<li><strong>Carbon Chains</strong>: Splits condensed chains like $C_aX_b$ into explicit sequences ($CX_q&hellip;CX_{q+r}$).</li>
<li><strong>Nested Formulas</strong>: Recursively parses nested structures like $N(CH_3)_2$ by treating them as superatoms attached to the current backbone.</li>
<li><strong>Valence Handling</strong>: Iterates through common valences first to resolve ambiguities.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is an encoder-decoder with a classification head:</p>
<ul>
<li><strong>Encoder</strong>: <strong>Swin Transformer (Swin-B)</strong>, pre-trained on ImageNet-22K (88M params).</li>
<li><strong>Decoder</strong>: 6-layer Transformer, 8 heads, hidden dimension 256.</li>
<li><strong>Bond Predictor</strong>: 2-layer MLP (Feedforward) with ReLU, taking concatenated atom hidden states as input.</li>
<li><strong>Training</strong>: Teacher forcing, Cross-Entropy Loss, Batch size 128, 30 epochs.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: Exact Match of Canonical SMILES.</p>
<ul>
<li>Stereochemistry: Must match tetrahedral chirality; cis-trans ignored.</li>
<li>R-groups: Replaced with wildcards <code>*</code> or <code>[d*]</code> for evaluation.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training performed on Linux server with <strong>96 CPUs</strong> and <strong>500GB RAM</strong>.</li>
<li><strong>GPUs</strong>: <strong>4x NVIDIA A100 GPUs</strong>.</li>
<li><strong>Training Time</strong>: Unspecified; comparative models on large datasets took &ldquo;more than one day&rdquo;.</li>
<li><strong>Inference</strong>: Requires autoregressive decoding for atoms, followed by a single forward pass for bonds.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/thomas0809/MolScribe">MolScribe (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with training, inference, and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/yujieq/MolScribe">MolScribe (Hugging Face)</a></td>
          <td>Demo</td>
          <td>MIT</td>
          <td>Interactive web demo for molecular image recognition</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Scoped to single-molecule images only; does not handle multi-molecule diagrams or reaction schemes.</li>
<li>Hand-drawn molecule recognition remains weak (the model was not trained on hand-drawn data).</li>
<li>Complex Markush structures (positional variation, frequency variation) are not supported, as these cannot be represented in SMILES or MOLfiles.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, Y., Guo, J., Tu, Z., Li, Z., Coley, C. W., &amp; Barzilay, R. (2023). MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation. <em>Journal of Chemical Information and Modeling</em>, 63(7), 1925-1934. <a href="https://doi.org/10.1021/acs.jcim.2c01480">https://doi.org/10.1021/acs.jcim.2c01480</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/spaces/yujieq/MolScribe">Hugging Face Space</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qianMolScribeRobustMolecular2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MolScribe}}: {{Robust Molecular Structure Recognition}} with {{Image-To-Graph Generation}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MolScribe}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Qian, Yujie and Guo, Jiang and Tu, Zhengkai and Li, Zhening and Coley, Connor W. and Barzilay, Regina}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1925--1934}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.2c01480}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolMole: Unified Vision Pipeline for Molecule Mining</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/</guid><description>A vision-based deep learning framework that unifies molecule detection, reaction parsing, and OCSR for page-level chemical data extraction.</description><content:encoded><![CDATA[<h2 id="molmoles-dual-contribution-unified-ocsr-method-and-page-level-benchmarks">MolMole&rsquo;s Dual Contribution: Unified OCSR Method and Page-Level Benchmarks</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong <strong>Resource</strong> contribution.</p>
<p>It functions as a <strong>Method</strong> paper because it introduces &ldquo;MolMole,&rdquo; a unified deep learning framework that integrates molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline. It validates this method through extensive comparisons against state-of-the-art baselines like DECIMER and OpenChemIE.</p>
<p>It also serves as a <strong>Resource</strong> paper because the authors construct and release a novel page-level benchmark dataset of 550 annotated pages (patents and articles) to address the lack of standardized evaluation metrics for full-page chemical extraction.</p>
<h2 id="addressing-the-limitations-of-fragmented-processing">Addressing the Limitations of Fragmented Processing</h2>
<p>The rapid accumulation of chemical literature has trapped valuable molecular and reaction data in unstructured formats like images and PDFs. Extracting this manually is time-consuming, while existing AI frameworks have significant limitations:</p>
<ul>
<li><strong>DECIMER</strong>: Lacks the ability to process reaction diagrams entirely.</li>
<li><strong>OpenChemIE</strong>: Relies on external layout parser models to crop elements before processing. This dependence often leads to detection failures in documents with complex layouts.</li>
<li><strong>Generative Hallucination</strong>: Existing generative OCSR models (like MolScribe) are prone to &ldquo;hallucinating&rdquo; structures or failing on complex notations like polymers.</li>
</ul>
<h2 id="a-unified-vision-pipeline-for-layout-aware-detection">A Unified Vision Pipeline for Layout-Aware Detection</h2>
<p>MolMole introduces several architectural and workflow innovations:</p>
<ul>
<li><strong>Direct Page-Level Processing</strong>: Unlike OpenChemIE, MolMole processes full document pages directly without requiring an external layout parser, which improves robustness on complex layouts like two-column patents.</li>
<li><strong>Unified Vision Pipeline</strong>: It integrates three specialized vision models into one workflow:
<ul>
<li><strong>ViDetect</strong>: A DINO-based object detector for identifying molecular regions.</li>
<li><strong>ViReact</strong>: An RxnScribe-based model adapted for full-page reaction parsing.</li>
<li><strong>ViMore</strong>: A detection-based OCSR model that explicitly predicts atoms and bonds.</li>
</ul>
</li>
<li><strong>Hallucination Mitigation</strong>: By using a detection-based approach (ViMore), the model avoids hallucinating chemical structures and provides confidence scores.</li>
<li><strong>Advanced Notation Support</strong>: The system explicitly handles &ldquo;wavy bonds&rdquo; (variable attachments in patents) and polymer bracket notations, which confuse standard SMILES-based models.</li>
</ul>
<h2 id="page-level-benchmark-evaluation-and-unified-metrics">Page-Level Benchmark Evaluation and Unified Metrics</h2>
<p>The authors evaluated the framework on both a newly curated benchmark and existing public datasets:</p>
<ul>
<li><strong>New Benchmark Creation</strong>: They curated 550 pages (300 patents, 250 articles) fully annotated with bounding boxes, reaction roles (reactant, product, condition), and MOLfiles.</li>
<li><strong>Baselines</strong>: MolMole was compared against <strong>DECIMER 2.0</strong>, <strong>OpenChemIE</strong>, and <strong>ReactionDataExtractor 2.0</strong>.</li>
<li><strong>OCSR Benchmarking</strong>: ViMore was evaluated against DECIMER, MolScribe, and MolGrapher on four public datasets: <strong>USPTO</strong>, <strong>UOB</strong>, <strong>CLEF</strong>, and <strong>JPO</strong>.</li>
<li><strong>Metric Proposal</strong>: They introduced a combined &ldquo;End-to-End&rdquo; metric that modifies standard object detection Precision/Recall to strictly require correct SMILES conversion for a &ldquo;True Positive&rdquo;.</li>
</ul>
<p>$$ \text{True Positive (End-to-End)} = ( \text{IoU} \geq 0.5 ) \land ( \text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}} ) $$</p>
<h2 id="key-results">Key Results</h2>
<ul>
<li><strong>Page-Level Performance</strong>: On the new benchmark, MolMole achieved F1 scores of <strong>89.1%</strong> (Patents) and <strong>86.8%</strong> (Articles) for the combined detection-to-conversion task, compared to 73.8% and 67.3% for DECIMER and 68.8% and 70.6% for OpenChemIE (Table 4).</li>
<li><strong>Reaction Parsing</strong>: ViReact achieved soft-match F1 scores of <strong>98.0%</strong> on patents and <strong>97.0%</strong> on articles, compared to 82.2% and 82.9% for the next best model, RxnScribe (w/o LP). Hard-match F1 scores were 92.5% (patents) and 84.6% (articles).</li>
<li><strong>Public Benchmarks</strong>: ViMore outperformed competitors on 3 out of 4 public OCSR datasets (CLEF, JPO, USPTO).</li>
<li><strong>Layout Handling</strong>: The authors demonstrated that MolMole successfully handles multi-column reaction diagrams where cropping-based models fail and faithfully preserves layout geometry in generated MOLfiles.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://lgai-ddu.github.io/molmole/">MolMole Project Page</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Demo and project information</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training Data</strong>: The models (ViDetect and ViMore) were trained on <strong>private/proprietary datasets</strong>, which is a limitation for full reproducibility from scratch.</li>
<li><strong>Benchmark Data</strong>: The authors introduce a test set of <strong>550 pages</strong> (3,897 molecules, 1,022 reactions) derived from patents and scientific articles. This dataset is stated to be made &ldquo;publicly available&rdquo;.</li>
<li><strong>Public Evaluation Data</strong>: Standard OCSR datasets used include USPTO (5,719 images), UOB (5,740 images), CLEF (992 images), and JPO (450 images).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pipeline Workflow</strong>: PDF → PNG Images → Parallel execution of <strong>ViDetect</strong> and <strong>ViReact</strong> → Cropping of molecular regions → <strong>ViMore</strong> conversion → Output (JSON/Excel).</li>
<li><strong>Post-Processing</strong>:
<ul>
<li><em>ViDetect</em>: Removes overlapping proposals based on confidence scores and size constraints.</li>
<li><em>ViReact</em>: Refines predictions by correcting duplicates and removing empty entities.</li>
<li><em>ViMore</em>: Assembles detected atom/bond information into structured representations (MOLfile).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture Basis</th>
          <th>Task</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ViDetect</strong></td>
          <td>DINO (DETR-based)</td>
          <td>Molecule Detection</td>
          <td>End-to-end training; avoids slow autoregressive methods.</td>
      </tr>
      <tr>
          <td><strong>ViReact</strong></td>
          <td>RxnScribe</td>
          <td>Reaction Parsing</td>
          <td>Operates on full pages; autoregressive decoder for structured sequence generation.</td>
      </tr>
      <tr>
          <td><strong>ViMore</strong></td>
          <td>Custom Vision Model</td>
          <td>OCSR</td>
          <td>Detection-based (predicts atom/bond regions).</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Molecule Detection</strong>: Evaluated using COCO metrics (AP, AR, F1) at IoU thresholds 0.50-0.95.</li>
<li><strong>Molecule Conversion</strong>: Evaluated using SMILES exact match accuracy and Tanimoto similarity.</li>
<li><strong>Combined Metric</strong>: A custom metric where a True Positive requires both IoU $\geq$ 0.5 and a correct SMILES string match where $\text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}}$.</li>
<li><strong>Reaction Parsing</strong>: Evaluated using <strong>Hard Match</strong> (all components correct) and <strong>Soft Match</strong> (molecular entities only, ignoring text labels).</li>
</ul>
<h3 id="missing-components">Missing Components</h3>
<ul>
<li><strong>Source code</strong>: Not publicly released. The paper states the toolkit &ldquo;will be accessible soon through an interactive demo on the LG AI Research website.&rdquo; For commercial use, the authors direct inquiries to contact <a href="mailto:ddu@lgresearch.ai">ddu@lgresearch.ai</a>.</li>
<li><strong>Training data</strong>: ViDetect and ViMore are trained on proprietary datasets. Training code and data are not available.</li>
<li><strong>Hardware requirements</strong>: Not specified in the paper.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chun, S., Kim, J., Jo, A., Jo, Y., Oh, S., et al. (2025). MolMole: Molecule Mining from Scientific Literature. <em>arXiv preprint arXiv:2505.03777</em>. <a href="https://doi.org/10.48550/arXiv.2505.03777">https://doi.org/10.48550/arXiv.2505.03777</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://lgai-ddu.github.io/molmole/">Project Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chun2025molmole,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolMole: Molecule Mining from Scientific Literature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chun, Sehyun and Kim, Jiye and Jo, Ahra and Jo, Yeonsik and Oh, Seungyul and Lee, Seungjun and Ryoo, Kwangrok and Lee, Jongmin and Kim, Seung Hwan and Kang, Byung Jun and Lee, Soonyoung and Park, Jun Ha and Moon, Chanwoo and Ham, Jiwon and Lee, Haein and Han, Heejae and Byun, Jaeseung and Do, Soojong and Ha, Minju and Kim, Dongyun and Bae, Kyunghoon and Lim, Woohyung and Lee, Edward Hwayoung and Park, Yongmin and Yu, Jeongsang and Jo, Gerrard Jeongwon and Hong, Yeonjung and Yoo, Kyungjae and Han, Sehui and Lee, Jaewan and Park, Changyoung and Jeon, Kijeong and Yi, Sihyuk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2505.03777}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2505.03777}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2505.03777}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGrapher: Graph-based Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</guid><description>A graph-based deep learning approach for optical chemical structure recognition that outperforms image captioning methods.</description><content:encoded><![CDATA[<h2 id="1-contribution--type">1. Contribution / Type</h2>
<p>This is primarily a <strong>Methodological</strong> paper that proposes a novel neural architecture (MolGrapher), shifting the paradigm of Optical Chemical Structure Recognition (OCSR) from image captioning back to graph reconstruction. It also has a significant <strong>Resource</strong> component, releasing a synthetic data generation pipeline and a new large-scale benchmark (USPTO-30K) to address the scarcity of annotated real-world data.</p>
<h2 id="2-motivation">2. Motivation</h2>
<p>The automatic analysis of chemical literature is critical for accelerating drug and material discovery, but much of this information is locked in 2D images of molecular structures.</p>
<ul>
<li><strong>Problem:</strong> Existing rule-based methods are rigid, while recent deep learning methods based on &ldquo;image captioning&rdquo; (predicting <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings) struggle with complex molecules and fail to exploit the natural graph structure of molecules.</li>
<li><strong>Gap:</strong> There is a lack of diverse, annotated real-world training data, and captioning models suffer from &ldquo;hallucinations&rdquo; where they predict valid SMILES that do not match the image.</li>
</ul>
<h2 id="3-novelty--core-innovation">3. Novelty / Core Innovation</h2>
<p>MolGrapher introduces a <strong>graph-based deep learning pipeline</strong> that explicitly models the molecule&rsquo;s geometry and topology.</p>
<ul>
<li><strong>Supergraph Concept:</strong> It first detects all atom keypoints and builds a &ldquo;supergraph&rdquo; of all plausible bonds.</li>
<li><strong>Hybrid Approach:</strong> It combines a ResNet-based keypoint detector with a Graph Neural Network (GNN) that classifies both atom nodes and bond nodes within the supergraph context. Both atoms and bonds are represented as nodes, with edges only connecting atom nodes to bond nodes.</li>
<li><strong>Synthetic Pipeline:</strong> A data generation pipeline that renders molecules with varying styles (fonts, bond widths) and augmentations (pepper patches, random lines, captions) to simulate real document noise.</li>
</ul>
<p>At the core of the Keypoint Detector&rsquo;s performance is the <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss. Since pixels without an atom drastically outnumber pixels containing an atom, WAHR loss is designed to counter the class imbalance. For ground-truth heatmap $y$ and prediction $p$:</p>
<p>$$ L_{WAHR}(p, y) = \sum_i \alpha_y (p_i - y_i)^2 $$</p>
<p>where $\alpha_y$ dynamically down-weights easily classified background pixels.</p>
<h2 id="4-methodology--experiments">4. Methodology &amp; Experiments</h2>
<p>The authors evaluated MolGrapher against both rule-based (OSRA, MolVec) and deep learning baselines (DECIMER, Img2Mol, Image2Graph).</p>
<ul>
<li><strong>Benchmarks:</strong> Evaluated on standard datasets: USPTO, Maybridge UoB, CLEF-2012, and JPO.</li>
<li><strong>New Benchmark:</strong> Introduced and tested on <strong>USPTO-30K</strong>, split into clean, abbreviated, and large molecule subsets.</li>
<li><strong>Ablations:</strong> Analyzed the impact of synthetic augmentations, keypoint loss functions, supergraph connectivity radius, and GNN layers.</li>
<li><strong>Robustness:</strong> Tested on perturbed images (rotations, shearing) to mimic scanned patent quality.</li>
</ul>
<p>The GNN iteratively updates node embeddings through layers ${g^k}_{k \in [1, N]}$, where $e^{k+1} = g^k(e^k)$. Final predictions are obtained via two MLPs (one for atoms, one for bonds): $p_i = MLP_t(e_i^N)$, where $p_i \in \mathbb{R}^{C_t}$ contains the logits for atom or bond classes.</p>
<h2 id="5-results--conclusions">5. Results &amp; Conclusions</h2>
<p>MolGrapher achieved the highest accuracy among synthetic-only deep learning methods on most benchmarks tested.</p>
<ul>
<li><strong>Accuracy:</strong> It achieved <strong>91.5%</strong> accuracy on USPTO, outperforming all other synthetic-only deep learning methods including ChemGrapher (80.9%), Graph Generation (67.0%), and DECIMER 2.0 (61.0%).</li>
<li><strong>Large Molecules:</strong> It demonstrated superior scaling, correctly recognizing large molecules (USPTO-10K-L) where image captioning methods like Img2Mol failed completely (0.0% accuracy).</li>
<li><strong>Generalization:</strong> The method proved robust to image perturbations and style variations without requiring fine-tuning on real data. The paper acknowledges that MolGrapher cannot recognize Markush structures (depictions of sets of molecules with positional and frequency variation indicators).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model relies on synthetic data for training due to the scarcity of annotated real-world images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic Data</td>
          <td>300,000 images</td>
          <td>Generated from PubChem SMILES using RDKit. Augmentations include pepper patches, random lines, and variable bond styles.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>USPTO-30K</td>
          <td>30,000 images</td>
          <td>Created by authors from USPTO patents (2001-2020). Subsets: 10K clean, 10K abbreviated, 10K large (&gt;70 atoms).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Standard Benchmarks</td>
          <td>Various</td>
          <td>USPTO (5,719), Maybridge UoB (5,740), CLEF-2012 (992), JPO (450).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of three distinct algorithmic stages:</p>
<ol>
<li>
<p><strong>Keypoint Detection</strong>:</p>
<ul>
<li>Predicts a heatmap of atom locations using a CNN.</li>
<li>Thresholds heatmaps at the bottom 10th percentile and uses a $5\times5$ window for local maxima.</li>
<li>Uses <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss to handle class imbalance (background vs. atoms).</li>
</ul>
</li>
<li>
<p><strong>Supergraph Construction</strong>:</p>
<ul>
<li>Connects every detected keypoint to neighbors within a radius of $3 \times$ the estimated bond length.</li>
<li>Prunes edges with no filled pixels or if obstructed by a third keypoint.</li>
<li>Keeps a maximum of 6 bond candidates per atom.</li>
</ul>
</li>
<li>
<p><strong>Superatom Recognition</strong>:</p>
<ul>
<li>Detects &ldquo;superatom&rdquo; nodes (abbreviations like <code>COOH</code>).</li>
<li>Uses <strong>PP-OCR</strong> to transcribe the text at these node locations.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The architecture utilizes standard backbones tailored for specific sub-tasks:</p>
<ul>
<li><strong>Keypoint Detector</strong>: <strong>ResNet-18</strong> backbone with $8\times$ dilation to preserve spatial resolution.</li>
<li><strong>Node Classifier</strong>: <strong>ResNet-50</strong> backbone with $2\times$ dilation for extracting visual features at node locations.</li>
<li><strong>Graph Neural Network</strong>: A custom GNN that updates node embeddings based on visual features and neighborhood context. The initial node embedding combines the visual feature vector $v_i$ and a learnable type encoding $w_{t_i}$.</li>
<li><strong>Readout</strong>: MLPs classify nodes into atom types (e.g., C, O, N) and bond types (No Bond, Single, Double, Triple).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Accuracy is defined strictly: the predicted molecule must have an identical <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> string to the ground truth. Stereochemistry and Markush structures are excluded from evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>MolGrapher Score</th>
          <th>Best DL Baseline (Synthetic)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>USPTO</td>
          <td><strong>91.5%</strong></td>
          <td>80.9% (ChemGrapher)</td>
          <td>Full USPTO benchmark</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>USPTO-10K-L</td>
          <td><strong>31.4%</strong></td>
          <td>0.0% (Img2Mol)</td>
          <td>Large molecules (&gt;70 atoms)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>JPO</td>
          <td><strong>67.5%</strong></td>
          <td>64.0% (DECIMER 2.0)</td>
          <td>Challenging, low-quality images</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: Trained on 3 NVIDIA A100 GPUs.</li>
<li><strong>Training Time</strong>: 20 epochs.</li>
<li><strong>Optimization</strong>: ADAM optimizer, learning rate 0.0001, decayed by 0.8 after 5000 iterations.</li>
<li><strong>Loss Weighting</strong>: Atom classifier loss weighted by 1; bond classifier loss weighted by 3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MolGrapher">DS4SD/MolGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with training and inference scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Title</strong>: MolGrapher: Graph-based Visual Recognition of Chemical Structures</p>
<p><strong>Authors</strong>: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valéry Weber, Ingmar Meijer, Peter Staar, Fisher Yu</p>
<p><strong>Citation</strong>: Morin, L., Danelljan, M., Agea, M. I., Nassar, A., Weber, V., Meijer, I., Staar, P., &amp; Yu, F. (2023). MolGrapher: Graph-based Visual Recognition of Chemical Structures. <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em>, 19552-19561.</p>
<p><strong>Publication</strong>: ICCV 2023</p>
<p><strong>Links</strong>:</p>
<ul>
<li><a href="https://openaccess.thecvf.com/content/ICCV2023/html/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.html">Paper</a></li>
<li><a href="https://github.com/DS4SD/MolGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMolGrapherGraphbasedVisual2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MolGrapher}}: {{Graph-based Visual Recognition}} of {{Chemical Structures}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MolGrapher}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valéry and Meijer, Ingmar and Staar, Peter and Yu, Fisher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{19552--19561}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICCV51070.2023.01791}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-18}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MMSSC-Net: Multi-Stage Sequence Cognitive Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</guid><description>A deep learning model for Optical Chemical Structure Recognition (OCSR) using SwinV2 and GPT-2 to convert molecular images to SMILES.</description><content:encoded><![CDATA[<h2 id="contribution-a-multi-stage-architectural-pipeline">Contribution: A Multi-Stage Architectural Pipeline</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.
The paper proposes a deep learning architecture (<strong>MMSSC-Net</strong>) for Optical Chemical Structure Recognition (OCSR). It focuses on architectural innovation, specifically combining a SwinV2 visual encoder with a GPT-2 decoder, and validates this method through extensive benchmarking against existing rule-based and deep-learning baselines. It includes ablation studies to justify the choice of the visual encoder.</p>
<h2 id="motivation-addressing-noise-and-rigid-image-recognition">Motivation: Addressing Noise and Rigid Image Recognition</h2>
<ul>
<li><strong>Data Usage Gap</strong>: Drug discovery relies heavily on scientific literature, but molecular structures are often locked in vector graphics or images that computers cannot easily process.</li>
<li><strong>Limitations of Prior Work</strong>: Existing Rule-based methods are rigid and sensitive to noise. Previous Deep Learning approaches (Encoder-Decoder &ldquo;Image Captioning&rdquo; styles) often lack precision, interpretability, and struggle with varying image resolutions or large molecules.</li>
<li><strong>Need for &ldquo;Cognition&rdquo;</strong>: The authors argue that treating the image as a single isolated whole is insufficient; a model needs to &ldquo;perceive&rdquo; fine-grained details (atoms and bonds) to handle noise and varying pixel qualities effectively.</li>
</ul>
<h2 id="novelty-a-fine-grained-perception-pipeline">Novelty: A Fine-Grained Perception Pipeline</h2>
<ul>
<li><strong>Multi-Stage Cognitive Architecture</strong>: MMSSC-Net splits the task into stages:
<ol>
<li><strong>Fine-grained Perception</strong>: Detecting atom and bond sequences (including spatial coordinates) using SwinV2.</li>
<li><strong>Graph Construction</strong>: Assembling these into a molecular graph.</li>
<li><strong>Sequence Evolution</strong>: converting the graph into a machine-readable format (SMILES).</li>
</ol>
</li>
<li><strong>Hybrid Transformer Model</strong>: It combines a hierarchical vision transformer (<strong>SwinV2</strong>) for encoding with a generative pre-trained transformer (<strong>GPT-2</strong>) and MLPs for decoding atomic and bond targets.</li>
<li><strong>Robustness Mechanisms</strong>: The inclusion of random noise sequences during training to improve generalization to new molecular targets.</li>
</ul>
<h2 id="methodology-and-benchmarks">Methodology and Benchmarks</h2>
<ul>
<li><strong>Baselines</strong>: compared against 8 other tools:
<ul>
<li><em>Rule-based</em>: MolVec, OSRA.</li>
<li><em>Image-Smiles (DL)</em>: ABC-Net, Img2Mol, MolMiner.</li>
<li><em>Image-Graph-Smiles (DL)</em>: Image-To-Graph, MolScribe, ChemGrapher.</li>
</ul>
</li>
<li><strong>Datasets</strong>: Evaluated on 5 diverse datasets: STAKER (synthetic), USPTO, CLEF, JPO, and UOB (real-world).</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Accuracy</strong>: Exact string match of the predicted SMILES.</li>
<li><strong>Tanimoto Similarity</strong>: Chemical similarity using Morgan fingerprints.</li>
</ul>
</li>
<li><strong>Ablation Study</strong>: Tested different visual encoders (Swin Transformer, ViT-B, ResNet-50) to validate the choice of SwinV2.</li>
<li><strong>Resolution Sensitivity</strong>: Tested model performance across image resolutions from 256px to 2048px.</li>
</ul>
<h2 id="results-and-core-outcomes">Results and Core Outcomes</h2>
<ul>
<li><strong>Strong Performance</strong>: MMSSC-Net achieved 75-98% accuracy across datasets, outperforming baselines on most benchmarks. The first three intra-domain and real datasets achieved above 94% accuracy.</li>
<li><strong>Resolution Robustness</strong>: The model maintained relatively stable accuracy across varying image resolutions, whereas baselines like Img2Mol showed greater sensitivity to resolution changes (Fig. 4 in the paper).</li>
<li><strong>Efficiency</strong>: The SwinV2 encoder was noted to be more efficient than ViT-B in this context.</li>
<li><strong>Limitations</strong>: The model struggles with stereochemistry, specifically confusing dashed wedge bonds with solid wedge bonds and misclassifying single bonds as solid wedge bonds. It also has difficulty with &ldquo;irrelevant text&rdquo; noise (e.g., unexpected symbols in JPO and DECIMER datasets).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combination of PubChem and USPTO data, augmented to handle visual variability.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>PubChem</strong></td>
          <td>1,000,000</td>
          <td>Converted from <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> to SMILES; random sampling.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO</strong></td>
          <td>600,000</td>
          <td>Patent images; converted from MOL to SMILES.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>STAKER</strong></td>
          <td>40,000</td>
          <td>Synthetic; Avg res $256 \times 256$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>USPTO</strong></td>
          <td>4,862</td>
          <td>Real; Avg res $721 \times 432$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>CLEF</strong></td>
          <td>881</td>
          <td>Real; Avg res $1245 \times 412$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>JPO</strong></td>
          <td>380</td>
          <td>Real; Avg res $614 \times 367$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>UOB</strong></td>
          <td>5,720</td>
          <td>Real; Avg res $759 \times 416$.</td>
      </tr>
  </tbody>
</table>
<p><strong>Augmentation</strong>:</p>
<ul>
<li><strong>Image</strong>: Random perturbations using RDKit/Indigo (rotation, filling, cropping, bond thickness/length, font size, Gaussian noise).</li>
<li><strong>Molecular</strong>: Introduction of functional group abbreviations and R-substituents (dummy atoms) using SMARTS templates.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Target Sequence Formulation</strong>: The model predicts a sequence containing bounding box coordinates and type labels: ${y_{\text{min}}, x_{\text{min}}, y_{\text{max}}, x_{\text{max}}, C_{n}}$.</li>
<li><strong>Loss Function</strong>: Cross-entropy loss with maximum likelihood estimation.
$$ \max \sum_{i=1}^{N} \sum_{j=1}^{L} \omega_{j} \log P(t_{j}^{i} \mid x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}, t_{1}^{i}, \dots, t_{j-1}^{i}) $$</li>
<li><strong>Noise Injection</strong>: A random sequence $T_r$ is appended to the target sequence during training to improve generalization to new goals.</li>
<li><strong>Graph Construction</strong>: Atoms ($v$) and bonds ($e$) are recognized separately; bonds are defined by connecting spatial atomic coordinates.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder</strong>: <strong>Swin Transformer V2</strong>.
<ul>
<li>Pre-trained on ImageNet-1K.</li>
<li>Window size: $16 \times 16$.</li>
<li>Parameters: 88M.</li>
<li>Input resolution: $256 \times 256$.</li>
<li>Features: Scaled cosine attention; log-space continuous position bias.</li>
</ul>
</li>
<li><strong>Decoder</strong>: <strong>GPT-2</strong> + <strong>MLP</strong>.
<ul>
<li><strong>GPT-2</strong>: Used for recognizing atom types.
<ul>
<li>Layers: 24.</li>
<li>Attention Heads: 12.</li>
<li>Hidden Dimension: 768.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
<li><strong>MLP</strong>: Used for classifying bond types (single, double, triple, aromatic, solid wedge, dashed wedge).</li>
</ul>
</li>
<li><strong>Vocabulary</strong>:
<ul>
<li>Standard: 95 common numbers/characters ([0], [C], [=], etc.).</li>
<li>Extended: 2000 SMARTS-based characters for isomers/groups (e.g., &ldquo;[C2F5]&rdquo;, &ldquo;[halo]&rdquo;).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ol>
<li><strong>Accuracy</strong>: Exact match of the generated SMILES string.</li>
<li><strong>Tanimoto Similarity</strong>: Similarity of Morgan fingerprints between predicted and ground truth molecules.</li>
</ol>
<p><strong>Key Results (Accuracy)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MMSSC-Net</th>
          <th>MolVec (Rule)</th>
          <th>ABC-Net (DL)</th>
          <th>MolScribe (DL)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Indigo</strong></td>
          <td>98.14</td>
          <td>95.63</td>
          <td>96.4</td>
          <td>97.5</td>
      </tr>
      <tr>
          <td><strong>RDKit</strong></td>
          <td>94.91</td>
          <td>86.7</td>
          <td>98.3</td>
          <td>93.8</td>
      </tr>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>94.24</td>
          <td>88.47</td>
          <td>*</td>
          <td>92.6</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>91.26</td>
          <td>81.61</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>92.71</td>
          <td>81.32</td>
          <td>96.1</td>
          <td>87.9</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>89.44</td>
          <td>4.49</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>75.48</td>
          <td>66.8</td>
          <td>*</td>
          <td>76.2</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch Size: 128.</li>
<li>Learning Rate: $4 \times 10^{-5}$.</li>
<li>Epochs: 40.</li>
</ul>
</li>
<li><strong>Inference Speed</strong>: The SwinV2 encoder demonstrated higher efficiency (faster inference time) compared to ViT-B and ResNet-50 baselines during ablation.</li>
</ul>
<h3 id="reproducibility">Reproducibility</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Wzew5Lp/MMSSCNet">MMSSCNet (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation; includes training and prediction scripts</td>
      </tr>
  </tbody>
</table>
<p>The paper is published in RSC Advances (open access). Source code is available on GitHub, though the repository has minimal documentation and no explicit license. The training data comes from PubChem (public) and USPTO (public patent data). Pre-trained model weights do not appear to be released. No specific GPU hardware or training time is reported in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, D., Zhao, D., Wang, Z., Li, J., &amp; Li, J. (2024). MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition. <em>RSC Advances</em>, 14(26), 18182-18191. <a href="https://doi.org/10.1039/D4RA02442G">https://doi.org/10.1039/D4RA02442G</a></p>
<p><strong>Publication</strong>: RSC Advances 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangMMSSCNetMultistageSequence2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MMSSC-Net: Multi-Stage Sequence Cognitive Networks for Drug Molecule Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MMSSC-Net}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Dehai and Zhao, Di and Wang, Zhengwu and Li, Junhui and Li, Jin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{RSC Advances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{26}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{18182--18191}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D4RA02442G}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://pubs.rsc.org/en/content/articlelanding/2024/ra/d4ra02442g}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher: Multi-modal Markush Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</guid><description>Multi-modal transformer combining vision, text, and layout encoding to extract complex Markush structures from patent documents with OCSR.</description><content:encoded><![CDATA[<h2 id="overcoming-unimodal-limitations-for-markush-structures">Overcoming Unimodal Limitations for Markush Structures</h2>
<p>The automated analysis of chemical literature, particularly patents, is critical for drug discovery and material science. A major bottleneck is the extraction of <strong>Markush structures</strong>, which are complex chemical templates that represent families of molecules using a core backbone image and textual variable definitions. Existing methods are limited because they either rely solely on images (OCSR) and miss the textual context, or focus solely on text and miss the structural backbone. This creates a practical need for a unified, multi-modal approach that jointly interprets visual and textual data to accurately extract these structures for prior-art search and database construction. This paper proposes a <strong>Method</strong> and introduces a new <strong>Resource</strong> (M2S dataset) to bridge this gap.</p>
<h2 id="markushgrapher-the-multi-modal-architecture">MarkushGrapher: The Multi-Modal Architecture</h2>
<p>The core innovation is <strong>MarkushGrapher</strong>, a multi-modal architecture that jointly encodes image, text, and layout information. Key contributions include:</p>
<ul>
<li><strong>Dual-Encoder Architecture</strong>: Combines a Vision-Text-Layout (VTL) encoder (based on UDOP) with a specialized, pre-trained Optical Chemical Structure Recognition (OCSR) encoder (MolScribe). Let $E_{\text{VTL}}$ represent the combined sequence embedding and $E_{\text{OCSR}}$ represent the domain-specific visual embeddings.</li>
<li><strong>Joint Recognition</strong>: The model autoregressively generates a sequential graph representation (Optimized CXSMILES) and a substituent table simultaneously. It uses cross-modal dependencies, allowing text to clarify ambiguous visual details like bond types.</li>
<li><strong>Synthetic Data Pipeline</strong>: A comprehensive pipeline generates realistic synthetic Markush structures (images and text) from PubChem data, overcoming the lack of labeled training data.</li>
<li><strong>Optimized Representation</strong>: A compacted version of CXSMILES moves variable groups into the SMILES string and adds explicit atom indexing to handle complex &ldquo;frequency&rdquo; and &ldquo;position&rdquo; variation indicators.</li>
</ul>
<h2 id="experimental-validation-on-the-new-m2s-benchmark">Experimental Validation on the New M2S Benchmark</h2>
<p>The authors validated their approach using the following setup:</p>
<ul>
<li><strong>Baselines</strong>: Compared against image-only chemistry models (DECIMER, MolScribe) and general-purpose multi-modal models (Uni-SMART, GPT-4o, Pixtral, Llama-3.2).</li>
<li><strong>Datasets</strong>: Evaluated on three benchmarks:
<ol>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 generated samples.</li>
<li><strong>M2S</strong>: A new benchmark of 103 manually annotated real-world patent images.</li>
<li><strong>USPTO-Markush</strong>: 74 Markush backbone images from USPTO patents.</li>
</ol>
</li>
<li><strong>Ablation Studies</strong>: Analyzed the impact of the OCSR encoder, late fusion strategies, and the optimized CXSMILES format. Late fusion improved USPTO-Markush EM from 23% (VTL only) to 32% (Table 3). Removing R-group compression dropped M2S EM from 38% to 30%, and removing atom indexing dropped USPTO-Markush EM from 32% to 24% (Table 4).</li>
</ul>
<h2 id="key-results">Key Results</h2>
<ul>
<li><strong>Performance</strong>: MarkushGrapher outperformed all baselines. On the M2S benchmark, it achieved 38% Exact Match on CXSMILES (compared to 21% for MolScribe) and 29% Exact Match on tables. On USPTO-Markush, it reached 32% CXSMILES EM versus 7% for MolScribe.</li>
<li><strong>Markush Feature Recognition</strong>: The model can recognize complex Markush features like frequency variation (&lsquo;Sg&rsquo;) and position variation (&rsquo;m&rsquo;) indicators. DECIMER and MolScribe scored 0% on both &rsquo;m&rsquo; and &lsquo;Sg&rsquo; sections (Table 2), while MarkushGrapher achieved 76% on &rsquo;m&rsquo; and 31% on &lsquo;Sg&rsquo; sections on M2S.</li>
<li><strong>Cross-Modal Reasoning</strong>: Qualitative analysis showed the model can correctly infer visual details (such as bond order) that appear ambiguous in the image but become apparent with the text description.</li>
<li><strong>Robustness</strong>: The model generalizes well to real-world data despite being trained purely on synthetic data. On augmented versions of M2S and USPTO-Markush simulating low-quality scanned documents, it maintained 31% and 32% CXSMILES EM respectively (Table 6).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>The authors note several limitations:</p>
<ul>
<li>MarkushGrapher does not currently handle abbreviations in chemical structures (e.g., &lsquo;OG&rsquo; for oxygen connected to a variable group).</li>
<li>The model relies on ground-truth OCR cells as input, requiring an external OCR model for practical deployment.</li>
<li>Substituent definitions that combine text with interleaved chemical structure drawings are not supported.</li>
<li>The model is trained to predict &rsquo;m&rsquo; sections connecting to all atoms in a cycle, which can technically violate valence constraints, though the output contains enough information to reconstruct only valid connections.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong></p>
<ul>
<li><strong>Source</strong>: Synthetic dataset generated from PubChem SMILES.</li>
<li><strong>Size</strong>: 210,000 synthetic images.</li>
<li><strong>Pipeline</strong>:
<ol>
<li><strong>Selection</strong>: Sampled SMILES from PubChem based on substructure diversity.</li>
<li><strong>Augmentation</strong>: SMILES augmented to artificial CXSMILES using RDKit (inserting variable groups, frequency indicators).</li>
<li><strong>Rendering</strong>: Images rendered using Chemistry Development Kit (CDK) with randomized drawing parameters (font, bond width, spacing).</li>
<li><strong>Text Generation</strong>: Textual definitions generated using manual templates extracted from patents; 10% were paraphrased using Mistral-7B-Instruct-v0.3 to increase diversity.</li>
<li><strong>OCR</strong>: Bounding boxes extracted via a custom SVG parser aligned with MOL files.</li>
</ol>
</li>
</ul>
<p><strong>Evaluation Data</strong></p>
<ul>
<li><strong>M2S Dataset</strong>: 103 images from USPTO, EPO, and WIPO patents (1999-2023), manually annotated with CXSMILES and substituent tables.</li>
<li><strong>USPTO-Markush</strong>: 74 images from USPTO patents (2010-2016).</li>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 samples generated via the pipeline.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimized CXSMILES</strong>:
<ul>
<li><strong>Compression</strong>: Variable groups moved from the extension block to the main SMILES string as special atoms to reduce sequence length.</li>
<li><strong>Indexing</strong>: Atom indices appended to each atom (e.g., <code>C:1</code>) to explicitly link the graph to the extension block (crucial for <code>m</code> and <code>Sg</code> sections).</li>
<li><strong>Vocabulary</strong>: Specific tokens used for atoms and bonds.</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Standard image augmentations (shift, scale, blur, pepper noise, random lines) and OCR text augmentations (character substitution/insertion/deletion).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer.
<ul>
<li><strong>VTL Encoder</strong>: T5-large encoder (initialized from UDOP) that processes image patches, text tokens, and layout (bounding boxes).</li>
<li><strong>OCSR Encoder</strong>: Vision encoder from MolScribe (Swin Transformer), frozen during training.</li>
<li><strong>Text Decoder</strong>: T5-large decoder.</li>
</ul>
</li>
<li><strong>Fusion Strategy</strong>: <strong>Late Fusion</strong>. The core multi-modal alignment combines the textual layout features with specialized chemical vision explicitly. The fused representation relies on the VTL output $e_1$ concatenated with the MLP-projected OCSR output $e_2$ before decoding:
$$ e = e_1(v, t, l) \oplus \text{MLP}(e_2(v)) $$</li>
<li><strong>Parameters</strong>: 831M total (744M trainable).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>CXSMILES Exact Match (EM)</strong>: Requires perfect match of SMILES string, variable groups, <code>m</code> sections, and <code>Sg</code> sections (ignoring stereochemistry).</li>
<li><strong>Tanimoto Score</strong>: Similarity of RDKit DayLight fingerprints (Markush features removed).</li>
<li><strong>Table Exact Match</strong>: All variable groups and substituents must match.</li>
<li><strong>Table F1-Score</strong>: Aggregated recall and precision of substituents per variable group.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Trained on a single NVIDIA H100 GPU.</li>
<li><strong>Training Config</strong>: 10 epochs, batch size of 10, ADAM optimizer, learning rate 5e-4, 100 warmup steps, weight decay 1e-3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Weber, V., Nassar, A., Meijer, G. I., Van Gool, L., Li, Y., &amp; Staar, P. (2025). MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures. <em>2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 14505-14515. <a href="https://doi.org/10.1109/CVPR52734.2025.01352">https://doi.org/10.1109/CVPR52734.2025.01352</a></p>
<p><strong>Publication</strong>: CVPR 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMarkushGrapherJointVisual2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MarkushGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Weber, Valéry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14505--14515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/CVPR52734.2025.01352}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2InChI: SwinTransformer for Molecular Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</guid><description>Deep learning model using improved SwinTransformer encoder and attention-based feature fusion to convert molecular images to InChI strings.</description><content:encoded><![CDATA[<h2 id="image2inchi-as-a-methodological-innovation">Image2InChI as a Methodological Innovation</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>. It proposes a specific new deep learning architecture (&ldquo;Image2InChI&rdquo;) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.</p>
<h2 id="bottlenecks-in-chemical-literature-digitization">Bottlenecks in Chemical Literature Digitization</h2>
<p>The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.</p>
<h2 id="hierarchical-swintransformer-and-attention-integration">Hierarchical SwinTransformer and Attention Integration</h2>
<p>The core novelty is the <strong>Image2InChI</strong> architecture, which integrates:</p>
<ol>
<li><strong>Improved SwinTransformer Encoder</strong>: Uses a hierarchical vision transformer to capture image features.</li>
<li><strong>Feature Fusion with Attention</strong>: A novel network designed to integrate image patch features with InChI prediction steps.</li>
<li><strong>End-to-End InChI Prediction</strong>: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary:
$$ \mathcal{L}_{\text{CE}} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{X}) $$
where $\mathbf{X}$ represents the input image features, $y_t$ is the predicted token, and $T$ is the sequence length.</li>
</ol>
<h2 id="benchmarking-on-the-bms-dataset">Benchmarking on the BMS Dataset</h2>
<ul>
<li><strong>Benchmark Validation</strong>: The model was trained and tested on the <strong>BMS1000 (Bristol-Myers Squibb)</strong> dataset from a Kaggle competition.</li>
<li><strong>Ablation/Comparative Analysis</strong>: The authors compared their method against other models in the supplement.</li>
<li><strong>Preprocessing Validation</strong>: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing &ldquo;spiky point noise&rdquo;.</li>
</ul>
<h2 id="high-inchi-recognition-metrics">High InChI Recognition Metrics</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved <strong>99.8% InChI accuracy</strong>, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.</li>
<li><strong>Effective Denoising</strong>: The authors concluded that <strong>eight-neighborhood filtering</strong> is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.</li>
<li><strong>Open Source</strong>: The authors stated their intention to release the code, though no public repository has been identified.</li>
</ul>
<hr>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition</td>
          <td>Bristol-Myers Squibb Molecular Translation competition dataset</td>
      </tr>
  </tbody>
</table>
<p>No public code repository has been identified for Image2InChI despite the authors&rsquo; stated intent to release it.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The primary dataset used is the <strong>BMS (Bristol-Myers Squibb) Dataset</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>Kaggle Competition (BMS-Molecular-Translation)</td>
      </tr>
      <tr>
          <td><strong>Total Size</strong></td>
          <td>2.4 million images</td>
      </tr>
      <tr>
          <td><strong>Training Set</strong></td>
          <td>1.8 million images</td>
      </tr>
      <tr>
          <td><strong>Test Set</strong></td>
          <td>0.6 million images</td>
      </tr>
      <tr>
          <td><strong>Content</strong></td>
          <td>Each image corresponds to a unique International Chemical Identifier (<a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>)</td>
      </tr>
  </tbody>
</table>
<p><strong>Other Datasets</strong>: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.</p>
<p><strong>Preprocessing Pipeline</strong>:</p>
<ol>
<li><strong>Denoising</strong>: <strong>Eight-neighborhood filtering</strong> (threshold &lt; 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.</li>
<li><strong>Sequence Padding</strong>:
<ul>
<li>Analysis showed max InChI length &lt; 270.</li>
<li>Fixed sequence length set to <strong>300</strong>.</li>
<li>Tokens: <code>&lt;sos&gt;</code> (190), <code>&lt;eos&gt;</code> (191), <code>&lt;pad&gt;</code> (192) used for padding/framing.</li>
</ul>
</li>
<li><strong>Numerization</strong>: Characters are mapped to integers based on a fixed vocabulary (e.g., &lsquo;C&rsquo; -&gt; 178, &lsquo;H&rsquo; -&gt; 182).</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Eight-Neighborhood Filtering (Denoising)</strong>:</p>
<p>Pseudocode logic:</p>
<ul>
<li>Iterate through every pixel.</li>
<li>Count non-white neighbors in the 3x3 grid (8 neighbors).</li>
<li>If count &lt; threshold (default 4), treat as noise and remove.</li>
</ul>
<p><strong>InChI Tokenization</strong>:</p>
<ul>
<li>InChI strings are split into character arrays.</li>
<li>Example: Vitamin C <code>InChI=1S/C6H8O6...</code> becomes <code>[&lt;sos&gt;, C, 6, H, 8, O, 6, ..., &lt;eos&gt;, &lt;pad&gt;...]</code>.</li>
<li>Mapped to integer tensor for model input.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image2InChI</p>
<ul>
<li><strong>Encoder</strong>: Improved SwinTransformer (Hierarchical Vision Transformer).</li>
<li><strong>Decoder</strong>: Transformer Decoder with patch embedding.</li>
<li><strong>Fusion</strong>: A novel &ldquo;feature fusion network with attention&rdquo; integrates the visual tokens with the sequence generation process.</li>
<li><strong>Framework</strong>: PyTorch 1.8.1.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>InChI Acc</strong>: Exact match accuracy of the predicted InChI string (Reported: 99.8%).</li>
<li><strong>MCS Acc</strong>: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).</li>
<li><strong>LCS Acc</strong>: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).</li>
<li><strong>Morgan FP</strong>: Morgan Fingerprint similarity (Reported: 94.1%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Specification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GPU</strong></td>
          <td>NVIDIA Tesla P100 (16GB VRAM)</td>
      </tr>
      <tr>
          <td><strong>Platform</strong></td>
          <td>MatPool cloud platform</td>
      </tr>
      <tr>
          <td><strong>CPU</strong></td>
          <td>Intel Xeon Gold 6271</td>
      </tr>
      <tr>
          <td><strong>RAM</strong></td>
          <td>32GB System Memory</td>
      </tr>
      <tr>
          <td><strong>Driver</strong></td>
          <td>NVIDIA-SMI 440.100</td>
      </tr>
      <tr>
          <td><strong>OS</strong></td>
          <td>Ubuntu 18.04</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, D., Xu, X., Pan, J., Gao, W., &amp; Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. <em>Journal of Chemical Information and Modeling</em>, 64(9), 3640-3649. <a href="https://doi.org/10.1021/acs.jcim.3c02082">https://doi.org/10.1021/acs.jcim.3c02082</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></li>
</ul>
<p><strong>Note</strong>: These notes are based on the Abstract and Supporting Information files only.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2024image2inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Image2InChI: Automated Molecular Optical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3640--3649}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.3c02082}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Enhanced DECIMER for Hand-Drawn Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</guid><description>An improved encoder-decoder model (EfficientNetV2 + Transformer) converts hand-drawn chemical structures into SMILES strings using synthetic training data.</description><content:encoded><![CDATA[<h2 id="method-contribution-architectural-optimization">Method Contribution: Architectural Optimization</h2>
<p>This is a <strong>Method</strong> paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.</p>
<h2 id="motivation-digitizing-dark-chemical-data">Motivation: Digitizing &ldquo;Dark&rdquo; Chemical Data</h2>
<p>Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.</p>
<ul>
<li><strong>Gap:</strong> Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.</li>
<li><strong>Need:</strong> There is a critical need for automated tools to digitize this &ldquo;dark data&rdquo; effectively to preserve it and make it machine-readable and searchable.</li>
</ul>
<h2 id="core-innovation-decoder-only-design-and-synthetic-scaling">Core Innovation: Decoder-Only Design and Synthetic Scaling</h2>
<p>The core novelty is the <strong>architectural enhancement</strong> and <strong>synthetic training strategy</strong>:</p>
<ol>
<li><strong>Decoder-Only Transformer:</strong> Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).</li>
<li><strong>EfficientNetV2 Integration:</strong> Replacing standard CNNs or EfficientNetV1 with <strong>EfficientNetV2-M</strong> provided better feature extraction and 2x faster training speeds.</li>
<li><strong>Scale of Synthetic Data:</strong> The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.</li>
</ol>
<h2 id="experimental-setup-ablation-and-real-world-baselines">Experimental Setup: Ablation and Real-World Baselines</h2>
<ul>
<li><strong>Model Selection (Ablation):</strong> Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).</li>
<li><strong>Data Scaling:</strong> Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.</li>
<li><strong>Real-World Benchmarking:</strong> Validated the final model on the <strong>DECIMER Hand-drawn dataset</strong> (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).</li>
</ul>
<h2 id="results-and-conclusions-strong-accuracy-on-hand-drawn-scans">Results and Conclusions: Strong Accuracy on Hand-Drawn Scans</h2>
<ul>
<li><strong>Strong Performance:</strong> The final DECIMER model achieved <strong>99.72% valid predictions</strong> and <strong>73.25% exact accuracy</strong> on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.</li>
<li><strong>Robustness:</strong> Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.</li>
<li><strong>Data Saturation:</strong> Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official TensorFlow implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained hand-drawn model weights</td>
      </tr>
      <tr>
          <td><a href="https://pypi.org/project/decimer/">DECIMER PyPi Package</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Installable Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic hand-drawn image generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The model was trained entirely on <strong>synthetic data</strong> generated using the <a href="https://github.com/OBrink/RanDepict">RanDepict</a> toolkit. No real hand-drawn images were used for training.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Molecules</th>
          <th>Total Images</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>4,375,338</td>
          <td>1 augmented + 1 clean per molecule</td>
      </tr>
      <tr>
          <td>2</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>13,126,014</td>
          <td>2 augmented + 4 clean per molecule</td>
      </tr>
      <tr>
          <td>3</td>
          <td>PubChem</td>
          <td>9,510,000</td>
          <td>38,040,000</td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
      <tr>
          <td>4</td>
          <td>PubChem</td>
          <td>38,040,000</td>
          <td><strong>152,160,000</strong></td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
  </tbody>
</table>
<p>A separate <strong>model selection</strong> experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The <strong>DECIMER Hand-Drawn</strong> evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.</p>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings length &lt; 300 characters.</li>
<li>Images resized to $512 \times 512$.</li>
<li>Images generated with and without &ldquo;hand-drawn style&rdquo; augmentations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization:</strong> SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start <code>&lt;start&gt;</code> and end <code>&lt;end&gt;</code> tokens added; padded with <code>&lt;pad&gt;</code>.</li>
<li><strong>Optimization:</strong> Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.</li>
<li><strong>Loss Function:</strong> Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples:
$$
\text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}})
$$</li>
<li><strong>Augmentations:</strong> RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture (Model 3) is an Encoder-Decoder structure:</p>
<ul>
<li><strong>Encoder:</strong> <strong>EfficientNetV2-M</strong> (pretrained ImageNet backbone).
<ul>
<li>Input: $512 \times 512 \times 3$ image.</li>
<li>Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).</li>
<li><em>Note:</em> The final fully connected layer of the CNN is removed.</li>
</ul>
</li>
<li><strong>Decoder:</strong> <strong>Transformer (Decoder-only)</strong>.
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 8</li>
<li>Embedding Dimension: 512</li>
</ul>
</li>
<li><strong>Output:</strong> Predicted SMILES string token by token.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics used for evaluation:</p>
<ol>
<li><strong>Valid Predictions (%):</strong> Percentage of outputs that are syntactically valid SMILES.</li>
<li><strong>Exact Match Accuracy (%):</strong> Canonical SMILES string identity.</li>
<li><strong>Tanimoto Similarity:</strong> Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.</li>
</ol>
<p><strong>Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Training Images</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 (ChEMBL)</td>
          <td>4,375,338</td>
          <td>96.21%</td>
          <td>5.09%</td>
          <td>0.490</td>
      </tr>
      <tr>
          <td>2 (ChEMBL)</td>
          <td>13,126,014</td>
          <td>97.41%</td>
          <td>26.08%</td>
          <td>0.690</td>
      </tr>
      <tr>
          <td>3 (PubChem)</td>
          <td>38,040,000</td>
          <td>99.67%</td>
          <td>70.34%</td>
          <td>0.939</td>
      </tr>
      <tr>
          <td>4 (PubChem)</td>
          <td>152,160,000</td>
          <td>99.72%</td>
          <td>73.25%</td>
          <td>0.942</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>OCSR Tool</th>
          <th>Method</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER (Ours)</strong></td>
          <td>Deep Learning</td>
          <td><strong>99.72%</strong></td>
          <td><strong>73.25%</strong></td>
          <td><strong>0.94</strong></td>
      </tr>
      <tr>
          <td>DECIMER.ai</td>
          <td>Deep Learning</td>
          <td>96.07%</td>
          <td>26.98%</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>Deep Learning</td>
          <td>99.94%</td>
          <td>10.81%</td>
          <td>0.51</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>Deep Learning</td>
          <td>95.66%</td>
          <td>7.65%</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>Deep Learning</td>
          <td>98.96%</td>
          <td>5.25%</td>
          <td>0.52</td>
      </tr>
      <tr>
          <td>SwinOCSR</td>
          <td>Deep Learning</td>
          <td>97.37%</td>
          <td>5.11%</td>
          <td>0.64</td>
      </tr>
      <tr>
          <td>ChemGrapher</td>
          <td>Deep Learning</td>
          <td>69.56%</td>
          <td>N/A</td>
          <td>0.09</td>
      </tr>
      <tr>
          <td>Imago</td>
          <td>Rule-based</td>
          <td>43.14%</td>
          <td>2.99%</td>
          <td>0.22</td>
      </tr>
      <tr>
          <td>MolVec</td>
          <td>Rule-based</td>
          <td>71.86%</td>
          <td>1.30%</td>
          <td>0.23</td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td>Rule-based</td>
          <td>54.66%</td>
          <td>0.57%</td>
          <td>0.17</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Google Cloud TPU v4-128 pod slice.</li>
<li><strong>Training Time:</strong>
<ul>
<li>EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.</li>
<li>Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).</li>
</ul>
</li>
<li><strong>Epochs:</strong> Models trained for 25 epochs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. <em>Journal of Cheminformatics</em>, 16(78). <a href="https://doi.org/10.1186/s13321-024-00872-7">https://doi.org/10.1186/s13321-024-00872-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://pypi.org/project/decimer/">PyPi Package</a></li>
<li><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanAdvancementsHanddrawnChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{78}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00872-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Dual-Path Global Awareness Transformer (DGAT) for OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</guid><description>A Transformer-based OCSR model introducing dual-path modules (CGFE and SDGLA) to improve global context awareness and complex motif recognition.</description><content:encoded><![CDATA[<h2 id="contribution-type-deep-learning-method-for-ocsr">Contribution Type: Deep Learning Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>The classification is based on the proposal of a novel deep learning architecture (DGAT) designed to address specific limitations in existing Optical Chemical Structure Recognition (OCSR) systems. The contribution is validated through benchmarking against external baselines (DeepOCSR, DECIMER, SwinOCSR) and ablation studies that isolate the impact of the new modules.</p>
<h2 id="motivation-addressing-global-context-loss">Motivation: Addressing Global Context Loss</h2>
<p>Existing multimodal fusion methods for OCSR suffer from limited awareness of global context.</p>
<ul>
<li><strong>Problem</strong>: Models often generate erroneous sequences when processing complex motifs, such as rings or long chains, due to a disconnect between local feature extraction and global structural understanding.</li>
<li><strong>Gap</strong>: Current architectures struggle to capture the &ldquo;fine-grained differences between global and local features,&rdquo; leading to topological errors.</li>
<li><strong>Practical Need</strong>: Accurate translation of chemical images to machine-readable sequences (SMILES/SELFIES) is critical for materials science and AI-guided chemical research.</li>
</ul>
<h2 id="core-innovation-dual-path-global-awareness-transformer">Core Innovation: Dual-Path Global Awareness Transformer</h2>
<p>The authors propose the <strong>Dual-Path Global Awareness Transformer (DGAT)</strong>, which redesigns the decoder with two novel mechanisms to better handle global context:</p>
<ol>
<li>
<p><strong>Cascaded Global Feature Enhancement (CGFE)</strong>: This module bridges cross-modal gaps by emphasizing global context. It concatenates global visual features with sequence features and processes them through a Cross-Modal Assimilation MLP and an Adaptive Alignment MLP to align multimodal representations. The feature enhancement conceptually computes:</p>
<p>$$ f_{\text{enhanced}} = \text{MLP}_{\text{align}}(\text{MLP}_{\text{assimilate}}([f_{\text{global}}, f_{\text{seq}}])) $$</p>
</li>
<li>
<p><strong>Sparse Differential Global-Local Attention (SDGLA)</strong>: A module that dynamically captures fine-grained differences between global and local features. It uses sequence features (embedded with global info) as queries, while utilizing local and global visual features as keys/values in parallel attention heads to generate initial multimodal features.</p>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was evaluated on a newly constructed dataset and compared against five major baselines.</p>
<ul>
<li><strong>Baselines</strong>: DeepOCSR, DECIMER 1.0, DECIMER V2, SwinOCSR, and MPOCSR.</li>
<li><strong>Ablation Studies</strong>:
<ul>
<li><strong>Layer Depth</strong>: Tested Transformer depths from 1 to 5 layers; 3 layers proved optimal for balancing gradient flow and parameter sufficiency.</li>
<li><strong>Beam Size</strong>: Tested inference beam sizes 1-5; size 3 achieved the best balance between search depth and redundancy.</li>
<li><strong>Module Contribution</strong>: Validated that removing CGFE results in a drop in structural similarity (Tanimoto), proving the need for pre-fusion alignment.</li>
</ul>
</li>
<li><strong>Robustness Analysis</strong>: Performance broken down by molecule complexity (atom count, ring count, bond count).</li>
<li><strong>Chirality Validation</strong>: Qualitative analysis of attention maps on chiral molecules to verify the model learns stereochemical cues implicitly.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Performance Over Baselines</strong>: DGAT outperformed the MPOCSR baseline across all metrics:
<ul>
<li><strong>BLEU-4</strong>: 84.0% (+5.3% improvement)</li>
<li><strong>ROUGE</strong>: 90.8% (+1.9% improvement)</li>
<li><strong>Tanimoto Similarity</strong>: 98.8% (+1.2% improvement)</li>
<li><strong>Exact Match Accuracy</strong>: 54.6% (+10.9% over SwinOCSR)</li>
</ul>
</li>
<li><strong>Chiral Recognition</strong>: The model implicitly recognizes chiral centers (e.g., generating <code>[C@@H1]</code> tokens correctly) based on 2D wedge cues without direct stereochemical supervision.</li>
<li><strong>Limitations</strong>: Performance drops for extreme cases, such as molecules with 4+ rings or 4+ double/triple bonds, due to dataset imbalance. The model still hallucinates branches in highly complex topologies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is primarily drawn from PubChem and augmented to improve robustness.</p>
<ul>
<li><strong>Augmentation Strategy</strong>: Each sequence generates three images with random rendering parameters.
<ul>
<li><strong>Rotation</strong>: 0, 90, 180, 270, or random [0, 360)</li>
<li><strong>Bond Width</strong>: 1, 2, or 3 pixels</li>
<li><strong>Bond Offset</strong>: Sampled from 0.08-0.18 (inherited from Image2SMILES)</li>
<li><strong>CoordGen</strong>: Enabled with 20% probability</li>
</ul>
</li>
<li><strong>Evaluation Set</strong>: A newly constructed benchmark dataset was used for final reporting.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Encoder LR</strong>: $5 \times 10^{-5}$ (Pretrained ResNet-101)</li>
<li><strong>Decoder LR</strong>: $1 \times 10^{-4}$ (Randomly initialized Transformer)</li>
<li><strong>Optimizer</strong>: Implied SGD/Adam (context mentions Momentum 0.9, Weight Decay 0.0001)</li>
<li><strong>Batch Size</strong>: 256</li>
</ul>
</li>
<li><strong>Inference</strong>:
<ul>
<li><strong>Beam Search</strong>: A beam size of <strong>3</strong> is used. Larger beam sizes (4-5) degraded BLEU/ROUGE scores due to increased redundancy.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Visual Encoder</strong>:
<ul>
<li><strong>Backbone</strong>: ResNet-101 initialized with ImageNet weights</li>
<li><strong>Structure</strong>: Convolutional layers preserved up to the final module. Classification head removed.</li>
<li><strong>Pooling</strong>: A $7 \times 7$ average pooling layer is used to extract global visual features.</li>
</ul>
</li>
<li><strong>Sequence Decoder</strong>:
<ul>
<li><strong>Architecture</strong>: Transformer-based with CGFE and SDGLA modules.</li>
<li><strong>Depth</strong>: 3 Transformer layers</li>
<li><strong>Dropout</strong>: Not utilized</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is reported using sequence-level and structure-level metrics.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">DGAT Score</th>
          <th style="text-align: left">Baseline (MPOCSR)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU-4</strong></td>
          <td style="text-align: left"><strong>84.0%</strong></td>
          <td style="text-align: left">78.7%</td>
          <td style="text-align: left">Measures n-gram precision</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ROUGE</strong></td>
          <td style="text-align: left"><strong>90.8%</strong></td>
          <td style="text-align: left">88.9%</td>
          <td style="text-align: left">Sequence recall metric</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto</strong></td>
          <td style="text-align: left"><strong>98.8%</strong></td>
          <td style="text-align: left">97.6%</td>
          <td style="text-align: left">Structural similarity fingerprint</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Accuracy</strong></td>
          <td style="text-align: left"><strong>54.6%</strong></td>
          <td style="text-align: left">35.7%</td>
          <td style="text-align: left">Exact structure match rate</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/Drwr97/DGAT">DGAT</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation with training and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, R., Ji, Y., Li, Y., &amp; Lee, S.-T. (2025). Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition. <em>The Journal of Physical Chemistry Letters</em>, 16(50), 12787-12795. <a href="https://doi.org/10.1021/acs.jpclett.5c03057">https://doi.org/10.1021/acs.jpclett.5c03057</a></p>
<p><strong>Publication</strong>: The Journal of Physical Chemistry Letters 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Drwr97/DGAT">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wang2025dgat,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Rui and Ji, Yujin and Li, Youyong and Lee, Shuit-Tong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{The Journal of Physical Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{12787--12795}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jpclett.5c03057}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemVLM: A Multimodal Large Language Model for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemvlm/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemvlm/</guid><description>A 26B parameter multimodal LLM for chemistry, combining InternViT-6B and ChemLLM-20B for molecular structure recognition, property prediction, and reasoning.</description><content:encoded><![CDATA[<h2 id="paper-classification-method-and-resource">Paper Classification: Method and Resource</h2>
<p>This paper is a combination of <strong>Method</strong> (primary) and <strong>Resource</strong> (secondary).</p>
<p>It is primarily a <strong>Method</strong> paper because it proposes <strong>ChemVLM</strong>, a novel multimodal architecture specifically tailored for the chemical domain, utilizing a &ldquo;ViT-MLP-LLM&rdquo; framework. The authors introduce a specific two-stage training strategy to align visual features with chemical text representations.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper as it introduces a comprehensive suite of three new datasets: <strong>ChemOCR</strong>, <strong>MMCR-Bench</strong>, and <strong>MMChemBench</strong>, developed to rigorously evaluate multimodal capabilities in chemistry, covering OCR, reasoning, and property prediction.</p>
<h2 id="bridging-the-visual-gap-in-chemical-llms">Bridging the Visual Gap in Chemical LLMs</h2>
<p>The primary motivation is the limitation of existing models in handling the multimodal nature of chemistry.</p>
<ul>
<li><strong>Visual Data Gap</strong>: Chemical tasks heavily rely on visual information (molecular structures, reactions) which purely text-based chemical LLMs cannot process.</li>
<li><strong>Limitations of Generalist Models</strong>: General multimodal models (like GPT-4V or LLaVA) lack specialized chemical domain knowledge, leading to hallucinations or misinterpretations.</li>
<li><strong>Inadequacy of OCR Tools</strong>: Traditional <a href="/notes/chemistry/optical-structure-recognition/">chemical OCR</a> tools (like <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a>) excel at modality conversion (Image-to-<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) but fail at complex reasoning tasks.</li>
</ul>
<h2 id="domain-specific-data-curation-and-benchmarking">Domain-Specific Data Curation and Benchmarking</h2>
<ul>
<li><strong>Data-Driven Alignment</strong>: The underlying &ldquo;ViT-MLP-LLM&rdquo; framework is standard in multimodal modeling, paralleling architectures like LLaVA. The core innovation here is the rigorous creation of a bilingual multimodal dataset spanning hand-drawn molecules, reactions, and exam questions augmented with style transfers. The training data pipeline heavily relies on generating synthetic variance using tools like RanDepict and <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> to introduce distortions, rotations, and handwritten styles, alongside GPT-4 generated prompts to ensure linguistic diversity.</li>
<li><strong>Model Integration</strong>: ChemVLM merges <strong>InternViT-6B</strong> (a large-scale vision transformer) with <strong><a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM-20B</a></strong> (a chemical language model). Visual features $X_v$ are mapped into the linguistic embedding space via an MLP projector, producing aligned token sequences alongside text instructions $X_q$. The joint multimodal sequence is trained using standard autoregressive next-token prediction:
$$ \mathcal{L} = -\sum_{i} \log P(y_i \mid X_v, X_q, y_{&lt;i}) $$</li>
<li><strong>Three Custom Benchmarks</strong>: The authors introduce tailored benchmarks to assess distinct competencies:
<ul>
<li><strong>ChemOCR</strong>: For image-to-SMILES conversion.</li>
<li><strong>MMCR-Bench</strong>: College entrance exam questions testing complex logical reasoning.</li>
<li><strong>MMChemBench</strong>: For molecule captioning and zero-shot property prediction.</li>
</ul>
</li>
</ul>
<h2 id="evaluating-chemical-ocr-and-reasoning">Evaluating Chemical OCR and Reasoning</h2>
<p>The authors benchmarked ChemVLM against both open-source (LLaVA, Qwen-VL, InternVL) and proprietary (GPT-4V) models across three primary domains:</p>
<ol>
<li><strong>Chemical OCR</strong>: Evaluated on 1,000 image-text pairs from ChemOCR. The primary metric is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between the Morgan fingerprints of the generated structure ($A$) and the ground-truth SMILES ($B$):
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$
They report both the average Tanimoto similarity and the strict exact-match rate (<code>Tanimoto@1.0</code>).</li>
<li><strong>Multimodal Chemical Reasoning (MMCR)</strong>: Tested on MMCR-Bench (1,000 exam questions), ScienceQA, and CMMU. Performance was scored based on accuracy for multiple-choice and fill-in-the-blank questions.</li>
<li><strong>Multimodal Molecule Understanding</strong>: Evaluated on MMChemBench for molecule captioning and property prediction.</li>
<li><strong>Text-Only Reasoning</strong>: Tested on SciBench, a text-only benchmark for university-level science, to ensure the model retains fundamental linguistic reasoning.</li>
<li><strong>Generalization</strong>: Tested on non-chemistry subjects within the CMMU framework (Biology, Physics, Math) to assess cross-domain competence.</li>
</ol>
<h2 id="performance-gains-and-existing-limitations">Performance Gains and Existing Limitations</h2>
<ul>
<li><strong>Multimodal Reasoning Leadership</strong>: ChemVLM achieved state-of-the-art results on MMCR-Bench (41.7%), surpassing generalist models like GPT-4V (40.1%). However, scoring for portions of these benchmarks relied heavily on an LLM-as-a-judge (the Qwen-max API), which can introduce bias as LLM evaluators often favor structural characteristics and verbosity produced by similar autoregressive models. Furthermore, the model was fine-tuned on 200,000 exam questions and tested on MMCR-Bench (also derived from Chinese college entrance exams). While the authors state the data was deduplicated, the potential for data leakage remains a significant unaddressed confounder.</li>
<li><strong>Superior Understanding</strong>: In molecule captioning and prediction, ChemVLM showed significant improvements over general baseline models, scoring 80.9% on prediction compared to GPT-4V&rsquo;s 38.6%. This is a natural consequence of testing a custom-trained model on domain-specific benchmarks.</li>
<li><strong>OCR Capabilities vs. Dedicated Tools</strong>: ChemVLM outperformed generalist MLLMs in chemical structure recognition, achieving an average Tanimoto similarity of 71.0% (vs. GPT-4V&rsquo;s 15.0%). However, it remains significantly inferior to pure structural OCR tools like MolScribe in strict modality conversion tasks, only achieving an exact structural match (<code>Tanimoto@1.0</code>) of 42.9% compared to MolScribe&rsquo;s 89.1%.</li>
<li><strong>Textual Retention and Generalization Claims</strong>: The authors claim the diverse training strategy imparts broad scientific reasoning, pointing to performance retention on non-chemistry subjects (Biology, Physics, Math) and strong results on the purely textual SciBench benchmark. However, this cross-domain generalization highly likely stems from the underlying base model (ChemLLM-20B/InternLM2) or the inclusion of 1.3 million &ldquo;General&rdquo; visual QA pairs in their training blend, rather than emergent general scientific skills originating purely from learning chemistry representations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training and evaluation data relied on a mix of open-source repositories and custom curation. Many of the curated datasets have been formally released by the authors on Hugging Face (<a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets"><code>di-zhang-fdu/chemvlm-sft-datasets</code></a>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Source/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/">DECIMER HDM</a></strong></td>
          <td>7,000+ hand-drawn molecular images.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>MolScribe Data</strong></td>
          <td>Scanned/photographed images from literature.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>Synthetic</strong></td>
          <td>Generated via ChemDraw, RDKit, and Indigo with style transfer (blurring, rotation, handwritten styles).</td>
      </tr>
      <tr>
          <td><strong>Training (Reaction)</strong></td>
          <td><strong>PEACE &amp; USPTO-50K</strong></td>
          <td>Inorganic and organic reaction schemes.</td>
      </tr>
      <tr>
          <td><strong>Training (Reasoning)</strong></td>
          <td><strong>Exam Questions</strong></td>
          <td>200,000 questions from OpenDataLab (Chinese education level). <a href="https://huggingface.co/collections/di-zhang-fdu/multi-corpus-datasets-for-chemllm">Available on Hugging Face</a>.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>ChemOCR</strong></td>
          <td>1,000 bilingual image-text pairs for SMILES recognition. Released via Google Drive link in repo.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMCR-Bench</strong></td>
          <td>1,000 multimodal chemistry exam questions. <strong>Requires emailing authors directly for access.</strong></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMChemBench</strong></td>
          <td>Extension of <a href="/notes/chemistry/llm-applications/chembench-llm-chemistry-evaluation/">ChemBench</a> for captioning and property prediction. Released via Google Drive link in repo.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: Images were augmented using <strong>RanDepict</strong> for style variation. Text data (SMILES) was validated and cleaned. Prompts were diversified using GPT-4 to generate different linguistic styles.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;ViT-MLP-LLM&rdquo; structure.
<ul>
<li><strong>Vision Encoder</strong>: InternViT-6B, processing images at $448 \times 448$ resolution. Images are segmented into tiles (max 12).</li>
<li><strong>Projector</strong>: Multi-Layer Perceptron (MLP) initialized randomly to map visual features to text embedding space.</li>
<li><strong>LLM</strong>: ChemLLM-20B, a domain-specific model.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Two-stage supervised fine-tuning.
<ol>
<li><strong>Modal Alignment</strong>: Freeze LLM and base Vision Encoder weights. Train only the randomly initialized MLP projector and LoRA layers (rank 32) of the Vision Encoder. Uses diverse multimodal data.</li>
<li><strong>Supervised Fine-Tuning (SFT)</strong>: Keep LLM and Vision Encoder base weights frozen, but add LoRA (rank 16) to the LLM and retain LoRA (rank 32) on the Vision Encoder. The MLP projector is fully trained. Data includes specialized chemistry and general corpora.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: AdamW</li>
<li>Context Length: 2048 tokens</li>
<li>Chat Template: InternLM2 dialogue schema</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>ChemVLM-26B</strong>: The primary model released. It combines the 6B parameter vision encoder and the 20B parameter language model. Weights are fully available at <a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2"><code>AI4Chem/ChemVLM-26B-1-2</code></a>. An 8B version is also available.</li>
<li><strong>Baselines</strong>: Comparisons were made against <strong>GPT-4V</strong>, <strong>Qwen-VL-Chat</strong>, <strong>LLaVA-v1.5-13B</strong>, <strong>InternVL-v1.5</strong>, and <strong>Yi-VL-Plus</strong>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured across three distinct task types. Exact <a href="https://github.com/lijunxian111/ChemVlm/tree/master/evaluation">evaluation scripts</a> have been released in the official repository.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto Similarity</strong></td>
          <td>ChemOCR</td>
          <td>Comparison of generated SMILES vs. ground truth using RDKit. Reports Average Similarity and <code>Tanimoto@1.0</code> (exact match).</td>
      </tr>
      <tr>
          <td><strong>Accuracy</strong></td>
          <td>MMCR (Reasoning)</td>
          <td>+1 point for correct multiple-choice/fill-in-the-blank; 0 otherwise. Scored via Qwen-max API prompting.</td>
      </tr>
      <tr>
          <td><strong>Prediction Score</strong></td>
          <td>Property Prediction</td>
          <td>Evaluated on MMChemBench subsets.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Compute</strong>: Training utilized <strong>16 NVIDIA A100 (80GB)</strong> GPUs.</li>
<li><strong>Configuration</strong>:
<ul>
<li>Batch size: 4 (per GPU, resulting in an effective global batch size of 256)</li>
<li>Gradient Accumulation: 4 iterations</li>
<li>Precision: <strong><a href="https://en.wikipedia.org/wiki/DeepSpeed">Deepspeed</a> bfloat16 (bf16)</strong> with <strong>ZeRO-3</strong> offloading strategy</li>
<li>Framework: Training runs on the InternVL-v1.5 codebase rather than standalone scripts.</li>
</ul>
</li>
<li><strong>Inference Compute</strong>: Evaluating the 26B model requires at least one 80GB A100 GPU (with Flash Attention + bfloat16). The 8B variant requires a GPU with at least 48GB of VRAM.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B">ChemVLM-26B</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Original 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2">ChemVLM-26B-1-2</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Updated 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets">chemvlm-sft-datasets</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SFT training data (~51.7k rows)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lijunxian111/ChemVlm">ChemVlm (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, evaluation, and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, J., et al. (2025). ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(1), 415-423. <a href="https://doi.org/10.1609/aaai.v39i1.32020">https://doi.org/10.1609/aaai.v39i1.32020</a></p>
<p><strong>Publication</strong>: AAAI 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{li2025chemvlm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Li, Wei and Su, Mao and Zhang, Shufei and Ouyang, Wanli and Li, Yuqiang and Zhou, Dongzhan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{415--423}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://doi.org/10.1609/aaai.v39i1.32020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i1.32020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/lijunxian111/ChemVlm">Official Repository</a></li>
</ul>
]]></content:encoded></item><item><title>ChemReco: Hand-Drawn Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</guid><description>A deep learning method using EfficientNet and Transformer to convert hand-drawn chemical structures into SMILES codes, achieving 96.9% accuracy.</description><content:encoded><![CDATA[<h2 id="research-contribution--classification">Research Contribution &amp; Classification</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: The primary contribution is &ldquo;ChemReco,&rdquo; a specific deep learning pipeline (EfficientNet + Transformer) designed to solve the Optical Chemical Structure Recognition (OCSR) task for hand-drawn images. The authors conduct extensive ablation studies on architecture and data mixing ratios to validate performance.</li>
<li><strong>Resource</strong>: The authors explicitly state that &ldquo;the primary focus of this paper is constructing datasets&rdquo; due to the scarcity of hand-drawn molecular data. They introduce a comprehensive synthetic data generation pipeline involving RDKit modifications and image degradation to create training data.</li>
</ul>
<h2 id="motivation-digitizing-hand-drawn-chemical-sketches">Motivation: Digitizing Hand-Drawn Chemical Sketches</h2>
<p>Hand-drawing is the most intuitive method for chemists and students to record molecular structures. However, digitizing these drawings into machine-readable formats (like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) usually requires time-consuming manual entry or specialized software.</p>
<ul>
<li><strong>Gap</strong>: Existing OCSR tools and rule-based methods often fail on hand-drawn sketches due to diverse writing styles, poor image quality, and the absence of labeled data.</li>
<li><strong>Application</strong>: Automated recognition enables efficient chemical research and allows for automatic grading in educational settings.</li>
</ul>
<h2 id="core-innovation-synthetic-pipeline-and-hybrid-architecture">Core Innovation: Synthetic Pipeline and Hybrid Architecture</h2>
<p>The paper introduces <strong>ChemReco</strong>, an end-to-end system for recognizing C-H-O structures. Key novelties include:</p>
<ol>
<li><strong>Synthetic Data Pipeline</strong>: A multi-stage generation method that modifies RDKit source code to randomize bond/angle parameters, followed by OpenCV-based augmentation, degradation, and background addition to simulate realistic hand-drawn artifacts.</li>
<li><strong>Architectural Choice</strong>: The specific application of <strong>EfficientNet</strong> (encoder) combined with a <strong>Transformer</strong> (decoder) for this domain, which the authors demonstrate outperforms the more common ResNet+LSTM baselines.</li>
<li><strong>Hybrid Training Strategy</strong>: Finding that a mix of 90% synthetic and 10% real data yields optimal performance, superior to using either dataset alone.</li>
</ol>
<h2 id="methodology--ablation-studies">Methodology &amp; Ablation Studies</h2>
<p>The authors performed a series of ablation studies and comparisons:</p>
<ul>
<li><strong>Synthesis Ablation</strong>: Evaluated the impact of each step in the generation pipeline (RDKit only $\rightarrow$ Augmentation $\rightarrow$ Degradation $\rightarrow$ Background) on validation loss and accuracy.</li>
<li><strong>Dataset Size Ablation</strong>: Tested model performance when trained on synthetic datasets ranging from 100,000 to 1,000,000 images.</li>
<li><strong>Real/Synthetic Ratio</strong>: Investigated the optimal mixing ratio of synthetic to real hand-drawn images (100:0, 90:10, 50:50, 10:90, 0:100), finding that the 90:10 ratio achieved 93.81% exact match, compared to 63.33% for synthetic-only and 65.83% for real-only.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked four encoder-decoder combinations: ResNet vs. EfficientNet encoders paired with LSTM vs. Transformer decoders.</li>
<li><strong>Baseline Comparison</strong>: Compared results against a related study utilizing a CNN+LSTM framework.</li>
</ul>
<h2 id="results--interpretations">Results &amp; Interpretations</h2>
<ul>
<li><strong>Best Performance</strong>: The EfficientNet + Transformer model trained on a 90:10 synthetic-to-real ratio achieved a <strong>96.90% Exact Match</strong> rate on the test set.</li>
<li><strong>Background Robustness</strong>: When training on synthetic data alone (no real images), the best accuracy on background-free test images was approximately 46% (using RDKit-aug-deg), while background test images reached approximately 53% (using RDKit-aug-bkg-deg). Adding random backgrounds during training helped prevent the model from overfitting to clean white backgrounds.</li>
<li><strong>Data Volume</strong>: Increasing the synthetic dataset size from 100k to 1M consistently improved accuracy (average exact match: 49.40% at 100k, 54.29% at 200k, 61.31% at 500k, 63.33% at 1M, all without real images in training).</li>
<li><strong>Encoder-Decoder Comparison</strong> (at 90:10 mix with 1M images):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Encoder</th>
          <th style="text-align: left">Decoder</th>
          <th style="text-align: left">Avg. Exact Match (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">93.81</td>
      </tr>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left">94.76</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">96.31</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left"><strong>96.90</strong></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Superiority over Baselines</strong>: The model outperformed the cited CNN+LSTM baseline from ChemPix (93% vs 76% on the ChemPix test set).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Restricted atom types</strong>: The system only handles molecules composed of carbon, hydrogen, and oxygen (C-H-O), excluding nitrogen, sulfur, halogens, and other heteroatoms commonly found in organic chemistry.</li>
<li><strong>Structural complexity</strong>: Only structures with at most one ring are supported. Complex multi-ring systems and fused ring structures are not covered.</li>
<li><strong>Dataset availability</strong>: The real hand-drawn dataset (2,598 images) is not publicly released and is only available upon request from the corresponding author.</li>
<li><strong>Future directions</strong>: The authors suggest expanding to more heteroatoms, complex ring structures, and applications in automated grading of chemistry exams.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/a-die/hdr-DeepLearning">hdr-DeepLearning</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation in PyTorch</td>
      </tr>
      <tr>
          <td style="text-align: left">Paper</td>
          <td style="text-align: left">Publication</td>
          <td style="text-align: left">CC-BY-4.0</td>
          <td style="text-align: left">Open access via Nature</td>
      </tr>
  </tbody>
</table>
<p>The real hand-drawn dataset (2,598 images) is available upon request from the corresponding author, not publicly downloadable. The synthetic data generation pipeline is described in detail but relies on modified RDKit source code, which is included in the repository.</p>
<h3 id="data">Data</h3>
<p>The study utilizes a combination of collected SMILES data, real hand-drawn images, and generated synthetic images.</p>
<ul>
<li><strong>Source Data</strong>: SMILES codes collected from PubChem, ZINC, <a href="/notes/chemistry/datasets/gdb-11/">GDB-11</a>, and <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>. Filtered for C, H, O atoms and max 1 ring.</li>
<li><strong>Real Dataset</strong>: 670 selected SMILES codes drawn by multiple volunteers, totaling <strong>2,598 images</strong>.</li>
<li><strong>Synthetic Dataset</strong>: Generated up to <strong>1,000,000 images</strong> using the pipeline below.</li>
<li><strong>Training Mix</strong>: The optimal training set used 1 million images with a <strong>90:10 ratio</strong> of synthetic to real images.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset Type</th>
          <th style="text-align: left">Source</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Real</strong></td>
          <td style="text-align: left">Volunteer Drawings</td>
          <td style="text-align: left">2,598 images</td>
          <td style="text-align: left">Used for mixed training and testing</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Synthetic</strong></td>
          <td style="text-align: left">Generated</td>
          <td style="text-align: left">100k - 1M</td>
          <td style="text-align: left">Generated via modified RDKit + OpenCV augmentation/degradation; optionally enhanced with Stable Diffusion</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>Synthetic Image Generation Pipeline</strong> is critical for reproduction:</p>
<ol>
<li><strong>RDKit Modification</strong>: Modify source code to introduce random keys, character width, length, and bond angles.</li>
<li><strong>Augmentation (OpenCV)</strong>: Apply sequence: Resize ($p=0.5$), Blur ($p=0.4$), Erode/Dilate ($p=0.2$), Distort ($p=0.8$), Flip ($p=0.5$), Affine ($p=0.7$).</li>
<li><strong>Degradation</strong>: Apply sequence: Salt+pepper noise ($p=0.1$), Contrast ($p=0.7$), Sharpness ($p=0.5$), Invert ($p=0.3$).</li>
<li><strong>Background Addition</strong>: Random backgrounds are augmented (Crop, Distort, Flip) and added to the molecular image to prevent background overfitting.</li>
<li><strong>Diffusion Enhancement</strong>: Stable Diffusion (v1-4) is used for image-to-image enhancement to better simulate hand-drawn styles (prompt: &ldquo;A pencil sketch of [Formula]&hellip; without charge distribution&rdquo;).</li>
</ol>
<h3 id="models">Models</h3>
<p>The system uses an encoder-decoder architecture:</p>
<ul>
<li><strong>Encoder</strong>: <strong>EfficientNet</strong> (pre-trained on ImageNet). The last layer is removed, and features are extracted into a Numpy array.</li>
<li><strong>Decoder</strong>: <strong>Transformer</strong>. Utilizes self-attention to generate the SMILES sequence. Chosen over LSTM for better handling of long-range dependencies.</li>
<li><strong>Output</strong>: Canonical SMILES string.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: <strong>Exact Match (EM)</strong>. A strict binary evaluation checking whether the complete generated SMILES perfectly replicates the target string.</li>
<li><strong>Other Metrics</strong>: <strong>Levenshtein Distance</strong> measures edit-level character proximity, while the <strong>Tanimoto coefficient</strong> evaluates structural similarity based on chemical fingerprints. Both were monitored during validation ablation runs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Baseline (CNN+LSTM)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Exact Match</strong></td>
          <td style="text-align: left"><strong>96.90%</strong></td>
          <td style="text-align: left">76%</td>
          <td style="text-align: left">Tested on the provided test set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>CPU</strong>: Intel(R) Xeon(R) Gold 6130 (40 GB RAM).</li>
<li><strong>GPU</strong>: NVIDIA Tesla V100 (32 GB video memory).</li>
<li><strong>Framework</strong>: PyTorch 1.9.1.</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Optimizer: Adam (learning rate 1e-4).</li>
<li>Batch size: 32.</li>
<li>Epochs: 100.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, H., Liu, W., Tao, J., et al. (2024). ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. <em>Scientific Reports</em>, 14, 17126. <a href="https://doi.org/10.1038/s41598-024-67496-7">https://doi.org/10.1038/s41598-024-67496-7</a></p>
<p><strong>Publication</strong>: Scientific Reports 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/a-die/hdr-DeepLearning">Official Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ouyangChemRecoAutomatedRecognition2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ChemReco: Automated Recognition of Hand-Drawn Carbon--Hydrogen--Oxygen Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ouyang, Hengjie and Liu, Wei and Tao, Jiajun and Luo, Yanghong and Zhang, Wanjia and Zhou, Jiayu and Geng, Shuqi and Zhang, Chengpeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{17126}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-024-67496-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AtomLenz: Atom-Level OCSR with Limited Supervision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</guid><description>Weakly supervised OCSR framework combining object detection and graph construction to recognize chemical structures from hand-drawn images using SMILES.</description><content:encoded><![CDATA[<h2 id="dual-contribution-method-and-data-resource">Dual Contribution: Method and Data Resource</h2>
<p>The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.</p>
<h2 id="overcoming-annotation-bottlenecks-in-ocsr">Overcoming Annotation Bottlenecks in OCSR</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:</p>
<ol>
<li><strong>Generalization Limits:</strong> They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.</li>
<li><strong>Annotation Cost:</strong> &ldquo;Atom-level&rdquo; methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.</li>
<li><strong>Lack of Interpretability/Localization:</strong> Pure &ldquo;Image-to-SMILES&rdquo; models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.</li>
</ol>
<h2 id="atomlenz-probkt-and-graph-edit-correction">AtomLenz, ProbKT*, and Graph Edit-Correction</h2>
<p>The core contribution is <strong>AtomLenz</strong>, an OCSR framework that achieves atom-level entity detection using <strong>only SMILES supervision</strong> on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:</p>
<p>$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$</p>
<p>To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:</p>
<ul>
<li><em><em>ProbKT</em> (Probabilistic Knowledge Transfer):</em>* Uses probabilistic logic and Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as <strong>EditKT</strong>*.</li>
<li><strong>ChemExpert:</strong> A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.</li>
</ul>
<h2 id="data-efficiency-and-domain-adaptation-experiments">Data Efficiency and Domain Adaptation Experiments</h2>
<p>The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:</p>
<ul>
<li><strong>Pretraining:</strong> Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).</li>
<li><strong>Target Domain Adaptation:</strong> Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.</li>
<li><strong>Evaluation Sets:</strong>
<ul>
<li><strong>Hand-drawn test set</strong>: 1,018 images.</li>
<li><strong>ChemPix</strong>: 614 out-of-domain hand-drawn images.</li>
<li><strong>Atom Localization set</strong>: 1,000 synthetic images to evaluate precise bounding box capabilities.</li>
</ul>
</li>
<li><strong>Baselines:</strong> Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.</li>
</ul>
<h2 id="state-of-the-art-ensembles-vs-standalone-limitations">State-of-the-Art Ensembles vs. Standalone Limitations</h2>
<ul>
<li><strong>SOTA Ensemble Performance:</strong> The <strong>ChemExpert</strong> module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.</li>
<li><strong>Data Efficiency under Bottleneck Regimes:</strong> AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.</li>
<li><strong>Localization Success:</strong> The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.</li>
<li><strong>Methodological Tradeoffs:</strong> While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz">Official Repository (AtomLenz)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz/tree/main/models">Pre-trained Models</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Downloadable weights for Faster R-CNN detection backbones.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset (Brinkhaus)</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Images and SMILES used for target domain fine-tuning and evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599172">Relabeled Hand-drawn Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">1,417 images with bounding box annotations generated via EditKT*.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/spaces/moldenhof/atomlenz">AtomLenz Web Demo</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Interactive Hugging Face space for testing model inference.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>Synthetic ChEMBL</td>
          <td>~214,000</td>
          <td>Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-drawn (Brinkhaus)</td>
          <td>4,070</td>
          <td>Used for weakly supervised adaptation (SMILES only).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Hand-drawn Test</td>
          <td>1,018</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChemPix</td>
          <td>614</td>
          <td>Out-of-distribution hand-drawn images.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Atom Localization</td>
          <td>1,000</td>
          <td>Synthetic images with ground truth bounding boxes.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Molecular Graph Constructor (Algorithm 1):</strong> A rule-based system to assemble the graph from detected objects:
<ol>
<li><strong>Filtering:</strong> Removes overlapping atom boxes (IoU threshold).</li>
<li><strong>Node Creation:</strong> Merges overlapping charge and stereocenter objects with their corresponding atom objects.</li>
<li><strong>Edge Creation:</strong> Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If &gt;2, it selects the most probable pair.</li>
<li><strong>Validation:</strong> Checks valency constraints; removes bonds iteratively if constraints are violated.</li>
</ol>
</li>
<li><strong>Weakly Supervised Training:</strong>
<ul>
<li><strong>ProbKT*:</strong> Uses Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; implied by the SMILES string, allowing backpropagation without explicit boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Object Detection Backbone:</strong> <strong>Faster R-CNN</strong>.
<ul>
<li>Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).</li>
<li><strong>Loss Function:</strong> Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).</li>
</ul>
</li>
<li><strong>ChemExpert:</strong> An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Primary metrics focused on structural correctness and localization accuracy.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Hand-drawn)</th>
          <th>Baseline (DECIMER FT)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (T=1)</strong></td>
          <td>33.8% (AtomLenz+EditKT*)</td>
          <td>62.2%</td>
          <td>Exact ECFP6 fingerprint match.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto Sim.</strong></td>
          <td>0.484</td>
          <td>0.727</td>
          <td>Average similarity.</td>
      </tr>
      <tr>
          <td><strong>mAP</strong></td>
          <td>0.801</td>
          <td>N/A</td>
          <td>Localization accuracy (IoU 0.05-0.35).</td>
      </tr>
      <tr>
          <td><strong>Ensemble Acc.</strong></td>
          <td><strong>63.5%</strong></td>
          <td>62.2%</td>
          <td>ChemExpert (DECIMER + AtomLenz).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Experiments utilized the Flemish Supercomputer Center (VSC) resources.</li>
<li><strong>Note:</strong> Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., De Brouwer, E., Arany, Á., &amp; Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2024.</p>
<p><strong>Publication venue/year</strong>: CVPR 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molden/atomlenz">Official Repository</a></li>
<li><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset on Figshare</a></li>
</ul>
<p><strong>BibTeX</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{oldenhofAtomLevelOpticalChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Atom-Level Optical Chemical Structure Recognition with Limited Supervision}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Oldenhof, Martijn and De Brouwer, Edward and Arany, {\&#39;A}d{\&#39;a}m and Moreau, Yves}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2404.01743}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SwinOCSR: End-to-End Chemical OCR with Swin Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</guid><description>Deep learning model using Swin Transformer and Focal Loss for OCSR, achieving 98.58% accuracy on synthetic benchmarks.</description><content:encoded><![CDATA[<h2 id="contribution-methodological-architecture-and-datasets">Contribution: Methodological Architecture and Datasets</h2>
<p>This is a <strong>Methodological Paper</strong> with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).</li>
<li><strong>Resource</strong>: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.</li>
</ul>
<h2 id="motivation-addressing-visual-context-and-data-imbalance">Motivation: Addressing Visual Context and Data Imbalance</h2>
<ul>
<li><strong>Problem</strong>: OCSR (converting images of chemical structures to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.</li>
<li><strong>Technical Gap</strong>: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.</li>
<li><strong>Data Imbalance</strong>: Chemical strings suffer from severe class imbalance (e.g., &lsquo;C&rsquo; and &lsquo;H&rsquo; are frequent; &lsquo;Br&rsquo; or &lsquo;Cl&rsquo; are rare), which causes standard Cross Entropy loss to underperform.</li>
</ul>
<h2 id="core-innovation-swin-transformers-and-focal-loss">Core Innovation: Swin Transformers and Focal Loss</h2>
<ul>
<li><strong>Swin Transformer Backbone</strong>: SwinOCSR replaces the standard CNN backbone with a <strong>Swin Transformer</strong>, using shifted window attention to capture both local and global image features more effectively.</li>
<li><strong>Multi-label Focal Loss (MFL)</strong>: The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the &ldquo;long-tail&rdquo; distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples:
$$
\begin{aligned}
FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\
\end{aligned}
$$</li>
<li><strong>Structured Synthetic Dataset</strong>: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Backbone Comparison</strong>: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).</li>
<li><strong>Loss Function Ablation</strong>: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).</li>
<li><strong>Category Stress Test</strong>: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.</li>
<li><strong>Real-world Evaluation</strong>: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.</li>
</ul>
<h2 id="results-and-limitations">Results and Limitations</h2>
<ul>
<li><strong>Synthetic test set performance</strong>: With Multi-label Focal Loss (MFL), SwinOCSR achieved <strong>98.58% accuracy</strong> on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).</li>
<li><strong>Handling of long sequences</strong>: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.</li>
<li><strong>Per-category results</strong>: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.</li>
<li><strong>Domain shift</strong>: While performance on synthetic data was strong, accuracy dropped to <strong>25%</strong> on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The first 8.5 million structures from <strong>PubChem</strong> were downloaded, yielding ~6.9 million unique SMILES.</li>
<li><strong>Generation Pipeline</strong>:
<ul>
<li><strong>Tools</strong>: <strong>CDK</strong> (Chemistry Development Kit) for image rendering; <strong>RDKit</strong> for SMILES canonicalization.</li>
<li><strong>Augmentation</strong>: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.</li>
<li><strong>Preprocessing</strong>: Images rendered as binary, resized to <strong>224x224</strong>, and copied to 3 channels (RGB simulation).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>4,500,000</td>
          <td>18:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
      <tr>
          <td>Test</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Loss Function</strong>: <strong>Multi-label Focal Loss (MFL)</strong>. The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Optimizer</strong>: <strong>Adam</strong> with initial learning rate <code>5e-4</code>.</li>
<li><strong>Schedulers</strong>: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.</li>
<li><strong>Regularization</strong>: Dropout rate of <code>0.1</code>.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Backbone (Encoder 1)</strong>: <strong>Swin Transformer</strong>.
<ul>
<li>Patch size: $4 \times 4$.</li>
<li>Linear embedding dimension: 192.</li>
<li>Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).</li>
<li>Output: Flattened patch sequence $S_b$.</li>
</ul>
</li>
<li><strong>Transformer Encoder (Encoder 2)</strong>: 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.</li>
<li><strong>Transformer Decoder</strong>: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).</li>
<li><strong>Tokenization</strong>: <strong>DeepSMILES</strong> format used (syntactically more robust than SMILES). Vocabulary size: <strong>76 tokens</strong> (76 unique characters found in dataset). Embedding dimension: 256.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SwinOCSR (CE)</th>
          <th>SwinOCSR (MFL)</th>
          <th>ResNet-50 (CE)</th>
          <th>EfficientNet-B3 (CE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>97.36%</td>
          <td><strong>98.58%</strong></td>
          <td>89.17%</td>
          <td>86.70%</td>
      </tr>
      <tr>
          <td>Tanimoto</td>
          <td>99.65%</td>
          <td><strong>99.77%</strong></td>
          <td>98.79%</td>
          <td>98.46%</td>
      </tr>
      <tr>
          <td>BLEU</td>
          <td>99.46%</td>
          <td><strong>99.59%</strong></td>
          <td>98.62%</td>
          <td>98.37%</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>99.64%</td>
          <td><strong>99.78%</strong></td>
          <td>98.87%</td>
          <td>98.66%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Trained on <strong>NVIDIA Tesla V100-PCIE</strong>.</li>
<li><strong>Training Time</strong>: 30 epochs.</li>
<li><strong>Batch Size</strong>: 256 images ($224 \times 224$ pixels).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/suanfaxiaohuo/SwinOCSR">SwinOCSR</a></td>
          <td>Code + Data</td>
          <td>Unknown</td>
          <td>Official implementation with dataset and trained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. <em>Journal of Cheminformatics</em>, 14(41). <a href="https://doi.org/10.1186/s13321-022-00624-5">https://doi.org/10.1186/s13321-022-00624-5</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/suanfaxiaohuo/SwinOCSR">GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>One Strike, You're Out: Detecting Markush Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/</guid><description>Patch-based CNN method for detecting Markush structures in chemical images, addressing low signal-to-noise ratios in OCSR.</description><content:encoded><![CDATA[<h2 id="methodology-and-classification">Methodology and Classification</h2>
<p>This is a <strong>Method</strong> paper (Classification: $\Psi_{\text{Method}}$).</p>
<p>It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a baseline comparison (CNN vs. traditional ORB), ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.</p>
<h2 id="the-markush-structure-challenge">The Markush Structure Challenge</h2>
<p><strong>The Problem</strong>: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with &ldquo;Markush structures,&rdquo; generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).</p>
<p><strong>The Gap</strong>: Markush structures are difficult to detect because they often appear as small indicators (a single &ldquo;R&rdquo; or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.</p>
<p><strong>The Goal</strong>: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.</p>
<h2 id="patch-based-classification-pipeline">Patch-Based Classification Pipeline</h2>
<p>The core technical contribution is an end-to-end deep learning pipeline tailored for low-SNR chemical images where standard global resizing or cropping fails due to large variations in image resolution and pixel scales.</p>
<ul>
<li><strong>Patch Generation</strong>: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.</li>
<li><strong>Targeted Annotation</strong>: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.</li>
<li><strong>Inference Strategy</strong>: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.</li>
<li><strong>Evaluation</strong>: Provides the first systematic comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors compared two distinct paradigms on a manually annotated dataset:</p>
<ol>
<li>
<p><strong>Fixed-Feature Baseline</strong>: Used <strong>ORB</strong> (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an <strong>XGBoost</strong> model.</p>
</li>
<li>
<p><strong>Deep Learning Method</strong>: Fine-tuned <strong>ResNet18</strong> and <strong>Inception V3</strong> models on the generated image patches.</p>
<ul>
<li><strong>Ablations</strong>: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).</li>
<li><strong>Fine-tuning</strong>: Compared full-network fine-tuning against freezing all but the fully connected layers.</li>
</ul>
</li>
</ol>
<p>To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:</p>
<p>$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$</p>
<h2 id="performance-outcomes">Performance Outcomes</h2>
<ul>
<li>
<p><strong>CNN vs. ORB</strong>: Deep learning architectures outperformed the fixed-feature baseline. The best model (<strong>Inception V3</strong> pretrained on ImageNet) achieved an image-level Macro F1 of <strong>0.928</strong>, compared to <strong>0.701</strong> (image-level) for the ORB baseline, and a patch-level Macro F1 of <strong>0.917</strong>.</p>
</li>
<li>
<p><strong>The Pretraining Surprise</strong>: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. The authors hypothesize that the filters learned from ImageNet pretraining generalize well outside the ImageNet domain, though why the USPTO-pretrained filters underperform remains unclear.</p>
</li>
<li>
<p><strong>Full Model Tuning</strong>: Unfreezing the entire network yielded higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.</p>
</li>
<li>
<p><strong>Limitations and Edge Cases</strong>: The best CNN achieved an ROC AUC of <strong>0.97</strong> on the primary patch test set, while the ORB baseline scored <strong>0.81</strong> on the auxiliary dataset (the paper notes these ROC curves are not directly comparable due to different evaluation sets). The aggregation metric ($X = \max \{ x_i \}$) is naive and has not been optimized. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.</p>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Val</strong></td>
          <td><strong>Primary Dataset</strong></td>
          <td>272 Images</td>
          <td>Manually annotated with bounding boxes for Markush indicators. Split 60/20/20.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>Auxiliary Dataset</strong></td>
          <td>~5.4k Images</td>
          <td>5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).</td>
      </tr>
  </tbody>
</table>
<p><strong>Patch Generation</strong>:</p>
<ul>
<li>Images are cropped into patches of size <strong>224x224</strong> (ResNet) or <strong>299x299</strong> (Inception).</li>
<li>Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren&rsquo;t lost on edges.</li>
<li><strong>Labeling Rule</strong>: A patch is labeled &ldquo;Markush&rdquo; if &gt;50% of an annotation&rsquo;s pixels fall inside it.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>ORB (Baseline)</strong>:</p>
<ul>
<li>Matches query images against a bank of template patches containing Markush indicators.</li>
<li><strong>Features</strong>: Number of keypoints, number of matches, Hamming distance of best 5 matches.</li>
<li><strong>Classifier</strong>: XGBoost trained on these features.</li>
<li><strong>Hyperparameters</strong>: Search over number of features (500-2000) and template patches (50-250).</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Framework</strong>: PyTorch with Optuna for optimization.</li>
<li><strong>Optimization</strong>: 25 trials per configuration.</li>
<li><strong>Augmentations</strong>: Random perspective shift, posterization, sharpness/blur.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two main architectures were compared.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Input Size</th>
          <th>Parameters</th>
          <th>Pretraining Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ResNet18</strong></td>
          <td>224x224</td>
          <td>11.5M</td>
          <td>ImageNet</td>
      </tr>
      <tr>
          <td><strong>Inception V3</strong></td>
          <td>299x299</td>
          <td>23.8M</td>
          <td>ImageNet &amp; USPTO</td>
      </tr>
  </tbody>
</table>
<p><strong>Best Configuration</strong>: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).</p>
<h3 id="evaluation">Evaluation</h3>
<p>Primary metric was <strong>Macro F1</strong> due to class imbalance.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best CNN (Inception V3)</th>
          <th>Baseline (ORB)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Patch Test F1</strong></td>
          <td>$0.917 \pm 0.014$</td>
          <td>N/A</td>
          <td>ORB does not support patch-level</td>
      </tr>
      <tr>
          <td><strong>Image Test F1</strong></td>
          <td>$0.928 \pm 0.035$</td>
          <td>$0.701 \pm 0.052$</td>
          <td>CNN aggregates patch predictions</td>
      </tr>
      <tr>
          <td><strong>Aux Test F1</strong></td>
          <td>0.914</td>
          <td>0.533</td>
          <td>Evaluation on large secondary dataset</td>
      </tr>
      <tr>
          <td><strong>ROC AUC</strong></td>
          <td>0.97</td>
          <td>0.81</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Tesla V100-SXM2-16GB</li>
<li><strong>CPU</strong>: Intel Xeon E5-2686 @ 2.30GHz</li>
<li><strong>RAM</strong>: 64 GB</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Thomasjurriaans/markush-recognition-msc-thesis">GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>MSc thesis code: CNN training, ORB baseline, evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p>The primary dataset was manually annotated by Elsevier domain experts and is not publicly available. The auxiliary dataset (from Elsevier) is also not public. Pre-trained model weights are not released in the repository.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., &amp; Akhondi, S. (2023). One Strike, You&rsquo;re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. <em>arXiv preprint arXiv:2311.14633</em>. <a href="https://doi.org/10.48550/arXiv.2311.14633">https://doi.org/10.48550/arXiv.2311.14633</a></p>
<p><strong>Publication</strong>: arXiv 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Thomasjurriaans/markush-recognition-msc-thesis">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{jurriaansOneStrikeYoure2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{One {{Strike}}, {{You}}&#39;re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{One {{Strike}}, {{You}}&#39;re {{Out}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MICER: Molecular Image Captioning with Transfer Learning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/</guid><description>Encoder-decoder model using pre-trained ResNet and attention-based LSTM to translate molecular images into SMILES strings, reaching 97.54% sequence accuracy.</description><content:encoded><![CDATA[<h2 id="micers-contribution-to-optical-structure-recognition">MICER&rsquo;s Contribution to Optical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.</p>
<h2 id="the-challenge-of-generalizing-in-ocsr">The Challenge of Generalizing in OCSR</h2>
<p>Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end &ldquo;image captioning&rdquo; system that translates molecular images directly into <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings without intermediate segmentation steps.</p>
<h2 id="integrating-fine-tuning-and-attention-for-chemistry">Integrating Fine-Tuning and Attention for Chemistry</h2>
<p>The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.</p>
<p>The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes &ldquo;intrinsic features&rdquo; of molecular data (stereochemistry, complexity) to guide the design of the training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.</p>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.</p>
<p><strong>Factor Comparisons</strong>: They evaluated how performance is affected by:</p>
<ul>
<li><strong>Stereochemistry (SI)</strong>: Comparing models trained on data with and without stereochemical information.</li>
<li><strong>Molecular Complexity (MC)</strong>: Analyzing performance across 5 molecular weight intervals.</li>
<li><strong>Data Volume (DV)</strong>: Training on datasets ranging from 0.64 million to 10 million images.</li>
<li><strong>Pre-trained Models (PTMs)</strong>: Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.</li>
</ul>
<p><strong>Benchmarking</strong>:</p>
<ul>
<li><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).</li>
<li><strong>Datasets</strong>: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).</li>
<li><strong>Metrics</strong>: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).</li>
</ul>
<h2 id="results-and-core-insights">Results and Core Insights</h2>
<p>MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, outperforming rule-based and deep learning baselines across all four test sets.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Method</th>
          <th>SA (%)</th>
          <th>AMFTS (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Uni-style</td>
          <td>OSRA</td>
          <td>23.14</td>
          <td>56.83</td>
      </tr>
      <tr>
          <td>Uni-style</td>
          <td>DECIMER</td>
          <td>35.32</td>
          <td>86.92</td>
      </tr>
      <tr>
          <td>Uni-style</td>
          <td><strong>MICER</strong></td>
          <td><strong>97.54</strong></td>
          <td><strong>99.74</strong></td>
      </tr>
      <tr>
          <td>Multi-style</td>
          <td>OSRA</td>
          <td>15.68</td>
          <td>44.50</td>
      </tr>
      <tr>
          <td>Multi-style</td>
          <td><strong>MICER</strong></td>
          <td><strong>95.09</strong></td>
          <td><strong>99.28</strong></td>
      </tr>
      <tr>
          <td>Noisy</td>
          <td><strong>MICER</strong></td>
          <td><strong>94.95</strong></td>
          <td><strong>99.25</strong></td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td>OSRA</td>
          <td>80.24</td>
          <td>91.17</td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td>DECIMER</td>
          <td>21.75</td>
          <td>65.15</td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td><strong>MICER</strong></td>
          <td><strong>82.33</strong></td>
          <td><strong>94.47</strong></td>
      </tr>
  </tbody>
</table>
<p>ResNet101 was identified as the most effective encoder (87.58% SA in preliminary tests on 0.8M images), outperforming deeper (DenseNet121 at 81.41%) and lighter (MobileNetV2 at 39.83%) networks. Performance saturates around 6 million training samples, reaching 98.84% SA. Stereochemical information drops accuracy by approximately 6.1% (from 87.61% to 81.50%), indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on &lsquo;S&rsquo; or &lsquo;Cl&rsquo; pixels) when generating the corresponding character.</p>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations. MICER struggles with superatoms, R-groups, text labels, and uncommon atoms (e.g., Sn) that were not seen during training. On noisy data, noise spots near Cl atoms can cause misclassification as O atoms. Complex molecular images with noise lead to misrecognition of noise points as single bonds and wedge-shaped bonds as double bonds. All methods, including MICER, have substantial room for improvement on real-world datasets that contain these challenging elements.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was curated from the <strong>ZINC20</strong> database.</p>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Filtering</strong>: Removed organometallics, mixtures, and invalid molecules.</li>
<li><strong>Standardization</strong>: SMILES were canonicalized and de-duplicated.</li>
<li><strong>Generation</strong>: Images generated using <strong>Indigo</strong> and <strong>RDKit</strong> toolkits to vary styles.</li>
</ul>
<p><strong>Dataset Size</strong>:</p>
<ul>
<li><strong>Total</strong>: 10 million images selected for the final model.</li>
<li><strong>Composition</strong>: 6 million &ldquo;default style&rdquo; (Indigo) + 4 million &ldquo;multi-style&rdquo; (Indigo + RDKit).</li>
<li><strong>Splits</strong>: 8:1:1 ratio for Training/Validation/Test.</li>
</ul>
<p><strong>Vocabulary</strong>: A token dictionary of 39 SMILES characters plus 3 special tokens: <code>[pad]</code>, <code>[sos]</code>, <code>[eos]</code>, <code>[0]</code>-<code>[9]</code>, <code>[C]</code>, <code>[l]</code>, <code>[c]</code>, <code>[O]</code>, <code>[N]</code>, <code>[n]</code>, <code>[F]</code>, <code>[H]</code>, <code>[o]</code>, <code>[S]</code>, <code>[s]</code>, <code>[B]</code>, <code>[r]</code>, <code>[I]</code>, <code>[i]</code>, <code>[P]</code>, <code>[p]</code>, <code>(</code>, <code>)</code>, <code>[</code>, <code>]</code>, <code>@</code>, <code>=</code>, <code>#</code>, <code>/</code>, <code>-</code>, <code>+</code>, <code>\</code>, <code>%</code>. Two-letter atoms like &lsquo;Br&rsquo; are tokenized as distinct characters <code>[B]</code>, <code>[r]</code>, and &lsquo;Cl&rsquo; as <code>[C]</code>, <code>[l]</code>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Character-level tokenization (not atom-level); the model learns to assemble &lsquo;C&rsquo; and &rsquo;l&rsquo; into &lsquo;Cl&rsquo;.</li>
<li><strong>Attention Mechanism</strong>: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder&rsquo;s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula:
$$
\begin{aligned}
\text{att_score} &amp;= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t))))
\end{aligned}
$$</li>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Loss Function</strong>: Cross-entropy loss</li>
<li><strong>Optimizer</strong>: Adam optimizer</li>
<li><strong>Learning Rate</strong>: 2e-5</li>
<li><strong>Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 15</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: Pre-trained <strong>ResNet101</strong> (trained on ImageNet).</li>
<li><strong>Modifications</strong>: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.</li>
<li><strong>Flattening</strong>: Reshaped to a $64 \times 512$ feature matrix for the decoder.</li>
</ul>
<p><strong>Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Long Short-Term Memory (LSTM) with Attention.</li>
<li><strong>Dropout</strong>: 0.3 applied to minimize overfitting.</li>
</ul>
<p>The encoder uses a pilot network (for universal feature extraction), a max-pooling layer, and multiple feature extraction layers containing convolutional blocks (CBs), feeding into the attention LSTM.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>SA (Sequence Accuracy)</strong>: Strict exact match of SMILES strings.</li>
<li><strong>ALD (Average Levenshtein Distance)</strong>: Edit distance for character-level error analysis.</li>
<li><strong>AMFTS / <a href="mailto:MFTS@1.0">MFTS@1.0</a></strong>: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.</li>
</ul>
<p><strong>Test Sets</strong>:</p>
<ul>
<li><strong>Uni-style</strong>: 100,000 images (Indigo default).</li>
<li><strong>Multi-style</strong>: 100,000 images (&gt;10 styles).</li>
<li><strong>Noisy</strong>: 100,000 images with noise added.</li>
<li><strong>UOB</strong>: 5,575 real-world images from literature.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 GPUs</li>
<li><strong>Training Time</strong>: Approximately 42 hours for the final model</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Jiacai-Yi/MICER">MICER</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<p>The training data (generated from ZINC20) and pre-trained model weights are not publicly released. The repository contains code but has minimal documentation (2 commits, no description).</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., &amp; Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. <em>Bioinformatics</em>, 38(19), 4562-4572. <a href="https://doi.org/10.1093/bioinformatics/btac545">https://doi.org/10.1093/bioinformatics/btac545</a></p>
<p><strong>Publication</strong>: Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Jiacai-Yi/MICER">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yiMICERPretrainedEncoder2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MICER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{19}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4562--4572}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1367-4811}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bioinformatics/btac545}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2SMILES: Transformer OCSR with Synthetic Data Pipeline</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</guid><description>Transformer-based OCSR using a novel synthetic data generation pipeline for robust molecular image interpretation across diverse drawing styles.</description><content:encoded><![CDATA[<h2 id="contribution-image2smiles-as-a-method-and-resource">Contribution: Image2SMILES as a Method and Resource</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering &ldquo;How well does this work?&rdquo; with extensive benchmarks against rule-based systems like OSRA.</li>
<li><strong>Resource</strong>: A core contribution is the &ldquo;Generate and Train!&rdquo; paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.</li>
</ul>
<h2 id="motivation-bottlenecks-in-recognizing-trapped-chemical-structures">Motivation: Bottlenecks in Recognizing Trapped Chemical Structures</h2>
<p>Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.</p>
<ul>
<li><strong>Problem</strong>: Chemical structures are often &ldquo;trapped&rdquo; in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, &ldquo;Markush&rdquo; structures (templates), or visual contamination.</li>
<li><strong>Gap</strong>: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.</li>
<li><strong>Goal</strong>: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).</li>
</ul>
<h2 id="core-innovation-the-generate-and-train-pipeline-and-fg-smiles">Core Innovation: The &ldquo;Generate and Train!&rdquo; Pipeline and FG-SMILES</h2>
<ul>
<li><strong>&ldquo;Generate and Train!&rdquo; Paradigm</strong>: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like &ldquo;Markush&rdquo; variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual &ldquo;contamination&rdquo; (stray text, arrows).</li>
<li><strong>FG-SMILES</strong>: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.</li>
<li><strong>Encoder-Free Architecture</strong>: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.</li>
</ul>
<h2 id="methodology-and-benchmarking-against-osra">Methodology and Benchmarking Against OSRA</h2>
<ul>
<li><strong>Training</strong>: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.</li>
<li><strong>Validation (Synthetic)</strong>: Evaluated on a hold-out set of 1M synthetic images.</li>
<li><strong>Validation (Real World)</strong>:
<ul>
<li><strong>Dataset A</strong>: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.</li>
<li><strong>Dataset B</strong>: 296 structures systematically extracted from <em>Journal of Organic Chemistry</em> (one paper per issue from 2020) to reduce selection bias.</li>
</ul>
</li>
<li><strong>Comparison</strong>: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.</li>
</ul>
<h2 id="results-high-precision-extraction-and-key-limitations">Results: High-Precision Extraction and Key Limitations</h2>
<ul>
<li><strong>Performance</strong>:
<ul>
<li><strong>Synthetic</strong>: 90.7% exact match accuracy.</li>
<li><strong>Real Data (Dataset A)</strong>: Image2SMILES achieved <strong>79.2%</strong> accuracy compared to OSRA&rsquo;s <strong>62.1%</strong>.</li>
<li><strong>Real Data (Dataset B)</strong>: Image2SMILES achieved <strong>62.5%</strong> accuracy compared to OSRA&rsquo;s <strong>24.0%</strong>.</li>
</ul>
</li>
<li><strong>Confidence Correlation</strong>: There is a strong correlation between the model&rsquo;s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22.5% of data, enabling high-precision automated pipelines.</li>
<li><strong>Key Failures</strong>: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R&rsquo;$ vs $R_1$), and explicit hydrogens rendered as groups.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: A subset of 10 million molecules sampled from PubChem.</li>
<li><strong>Selection Logic</strong>: Bias towards complex/rare structures using a &ldquo;Full Coefficient&rdquo; (FC) probability metric based on molecule size and ring/atom rarity.
<ul>
<li>Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.</li>
</ul>
</li>
<li><strong>Generation</strong>: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).</li>
<li><strong>Contamination</strong>: &ldquo;Visual noise&rdquo; is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.</li>
<li><strong>Target Format</strong>: <strong>FG-SMILES</strong> (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a <code>v</code> token.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Contamination Augmentation</strong>: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.</li>
<li><strong>Functional Group Resolution</strong>: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).</li>
<li><strong>Markush Support</strong>: Stochastic replacement of substituents with R-group labels ($R_1$, $R&rsquo;$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;Image-to-Sequence&rdquo; hybrid model.
<ul>
<li><strong>Backbone</strong>: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.</li>
<li><strong>Neck</strong>: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.</li>
<li><strong>Decoder</strong>: Standard Transformer Decoder with parameters from the original Transformer architecture.</li>
</ul>
</li>
<li><strong>Input</strong>: Images resized to $384 \times 384 \times 3$.</li>
<li><strong>Output</strong>: Sequence of FG-SMILES tokens.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Binary &ldquo;Exact Match&rdquo; (valid/invalid).
<ul>
<li>Strict criteria: Stereo and R-group indices must match exactly (e.g., $R&rsquo;$ vs $R_1$ is a failure).</li>
</ul>
</li>
<li><strong>Datasets</strong>:
<ul>
<li><strong>Internal</strong>: 5% random split of generated data (500k samples).</li>
<li><strong>External (Dataset A &amp; B)</strong>: Manually cropped real-world images from specified journals.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.</li>
<li><strong>Duration</strong>: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.</li>
<li><strong>Optimizer</strong>: RAdam with learning rate $3 \cdot 10^{-4}$.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/syntelly/img2smiles_generator">Data Generator (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generator</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5069806">1M Generated Samples (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Randomly generated image-SMILES pairs</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5356500">Real-World Test Images (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Cropped structures from real papers with target FG-SMILES</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></td>
          <td>Other</td>
          <td>Proprietary</td>
          <td>Web demo for PDF-to-SMILES extraction</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Khokhlov, I., Krasnov, L., Fedorov, M. V., &amp; Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. <em>Chemistry-Methods</em>, 2(1), e202100069. <a href="https://doi.org/10.1002/cmtd.202100069">https://doi.org/10.1002/cmtd.202100069</a></p>
<p><strong>Publication</strong>: Chemistry-Methods 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/syntelly/img2smiles_generator">Official Code (Data Generator)</a></li>
<li><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khokhlovImage2SMILESTransformerBasedMolecular2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image2SMILES: Transformer-Based Molecular Optical Recognition Engine}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Image2SMILES}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Chemistry-Methods}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{e202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2628-9725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1002/cmtd.202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image-to-Graph Transformers for Chemical Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/</guid><description>A deep learning model that converts molecular images directly into graph structures, enabling recognition of abbreviated non-atomic symbols.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-classification">Contribution and Taxonomic Classification</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture designed to extract molecular structures from images by directly predicting the graph topology. The paper validates this approach through ablation studies (comparing ResNet-only baselines to the Transformer-augmented model) and extensive benchmarking against existing tools.</p>
<h2 id="the-challenge-with-smiles-and-non-atomic-symbols">The Challenge with SMILES and Non-Atomic Symbols</h2>
<ul>
<li><strong>Handling Abbreviations:</strong> Chemical structures in scientific literature often use non-atomic symbols (superatoms like &ldquo;R&rdquo; or &ldquo;Ph&rdquo;) to reduce complexity. Standard tools that generate SMILES strings fail here because SMILES syntax does not support arbitrary non-atomic symbols.</li>
<li><strong>Robustness to Style:</strong> Existing rule-based tools are brittle to the diverse drawing styles found in literature.</li>
<li><strong>Data Utilization:</strong> Pixel-wise graph recognition tools (like ChemGrapher) require expensive pixel-level labeling. An end-to-end approach can utilize massive amounts of image-molecule pairs (like USPTO data) without needing exact coordinate labels.</li>
</ul>
<h2 id="the-image-to-graph-i2g-architecture">The Image-to-Graph (I2G) Architecture</h2>
<p>The core novelty is the <strong>Image-to-Graph (I2G)</strong> architecture that bypasses string representations entirely:</p>
<ul>
<li><strong>Hybrid Encoder:</strong> Combines a ResNet backbone (for locality) with a Transformer encoder (for global context), allowing the model to capture relationships between atoms that are far apart in the image.</li>
<li><strong>Graph Decoder (GRAT):</strong> A modified Transformer decoder that generates the graph auto-regressively. It uses feature-wise transformations to modulate attention weights based on edge information (bond types).</li>
<li><strong>Coordinate-Aware Training:</strong> The model is forced to predict the exact 2D coordinates of atoms in the source image. Combined with auxiliary losses, this boosts SMI accuracy from 0.009 to 0.567 on the UoB ablation (Table 1 in the paper).</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Baselines:</strong> The model was compared against OSRA (rule-based), MolVec (rule-based), and ChemGrapher (deep learning pixel-wise).</li>
<li><strong>Benchmarks:</strong> Evaluated on four standard datasets: UoB, USPTO, CLEF, and JPO. Images were converted to PDF and back to simulate degradation.</li>
<li><strong>Large Molecule Test:</strong> A custom dataset (<strong>OLED</strong>) was created from 12 journal papers (434 images) to test performance on larger, more complex structures (average 52.8 atoms).</li>
<li><strong>Ablations:</strong> The authors tested the impact of the Transformer encoder, auxiliary losses, and coordinate prediction.</li>
</ul>
<h2 id="empirical-results-and-robustness">Empirical Results and Robustness</h2>
<ul>
<li><strong>Benchmark Performance:</strong> The proposed model outperformed existing models with a 17.1% relative improvement on benchmark datasets.</li>
<li><strong>Robustness:</strong> On large molecules (OLED dataset), it achieved a 12.8% relative improvement over MolVec (and 20.0% over OSRA).</li>
<li><strong>Data Scaling:</strong> Adding real-world USPTO data to the synthetic training set improved performance by 20.5%, demonstrating the model&rsquo;s ability to learn from noisy, unlabeled coordinates.</li>
<li><strong>Handling Superatoms:</strong> The model successfully recognized pseudo-atoms (e.g., $R_1$, $R_2$, $R_3$) as distinct nodes. OSRA, which outputs SMILES, collapsed them into generic &ldquo;Any&rdquo; atoms since SMILES does not support non-atomic symbols. MolVec could not recognize them properly at all.</li>
</ul>
<h2 id="limitations-and-error-analysis">Limitations and Error Analysis</h2>
<p>The paper identifies two main failure modes on the USPTO, CLEF, and JPO benchmarks:</p>
<ol>
<li><strong>Unrecognized superatoms:</strong> The model struggles with complex multi-character superatoms not seen during training (e.g., NHNHCOCH$_3$ or H$_3$CO$_2$S). The authors propose character-level atom decoding as a future solution.</li>
<li><strong>Caption interference:</strong> The model sometimes misidentifies image captions as atoms, particularly on the JPO dataset. Data augmentation with arbitrary caption text or a dedicated image segmentation step could mitigate this.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors used a combination of synthetic and real-world data for training.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td><strong>PubChem</strong></td>
          <td>4.6M</td>
          <td>Synthetic images generated using RDKit. Random superatoms (e.g., $CF_3$, $NO_2$) were substituted to simulate abbreviations.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td><strong>USPTO</strong></td>
          <td>2.5M</td>
          <td>Real image-molecule pairs from patents. Used for robustness; lacks coordinate labels.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Benchmarks</strong></td>
          <td>~5.7k</td>
          <td>UoB, USPTO, CLEF, JPO. Average ~15.8 atoms per molecule.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>OLED</strong></td>
          <td>434</td>
          <td>Manually segmented from 12 journal papers. Large molecules (avg 52.8 atoms).</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing:</strong></p>
<ul>
<li>Input resolution is fixed at $800 \times 800$ pixels.</li>
<li>Images are virtually split into a $25 \times 25$ grid (625 patches total), where each patch is $32 \times 32$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Encoder Logic:</strong></p>
<ul>
<li><strong>Grid Serialization:</strong> The $25 \times 25$ grid is flattened into a 1D sequence. 2D position information is concatenated to ResNet features before the Transformer.</li>
<li><strong>Auxiliary Losses:</strong> To aid convergence, classifiers on the encoder predict three things <em>per patch</em>: (1) number of atoms, (2) characters in atom labels, and (3) edge-sharing neighbors. These losses decrease to zero during training.</li>
</ul>
<p><strong>Decoder Logic:</strong></p>
<ul>
<li><strong>Auto-regressive Generation:</strong> At step $t$, the decoder generates a new node and connects it to existing nodes.</li>
<li><strong>Attention Modulation:</strong> Attention weights are transformed using bond information:
$$
\begin{aligned}
\text{Att}(Q, K, V) = \text{softmax} \left( \frac{\Gamma \odot (QK^T) + B}{\sqrt{d_k}} \right) V
\end{aligned}
$$
where $(\gamma_{ij}, \beta_{ij}) = f(e_{ij})$, with $e_{ij}$ being the edge type (in one-hot representation) between nodes $i$ and $j$, and $f$ is a multi-layer perceptron. $\Gamma$ and $B$ are matrices whose elements at position $(i, j)$ are $\gamma_{ij}$ and $\beta_{ij}$, respectively.</li>
<li><strong>Coordinate Prediction:</strong> The decoder outputs coordinates for each atom, which acts as a mechanism to track attention history.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Image Encoder:</strong> ResNet-34 backbone followed by a Transformer encoder.</li>
<li><strong>Graph Decoder:</strong> A &ldquo;Graph-Aware Transformer&rdquo; (GRAT) that outputs nodes (atom labels, coordinates) and edges (bond types).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on structural identity, as standard string matching (SMILES) is insufficient for graphs with superatoms.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMI</strong></td>
          <td>Canonical SMILES Match</td>
          <td>Correct if predicted SMILES is identical to ground truth.</td>
      </tr>
      <tr>
          <td><strong>TS 1</strong></td>
          <td>Tanimoto Similarity = 1.0</td>
          <td>Ratio of predictions with perfect fingerprint overlap.</td>
      </tr>
      <tr>
          <td><strong>Sim.</strong></td>
          <td>Average Tanimoto Similarity</td>
          <td>Measures average structural overlap across all predictions.</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility">Reproducibility</h2>
<p>The paper does not release source code, pre-trained models, or the custom OLED evaluation dataset. The training data sources (PubChem, USPTO) are publicly available, but the specific image generation pipeline (modified RDKit with coordinate extraction and superatom substitution) is not released. Key architectural details (ResNet-34 backbone, Transformer encoder/decoder configuration) and training techniques are described, but exact hyperparameters for full reproduction are limited.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source of 4.6M molecules for synthetic image generation</td>
      </tr>
      <tr>
          <td><a href="https://www.uspto.gov/">USPTO</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>2.5M real image-molecule pairs from patents</td>
      </tr>
      <tr>
          <td><a href="https://www.rdkit.org/">RDKit</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>Used (with modifications) for synthetic image generation</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoo, S., Kwon, O., &amp; Lee, H. (2022). Image-to-Graph Transformers for Chemical Structure Recognition. <em>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 3393-3397. <a href="https://doi.org/10.1109/ICASSP43922.2022.9746088">https://doi.org/10.1109/ICASSP43922.2022.9746088</a></p>
<p><strong>Publication</strong>: ICASSP 2022</p>
]]></content:encoded></item><item><title>ICMDT: Automated Chemical Structure Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</guid><description>A Transformer-based model (ICMDT) for converting chemical structure images into InChI text strings using a novel Deep TNT block.</description><content:encoded><![CDATA[<h2 id="contribution-image-to-text-translation-for-chemical-structures">Contribution: Image-to-Text Translation for Chemical Structures</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel neural network architecture, the <strong>Image Captioning Model based on Deep TNT (ICMDT)</strong>, to solve the specific problem of &ldquo;molecular translation&rdquo; (image-to-text). The classification is supported by the following rhetorical indicators:</p>
<ul>
<li><strong>Novel Mechanism:</strong> It introduces the &ldquo;Deep TNT block&rdquo; to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).</li>
<li><strong>Baseline Comparison:</strong> The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).</li>
<li><strong>Ablation Study:</strong> Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.</li>
</ul>
<h2 id="motivation-digitizing-historical-chemical-literature">Motivation: Digitizing Historical Chemical Literature</h2>
<p>The primary motivation is to speed up chemical research by digitizing historical chemical literature.</p>
<ul>
<li><strong>Problem:</strong> Historical sources often contain corrupted or noisy images, making automated recognition difficult.</li>
<li><strong>Gap:</strong> Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.</li>
<li><strong>Goal:</strong> To build a dependable generative model that can accurately translate these noisy images into <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> (International Chemical Identifier) text strings.</li>
</ul>
<h2 id="novelty-multi-level-feature-fusion-with-deep-tnt">Novelty: Multi-Level Feature Fusion with Deep TNT</h2>
<p>The core contribution is the <strong>Deep TNT block</strong> and the resulting <strong>ICMDT</strong> architecture.</p>
<ul>
<li><strong>Deep TNT Block:</strong> The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
<ol>
<li><strong>Internal Transformer:</strong> Processes pixel embeddings.</li>
<li><strong>Middle Transformer:</strong> Processes small patch embeddings.</li>
<li><strong>Exterior Transformer:</strong> Processes large patch embeddings.</li>
</ol>
</li>
<li><strong>Multi-level Fusion:</strong> The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.</li>
<li><strong>Position Encoding:</strong> A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.</li>
</ul>
<h2 id="methodology-benchmarking-on-the-bms-dataset">Methodology: Benchmarking on the BMS Dataset</h2>
<p>The authors evaluated the model on the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset.</p>
<ul>
<li><strong>Baselines:</strong> They constructed four comparative models:
<ul>
<li>EfficientNetb0 + RNN (Bi-LSTM)</li>
<li>ResNet50d + RNN (Bi-LSTM)</li>
<li>EfficientNetb0 + Transformer</li>
<li>ResNet101d + Transformer</li>
</ul>
</li>
<li><strong>Ablation:</strong> They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).</li>
<li><strong>Pre-processing Study:</strong> They experimented with denoising ratios and cropping strategies.</li>
</ul>
<h2 id="results--conclusions-improved-inchi-translation-accuracy">Results &amp; Conclusions: Improved InChI Translation Accuracy</h2>
<ul>
<li><strong>Performance:</strong> ICMDT achieved the lowest <strong>Levenshtein distance (0.69)</strong> among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.</li>
<li><strong>Convergence:</strong> The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.</li>
<li><strong>Ablation Results:</strong> The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.</li>
<li><strong>Limitations:</strong> The model struggles with <strong>stereochemical layers</strong> (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.</li>
<li><strong>Inference &amp; Fusion:</strong> The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.</li>
<li><strong>Future Work:</strong> Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p><strong>Status: Partially Reproducible.</strong> The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Molecular Translation (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition Terms</td>
          <td>Training/test images with InChI labels</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components:</strong> No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.</p>
<p><strong>Hardware/compute requirements:</strong> Not explicitly stated in the paper.</p>
<h3 id="data">Data</h3>
<p>The experiments used the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset from Kaggle.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BMS Training Set</td>
          <td>2,424,186 images</td>
          <td>Supervised; contains noise and blur</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BMS Test Set</td>
          <td>1,616,107 images</td>
          <td>Higher noise variation than training set</td>
      </tr>
  </tbody>
</table>
<p><strong>Pre-processing Strategy</strong>:</p>
<ul>
<li><strong>Effective:</strong> Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).</li>
<li><strong>Ineffective:</strong> Smart cropping (removing white borders degraded performance).</li>
<li><strong>Augmentation:</strong> GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).</li>
<li><strong>Denoising:</strong> Best results found by mixing denoised and original data (Ratio 2:13) during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer:</strong> Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).</li>
<li><strong>Loss Function:</strong> Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.</li>
<li><strong>Training Schedule:</strong>
<ul>
<li>Initial resolution: $224 \times 224$</li>
<li>Fine-tuning: Resolution $384 \times 384$ for labels $&gt;150$ length.</li>
<li>Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).</li>
<li>Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.</li>
</ul>
</li>
<li><strong>Inference Strategy:</strong>
<ul>
<li>Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).</li>
<li>Test Time Augmentation (TTA): Rotations of $90^\circ$.</li>
<li>Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>ICMDT Architecture:</strong></p>
<ul>
<li><strong>Encoder (Deep TNT)</strong> (Depth: 12 layers):
<ul>
<li><strong>Internal Block:</strong> Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.</li>
<li><strong>Middle Block:</strong> Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.</li>
<li><strong>Exterior Block:</strong> Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Decoder dim: 2560, FFN dim: 1024.</li>
<li>Depth: 3 layers, Heads: 8.</li>
<li>Vocab size: 193 (InChI tokens), text_dim: 384.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric:</strong> Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).</p>
<p><strong>Ablation Results (Table 3 from paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Levenshtein Distance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td><strong>0.69</strong></td>
      </tr>
      <tr>
          <td>ICMDT*</td>
          <td>138.16</td>
          <td>1.04</td>
      </tr>
      <tr>
          <td>TNTD</td>
          <td>114.36</td>
          <td>1.29</td>
      </tr>
      <tr>
          <td>TNTD-B</td>
          <td>114.36</td>
          <td>1.37</td>
      </tr>
  </tbody>
</table>
<p><strong>Baseline Comparison (from convergence curves, Figure 9):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Convergence (Epochs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td>~9.76</td>
      </tr>
      <tr>
          <td>ResNet101d + Transformer</td>
          <td>302.02</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + Transformer</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ResNet50d + RNN</td>
          <td>90.6</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + RNN</td>
          <td>46.3</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, Y., Chen, G., &amp; Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. <em>Applied Sciences</em>, 12(2), 680. <a href="https://doi.org/10.3390/app12020680">https://doi.org/10.3390/app12020680</a></p>
<p><strong>Publication</strong>: MDPI Applied Sciences 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">Kaggle Competition: BMS Molecular Translation</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liAutomatedRecognitionChemical2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Li, Yanchi and Chen, Guanyu and Li, Xiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Applied Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{680}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Multidisciplinary Digital Publishing Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2076-3417}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3390/app12020680}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Structure Recognition with RCGD</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</guid><description>An end-to-end framework (RCGD) and unambiguous markup language (SSML) for recognizing complex handwritten chemical structures with guided graph traversal.</description><content:encoded><![CDATA[<h2 id="contribution-and-methodological-framework">Contribution and Methodological Framework</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architectural framework (<strong>RCGD</strong>) and a new representation syntax (<strong>SSML</strong>) to solve the specific problem of handwritten chemical structure recognition.</li>
<li><strong>Resource</strong>: It introduces a new benchmark dataset, <strong>EDU-CHEMC</strong>, containing 50,000 handwritten images to address the lack of public data in this domain.</li>
</ul>
<h2 id="the-ambiguity-of-handwritten-chemical-structures">The Ambiguity of Handwritten Chemical Structures</h2>
<p>Recognizing handwritten chemical structures is significantly harder than printed ones due to:</p>
<ol>
<li><strong>Inherent Ambiguity</strong>: Handwritten atoms and bonds vary greatly in appearance.</li>
<li><strong>Projection Complexity</strong>: Converting 2D projected layouts (like Natta or Fischer projections) into linear strings is difficult.</li>
<li><strong>Limitations of Existing Formats</strong>: Standard formats like SMILES require domain knowledge (valence rules) and have a high semantic gap with the visual image. They often fail to represent &ldquo;invalid&rdquo; structures commonly found in educational/student work.</li>
</ol>
<h2 id="bridging-the-semantic-gap-with-ssml-and-rcgd">Bridging the Semantic Gap with SSML and RCGD</h2>
<p>The paper introduces two core contributions to bridge the semantic gap between image and markup:</p>
<ol>
<li>
<p><strong>Structure-Specific Markup Language (SSML)</strong>: An extension of Chemfig that provides an unambiguous, visual-based graph representation. Unlike SMILES, it describes <em>how to draw</em> the molecule step-by-step, making it easier for models to learn visual alignments. It supports &ldquo;reconnection marks&rdquo; to handle cyclic structures explicitly.</p>
</li>
<li>
<p><strong>Random Conditional Guided Decoder (RCGD)</strong>: A decoder that treats recognition as a graph traversal problem. It introduces three novel mechanisms:</p>
<ul>
<li><strong>Conditional Attention Guidance</strong>: Uses branch angle directions to guide the attention mechanism, preventing the model from getting lost in complex structures.</li>
<li><strong>Memory Classification</strong>: A module that explicitly stores and classifies &ldquo;unexplored&rdquo; branch points to handle ring closures (reconnections).</li>
<li><strong>Path Selection</strong>: A training strategy that randomly samples traversal paths to prevent overfitting to a specific serialization order.</li>
</ul>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p><strong>Datasets</strong>:</p>
<ul>
<li><strong>Mini-CASIA-CSDB</strong> (Printed): A subset of 97,309 printed molecular structure images, upscaled to $500 \times 500$ resolution.</li>
<li><strong>EDU-CHEMC</strong> (Handwritten): A new dataset of 52,987 images collected from educational settings (cameras, scanners, screens), including erroneous/non-existent structures.</li>
</ul>
<p><strong>Baselines</strong>:</p>
<ul>
<li>Compared against standard <strong>String Decoders (SD)</strong> (based on DenseWAP), tested with both SMILES and SSML on Mini-CASIA-CSDB and exclusively with SSML on EDU-CHEMC.</li>
<li>Compared against <strong>BTTR</strong> and <strong>ABM</strong> (recent mathematical expression recognition models) adapted for the chemical structure task, both using SSML on EDU-CHEMC.</li>
<li>On Mini-CASIA-CSDB, also compared against <strong>WYGIWYS</strong> (a SMILES-based string decoder at 300x300 resolution).</li>
</ul>
<p><strong>Ablation Studies</strong>:</p>
<ul>
<li>Evaluated the impact of removing Path Selection (PS) and Memory Classification (MC) mechanisms on EDU-CHEMC.</li>
<li>Tested robustness to image rotation ($180^{\circ}$) on Mini-CASIA-CSDB.</li>
</ul>
<h2 id="recognition-performance-and-robustness">Recognition Performance and Robustness</h2>
<ul>
<li><strong>Superiority of SSML</strong>: Models trained with SSML significantly outperformed those trained with SMILES (92.09% vs 81.89% EM on printed data) due to reduced semantic gap.</li>
<li><strong>Best Performance</strong>: RCGD achieved the highest Exact Match (EM) scores on both datasets:
<ul>
<li><strong>Mini-CASIA-CSDB</strong>: 95.01% EM.</li>
<li><strong>EDU-CHEMC</strong>: 62.86% EM.</li>
</ul>
</li>
<li><strong>EDU-CHEMC Baselines</strong>: On the handwritten dataset, SD (DenseWAP) achieved 61.35% EM, outperforming both BTTR (58.21% EM) and ABM (58.78% EM). The authors note that BTTR and ABM&rsquo;s reverse training mode, which helps in regular formula recognition, does not transfer well to graph-structured molecular data.</li>
<li><strong>Ablation Results</strong> (Table 5, EDU-CHEMC): Removing Path Selection alone dropped EM from 62.86% to 62.15%. Removing both Path Selection and Memory Classification dropped EM further to 60.31%, showing that memory classification has a larger impact.</li>
<li><strong>Robustness</strong>: RCGD showed minimal performance drop (0.85%) on rotated images compared to SMILES-based methods (10.36% drop). The SD with SSML dropped by 2.19%, confirming that SSML itself improves rotation invariance.</li>
<li><strong>Educational Utility</strong>: The method can recognize and reconstruct chemically invalid structures (e.g., a Carbon atom with 5 bonds), making it applicable for correcting and revising handwritten answers in chemistry education.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>1. EDU-CHEMC (Handwritten)</strong></p>
<ul>
<li><strong>Total Size</strong>: 52,987 images.</li>
<li><strong>Splits</strong>: Training (48,998), Validation (999), Test (2,992).</li>
<li><strong>Characteristics</strong>: Real-world educational data, mixture of isolated molecules and reaction equations, includes invalid chemical structures.</li>
</ul>
<p><strong>2. Mini-CASIA-CSDB (Printed)</strong></p>
<ul>
<li><strong>Total Size</strong>: 97,309 images.</li>
<li><strong>Splits</strong>: Training (80,781), Validation (8,242), Test (8,286).</li>
<li><strong>Preprocessing</strong>: Original $300 \times 300$ images were upscaled to $500 \times 500$ RGB to resolve blurring issues.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. SSML Generation</strong></p>
<p>To convert a molecular graph to SSML:</p>
<ol>
<li><strong>Traverse</strong>: Start from the left-most atom.</li>
<li><strong>Bonds/Atoms</strong>: Output atom text and bond format <code>&lt;bond&gt;[:&lt;angle&gt;]</code>.</li>
<li><strong>Branches</strong>: At branch points, use phantom symbols <code>(</code> and <code>)</code> to enclose branches, ordered by ascending bond angle.</li>
<li><strong>Reconnections</strong>: Use <code>?[tag]</code> and <code>?[tag, bond]</code> to mark start/end of ring closures.</li>
</ol>
<p><strong>2. RCGD Specifics</strong></p>
<ul>
<li><strong>RCGD-SSML</strong>: Modified version of SSML for the decoder. Removes <code>(</code> <code>)</code> delimiters; adds <code>\eob</code> (end of branch). Maintains a dynamic <strong>Branch Angle Set ($M$)</strong>.</li>
<li><strong>Path Selection</strong>: During training, when multiple branches exist in $M$, the model randomly selects one to traverse next. During inference, it uses beam search to score candidate paths.</li>
<li><strong>Loss Function</strong>:
$$
\begin{aligned}
L_{\text{total}} = L_{\text{ce}} + L_{\text{bc}}
\end{aligned}
$$
<ul>
<li>$L_{\text{ce}}$: Cross-entropy loss for character sequence generation.</li>
<li>$L_{\text{bc}}$: Multi-label classification loss for the memory module (predicting reconnection bond types for stored branch states).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>: DenseNet</p>
<ul>
<li><strong>Structure</strong>: 3 dense blocks.</li>
<li><strong>Growth Rate</strong>: 24.</li>
<li><strong>Depth</strong>: 32 per block.</li>
<li><strong>Output</strong>: High-dimensional feature map $x \in \mathbb{R}^{d_x \times h \times w}$.</li>
</ul>
<p><strong>Decoder</strong>: GRU with Attention</p>
<ul>
<li><strong>Hidden State Dimension</strong>: 256.</li>
<li><strong>Embedding Dimension</strong>: 256.</li>
<li><strong>Attention Projection</strong>: 128.</li>
<li><strong>Memory Classification Projection</strong>: 256.</li>
</ul>
<p><strong>Training Config</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: Adam.</li>
<li><strong>Learning Rate</strong>: 2e-4 with multi-step decay (gamma 0.5).</li>
<li><strong>Dropout</strong>: 15%.</li>
<li><strong>Strategy</strong>: Teacher-forcing used for validation selection.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Exact Match (EM)</strong>: Percentage of samples where the predicted graph structure perfectly matches the label. For SMILES, string comparison; for SSML, converted to graph for isomorphism check.</li>
<li><strong>Structure EM</strong>: Auxiliary metric for samples with mixed content (text + molecules), counting samples where <em>all</em> molecular structures are correct.</li>
</ul>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">EDU-CHEMC</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Dataset annotations and download links (actual data hosted on Google Drive)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing Components</strong>:</p>
<ul>
<li>No training or inference code is publicly released; only the dataset is available.</li>
<li>Pre-trained model weights are not provided.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, J., Wu, H., Chen, M., Liu, C., Wu, J., Yin, S., Yin, B., Yin, B., Liu, C., Du, J., &amp; Dai, L. (2023). Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder. <em>Proceedings of the 31st ACM International Conference on Multimedia</em> (pp. 8114-8124). <a href="https://doi.org/10.1145/3581783.3612573">https://doi.org/10.1145/3581783.3612573</a></p>
<p><strong>Publication</strong>: ACM Multimedia 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">GitHub Repository / EDU-CHEMC Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{huHandwrittenChemicalStructure2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 31st ACM International Conference on Multimedia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hu, Jinshui and Wu, Hao and Chen, Mingjun and Liu, Chenyu and Wu, Jiajia and Yin, Shi and Yin, Baocai and Yin, Bing and Liu, Cong and Du, Jun and Dai, Lirong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{8114--8124}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Ottawa ON Canada}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3581783.3612573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{979-8-4007-0108-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>End-to-End Transformer for Molecular Image Captioning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</guid><description>Vision Transformer encoder with Transformer decoder for molecular image-to-InChI translation, outperforming CNN baselines on noisy molecular datasets.</description><content:encoded><![CDATA[<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological Paper</strong>. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.</p>
<h2 id="motivation-and-problem-statement">Motivation and Problem Statement</h2>
<p>The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.</p>
<h2 id="core-innovations-end-to-end-vit-encoder">Core Innovations: End-to-End ViT Encoder</h2>
<p>The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.</p>
<h2 id="performance-outcomes-and-capabilities">Performance Outcomes and Capabilities</h2>
<p>The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of <strong>6.95</strong>, outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The authors note the model was evaluated on datasets with low distinguishable features and noise, where the ViT encoder&rsquo;s self-attention over all patches from the first layer helped capture relevant structure. The proposed caching optimization reduced the total decoding time complexity from $O(MN^2 + N^3)$ to $O(MN + N^2)$ for $N$ timesteps, by reducing the per-timestep cost to $O(M + N)$.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Bristol Myers Squibb</strong></td>
          <td>~2.4 million synthetic images with InChI labels.</td>
          <td>Provided by BMS global biopharmaceutical company.</td>
      </tr>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>Kaggle contest data converted to InChI.</td>
          <td>Images generated using RDKit.</td>
      </tr>
      <tr>
          <td><strong><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></strong></td>
          <td>Subset of 977 million small organic molecules (up to 13 atoms).</td>
          <td>Converted from SMILES using RDKit.</td>
      </tr>
      <tr>
          <td><strong>Augmented Images</strong></td>
          <td>Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.</td>
          <td>Used to improve robustness against noise.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Objective</strong>: Cross-entropy loss minimization.</li>
<li><strong>Inference Decoding</strong>: Autoregressive decoding predicting the next character of the InChI string.</li>
<li><strong>Positional Encoding</strong>: Standard sine and cosine functions of different frequencies.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Caching</strong>: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.</li>
<li><strong>JIT</strong>: PyTorch JIT compiler used for graph optimization (1.2-1.5x speed increase on GPU).</li>
<li><strong>Self-Critical Training</strong>: Finetuning performed using self-critical sequence training (SCST).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder (Vision Transformer)</strong>:
<ul>
<li>Input: Flattened 2D patches of the image. Patch size: $16 \times 16$.</li>
<li>Projection: Trainable linear projection to latent vector size $D$.</li>
<li>Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Input: Tokenized InChI string + sinusoidal positional embedding.</li>
<li>Vocabulary: 275 tokens (including <code>&lt;SOS&gt;</code>, <code>&lt;PAD&gt;</code>, <code>&lt;EOS&gt;</code>).</li>
</ul>
</li>
<li><strong>Hyperparameters (Best Model)</strong>:
<ul>
<li>Image Size: $384 \times 384$.</li>
<li>Layers: 24.</li>
<li>Feature Dimension: 512.</li>
<li>Attention Heads: 12.</li>
<li>Optimizer: Adam.</li>
<li>Learning Rate: $3 \times 10^{-5}$ (decayed by 0.5 in last 2 epochs).</li>
<li>Batch Size: Varied [64-512].</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Levenshtein Distance (lower is better).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Image Size</th>
          <th>Layers</th>
          <th>Epochs</th>
          <th>Levenshtein Dist.</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard CNN+RNN</td>
          <td>224</td>
          <td>3</td>
          <td>10</td>
          <td>103.7</td>
      </tr>
      <tr>
          <td>ResNet18 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>75.03</td>
      </tr>
      <tr>
          <td>ResNet34 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>45.72</td>
      </tr>
      <tr>
          <td>ResNet50 + LSTM</td>
          <td>224</td>
          <td>5</td>
          <td>10</td>
          <td>7.49</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>3</td>
          <td>5</td>
          <td>79.82</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>6</td>
          <td>5</td>
          <td>54.58</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>12</td>
          <td>5</td>
          <td>31.30</td>
      </tr>
      <tr>
          <td>ViT Transformers (Best)</td>
          <td>384</td>
          <td>24</td>
          <td>10</td>
          <td><strong>6.95</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>System</strong>: 70GB GPU system.</li>
<li><strong>Framework</strong>: PyTorch and PyTorch Lightning.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., &amp; Gupta, S. (2021). End-to-End Attention-based Image Captioning. <em>arXiv preprint arXiv:2104.14721</em>. <a href="https://doi.org/10.48550/arXiv.2104.14721">https://doi.org/10.48550/arXiv.2104.14721</a></p>
<p><strong>Publication</strong>: arXiv 2021 (preprint)</p>
<p><strong>Note</strong>: This is an arXiv preprint and has not undergone formal peer review.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{sundaramoorthyEndtoEndAttentionbasedImage2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{End-to-{{End Attention-based Image Captioning}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER 1.0: Transformers for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</guid><description>Transformer-based approach for Optical Chemical Structure Recognition converting chemical images to SELFIES strings with 96% accuracy.</description><content:encoded><![CDATA[<h2 id="evaluating-the-contribution-a-methodological-shift">Evaluating the Contribution: A Methodological Shift</h2>
<p><strong>Method (Dominant)</strong> with strong <strong>Resource</strong> elements.</p>
<p>This is primarily a <strong>Method</strong> paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a <strong>Transformer-based network</strong> to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.</p>
<p>It also serves as a <strong>Resource</strong> contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (&gt;35 million molecules).</p>
<h2 id="motivation-inaccessible-chemical-knowledge">Motivation: Inaccessible Chemical Knowledge</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.</li>
<li><strong>Manual Bottlenecks</strong>: Manual curation and extraction of this data is tedious, slow, and error-prone.</li>
<li><strong>Limitations of Prior Tools</strong>: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.</li>
</ul>
<h2 id="key-innovation-transformer-based-molecular-translation">Key Innovation: Transformer-Based Molecular Translation</h2>
<ul>
<li><strong>Transformer Architecture</strong>: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a <strong>Transformer-based decoder</strong>, significantly improving accuracy.</li>
<li><strong>EfficientNet Backbone</strong>: Replaces the standard InceptionV3 feature extractor with <strong>EfficientNet-B3</strong>, which improved feature extraction quality for chemical images.</li>
<li><strong>SELFIES Representation</strong>: Utilizes <a href="/notes/chemistry/molecular-representations/notations/selfies/"><strong>SELFIES</strong></a> (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the &ldquo;invalid SMILES&rdquo; problem common in generative models.</li>
<li><strong>Massive Scaling</strong>: Trains on synthetic datasets derived from PubChem (up to <strong>39 million molecules</strong> total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<ul>
<li><strong>Feature Extractor Ablation</strong>: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints:
$$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$</li>
<li><strong>Data Scaling</strong>: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.</li>
<li><strong>Stereochemistry &amp; Ions</strong>: Tested the model&rsquo;s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.</li>
<li><strong>Augmentation Robustness</strong>: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.</li>
</ul>
<h2 id="results-and-scaling-observations">Results and Scaling Observations</h2>
<ul>
<li><strong>Architecture Comparison</strong>: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved <strong>74.57%</strong> exact matches (Tanimoto 1.0) compared to only <strong>7.03%</strong> for the Encoder-Decoder (Table 4 in the paper).</li>
<li><strong>High Accuracy at Scale</strong>: With the full 35-million molecule training set (Dataset 1), the model achieved a <strong>Tanimoto 1.0 score of 96.47%</strong> and an average Tanimoto similarity of 0.99.</li>
<li><strong>Isomorphism</strong>: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>).</li>
<li><strong>Stereochemistry Costs</strong>: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).</li>
<li><strong>Hardware Efficiency</strong>: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.</li>
<li><strong>Augmentation Robustness (Dataset 3)</strong>: When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors generated synthetic data from PubChem.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 1 (Clean)</td>
          <td>39M total (35M train)</td>
          <td>No stereo/ions. Filtered for MW &lt; 1500, bond count 3-40, SMILES len &lt; 40.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 2 (Complex)</td>
          <td>37M total (33M train)</td>
          <td>Includes stereochemistry and charged groups (ions).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 3 (Augmented)</td>
          <td>37M total (33M train)</td>
          <td>Dataset 2 with image augmentations applied.</td>
      </tr>
      <tr>
          <td><strong>Preprocessing</strong></td>
          <td>N/A</td>
          <td>N/A</td>
          <td>Molecules converted to <strong>SELFIES</strong>. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.</td>
      </tr>
      <tr>
          <td><strong>Format</strong></td>
          <td>TFRecords</td>
          <td>75 MB chunks</td>
          <td>128 Data points (image vector + tokenized string) per record.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Text Representation</strong>: <strong>SELFIES</strong> used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
<ul>
<li><em>Dataset 1 Tokens</em>: 27 unique tokens. Max length 47.</li>
<li><em>Dataset 2/3 Tokens</em>: 61 unique tokens (due to stereo/ion tokens).</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Implemented using <code>imgaug</code> python package. Random application of:
<ul>
<li>Gaussian/Average Blur, Additive Gaussian Noise, Salt &amp; Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler (following the &ldquo;Attention is all you need&rdquo; paper).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture is an <strong>Image-to-SELFIES Transformer</strong>.</p>
<ul>
<li><strong>Encoder (Feature Extractor)</strong>:
<ul>
<li><strong>EfficientNet-B3</strong> (pre-trained on Noisy-student).</li>
<li>Input: $299 \times 299 \times 3$ images (normalized -1 to 1).</li>
<li>Output Feature Vector: $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder (Transformer)</strong>:
<ul>
<li>4 Encoder-Decoder layers.</li>
<li>8 Parallel Attention Heads.</li>
<li>Dimension size: 512.</li>
<li>Feed-forward size: 2048.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td><strong>96.47%</strong></td>
          <td>74.57% (1M subset)</td>
          <td>Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Avg Tanimoto</strong></td>
          <td><strong>0.9923</strong></td>
          <td>0.9371 (1M subset)</td>
          <td>Average similarity score (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Isomorphism</strong></td>
          <td><strong>99.75%</strong></td>
          <td>-</td>
          <td>Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Hardware</strong>: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.</li>
<li><strong>Comparison Hardware</strong>: Nvidia Tesla V100 (32GB GPU).</li>
<li><strong>Performance</strong>:
<ul>
<li>TPU v3-8 was ~4x faster than V100 GPU.</li>
<li>1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.</li>
<li>Largest model (35M) took less than 14 days on TPU.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p>The paper is open-access, and both code and data are publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER-TPU (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using TensorFlow and TPU training</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archival snapshot of the codebase</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SMILES data used for training (images generated via CDK SDG)</td>
      </tr>
      <tr>
          <td><a href="https://decimer.ai/">DECIMER Project Page</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Project landing page</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Hardware Requirements</strong>: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.</li>
<li><strong>Missing Components</strong>: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. <em>Journal of Cheminformatics</em>, 13(1), 61. <a href="https://doi.org/10.1186/s13321-021-00538-8">https://doi.org/10.1186/s13321-021-00538-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">GitHub Repository</a></li>
<li><a href="https://decimer.ai/">DECIMER Project Page</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMER10Deep2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{DECIMER 1.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00538-8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1186/s13321-021-00538-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemPix: Hand-Drawn Hydrocarbon Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/</guid><description>Deep learning framework using CNN-LSTM image captioning to convert hand-drawn hydrocarbon structures into SMILES strings with 76% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-core-contribution">Paper Classification and Core Contribution</h2>
<p>This is primarily a <strong>Method</strong> paper, with a secondary contribution as a <strong>Resource</strong> paper.</p>
<p>The paper&rsquo;s core contribution is the <strong>ChemPix architecture and training strategy</strong> using neural image captioning (CNN-LSTM) to convert hand-drawn chemical structures to SMILES. The extensive ablation studies on synthetic data generation (augmentation, degradation, backgrounds) and ensemble learning strategies confirm the methodological focus. The secondary resource contribution includes releasing a curated dataset of hand-drawn hydrocarbons and code for generating synthetic training data.</p>
<h2 id="the-structural-input-bottleneck-in-computational-chemistry">The Structural Input Bottleneck in Computational Chemistry</h2>
<p>Inputting molecular structures into computational chemistry software for quantum calculations is often a bottleneck, requiring domain expertise and cumbersome manual entry in drawing software. While optical chemical structure recognition (OCSR) tools exist, they typically struggle with the noise and variability of hand-drawn sketches. There is a practical need for a tool that allows chemists to simply photograph a hand-drawn sketch and immediately convert it into a machine-readable format (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>), making computational workflows more accessible.</p>
<h2 id="cnn-lstm-image-captioning-and-synthetic-generalization">CNN-LSTM Image Captioning and Synthetic Generalization</h2>
<ol>
<li><strong>Image Captioning Paradigm</strong>: The authors treat the problem as <strong>neural image captioning</strong>, using an encoder-decoder (CNN-LSTM) framework to &ldquo;translate&rdquo; an image directly to a SMILES string. This avoids the complexity of explicit atom/bond detection and graph assembly.</li>
<li><strong>Synthetic Data Engineering</strong>: The paper introduces a rigorous synthetic data generation pipeline that transforms clean RDKit-generated images into &ldquo;pseudo-hand-drawn&rdquo; images via randomized backgrounds, degradation, and heavy augmentation. This allows the model to achieve &gt;50% accuracy on real hand-drawn data without ever seeing it during training.</li>
<li><strong>Ensemble Uncertainty Estimation</strong>: The method utilizes a &ldquo;committee&rdquo; (ensemble) of networks to improve accuracy and estimate confidence based on vote agreement, providing users with reliability indicators for predictions.</li>
</ol>
<h2 id="extensive-ablation-and-real-world-evaluation">Extensive Ablation and Real-World Evaluation</h2>
<ol>
<li><strong>Ablation Studies on Data Pipeline</strong>: The authors trained models on datasets generated at different stages of the pipeline (Clean RDKit $\rightarrow$ Augmented $\rightarrow$ Backgrounds $\rightarrow$ Degraded) to quantify the value of each transformation in bridging the synthetic-to-real domain gap.</li>
<li><strong>Sample Size Scaling</strong>: They analyzed performance scaling by training on synthetic dataset sizes ranging from 10,000 to 500,000 images to understand data requirements.</li>
<li><strong>Real-world Validation</strong>: The model was evaluated on a held-out test set of hand-drawn images collected via a custom web app, providing genuine out-of-distribution testing.</li>
<li><strong>Fine-tuning Experiments</strong>: Comparisons of synthetic-only training versus fine-tuning with a small fraction of real hand-drawn data to assess the value of limited real-world supervision.</li>
</ol>
<h2 id="state-of-the-art-hand-drawn-ocsr-performance">State-of-the-Art Hand-Drawn OCSR Performance</h2>
<ol>
<li>
<p><strong>Pipeline Efficacy</strong>: Augmentation and image degradation were the most critical factors for generalization, achieving over 50% accuracy on hand-drawn data when training with 500,000 synthetic images. Adding backgrounds had a negligible effect on accuracy compared to degradation.</p>
</li>
<li>
<p><strong>State-of-the-Art Performance</strong>: The final ensemble model (5 out of 17 trained NNs, selected for achieving &gt;50% individual accuracy) achieved <strong>76% accuracy</strong> (top-1) and <strong>85.5% accuracy</strong> (top-3) on the hand-drawn test set, a significant improvement over the best single model&rsquo;s 67.5%.</p>
</li>
<li>
<p><strong>Synthetic Generalization</strong>: A model trained on 500,000 synthetic images achieved &gt;50% accuracy on real hand-drawn data without any fine-tuning, validating the synthetic data generation strategy as a viable alternative to expensive manual labeling.</p>
</li>
<li>
<p><strong>Ensemble Benefits</strong>: The voting committee approach improved accuracy and provided interpretable uncertainty estimates through vote distributions. When all five committee members agree ($V=5$), the confidence value reaches 98%.</p>
</li>
</ol>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations of the current system:</p>
<ul>
<li><strong>Hydrocarbons only</strong>: The model is restricted to hydrocarbon structures and does not handle heteroatoms or functional groups.</li>
<li><strong>No conjoined rings</strong>: Molecules with multiple conjoined rings are excluded due to limitations of RDKit&rsquo;s image generation, which depicts bridges differently from standard chemistry drawing conventions.</li>
<li><strong>Resonance hybrid notation</strong>: The network struggles with benzene rings drawn in the resonance hybrid style (with a circle) compared to the Kekule structure, since the RDKit training images use exclusively Kekule representations.</li>
<li><strong>Challenging backgrounds</strong>: Lined and squared paper increase recognition difficulty, and structures bleeding through from the opposite side of the page can confuse the network.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study relies on two primary data sources: a massive synthetic dataset generated procedurally and a smaller collected dataset of real drawings.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic (RDKit)</td>
          <td>500,000 images</td>
          <td>Generated via RDKit with &ldquo;heavy&rdquo; augmentation: rotation ($0-360°$), blur, salt+pepper noise, and background texture addition.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-Drawn (Real)</td>
          <td>613 images</td>
          <td>Crowdsourced via a web app from over 100 unique users; split into 200-image test set and 413 training/validation images.</td>
      </tr>
      <tr>
          <td><strong>Backgrounds</strong></td>
          <td>Texture Images</td>
          <td>1,052 images</td>
          <td>A pool of unlabeled texture photos (paper, desks, shadows) used to generate synthetic backgrounds.</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation Parameters</strong>:</p>
<ul>
<li><strong>Augmentations</strong>: Rotation, Resize ($200-300px$), Blur, Dilate, Erode, Aspect Ratio, Affine transform ($\pm 20px$), Contrast, Quantize, Sharpness</li>
<li><strong>Backgrounds</strong>: Randomly translated $\pm 100$ pixels and reflected</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Ensemble Voting</strong><br>
A committee of networks casts votes for the predicted SMILES string. The final prediction is the one with the highest vote count. Validity of SMILES is checked using RDKit.</p>
<p><strong>Beam Search</strong><br>
Used in the decoding layer with a beam width of $k=5$ to explore multiple potential SMILES strings. It approximates the sequence $\mathbf{\hat{y}}$ that maximizes the joint probability:</p>
<p>$$ \mathbf{\hat{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p><strong>Optimization</strong>:</p>
<ul>
<li>
<p><strong>Optimizer</strong>: Adam</p>
</li>
<li>
<p><strong>Learning Rate</strong>: $1 \times 10^{-4}$</p>
</li>
<li>
<p><strong>Batch Size</strong>: 20</p>
</li>
<li>
<p><strong>Loss Function</strong>: Cross-entropy loss across the sequence of $T$ tokens, computed as:</p>
<p>$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p>where $\mathbf{x}$ is the image representation and $y_t$ is the predicted SMILES character. This is calculated as perplexity for validation.</p>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a standard image captioning model (Show, Attend and Tell style) adapted for chemical structures.</p>
<p><strong>Encoder (CNN)</strong>:</p>
<ul>
<li><strong>Input</strong>: 256x256 pixel PNG images</li>
<li><strong>Structure</strong>: 4 blocks of Conv2D + MaxPool
<ul>
<li>Block 1: 64 filters, (3,3) kernel</li>
<li>Block 2: 128 filters, (3,3) kernel</li>
<li>Block 3: 256 filters, (3,3) kernel</li>
<li>Block 4: 512 filters, (3,3) kernel</li>
</ul>
</li>
<li><strong>Activation</strong>: ReLU throughout</li>
</ul>
<p><strong>Decoder (LSTM)</strong>:</p>
<ul>
<li><strong>Hidden Units</strong>: 512</li>
<li><strong>Embedding Dimension</strong>: 80</li>
<li><strong>Attention</strong>: Mechanism with intermediary vector dimension of 512</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Exact SMILES match accuracy (character-by-character identity between predicted and ground truth SMILES)</li>
<li><strong>Perplexity</strong>: Used for saving model checkpoints (minimizing uncertainty)</li>
<li><strong>Top-k Accuracy</strong>: Reported for $k=1$ (76%) and $k=3$ (85.5%)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mtzgroup/ChemPixCH">ChemPixCH</a></td>
          <td>Code + Dataset</td>
          <td>Apache-2.0</td>
          <td>Official implementation with synthetic data generation pipeline and collected hand-drawn dataset</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., &amp; Martínez, T. J. (2021). ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning. <em>Chemical Science</em>, 12(31), 10622-10633. <a href="https://doi.org/10.1039/D1SC02957F">https://doi.org/10.1039/D1SC02957F</a></p>
<p><strong>Publication</strong>: Chemical Science 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/mtzgroup/ChemPixCH">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{weir2021chempix,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Weir, Hayley and Thompson, Keiran and Woodward, Amelia and Choi, Benjamin and Braun, Augustin and Mart{\&#39;i}nez, Todd J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{31}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{10622--10633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D1SC02957F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ABC-Net: Keypoint-Based Molecular Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/</guid><description>Deep learning OCSR model using keypoint estimation to detect atom and bond centers for graph-based molecular structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-and-paper-type">Contribution and Paper Type</h2>
<p><strong>Method</strong>. The paper proposes a novel architectural framework (ABC-Net) for Optical Chemical Structure Recognition (OCSR). It reformulates the problem from image captioning (sequence generation) to keypoint estimation (pixel-wise detection), backed by ablation studies on noise and comparative benchmarks against state-of-the-art tools.</p>
<h2 id="motivation-for-keypoint-based-ocsr">Motivation for Keypoint-Based OCSR</h2>
<ul>
<li><strong>Inefficiency of Rule-Based Methods</strong>: Traditional tools (OSRA, MolVec) rely on hand-coded rules that are brittle, require domain expertise, and fail to handle the wide variance in molecular drawing styles.</li>
<li><strong>Data Inefficiency of Captioning Models</strong>: Recent Deep Learning approaches (like DECIMER, Img2mol) treat OCSR as image captioning (Image-to-SMILES). This is data-inefficient because canonical SMILES require learning traversal orders, necessitating millions of training examples.</li>
<li><strong>Goal</strong>: To create a scalable, data-efficient model that predicts graph structures directly by detecting atomic/bond primitives.</li>
</ul>
<h2 id="abc-nets-divide-and-conquer-architecture">ABC-Net&rsquo;s Divide-and-Conquer Architecture</h2>
<ul>
<li><strong>Divide-and-Conquer Strategy</strong>: ABC-Net breaks the problem down into detecting <strong>atom centers</strong> and <strong>bond centers</strong> as independent keypoints.</li>
<li><strong>Keypoint Estimation</strong>: A Fully Convolutional Network (FCN) generates heatmaps for object centers. This is inspired by computer vision techniques like CornerNet and CenterNet.</li>
<li><strong>Angle-Based Bond Detection</strong>: To handle overlapping bonds, the model classifies bond angles into 60 distinct bins ($0-360°$) at detected bond centers, allowing separation of intersecting bonds.</li>
<li><strong>Implicit Hydrogen Prediction</strong>: The model explicitly predicts the number of implicit hydrogens for heterocyclic atoms to resolve ambiguity in dearomatization.</li>
</ul>
<h2 id="experimental-setup-and-synthetic-data">Experimental Setup and Synthetic Data</h2>
<ul>
<li><strong>Dataset Construction</strong>: Synthetic dataset of 100,000 molecules from ChEMBL, rendered using two different engines (RDKit and Indigo) to ensure style diversity.</li>
<li><strong>Baselines</strong>: Compared against two rule-based methods (MolVec, OSRA) and one deep learning method (Img2mol).</li>
<li><strong>Robustness Testing</strong>: Evaluated on the external UOB dataset (real-world images) and synthetic images with varying levels of salt-and-pepper noise (up to $p=0.6$).</li>
<li><strong>Data Efficiency</strong>: Analyzed performance scaling with training set size (10k to 160k images).</li>
</ul>
<h2 id="results-generalization-and-noise-robustness">Results, Generalization, and Noise Robustness</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ABC-Net achieved <strong>94-98% accuracy</strong> across all test sets (Table 1), outperforming MolVec (12-45% on synthetic data, ~83% on UOB), OSRA (26-62% on synthetic, ~82% on UOB), and Img2mol (78-93% on non-stereo subsets).</li>
<li><strong>Generalization</strong>: On the external UOB benchmark, ABC-Net achieved <strong>&gt;95% accuracy</strong>, whereas the deep learning baseline (Img2mol) dropped to 78.2%, indicating better generalization.</li>
<li><strong>Data Efficiency</strong>: The model reached ~95% performance with only 80,000 training images, requiring roughly an order of magnitude less data than captioning-based models like Img2mol (which use millions of training examples).</li>
<li><strong>Noise Robustness</strong>: Performance remained stable (&lt;2% drop) with noise levels up to $p=0.1$. Even at extreme noise ($p=0.6$), Tanimoto similarity remained high, suggesting the model recovers most substructures even when exact matches fail.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Drawing style coverage</strong>: The synthetic training data covers only styles available through RDKit and Indigo renderers. Many real-world styles (e.g., hand-drawn structures, atomic group abbreviations) are not represented.</li>
<li><strong>No stereo baseline from Img2mol</strong>: The Img2mol comparison only covers non-stereo subsets because stereo results were not available from the original Img2mol paper.</li>
<li><strong>Scalability to large molecules</strong>: Molecules with more than 50 non-hydrogen atoms are excluded from the dataset, and performance on such large structures is untested.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/zhang-xuan1314/ABC-Net">ABC-Net Repository</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation. Missing requirements.txt and pre-trained weights.</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status: Partially Reproducible</strong>. The code is provided, but key components like the pre-trained weights, exact training environment dependencies, and the generated synthetic datasets are missing from the open-source release, making exact reproduction difficult.</p>
<h3 id="data">Data</h3>
<p>The authors constructed a synthetic dataset because labeled pixel-wise OCSR data is unavailable.</p>
<ul>
<li><strong>Source</strong>: ChEMBL database</li>
<li><strong>Filtering</strong>: Excluded molecules with &gt;50 non-H atoms or rare atom types/charges (&lt;1000 occurrences).</li>
<li><strong>Sampling</strong>: 100,000 unique SMILES selected such that every atom type/charge appears in at least 1,000 compounds.</li>
<li><strong>Generation</strong>: Images generated via <strong>RDKit</strong> and <strong>Indigo</strong> libraries.
<ul>
<li><em>Augmentation</em>: Varied bond thickness, label mode, orientation, and aromaticity markers.</li>
<li><em>Resolution</em>: $512 \times 512$ pixels.</li>
<li><em>Noise</em>: Salt-and-pepper noise added during training ($P$ = prob of background flip, $Q = 50P$).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (RDKit/Indigo)</td>
          <td>80k</td>
          <td>8:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>UOB Dataset</td>
          <td>~5.7k images</td>
          <td>External benchmark from Univ. of Birmingham</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Keypoint Detection (Heatmaps)</strong></p>
<ul>
<li>
<p><strong>Down-sampling</strong>: Input $512 \times 512$ → Output $128 \times 128$ (stride 4).</p>
</li>
<li>
<p><strong>Label Softening</strong>: To handle discretization error, ground truth peaks are set to 1, first-order neighbors to 0.95, others to 0.</p>
</li>
<li>
<p><strong>Loss Function</strong>: Penalty-reduced pixel-wise binary focal loss (variants of CornerNet loss). The loss formulation is given as:</p>
<p>$$ L_{det} = - \frac{1}{N} \sum_{x,y} \begin{cases} (1 - \hat{A}_{x,y})^{\alpha} \log(\hat{A}_{x,y}) &amp; \text{if } A_{x,y} = 1 \\ (1 - A_{x,y}) (\hat{A}_{x,y})^{\alpha} \log(1 - \hat{A}_{x,y}) &amp; \text{otherwise} \end{cases} $$</p>
<ul>
<li>$\alpha=2$ (focal parameter). The $(1 - A_{x,y})$ term reduces the penalty for first-order neighbors of ground truth locations.</li>
<li>Property classification losses use a separate focal parameter $\beta=2$ with weight balancing: classes with &lt;10% frequency are weighted 10x.</li>
</ul>
</li>
</ul>
<p><strong>2. Bond Direction Classification</strong></p>
<ul>
<li><strong>Angle Binning</strong>: $360°$ divided into 60 intervals.</li>
<li><strong>Inference</strong>: A bond is detected if the angle probability is a local maximum and exceeds a threshold.</li>
<li><strong>Non-Maximum Suppression (NMS)</strong>: Required for opposite angles (e.g., $30°$ and $210°$) representing the same non-stereo bond.</li>
</ul>
<p><strong>3. Multi-Task Weighting</strong></p>
<ul>
<li>Uses Kendall&rsquo;s uncertainty weighting to balance 8 different loss terms (atom det, bond det, atom type, charge, H-count, bond angle, bond type, bond length).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: ABC-Net (Custom U-Net / FCN)</p>
<ul>
<li><strong>Input</strong>: $512 \times 512 \times 1$ (Grayscale).</li>
<li><strong>Contracting Path</strong>: 5 steps. Each step has conv-blocks + $2 \times 2$ MaxPool.</li>
<li><strong>Expansive Path</strong>: 3 steps. Transpose-Conv upsampling + Concatenation (Skip Connections).</li>
<li><strong>Heads</strong>: Separate $1 \times 1$ convs for each task map (Atom Heatmap, Bond Heatmap, Property Maps).</li>
<li><strong>Output Dimensions</strong>:
<ul>
<li>Heatmaps: $(1, 128, 128)$</li>
<li>Bond Angles: $(60, 128, 128)$</li>
</ul>
</li>
<li><strong>Pre-trained Weights</strong>: Not included in the public <a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub repository</a>. The paper&rsquo;s availability statement mentions code and training datasets but not weights.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Detection</strong>: Precision &amp; Recall (Object detection level).</li>
<li><strong>Regression</strong>: Mean Absolute Error (MAE) for bond lengths.</li>
<li><strong>Structure Recovery</strong>:
<ul>
<li><em>Accuracy</em>: Exact SMILES match rate.</li>
<li><em>Tanimoto</em>: ECFP similarity (fingerprint overlap).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>ABC-Net</th>
          <th>Img2mol (Baseline)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (UOB)</strong></td>
          <td><strong>96.1%</strong></td>
          <td>78.2%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Accuracy (Indigo)</strong></td>
          <td><strong>96.4%</strong></td>
          <td>89.5%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (UOB)</strong></td>
          <td><strong>0.989</strong></td>
          <td>0.953</td>
          <td>Higher substructure recovery</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>: 15 epochs, Batch size 64.</li>
<li><strong>Optimization</strong>: Adam Optimizer. LR $2.5 \times 10^{-4}$ (first 5 epochs) → $2.5 \times 10^{-5}$ (last 10).</li>
<li><strong>Repetition</strong>: Every experiment was repeated 3 times with random dataset splitting; mean values are reported.</li>
<li><strong>Compute</strong>: High-Performance Computing Center of Central South University. Specific GPU model not listed.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Yi, J.-C., Yang, G.-P., Wu, C.-K., Hou, T.-J., &amp; Cao, D.-S. (2022). ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. <em>Briefings in Bioinformatics</em>, 23(2), bbac033. <a href="https://doi.org/10.1093/bib/bbac033">https://doi.org/10.1093/bib/bbac033</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangABCNetDivideandconquerBased2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ABC-Net: A Divide-and-Conquer Based Deep Learning Architecture for {SMILES} Recognition from Molecular Images}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Xiao-Chen and Yi, Jia-Cai and Yang, Guo-Ping and Wu, Cheng-Kun and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{bbac033}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bib/bbac033}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Unified Framework for Handwritten Chemical Expressions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/</guid><description>A 2009 unified framework for inorganic/organic chemical handwriting recognition using graph search and statistical symbol grouping.</description><content:encoded><![CDATA[<h2 id="addressing-the-complexity-of-handwritten-organic-chemistry">Addressing the Complexity of Handwritten Organic Chemistry</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) from Microsoft Research Asia that addresses the challenge of recognizing complex 2D organic chemistry structures. By 2009, math expression recognition had seen significant commercial progress, but chemical expression recognition remained less developed.</p>
<p>The specific gap addressed is the geometric complexity of organic formulas. While inorganic formulas typically follow a linear, equation-like structure, organic formulas present complex 2D diagrammatic structures with various bond types and rings. Existing work often relied on strong assumptions (like single-stroke symbols) or failed to handle arbitrary compounds. There was a clear need for a unified solution capable of handling both inorganic and organic domains consistently.</p>
<h2 id="the-chemical-expression-structure-graph-cesg">The Chemical Expression Structure Graph (CESG)</h2>
<p>The core innovation is a unified statistical framework that processes inorganic and organic expressions within the same pipeline. Key technical novelties include:</p>
<ol>
<li><strong>Unified Bond Modeling</strong>: Bonds are treated as special symbols. The framework detects &ldquo;extended bond symbols&rdquo; (multi-stroke bonds) and splits them into single, double, or triple bonds using corner detection for consistent processing.</li>
<li><strong>Chemical Expression Structure Graph (CESG)</strong>: A defined graph representation for generic chemical expressions where nodes represent symbols and edges represent bonds or spatial relations.</li>
<li><strong>Non-Symbol Modeling</strong>: During the symbol grouping phase, the system explicitly models &ldquo;invalid groups&rdquo; to reduce over-grouping errors.</li>
<li><strong>Global Graph Search</strong>: Structure analysis is formulated as finding the optimal CESG by searching over a Weighted Direction Graph ($G_{WD}$).</li>
</ol>
<h2 id="graph-search-and-statistical-validation">Graph Search and Statistical Validation</h2>
<p>The authors validated the framework on a proprietary database of 35,932 handwritten chemical expressions collected from 300 writers.</p>
<ul>
<li><strong>Setup</strong>: The data was split into roughly 26,000 training and 6,400 testing samples.</li>
<li><strong>Metric</strong>: Recognition accuracy was measured strictly by expression (all symbols and the complete structure must be correct).</li>
<li><strong>Ablations</strong>: The team evaluated the performance contribution of symbol grouping, structure analysis, and full semantic verification.</li>
</ul>
<h2 id="recognition-accuracy-and-outcomes">Recognition Accuracy and Outcomes</h2>
<p>The full framework achieved a Top-1 accuracy of 75.4% and a Top-5 accuracy of 83.1%.</p>
<ul>
<li><strong>Component Contribution</strong>: Structure analysis is the primary bottleneck. Adding it drops the theoretical &ldquo;perfect grouping&rdquo; performance from 85.9% to 74.1% (Top-1) due to structural errors.</li>
<li><strong>Semantic Verification</strong>: Checking valence and grammar improved relative accuracy by 1.7%.</li>
</ul>
<p>The unified framework effectively handles the variance in 2D space for chemical expressions, demonstrating that delayed decision-making (keeping top-N candidates) improves robustness.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No public artifacts (code, data, models) were released by the authors.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study used a private Microsoft Research Asia dataset, making direct reproduction difficult.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total</td>
          <td>Proprietary MSRA DB</td>
          <td>35,932 expressions</td>
          <td>Written by 300 people</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Subset</td>
          <td>25,934 expressions</td>
          <td></td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>Subset</td>
          <td>6,398 expressions</td>
          <td></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Content</strong>: 2,000 unique expressions from high school/college textbooks.</li>
<li><strong>Composition</strong>: ~25% of samples are organic expressions.</li>
<li><strong>Vocabulary</strong>: 163 symbol classes (elements, digits, <code>+</code>, <code>↑</code>, <code>%</code>, bonds, etc.).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Symbol Grouping (Dynamic Programming)</strong></p>
<ul>
<li>Objective: Find the optimal symbol sequence $G_{max}$ maximizing the posterior probability given the ink strokes:
$$ G_{max} = \arg\max_{G} P(G | \text{Ink}) $$</li>
<li><strong>Non-symbol modeling</strong>: Iteratively trained models on &ldquo;incorrect grouping results&rdquo; to learn to reject invalid strokes.</li>
<li><strong>Inter-group modeling</strong>: Uses Gaussian Mixture Models (GMM) to model spatial relations ($R_j$) between groups.</li>
</ul>
<p><strong>2. Bond Processing</strong></p>
<ul>
<li><strong>Extended Bond Symbol</strong>: Recognizes connected strokes (e.g., a messy double bond written in one stroke) as a single &ldquo;extended&rdquo; symbol.</li>
<li><strong>Splitting</strong>: Uses <strong>Curvature Scale Space (CSS)</strong> corner detection to split extended symbols into primitive lines.</li>
<li><strong>Classification</strong>: A Neural Network verifies if the split lines form valid single, double, or triple bonds.</li>
</ul>
<p><strong>3. Structure Analysis (Graph Search)</strong></p>
<ul>
<li><strong>Graph Construction</strong>: Builds a Weighted Direction Graph ($G_{WD}$) where nodes are symbol candidates and edges are potential relationships ($E_{c}, E_{nc}, E_{peer}, E_{sub}$).</li>
<li><strong>Edge Weights</strong>: Calculated as the product of observation, spatial, and contextual probabilities:
$$ W(S, O, R) = P(O|S) \times P(\text{Spatial}|R) \times P(\text{Context}|S, R) $$
<ul>
<li>Spatial probability uses rectangular control regions and distance functions.</li>
<li>Contextual probability uses statistical co-occurrence (e.g., &lsquo;C&rsquo; often appears with &lsquo;H&rsquo;).</li>
</ul>
</li>
<li><strong>Search</strong>: Breadth-first search with pruning to find the top-N optimal CESGs.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Symbol Recognition</strong>: Implementation details not specified, but likely HMM or NN based on the era. Bond verification explicitly uses a <strong>Neural Network</strong>.</li>
<li><strong>Spatial Models</strong>: <strong>Gaussian Mixture Models (GMM)</strong> are used to model the 9 spatial relations (e.g., Left-super, Above, Subscript).</li>
<li><strong>Semantic Model</strong>: A <strong>Context-Free Grammar (CFG)</strong> parser is used for final verification (e.g., ensuring digits aren&rsquo;t reactants).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation is performed using &ldquo;Expression-level accuracy&rdquo;.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Top-1)</th>
          <th>Value (Top-5)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full Framework</td>
          <td>75.4%</td>
          <td>83.1%</td>
          <td></td>
      </tr>
      <tr>
          <td>Without Semantics</td>
          <td>74.1%</td>
          <td>83.0%</td>
          <td></td>
      </tr>
      <tr>
          <td>Grouping Only</td>
          <td>85.9%</td>
          <td>95.6%</td>
          <td>Theoretical max if structure analysis was perfect</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, M., Han, S., &amp; Zhang, D. (2009). A Unified Framework for Recognizing Handwritten Chemical Expressions. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1345&ndash;1349. <a href="https://doi.org/10.1109/ICDAR.2009.64">https://doi.org/10.1109/ICDAR.2009.64</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{changUnifiedFrameworkRecognizing2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A {{Unified Framework}} for {{Recognizing Handwritten Chemical Expressions}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chang, Ming and Han, Shi and Zhang, Dongmei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2009</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1345--1349}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Barcelona, Spain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.64}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SVM-HMM Online Classifier for Chemical Symbols</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/</guid><description>A dual-stage classifier combining SVM and HMM to recognize online handwritten chemical symbols, introducing a reordering algorithm for organic rings.</description><content:encoded><![CDATA[<h2 id="contribution-double-stage-classification-method">Contribution: Double-Stage Classification Method</h2>
<p><strong>Method</strong>.
This paper is a methodological contribution that proposes a novel &ldquo;double-stage classifier&rdquo; architecture. It fits the taxonomy by introducing a specific algorithmic pipeline (SVM rough classification followed by HMM fine classification) and a novel pre-processing algorithm (Point Sequence Reordering) to solve technical limitations in recognizing organic ring structures. The contribution is validated through ablation studies (comparing SVM kernels and HMM state/Gaussian counts) and performance benchmarks.</p>
<h2 id="motivation-recognizing-complex-organic-ring-structures">Motivation: Recognizing Complex Organic Ring Structures</h2>
<p>The primary motivation is the complexity of recognizing handwritten chemical symbols, specifically the distinction between <strong>Organic Ring Structures (ORS)</strong> and <strong>Non-Ring Structures (NRS)</strong>. Existing single-stage classifiers are unreliable for ORS because these symbols have arbitrary writing styles, variable stroke numbers, and inconsistent stroke orders due to their 2D hexagonal structure. A robust system is needed to handle this uncertainty and achieve high accuracy.</p>
<h2 id="core-innovation-point-sequence-reordering-psr">Core Innovation: Point Sequence Reordering (PSR)</h2>
<p>The authors introduce two main novelties:</p>
<ol>
<li><strong>Double-Stage Architecture</strong>: A hybrid system where an SVM (using RBF kernel) first roughly classifies inputs as either ORS or NRS, followed by specialized HMMs for fine-grained recognition.</li>
<li><strong>Point Sequence Reordering (PSR) Algorithm</strong>: A stroke-order independent algorithm designed specifically for ORS. It reorders the point sequence of a symbol based on a counter-clockwise scan from the centroid, effectively eliminating the uncertainty caused by variations in stroke number and writing order.</li>
</ol>
<h2 id="methodology--experimental-design">Methodology &amp; Experimental Design</h2>
<p>The authors collected a custom dataset and performed sequential optimizations:</p>
<ul>
<li><strong>SVM Optimization</strong>: Compared Polynomial, RBF, and Sigmoid kernels to find the best rough classifier.</li>
<li><strong>HMM Optimization</strong>: Tested multiple combinations of states (4, 6, 8) and Gaussians (3, 4, 6, 8, 9, 12) to maximize fine classification accuracy.</li>
<li><strong>PSR Validation</strong>: Conducted an ablation study comparing HMM accuracy on ORS symbols &ldquo;Before PSR&rdquo; vs &ldquo;After PSR&rdquo; to quantify the algorithm&rsquo;s impact.</li>
</ul>
<h2 id="results--final-conclusions">Results &amp; Final Conclusions</h2>
<ul>
<li><strong>Architecture Performance</strong>: The RBF-based SVM achieved 99.88% accuracy in differentiating ORS from NRS.</li>
<li><strong>HMM Configuration</strong>: The optimal HMM topology was found to be 8-states and 12-Gaussians for both symbol types.</li>
<li><strong>PSR Impact</strong>: The PSR algorithm improved ORS recognition. Top-1 accuracy shifted from <strong>49.84% (Before PSR)</strong> to <strong>98.36% (After PSR)</strong>.</li>
<li><strong>Overall Accuracy</strong>: The final integrated system achieved a Top-1 accuracy of <strong>93.10%</strong> and Top-3 accuracy of <strong>98.08%</strong> on the test set.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study defined 101 chemical symbols split into two categories.</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Count</th>
          <th>Content</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>NRS</strong> (Non-Ring)</td>
          <td>63</td>
          <td>Digits 0-9, 44 letters, 9 operators</td>
          <td>Operators include +, -, =, $\rightarrow$, etc.</td>
      </tr>
      <tr>
          <td><strong>ORS</strong> (Organic Ring)</td>
          <td>38</td>
          <td>2D hexagonal structures</td>
          <td>Benzene rings, cyclohexane, etc.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Collection</strong>: 12,322 total samples (122 per symbol) collected from 20 writers (teachers and students).</li>
<li><strong>Split</strong>: 9,090 training samples and 3,232 test samples.</li>
<li><strong>Constraints</strong>: Three specifications were used: normal, standard, and freestyle.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. SVM Feature Extraction (Rough Classification)</strong>
The input strokes are scaled, and a 58-dimensional feature vector is calculated:</p>
<ul>
<li><strong>Mesh ($4 \times 4$)</strong>: Ratio of points in 16 grids (16 features).</li>
<li><strong>Outline</strong>: Normalized scan distance from 4 edges with 5 scan lines each (20 features).</li>
<li><strong>Projection</strong>: Point density in 5 bins per edge (20 features).</li>
<li><strong>Aspect Ratio</strong>: Height/Width ratios (2 features).</li>
</ul>
<p><strong>2. Point Sequence Reordering (PSR)</strong>
Used strictly for ORS preprocessing:</p>
<ol>
<li>Calculate the centroid $(x_c, y_c)$ of the symbol.</li>
<li>Initialize a scan line at angle $\theta = 0$.</li>
<li>Traverse points; if a point $p_i = (x_i, y_i)$ satisfies the distance threshold to the scan line, add it to the reordered list. Distance $d_i$ is calculated as:
$$ d_i = |(y_i - y_c)\cos(\theta) - (x_i - x_c)\sin(\theta)| $$</li>
<li>Increment $\theta$ by $\Delta\theta$ and repeat until a full circle ($2\pi$) is completed.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM (Stage 1)</strong>: RBF Kernel was selected as optimal with parameters $C=512$ and $\gamma=0.5$.</li>
<li><strong>HMM (Stage 2)</strong>: Left-right continuous HMM trained via Baum-Welch algorithm. The topology is one model per symbol using <strong>8 states and 12 Gaussians</strong>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics reported are Top-1, Top-2, and Top-3 accuracy on the held-out test set.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>NRS Accuracy</th>
          <th>ORS Accuracy</th>
          <th>Overall Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Top-1</strong></td>
          <td>91.91%</td>
          <td>97.53%</td>
          <td>93.10%</td>
      </tr>
      <tr>
          <td><strong>Top-3</strong></td>
          <td>99.12%</td>
          <td>99.34%</td>
          <td>98.08%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: HP Pavilion tx1000 Tablet PC.</li>
<li><strong>Processor</strong>: 2.00GHz CPU.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Y., Shi, G., &amp; Wang, K. (2010). A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols. <em>2010 International Conference on Pattern Recognition</em>, 1888&ndash;1891. <a href="https://doi.org/10.1109/ICPR.2010.465">https://doi.org/10.1109/ICPR.2010.465</a></p>
<p><strong>Publication</strong>: ICPR 2010</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2010svm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2010 International Conference on Pattern Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Yang and Shi, Guangshun and Wang, Kai}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2010}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1888--1891}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICPR.2010.465}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Recognition of On-line Handwritten Chemical Expressions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/</guid><description>A two-level recognition algorithm for on-line handwritten chemical expressions using structural and syntactic features.</description><content:encoded><![CDATA[<h2 id="contribution-on-line-chemical-expression-recognition-framework">Contribution: On-line Chemical Expression Recognition Framework</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural pipeline (&ldquo;Algorithm Model&rdquo;) for recognizing on-line handwritten chemical expressions. The paper focuses on detailing the specific mechanisms of this pipeline (pre-processing, segmentation, two-level recognition, and HCI) and validates its effectiveness through quantitative comparison against a conventional baseline. The rhetorical structure aligns with the &ldquo;Methodological Basis&rdquo; of the taxonomy, prioritizing the &ldquo;how well does this work?&rdquo; question over theoretical derivation or dataset curation.</p>
<h2 id="motivation-the-hci-gap-in-chemical-drawing">Motivation: The HCI Gap in Chemical Drawing</h2>
<p>The authors identify a gap in existing human-computer interaction (HCI) for chemistry. While mathematical formula recognition had seen progress, chemical expression recognition was under-researched. Existing tools relied on keyboard/mouse input, which was time-consuming and inefficient for the complex, variable nature of chemical structures. Previous attempts were either too slow (vectorization-based) or failed to leverage specific chemical knowledge effectively. There was a practical need for a system that could handle the specific syntactic rules of chemistry in an on-line (real-time) handwriting setting.</p>
<h2 id="novelty-two-level-recognition-architecture">Novelty: Two-Level Recognition Architecture</h2>
<p>The core contribution is a <strong>two-level recognition algorithm</strong> that integrates chemical domain knowledge.</p>
<ul>
<li><strong>Level 1 (Substance Level):</strong> Treats connected strokes as a potential &ldquo;substance unit&rdquo; (e.g., &ldquo;H2O&rdquo;) and matches them against a dictionary using a modified edit distance algorithm.</li>
<li><strong>Level 2 (Character Level):</strong> If the substance match fails, it falls back to segmenting the unit into isolated characters and reconstructing them using syntactic rules.</li>
<li><strong>Hybrid Segmentation:</strong> Combines structural analysis (using bounding box geometry for super/subscript detection) with &ldquo;partial recognition&rdquo; (identifying special symbols like <code>+</code>, <code>=</code>, <code>-&gt;</code> early to split the expression).</li>
</ul>
<h2 id="methodology-custom-dataset-and-baseline-comparisons">Methodology: Custom Dataset and Baseline Comparisons</h2>
<p>The authors conducted a validation experiment in a laboratory environment with 20 participants (chemistry students and teachers).</p>
<ul>
<li><strong>Dataset:</strong> 1,197 total samples (983 from a standard set of 341 expressions, 214 arbitrary expressions written by users).</li>
<li><strong>Baselines:</strong> They compared their &ldquo;Two-Level&rdquo; algorithm against a &ldquo;Conventional&rdquo; algorithm that skips the substance-level check and directly recognizes characters (&ldquo;Recognize Character Directly&rdquo;).</li>
<li><strong>Conditions:</strong> They also tested the impact of their Human-Computer Interaction (HCI) module which allows user corrections.</li>
</ul>
<h2 id="results-high-accuracy-and-hci-corrections">Results: High Accuracy and HCI Corrections</h2>
<ul>
<li><strong>Accuracy:</strong> The proposed two-level algorithm achieved significantly higher accuracy (<strong>96.4%</strong> for expression recognition) compared to the conventional baseline (<strong>91.5%</strong>).</li>
<li><strong>Robustness:</strong> The method performed well even on &ldquo;arbitrary&rdquo; expressions not in the standard set (92.5% accuracy vs 88.2% baseline).</li>
<li><strong>HCI Impact:</strong> Allowing users to modify results via the HCI module pushed final accuracy to high levels (<strong>98.8%</strong>).</li>
<li><strong>Conclusion:</strong> The authors concluded the algorithm is reliable for real applications and flexible enough to be extended to other domains like physics or engineering.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use a public benchmark but collected its own data for validation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Custom Lab Dataset</td>
          <td style="text-align: left">1,197 samples</td>
          <td style="text-align: left">Collected from 20 chemistry students/teachers using Tablet PCs. Includes 341 standard expressions + arbitrary user inputs.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of four distinct phases with specific algorithmic choices:</p>
<p><strong>1. Pre-processing</strong></p>
<ul>
<li><strong>Smoothing:</strong> Uses a 5-tap Gaussian low-pass filter (Eq. 1) with specific coefficients to smooth stroke data.</li>
<li><strong>Redundancy:</strong> Merges redundant points and removes &ldquo;prickles&rdquo; (isolated noise).</li>
<li><strong>Re-ordering:</strong> Strokes are spatially re-sorted left-to-right, top-to-down to correct for arbitrary writing order.</li>
</ul>
<p><strong>2. Segmentation</strong></p>
<ul>
<li><strong>Structural Analysis:</strong> Distinguishes relationships (Superscript vs. Subscript vs. Horizontal) using a geometric feature vector $(T, B)$ based on bounding box heights ($h$), vertical centers ($C$), and barycenters ($B_{bary}$):
$$
\begin{aligned}
d &amp;= 0.7 \cdot y_{12} - y_{22} + 0.3 \cdot y_{11} \\
T &amp;= 1000 \cdot \frac{d}{h_1} \\
B &amp;= 1000 \cdot \frac{B_{bary1} - B_{bary2}}{h_1}
\end{aligned}
$$</li>
<li><strong>Partial Recognition:</strong> Detects special symbols (<code>+</code>, <code>=</code>, <code>-&gt;</code>) early to break expressions into &ldquo;super-substance units&rdquo; (e.g., separating reactants from products).</li>
</ul>
<p><strong>3. Recognition (Two-Level)</strong></p>
<ul>
<li><strong>Level 1 (Dictionary Match):</strong>
<ul>
<li>Uses a modified <strong>Edit Distance</strong> (Eq. 6) incorporating a specific distance matrix based on chemical syntax.</li>
<li>Similarity $\lambda_{ij}$ is weighted by stroke credibility $\mu_i$ and normalized by string length.</li>
</ul>
</li>
<li><strong>Level 2 (Character Segmentation):</strong>
<ul>
<li>Falls back to this if Level 1 fails.</li>
<li>Segments characters by analyzing pixel density in horizontal/vertical/diagonal directions to find concave/convex points.</li>
<li>Recombines characters using syntactic rules (e.g., valency checks) to verify validity.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation focused on recognition accuracy at both the character and expression level.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value (Proposed)</th>
          <th style="text-align: left">Value (Baseline)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left"><strong>96.4%</strong></td>
          <td style="text-align: left">91.5%</td>
          <td style="text-align: left">&ldquo;Standard&rdquo; dataset subset.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left"><strong>92.5%</strong></td>
          <td style="text-align: left">88.2%</td>
          <td style="text-align: left">&ldquo;Other&rdquo; (arbitrary) dataset subset.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>HCI-Assisted Accuracy</strong></td>
          <td style="text-align: left"><strong>98.8%</strong></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Accuracy after user correction.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Input Devices:</strong> Tablet PCs were used for data collection and testing.</li>
<li><strong>Compute:</strong> Specific training hardware is not listed, but the algorithm is designed for real-time interaction on standard 2008-era computing devices.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yang, J., Shi, G., Wang, Q., &amp; Zhang, Y. (2008). Recognition of On-line Handwritten Chemical Expressions. <em>2008 IEEE International Joint Conference on Neural Networks</em>, 2360&ndash;2365. <a href="https://doi.org/10.1109/IJCNN.2008.4634125">https://doi.org/10.1109/IJCNN.2008.4634125</a></p>
<p><strong>Publication</strong>: IJCNN 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{jufengyangRecognitionOnlineHandwritten2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Recognition of On-Line Handwritten Chemical Expressions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2008 {{IEEE International Joint Conference}} on {{Neural Networks}} ({{IEEE World Congress}} on {{Computational Intelligence}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{{Jufeng Yang} and {Guangshun Shi} and {Qingren Wang} and {Yong Zhang}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2008</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2360--2365}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Hong Kong, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/IJCNN.2008.4634125}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4244-1820-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Online Handwritten Chemical Formula Structure Analysis</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/</guid><description>A hierarchical grammar-based approach for recognizing and analyzing online handwritten chemical formulas in mobile education contexts.</description><content:encoded><![CDATA[<h2 id="hierarchical-grammatical-framework-contribution">Hierarchical Grammatical Framework Contribution</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a &ldquo;grammar spanning tree&rdquo; and &ldquo;molecule spanning graph&rdquo; from online handwritten strokes.</p>
<h2 id="motivation-for-online-formula-recognition">Motivation for Online Formula Recognition</h2>
<p>The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, <em>online</em> handwritten formulas is a significant challenge.</p>
<ul>
<li><strong>2D Complexity</strong>: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).</li>
<li><strong>Format Limitations</strong>: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.</li>
<li><strong>Online Gap</strong>: Previous research focused heavily on <em>offline</em> (image-based) recognition, lacking solutions for <em>online</em> (stroke-based) handwritten chemical formulas (OHCF).</li>
</ul>
<h2 id="core-novelty-in-three-level-grammatical-analysis">Core Novelty in Three-Level Grammatical Analysis</h2>
<p>The core novelty is the <strong>Three-Level Grammatical Analysis</strong> approach:</p>
<ol>
<li><strong>Formula Level (1D)</strong>: Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.</li>
<li><strong>Molecule Level (2D)</strong>: Treats molecules as graphs where &ldquo;text groups&rdquo; are vertices and &ldquo;bonds&rdquo; are edges. It introduces specific handling for &ldquo;hidden Carbon dots&rdquo; (intersections of bonds without text).</li>
<li><strong>Text Level (1D)</strong>: Analyzes the internal structure of text groups (atoms, subscripts).</li>
</ol>
<p>Unique to this approach is the <strong>formal definition of the chemical grammar</strong> as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an <strong>Adjacency Matrix</strong> directly from the handwritten sketch to represent chemical connectivity.</p>
<h2 id="experimental-validation-on-handwritten-strokes">Experimental Validation on Handwritten Strokes</h2>
<p>The authors validated their model using a custom dataset of online handwritten formulas.</p>
<ul>
<li><strong>Data Source</strong>: 25 formulas were randomly selected from a larger pool of 1,250 samples.</li>
<li><strong>Scope</strong>: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.</li>
<li><strong>Granular Validation</strong>: The system was tested at multiple distinct stages:
<ul>
<li>Key Symbol Extraction (Formula Level)</li>
<li>Text Localization (Molecule Level)</li>
<li>Bond End Grouping (Molecule Level)</li>
<li>Text Recognition (Text Level)</li>
</ul>
</li>
</ul>
<h2 id="downstream-impact-and-parsing-accuracy">Downstream Impact and Parsing Accuracy</h2>
<p>The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.</p>
<ul>
<li><strong>Formula Level</strong>: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.</li>
<li><strong>Molecule Level</strong>: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.</li>
<li><strong>Text Recognition</strong>: 98.7% accuracy (Top-3) using HMMs.</li>
<li><strong>Impact</strong>: The method successfully preserves the writer&rsquo;s &ldquo;online information&rdquo; (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Symbol HMMs</td>
          <td>5,670 samples</td>
          <td>Used to train the text recognition module</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Text Recognition</td>
          <td>2,016 samples</td>
          <td>Test set for character HMMs</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Formula Analysis</td>
          <td>25 formulas</td>
          <td>Random subset of 1,250 collected samples; contains 484 symbols</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Formula Level Parsing</strong></p>
<ul>
<li><strong>HBL Analysis</strong>: Identify the &ldquo;Horizontal Baseline&rdquo; (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).</li>
<li><strong>Grammar</strong>: Use the productions defined in Figure 4. Example rules include:
<ul>
<li>$Reaction ::= ReactantList \ Generator \ ProductList$</li>
<li>$Reactant ::= BalancingNum \ Molecule \ IonicCharacter$</li>
</ul>
</li>
</ul>
<p><strong>2. Molecule Level Analysis (Bond Grouping)</strong></p>
<ul>
<li><strong>Endpoint Classification</strong>: Points are classified as <em>free ends</em>, <em>junctions</em> (3+ bonds), or <em>connections</em> (2 bonds).</li>
<li><strong>Grouping Equation</strong>: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding:
$$
\begin{aligned}
Include(x_0, y_0) = \begin{cases} 1, &amp; d_0 &lt; t \cdot \max d_x + \partial \\ 0, &amp; \text{else} \end{cases}
\end{aligned}
$$
Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.</li>
</ul>
<p><strong>3. Connection Detection</strong></p>
<ul>
<li><strong>Text-Bond Connection</strong>: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$:
$$
\begin{aligned}
Con(x,y) = \begin{cases} 1, &amp; \min x - t_W &lt; x &lt; \max x + t_W \text{ AND } \min y - t_H &lt; y &lt; \max y + t_H \\ 0, &amp; \text{else} \end{cases}
\end{aligned}
$$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Text Recognition</strong>: Hidden Markov Models (HMM) are used for recognizing individual text symbols.</li>
<li><strong>Grammar</strong>: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured by recognition accuracy at specific processing stages:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>F1 (Key Symbol Extraction)</td>
          <td>98.3%</td>
          <td>Formula Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>F2 (State-assisted Symbol)</td>
          <td>100%</td>
          <td>Formula Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>M2 (Bond End Grouping)</td>
          <td>98.8%</td>
          <td>Molecule Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>M3 (Free End-Text Conn)</td>
          <td>100%</td>
          <td>Molecule Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>T1 (Text Recognition)</td>
          <td>98.7%</td>
          <td>Top-3 Accuracy</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, X., Shi, G., &amp; Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1056&ndash;1060. <a href="https://doi.org/10.1109/ICDAR.2009.70">https://doi.org/10.1109/ICDAR.2009.70</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wangUnderstandingStructureAnalyzing2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Wang, Xin and Shi, Guangshun and Yang, Jufeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1056--1060}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Barcelona, Spain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.70}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4244-4500-4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>On-line Handwritten Chemical Expression Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/</guid><description>Two-level algorithm for recognizing on-line handwritten chemical expressions using structural analysis, ANNs, and string edit distance.</description><content:encoded><![CDATA[<h2 id="a-methodological-approach-to-chemical-recognition">A Methodological Approach to Chemical Recognition</h2>
<p>This is a <strong>Method</strong> paper. It proposes a specific &ldquo;novel two-level algorithm&rdquo; and a &ldquo;System model&rdquo; for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a &ldquo;conventional algorithm&rdquo; baseline, fitting the standard profile of a methodological contribution.</p>
<h2 id="bridging-the-gap-in-pen-based-chemical-input">Bridging the Gap in Pen-Based Chemical Input</h2>
<p>While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains &ldquo;time-consuming&rdquo;. Existing research often lacks &ldquo;adequate chemical knowledge&rdquo; or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.</p>
<h2 id="two-level-recognition-strategy-for-formulas">Two-Level Recognition Strategy for Formulas</h2>
<p>The core novelty is a <strong>two-level recognition strategy</strong>:</p>
<ol>
<li><strong>Level 1 (Substance Recognition)</strong>: Uses global structural information to identify entire &ldquo;substance units&rdquo; (e.g., $H_2SO_4$) by matching against a dictionary.</li>
<li><strong>Level 2 (Symbol Recognition)</strong>: If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.</li>
</ol>
<p>Additionally, the method integrates <strong>syntactic features</strong> (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.</p>
<h2 id="dataset-collection-and-baseline-comparisons">Dataset Collection and Baseline Comparisons</h2>
<ul>
<li><strong>Dataset Collection</strong>: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 &ldquo;standard&rdquo; expressions (from 341 templates) and 214 &ldquo;arbitrary&rdquo; expressions written freely.</li>
<li><strong>Comparison</strong>: They compared their &ldquo;Two-level recognition&rdquo; approach against a &ldquo;conventional algorithm&rdquo; that shields the first level (directly segmenting into characters).</li>
<li><strong>Metrics</strong>: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).</li>
</ul>
<h2 id="high-accuracy-in-formula-recognition">High Accuracy in Formula Recognition</h2>
<ul>
<li><strong>High Accuracy</strong>: The proposed algorithm achieved <strong>96.4% Material Accuracy (MA)</strong> and <strong>95.7% Expression Accuracy (EA)</strong> on the total test set.</li>
<li><strong>Robustness</strong>: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.</li>
<li><strong>Validation</strong>: The authors conclude the algorithm is &ldquo;reliable,&rdquo; &ldquo;flexible,&rdquo; and suitable for real-time applications compared to prior work.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors constructed two distinct datasets for training and evaluation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Symbol Training</strong></td>
          <td style="text-align: left">ISF Files</td>
          <td style="text-align: left">12,240 files</td>
          <td style="text-align: left">Used to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Testing</strong></td>
          <td style="text-align: left">Handwritten Expressions</td>
          <td style="text-align: left">1,197 samples</td>
          <td style="text-align: left">983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Structural Segmentation (Superscript/Subscript)</strong></p>
<p>To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):</p>
<p>$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$
$$T = 1000 \times d/h$$
$$B = 1000 \times (B_1 - B_2)/h_1$$</p>
<p>Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.</p>
<p><strong>2. Segmentation Reliability</strong></p>
<p>For segmenting strokes into units, the reliability of a segmentation path is calculated as:</p>
<p>$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$</p>
<p>Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.</p>
<p><strong>3. Substance Matching (Level 1)</strong></p>
<p>A modified string edit distance is used to match handwritten input against a dictionary:</p>
<p>$$\lambda_{\overline{u}}=\mu_{i} \times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$</p>
<p>Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>Classifier</strong>: An ANN-based classifier is used for isolated symbol recognition.</li>
<li><strong>Input Features</strong>: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.</li>
<li><strong>Performance</strong>: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The system was evaluated on the 1,197 expression samples.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value (Total)</th>
          <th style="text-align: left">Value (Standard)</th>
          <th style="text-align: left">Value (Other)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Material Accuracy (MA)</strong></td>
          <td style="text-align: left">96.4%</td>
          <td style="text-align: left">97.7%</td>
          <td style="text-align: left">94%</td>
          <td style="text-align: left">Accuracy of substance recognition.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left">95.7%</td>
          <td style="text-align: left">96.3%</td>
          <td style="text-align: left">92.5%</td>
          <td style="text-align: left">Accuracy of full expression recognition.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yang, J., Shi, G., Wang, K., Geng, Q., &amp; Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. <em>2008 19th International Conference on Pattern Recognition</em>, 1&ndash;4. <a href="https://doi.org/10.1109/ICPR.2008.4761824">https://doi.org/10.1109/ICPR.2008.4761824</a></p>
<p><strong>Publication</strong>: ICPR 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{yangStudyOnlineHandwritten2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A Study of On-Line Handwritten Chemical Expressions Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2008 19th {{International Conference}} on {{Pattern Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2008</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1--4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Tampa, FL, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICPR.2008.4761824}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Img2Mol: Accurate SMILES Recognition from Depictions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/</guid><description>Two-stage CNN approach for converting molecular images to SMILES using CDDD embeddings and extensive data augmentation.</description><content:encoded><![CDATA[<h2 id="method-classification">Method Classification</h2>
<p>This is a <strong>method paper</strong> that introduces Img2Mol, a deep learning system for Optical Chemical Structure Recognition (OCSR). The work focuses on building a fast, accurate, and robust system for converting molecular structure depictions into machine-readable SMILES strings.</p>
<h2 id="systematization-and-motivation">Systematization and Motivation</h2>
<p>Vast amounts of chemical knowledge exist only as images in scientific literature and patents, making this data inaccessible for computational analysis, database searches, or machine learning pipelines. Manually extracting this information is slow and error-prone, creating a bottleneck for drug discovery and chemical research.</p>
<p>While rule-based OCSR systems like OSRA, MolVec, and Imago exist, they are brittle. Small variations in drawing style or image quality can cause them to fail. The authors argue that a deep learning approach, trained on diverse synthetic data, can generalize better across different depiction styles and handle the messiness of real-world images more reliably.</p>
<h2 id="two-stage-architecture-and-core-novelty">Two-Stage Architecture and Core Novelty</h2>
<p>The novelty lies in a two-stage architecture that separates perception from decoding, combined with aggressive data augmentation to ensure robustness. The key contributions are:</p>
<p><strong>1. Two-Stage Architecture with CDDD Embeddings</strong></p>
<p>Img2Mol uses an intermediate representation to predict SMILES from pixels. A <strong>custom CNN encoder</strong> maps the input image to a 512-dimensional <strong>Continuous and Data-Driven Molecular Descriptor (CDDD)</strong> embedding - a pre-trained, learned molecular representation that smoothly captures chemical similarity. A <strong>pre-trained decoder</strong> then converts this CDDD vector into the final canonical SMILES string.</p>
<p>This two-stage design has several advantages:</p>
<ul>
<li>The CDDD space is continuous and chemically meaningful, so nearby embeddings correspond to structurally similar molecules. This makes the regression task easier than learning discrete token sequences directly.</li>
<li>The decoder is pre-trained and fixed, so the CNN only needs to learn the image → CDDD mapping. This decouples the visual recognition problem from the sequence generation problem.</li>
<li>CDDD embeddings naturally enforce chemical validity constraints, reducing the risk of generating nonsensical structures.</li>
</ul>
<p><strong>2. Extensive Data Augmentation for Robustness</strong></p>
<p>The model was trained on 11.1 million unique molecules from ChEMBL and PubChem, but the critical insight is how the training images were generated. To expose the CNN to maximum variation in depiction styles, the authors:</p>
<ul>
<li>Used <strong>three different cheminformatics libraries</strong> (RDKit, OEChem, Indigo) to render images, each with its own drawing conventions</li>
<li>Applied <strong>wide-ranging augmentations</strong>: varying bond thickness, font size, rotation, resolution (originally 192-256 px; expanded to 190-2500 px in the final model), and other stylistic parameters</li>
<li><strong>Over-sampled larger molecules</strong> to improve performance on complex structures, which are underrepresented in chemical databases</li>
</ul>
<p>This ensures the network rarely sees the same depiction of a molecule twice, forcing it to learn invariant features.</p>
<p><strong>3. Fast Inference</strong></p>
<p>Because the architecture is a simple CNN followed by a fixed decoder, inference is very fast - especially compared to rule-based systems that rely on iterative graph construction algorithms. This makes Img2Mol practical for large-scale document mining.</p>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating that Img2Mol is more accurate, robust, and generalizable than existing rule-based systems:</p>
<ol>
<li>
<p><strong>Benchmark Comparisons</strong>: Img2Mol was tested on several standard OCSR benchmarks, including USPTO (patent images), University of Birmingham (UoB), CLEF, and JPO (Japanese Patent Office) datasets, against three open-source baselines: <strong>OSRA, MolVec, and Imago</strong>. No deep learning baselines were available at the time for comparison.</p>
</li>
<li>
<p><strong>Resolution and Molecular Size Analysis</strong>: The initial model, <code>Img2Mol(no aug.)</code>, was evaluated across different image resolutions and molecule sizes (measured by number of atoms) to understand failure modes. This revealed that:</p>
<ul>
<li>Performance degraded for molecules with &gt;35 atoms</li>
<li>Very high-resolution images lost detail when downscaled to the fixed input size</li>
<li>Low-resolution images (where rule-based methods failed completely) were handled well</li>
</ul>
</li>
<li>
<p><strong>Data Augmentation Ablation</strong>: A final model, <strong>Img2Mol</strong>, was trained with the full augmentation pipeline (wider resolution range, over-sampling of large molecules). Performance was compared to the initial version to quantify the effect of augmentation.</p>
</li>
<li>
<p><strong>Depiction Library Robustness</strong>: The model was tested on images generated by each of the three rendering libraries separately to confirm that training on diverse styles improved generalization.</p>
</li>
<li>
<p><strong>Input Perturbation for Benchmark Fairness</strong>: For the smaller benchmark datasets (USPTO, UoB, CLEF, JPO), the authors applied slight random rotation (within +/-5 degrees) and shearing to each image five times to detect potential overfitting of rule-based methods to well-known benchmarks.</p>
</li>
<li>
<p><strong>Generalization Tests</strong>: Img2Mol was evaluated on real-world patent images from the <strong>STAKER</strong> dataset, which were not synthetically generated. This tested whether the model could transfer from synthetic training data to real documents.</p>
</li>
<li>
<p><strong>Hand-Drawn Molecule Recognition</strong>: As an exploratory test, the authors evaluated performance on hand-drawn molecular structures, a task the model was never trained for, to see if the learned features could generalize to completely different visual styles.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured and compared to rule-based baselines to demonstrate the practical efficiency of the approach.</p>
</li>
</ol>
<h2 id="results-conclusions-and-limitations">Results, Conclusions, and Limitations</h2>
<p>Key benchmark results from Table 1 of the paper (accuracy / Tanimoto similarity, in %):</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Img2Mol</th>
          <th>MolVec 0.9.8</th>
          <th>Imago 2.0</th>
          <th>OSRA 2.1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Img2Mol test set</td>
          <td>88.25 / 95.27</td>
          <td>2.59 / 13.03</td>
          <td>0.02 / 4.74</td>
          <td>2.59 / 13.03</td>
      </tr>
      <tr>
          <td>STAKER</td>
          <td>64.33 / 83.76</td>
          <td>5.32 / 31.78</td>
          <td>0.07 / 5.06</td>
          <td>5.23 / 26.98</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>42.29 / 73.07</td>
          <td>30.68 / 65.50</td>
          <td>5.07 / 7.28</td>
          <td>6.37 / 44.21</td>
      </tr>
      <tr>
          <td>UoB</td>
          <td>78.18 / 88.51</td>
          <td>75.01 / 86.88</td>
          <td>5.12 / 7.19</td>
          <td>70.89 / 85.27</td>
      </tr>
      <tr>
          <td>CLEF</td>
          <td>48.84 / 78.04</td>
          <td>44.48 / 76.61</td>
          <td>26.72 / 41.29</td>
          <td>17.04 / 58.84</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>45.14 / 69.43</td>
          <td>49.48 / 66.46</td>
          <td>23.18 / 37.47</td>
          <td>33.04 / 49.62</td>
      </tr>
  </tbody>
</table>
<p>Per-library accuracy on a 5,000-compound subset (depicted five times each):</p>
<table>
  <thead>
      <tr>
          <th>Library</th>
          <th>Img2Mol</th>
          <th>MolVec</th>
          <th>Imago</th>
          <th>OSRA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RDKit</td>
          <td>93.4%</td>
          <td>3.7%</td>
          <td>0.3%</td>
          <td>4.4%</td>
      </tr>
      <tr>
          <td>OEChem</td>
          <td>89.5%</td>
          <td>33.4%</td>
          <td>12.3%</td>
          <td>26.3%</td>
      </tr>
      <tr>
          <td>Indigo</td>
          <td>79.0%</td>
          <td>22.2%</td>
          <td>4.2%</td>
          <td>22.6%</td>
      </tr>
  </tbody>
</table>
<ul>
<li>
<p><strong>Substantial Performance Gains</strong>: Img2Mol outperformed all three rule-based baselines on nearly every benchmark. MolVec scored higher on JPO (49.48% vs. 45.14% accuracy). Accuracy was measured both as exact SMILES match and as <strong>Tanimoto similarity</strong> (using ECFP6 1024-bit fingerprints). Even when Img2Mol did not predict the exact molecule, it often predicted a chemically similar one.</p>
</li>
<li>
<p><strong>Robustness Across Conditions</strong>: The full Img2Mol model (with aggressive augmentation) showed consistent performance across all image resolutions and molecule sizes. In contrast, rule-based systems were &ldquo;brittle&rdquo; - performance dropped sharply with minor perturbations to image quality or style.</p>
</li>
<li>
<p><strong>Depiction Library Invariance</strong>: Img2Mol&rsquo;s performance was stable across all three rendering libraries (RDKit, OEChem, Indigo), validating the multi-library training strategy. Rule-based methods struggled particularly with RDKit-generated images.</p>
</li>
<li>
<p><strong>Strong Generalization to Real-World Data</strong>: Despite being trained exclusively on synthetic images, Img2Mol performed well on real patent images from the STAKER dataset. This suggests the augmentation strategy successfully captured the diversity of real-world depictions.</p>
</li>
<li>
<p><strong>Overfitting in Baselines</strong>: Rule-based methods performed surprisingly well on older benchmarks (USPTO, UoB, CLEF) but failed on newer datasets (Img2Mol&rsquo;s test set, STAKER). This suggests they may be implicitly tuned to specific drawing conventions in legacy datasets.</p>
</li>
<li>
<p><strong>Limited Hand-Drawn Recognition</strong>: Img2Mol could recognize simple hand-drawn structures but struggled with complex or large molecules. This is unsurprising given the lack of hand-drawn data in training, but it highlights a potential avenue for future work.</p>
</li>
<li>
<p><strong>Speed Advantage</strong>: Img2Mol processed 5,000 images in approximately 4 minutes at the smallest input size, with compute time mostly independent of input resolution due to the fixed 224x224 rescaling. Rule-based methods showed sharply increasing compute times at higher resolutions.</p>
</li>
</ul>
<p>The work establishes that deep learning can outperform traditional rule-based OCSR systems when combined with a principled two-stage architecture and comprehensive data augmentation. The CDDD embedding acts as a bridge between visual perception and chemical structure, providing a chemically meaningful intermediate representation that improves both accuracy and robustness. The focus on synthetic data diversity proves to be an effective strategy for generalizing to real-world documents.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Custom 8-layer Convolutional Neural Network (CNN) encoder</p>
<ul>
<li><strong>Input</strong>: $224 \times 224$ pixel grayscale images</li>
<li><strong>Backbone Structure</strong>: 8 convolutional layers organized into 3 stacks, followed by 3 fully connected layers
<ul>
<li><strong>Stack 1</strong>: 3 Conv layers ($7 \times 7$ filters, stride 3, padding 4) + Max Pooling</li>
<li><strong>Stack 2</strong>: 2 Conv layers + Max Pooling</li>
<li><strong>Stack 3</strong>: 3 Conv layers + Max Pooling</li>
<li><strong>Head</strong>: 3 fully connected layers</li>
</ul>
</li>
<li><strong>Output</strong>: 512-dimensional CDDD embedding vector</li>
</ul>
<p><strong>Decoder</strong>: Pre-trained CDDD decoder (from Winter et al.) - fixed during training, not updated</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Loss Function</strong>: Mean Squared Error (MSE) regression minimizing the distance between the predicted and true embeddings:</p>
<p>$$
l(d) = l(\text{cddd}_{\text{true}} - \text{cddd}_{\text{predicted}})
$$</p>
<p><strong>Optimizer</strong>: AdamW with initial learning rate $10^{-4}$</p>
<p><strong>Training Schedule</strong>:</p>
<ul>
<li>Batch size: 256</li>
<li>Training duration: 300 epochs</li>
<li>Plateau scheduler: Multiplies learning rate by 0.7 if validation loss plateaus for 10 epochs</li>
<li>Early stopping: Triggered if no improvement in validation loss for 50 epochs</li>
</ul>
<p><strong>Noise Tolerance</strong>: The decoder requires the CNN to predict embeddings with noise level $\sigma \le 0.15$ to achieve &gt;90% accuracy</p>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong>: 11.1 million unique molecules from ChEMBL and PubChem</p>
<p><strong>Splits</strong>: Approximately 50,000 examples each for validation and test sets</p>
<p><strong>Synthetic Image Generation</strong>:</p>
<ul>
<li>Three cheminformatics libraries: RDKit, OEChem, and Indigo</li>
<li>Augmentations: Resolution (190-2500 pixels), rotation, bond thickness, font size</li>
<li>Salt stripping: Keep only the largest fragment</li>
<li>Over-sampling: Larger molecules (&gt;35 atoms) over-sampled to improve performance</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li>Exact SMILES match accuracy</li>
<li>Tanimoto similarity (chemical fingerprint-based structural similarity)</li>
</ul>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li>Img2Mol test set (25,000 synthetic images at 224x224 px)</li>
<li>STAKER (30,000 real-world USPTO patent images at 256x256 px)</li>
<li>USPTO (4,852 patent images, avg. 649x417 px)</li>
<li>UoB (5,716 images from University of Birmingham, avg. 762x412 px)</li>
<li>CLEF (711 images, avg. 1243x392 px)</li>
<li>JPO (365 Japanese Patent Office images, avg. 607x373 px)</li>
<li>Hand-drawn molecular structures (exploratory, no defined benchmark)</li>
</ul>
<p><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based systems)</p>
<h3 id="hardware">Hardware</h3>
<p>⚠️ <strong>Unspecified in paper or supplementary materials.</strong> Inference speed reported as ~4 minutes for 5000 images; training hardware (GPU model, count) is undocumented.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">Img2Mol GitHub</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">Img2Mol model weights</a></td>
          <td>Model</td>
          <td>CC BY-NC 4.0</td>
          <td>Non-commercial use only</td>
      </tr>
  </tbody>
</table>
<h3 id="known-limitations">Known Limitations</h3>
<p><strong>Molecular Size</strong>: Performance degrades for molecules with &gt;35 atoms. This is partly a property of the CDDD latent space itself: for larger molecules, the &ldquo;volume of decodable latent space&rdquo; shrinks, making the decoder more sensitive to small noise perturbations in the predicted embedding.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clevert, D.-A., Le, T., Winter, R., &amp; Montanari, F. (2021). Img2Mol &ndash; accurate SMILES recognition from molecular graphical depictions. <em>Chemical Science</em>, 12(42), 14174&ndash;14181. <a href="https://doi.org/10.1039/d1sc01839f">https://doi.org/10.1039/d1sc01839f</a></p>
<p><strong>Publication</strong>: Chemical Science (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">GitHub Repository</a></li>
<li><a href="https://doi.org/10.1039/d1sc01839f">Paper on Royal Society of Chemistry</a></li>
</ul>
]]></content:encoded></item><item><title>HMM-based Online Recognition of Chemical Symbols</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/</guid><description>Online recognition of handwritten chemical symbols using Hidden Markov Models with 11-dimensional local features, achieving 89.5% top-1 accuracy.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper that proposes a specific algorithmic pipeline for the online recognition of handwritten chemical symbols. The core contribution is the engineering of an 11-dimensional feature vector combined with a Hidden Markov Model (HMM) architecture. The paper validates this method through quantitative experiments on a custom dataset, focusing on recognition accuracy as the primary metric.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Recognizing chemical symbols is uniquely challenging due to the complex structure of chemical expressions and the nature of pen-based input, which often results in broken or conglutinate strokes. Additionally, variations in writing style and random noise make the task difficult. While online recognition for Western characters and CJK scripts is well-developed, works specifically targeting online chemical symbol recognition are scarce, with most prior research focusing on offline recognition or global optimization.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The primary novelty is the application of continuous HMMs specifically to the domain of <strong>online</strong> chemical symbol recognition, utilizing a specialized set of <strong>11-dimensional local features</strong>. While HMMs have been used for other scripts, this paper tailors the feature extraction (including curliness, linearity, and writing direction) to capture the specific geometric properties of chemical symbols.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors constructed a specific dataset for this task involving 20 participants (college teachers and students).</p>
<ul>
<li><strong>Dataset</strong>: 64 distinct symbols (digits, English letters, Greek letters, operators)</li>
<li><strong>Volume</strong>: 7,808 total samples (122 per symbol), split into 5,670 training samples and 2,016 testing samples</li>
<li><strong>Model Sweeps</strong>: They evaluated the HMM performance by varying the number of states (4, 6, 8) and the number of Gaussians per state (3, 4, 6, 9, 12)</li>
</ul>
<h2 id="what-were-the-outcomes-and-conclusions-drawn">What were the outcomes and conclusions drawn?</h2>
<ul>
<li><strong>Performance</strong>: The best configuration (6 states, 9 Gaussians) achieved a <strong>top-1 accuracy of 89.5%</strong> and a <strong>top-3 accuracy of 98.7%</strong></li>
<li><strong>Scaling</strong>: Results showed that generally, increasing the number of states and Gaussians improved accuracy, though at the cost of computational efficiency</li>
<li><strong>Error Analysis</strong>: The primary sources of error were shape similarities between specific characters (e.g., &lsquo;0&rsquo; vs &lsquo;O&rsquo; vs &lsquo;o&rsquo;, and &lsquo;C&rsquo; vs &lsquo;c&rsquo; vs &lsquo;(&rsquo;)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status:</strong> Closed / Very Low Reproducibility. This 2009 study relies on a private, custom-collected dataset and does not provide source code, model weights, or an open-access preprint.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None publicly available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No open source code, open datasets, or open-access preprints were released with this publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study utilized a custom dataset collected in a laboratory environment.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">Custom Chemical Symbol Set</td>
          <td style="text-align: left">5,670 samples</td>
          <td style="text-align: left">90 samples per symbol</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">Custom Chemical Symbol Set</td>
          <td style="text-align: left">2,016 samples</td>
          <td style="text-align: left">32 samples per symbol</td>
      </tr>
  </tbody>
</table>
<p><strong>Dataset Composition</strong>: The set includes <strong>64 symbols</strong>: Digits (0-9), Uppercase (A-Z, missing Q), Lowercase (a-z, selected), Greek letters ($\alpha$, $\beta$, $\gamma$, $\pi$), and operators ($+$, $=$, $\rightarrow$, $\uparrow$, $\downarrow$, $($ , $)$).</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Preprocessing</strong></p>
<p>The raw tablet data undergoes a 6-step pipeline:</p>
<ol>
<li><strong>Duplicate Point Elimination</strong>: Removing sequential points with identical coordinates</li>
<li><strong>Broken Stroke Connection</strong>: Using Bezier curves to interpolate missing points/connect broken strokes</li>
<li><strong>Hook Elimination</strong>: Removing artifacts at the start/end of strokes characterized by short length and sharp angle changes</li>
<li><strong>Smoothing</strong>: Reducing noise from erratic pen movement</li>
<li><strong>Re-sampling</strong>: Spacing points equidistantly to remove temporal variation</li>
<li><strong>Size Normalization</strong>: Removing variation in writing scale</li>
</ol>
<p><strong>2. Feature Extraction (11 Dimensions)</strong></p>
<p>Features are extracted from a 5-point window centered on $t$ ($t-2$ to $t+2$). The 11 dimensions are:</p>
<ol>
<li><strong>Normalized Vertical Position</strong>: $y(t)$ mapped to $[0,1]$</li>
<li><strong>Normalized First Derivative ($x&rsquo;$)</strong>: Calculated via weighted sum of neighbors</li>
<li><strong>Normalized First Derivative ($y&rsquo;$)</strong>: Calculated via weighted sum of neighbors</li>
<li><strong>Normalized Second Derivative ($x&rsquo;&rsquo;$)</strong>: Computed using $x&rsquo;$ values</li>
<li><strong>Normalized Second Derivative ($y&rsquo;&rsquo;$)</strong>: Computed using $y&rsquo;$ values</li>
<li><strong>Curvature</strong>: $\frac{x&rsquo;y&rsquo;&rsquo; - x&rsquo;&lsquo;y&rsquo;}{(x&rsquo;^2 + y&rsquo;^2)^{3/2}}$</li>
<li><strong>Writing Direction (Cos)</strong>: $\cos \alpha(t)$ based on vector from $t-1$ to $t+1$</li>
<li><strong>Writing Direction (Sin)</strong>: $\sin \alpha(t)$</li>
<li><strong>Aspect Ratio</strong>: Ratio of height to width in the 5-point window</li>
<li><strong>Curliness</strong>: Deviation from the straight line connecting the first and last point of the window</li>
<li><strong>Linearity</strong>: Average squared distance of points in the window to the straight line connecting start/end points</li>
</ol>
<p><strong>3. Feature Normalization</strong></p>
<p>The final feature matrix $V$ is normalized to zero mean and unit standard deviation using the covariance matrix: $o_t = \Sigma^{-1/2}(v_t - \mu)$.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Continuous Hidden Markov Models (HMM)</li>
<li><strong>Topology</strong>: Left-to-right (Bakis model)</li>
<li><strong>Initialization</strong>: Initial distribution $\pi = {1, 0, &hellip;, 0}$; uniform transition matrix $A$; segmental k-means for observation matrix $B$</li>
<li><strong>Training</strong>: Baum-Welch re-estimation</li>
<li><strong>Decision</strong>: Maximum likelihood classification ($\hat{\lambda} = \arg \max P(O|\lambda)$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Best Value</th>
          <th style="text-align: left">Configuration</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Accuracy</strong></td>
          <td style="text-align: left"><strong>89.5%</strong></td>
          <td style="text-align: left">6 States, 9 Gaussians</td>
          <td style="text-align: left">Highest reported accuracy</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-3 Accuracy</strong></td>
          <td style="text-align: left"><strong>98.7%</strong></td>
          <td style="text-align: left">6 States, 9 Gaussians</td>
          <td style="text-align: left">Top-3 candidate accuracy</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Y., Shi, G., &amp; Yang, J. (2009). HMM-Based Online Recognition of Handwritten Chemical Symbols. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1255&ndash;1259. <a href="https://doi.org/10.1109/ICDAR.2009.99">https://doi.org/10.1109/ICDAR.2009.99</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2009hmm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{HMM-Based Online Recognition of Handwritten Chemical Symbols}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th International Conference on Document Analysis and Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Yang and Shi, Guangshun and Yang, Jufeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{75}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1255--1259}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.99}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Symbol Recognition Using SVMs</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/</guid><description>A hybrid SVM and elastic matching approach for recognizing handwritten chemical symbols drawn on touch devices, achieving 89.7% top-1 accuracy.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-taxonomy">Paper Contribution and Taxonomy</h2>
<p>This is a <strong>Method</strong> paper according to the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI for Physical Sciences taxonomy</a>.</p>
<ul>
<li><strong>Dominant Basis</strong>: The authors propose a novel hybrid architecture (SVM-EM) that combines two existing techniques to solve a specific recognition problem.</li>
<li><strong>Rhetorical Indicators</strong>: The paper explicitly defines algorithms (Algorithm 1 &amp; 2), presents a system architecture, and validates the method via ablation studies comparing the hybrid approach against its individual components.</li>
</ul>
<h2 id="motivation-for-pen-based-input">Motivation for Pen-Based Input</h2>
<p>Entering chemical expressions on digital devices is difficult due to their complex 2D spatial structure.</p>
<ul>
<li><strong>The Problem</strong>: While handwriting recognition for text and math is mature, chemical structures involve unique symbols and spatial arrangements that existing tools struggle to process efficiently.</li>
<li><strong>Existing Solutions</strong>: Standard tools (like ChemDraw) rely on point-and-click interactions, which are described as complicated and non-intuitive compared to direct handwriting.</li>
<li><strong>Goal</strong>: To enable fluid handwriting input on pen/touch-based devices (like iPads) by accurately recognizing individual chemical symbols in real-time.</li>
</ul>
<h2 id="novelty-hybrid-svm-and-elastic-matching">Novelty: Hybrid SVM and Elastic Matching</h2>
<p>The core contribution is the <strong>Hybrid SVM-EM</strong> approach, which splits recognition into a coarse classification stage and a fine-grained verification stage.</p>
<ul>
<li><strong>Two-Stage Pipeline</strong>:
<ol>
<li><strong>SVM Recognition</strong>: Uses statistical features (stroke count, turning angles) to generate a short-list of candidate symbols.</li>
<li><strong>Elastic Matching (EM)</strong>: Uses a geometric point-to-point distance metric to re-rank these candidates against a library of stored symbol prototypes.</li>
</ol>
</li>
<li><strong>Online Stroke Partitioning</strong>: A heuristic-based method to group strokes into symbols in real-time based on time adjacency (grouping the last $n$ strokes) and spatial intersection checks, without waiting for the user to finish the entire drawing.</li>
</ul>
<h2 id="experimental-design-and-data-collection">Experimental Design and Data Collection</h2>
<p>The authors conducted a user study to collect data and evaluate the system:</p>
<ul>
<li><strong>Participants</strong>: 10 users were recruited to write chemical symbols on an iPad.</li>
<li><strong>Task</strong>: Each user wrote 78 distinct chemical symbols (digits, alphabets, bonds) 3 times each.</li>
<li><strong>Baselines</strong>: The hybrid method was compared against two baselines:
<ol>
<li>SVM only</li>
<li>Elastic Matching only.</li>
</ol>
</li>
<li><strong>Metrics</strong>: Evaluation focused on <strong>Precision@k</strong> (where $k=1, 3, 5$), measuring how often the correct symbol appeared in the top-$k$ suggestions.</li>
</ul>
<h2 id="recognition-performance-and-outcomes">Recognition Performance and Outcomes</h2>
<p>The hybrid approach demonstrated improved performance compared to using either technique in isolation.</p>
<ul>
<li><strong>Key Results</strong>:
<ul>
<li><strong>Hybrid SVM-EM</strong>: 89.7% Precision@1 (Top-1 accuracy).</li>
<li><strong>SVM Only</strong>: 85.1% Precision@1.</li>
<li><strong>EM Only</strong>: 76.7% Precision@1.</li>
</ul>
</li>
<li><strong>Category Performance</strong>: The system performed best on Operators (91.9%) and Digits (91.3%), with slightly lower performance on Alphabetic characters (88.6%).</li>
<li><strong>Impact</strong>: The system was successfully implemented as a real-time iOS application, allowing users to draw complex structures like $C\#CC(O)$ which are then converted to SMILES strings.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study generated a custom dataset for training and evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Stats</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>2,340 samples</td>
          <td>Collected from 10 users. Consists of <strong>78 unique symbols</strong>: 10 digits (0-9), 52 letters (A-Z, a-z), and 16 bonds/operators (e.g., $=$, $+$, hash bonds).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Unspecified size</td>
          <td>A &ldquo;Chemical Elastic Symbol Library&rdquo; was created containing samples of all supported symbols to serve as prototypes for the Elastic Matching step.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of four distinct algorithmic steps:</p>
<p><strong>1. Stroke Partitioning</strong></p>
<ul>
<li><strong>Logic</strong>: Groups the most recently written stroke with up to the last 4 previous strokes.</li>
<li><strong>Filtering</strong>: Invalid groups are removed using &ldquo;Spatial Distance Checking&rdquo; (strokes too far apart) and &ldquo;Stroke Intersection Checking&rdquo; (strokes that don&rsquo;t intersect where expected).</li>
</ul>
<p><strong>2. Preprocessing</strong></p>
<ul>
<li><strong>Size Normalization</strong>: Scales symbol to a standard size based on its bounding box.</li>
<li><strong>Smoothing</strong>: Uses average smoothing (replacing mid-points with the average of neighbors) to remove jitter.</li>
<li><strong>Sampling</strong>: Resamples valid strokes to a fixed number of <strong>50 points</strong>.</li>
</ul>
<p><strong>3. SVM Feature Extraction</strong></p>
<ul>
<li><strong>Horizontal Angle</strong>: Calculated between two consecutive points ($P_1, P_2$). Values are binned into 12 groups ($30^{\circ}$ each).</li>
<li><strong>Turning Angle</strong>: The difference between two consecutive horizontal angles. Values are binned into 18 groups ($10^{\circ}$ each).</li>
<li><strong>Features</strong>: Input vector consists of stroke count, normalized coordinates, and the percentage of angles falling into the histograms described above.</li>
</ul>
<p><strong>4. Elastic Matching (Verification)</strong></p>
<ul>
<li><strong>Distance Function</strong>: Euclidean distance summation between the points of the candidate symbol ($s$) and the partitioned input ($s_p$).
$$
\begin{aligned}
D(s, s_p) = \sum_{j=1}^{n} \sqrt{(x_{s,j} - x_{p,j})^2 + (y_{s,j} - y_{p,j})^2}
\end{aligned}
$$
<em>Note: The paper formula sums the distances; $n$ is the number of points (50).</em></li>
<li><strong>Ranking</strong>: Candidates are re-ranked in ascending order of this elastic distance.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Classifier</strong>: Linear Support Vector Machine (SVM) implemented using <strong>LibSVM</strong>.</li>
<li><strong>Symbol Library</strong>: A &ldquo;Chemical Elastic Symbol Library&rdquo; stores the raw stroke point sequences for all 78 supported symbols to enable the elastic matching comparison.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using precision at different ranks (Top-N accuracy).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Precision@1</strong></td>
          <td><strong>89.7%</strong></td>
          <td>85.1% (SVM)</td>
          <td>Hybrid model reduces error rate significantly over baselines.</td>
      </tr>
      <tr>
          <td><strong>Precision@3</strong></td>
          <td><strong>94.1%</strong></td>
          <td>N/A</td>
          <td>High recall in top 3 allows users to quickly correct errors via UI selection.</td>
      </tr>
      <tr>
          <td><strong>Precision@5</strong></td>
          <td><strong>94.6%</strong></td>
          <td>N/A</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: Apple iPad (iOS platform).</li>
<li><strong>Input</strong>: Touch/Pen-based input recording digital ink (x, y coordinates and pen-up/down events).</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tang, P., Hui, S. C., &amp; Fu, C. W. (2013). Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition. <em>2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)</em>, 535&ndash;540. <a href="https://doi.org/10.1109/ICIS.2013.6607894">https://doi.org/10.1109/ICIS.2013.6607894</a></p>
<p><strong>Publication</strong>: IEEE ICIS 2013</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{tangOnlineChemicalSymbol2013,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Tang, Peng and Hui, Siu Cheung and Fu, Chi-Wing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2013</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{535--540}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICIS.2013.6607894}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Ring Recognition with Neural Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/</guid><description>A two-phase Classifier-Recognizer neural network pipeline for recognizing 23 types of handwritten heterocyclic chemical rings, achieving ~94% accuracy.</description><content:encoded><![CDATA[<h2 id="contribution-recognition-architecture-for-heterocyclic-rings">Contribution: Recognition Architecture for Heterocyclic Rings</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It proposes a specific algorithmic architecture (the &ldquo;Classifier-Recognizer Approach&rdquo;) to solve a pattern recognition problem. The rhetorical structure centers on defining three variations of a method, performing ablation-like comparisons between them (Whole Image vs. Lower Part), and demonstrating superior performance metrics (~94% accuracy) for the proposed technique.</p>
<h2 id="motivation-enabling-sketch-based-chemical-search">Motivation: Enabling Sketch-Based Chemical Search</h2>
<p>The authors identify a gap in existing OCR and handwriting recognition research, which typically focuses on alphanumeric characters or whole words.</p>
<ul>
<li><strong>Missing Capability</strong>: Recognition of specific <em>heterocyclic chemical rings</em> (23 types) had not been performed previously.</li>
<li><strong>Practical Utility</strong>: Existing chemical search engines require text-based queries (names); this work enables &ldquo;backward&rdquo; search where a user can draw a ring to find its information.</li>
<li><strong>Educational/Professional Aid</strong>: Useful for chemistry departments and mobile applications where chemists can sketch formulas on screens.</li>
</ul>
<h2 id="innovation-the-classifier-recognizer-pipeline">Innovation: The Classifier-Recognizer Pipeline</h2>
<p>The core novelty is the <strong>two-phase &ldquo;Classifier-Recognizer&rdquo; architecture</strong> designed to handle the visual similarity of heterocyclic rings:</p>
<ol>
<li><strong>Phase 1 (Classifier)</strong>: A neural network classifies the ring into one of four broad categories (S, N, O, Others) based solely on the <em>upper part</em> of the image (40x15 pixels).</li>
<li><strong>Phase 2 (Recognizer)</strong>: A class-specific neural network identifies the exact ring.</li>
<li><strong>Optimization</strong>: The most successful variation (&ldquo;Lower Part Image Recognizer with Half Size Grid&rdquo;) uses only the <em>lower part</em> of the image and <em>odd rows</em> (half-grid) to reduce input dimensionality and computation time while improving accuracy. This effectively subsamples the input grid matrix $M \in \mathbb{R}^{H \times W}$ to a reduced matrix $M_{\text{sub}}$:
$$ M_{\text{sub}} = { m_{i,j} \in M \mid i \text{ is odd} } $$</li>
</ol>
<h2 id="failed-preliminary-approaches">Failed Preliminary Approaches</h2>
<p>Before arriving at the Classifier-Recognizer architecture, the authors tried three simpler methods that all failed:</p>
<ol>
<li><strong>Ordinary NN</strong>: A single neural network with 1600 inputs (40x40 grid), 1600 hidden units, and 23 outputs. This standard approach achieved only 7% accuracy.</li>
<li><strong>Row/Column pixel counts</strong>: Using the number of black pixels per row and per column as features ($N_c + N_r$ inputs), which dramatically reduced dimensionality. This performed even worse, below 1% accuracy.</li>
<li><strong>Midline crossing count</strong>: Drawing a horizontal midline and counting the number of line crossings. This failed because the crossing count varies between writers for the same ring.</li>
</ol>
<p>These failures motivated the two-phase Classifier-Recognizer design.</p>
<h2 id="experimental-setup-and-network-variations">Experimental Setup and Network Variations</h2>
<p>The authors conducted a comparative study of three methodological variations:</p>
<ol>
<li><strong>Whole Image Recognizer</strong>: Uses the full image.</li>
<li><strong>Whole Image (Half Size Grid)</strong>: Uses only odd rows ($20 \times 40$ pixels).</li>
<li><strong>Lower Part (Half Size Grid)</strong>: Uses the lower part of the image with odd rows (the proposed method).</li>
</ol>
<p><strong>Setup</strong>:</p>
<ul>
<li><strong>Dataset</strong>: 23 types of heterocyclic rings.</li>
<li><strong>Training</strong>: 1500 samples (distributed across S, N, O, and Others classes).</li>
<li><strong>Testing</strong>: 1150 samples.</li>
<li><strong>Metric</strong>: Recognition accuracy (Performance %) and Error %.</li>
</ul>
<h2 id="results-high-accuracy-via-dimension-reduction">Results: High Accuracy via Dimension Reduction</h2>
<ul>
<li><strong>Superior Method</strong>: The &ldquo;Lower Part Image Recognizer with Half Size Grid&rdquo; achieved the best performance (~94% overall).</li>
<li><strong>High Classifier Accuracy</strong>: The first phase (classification into S/N/O/Other) achieves 100% accuracy for class S, 98.67% for O, 97.75% for N, and 97.67% for Others (Table 3).</li>
<li><strong>Class &lsquo;Others&rsquo; Difficulty</strong>: The &lsquo;Others&rsquo; class showed lower performance (~90-93%) compared to S/N/O due to the higher complexity and similarity of rings in that category.</li>
<li><strong>Efficiency</strong>: The half-grid approach reduced training time from ~53 hours (Whole Image) to ~35 hours (Lower Part Half Size Grid) while improving accuracy from 87% to 94%.</li>
</ul>
<p><strong>Training/Testing comparison across the three Classifier-Recognizer variations (Table 2)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Method</th>
          <th style="text-align: left">Hidden Nodes</th>
          <th style="text-align: left">Iterations</th>
          <th style="text-align: left">Training Time (hrs)</th>
          <th style="text-align: left">Error</th>
          <th style="text-align: left">Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Whole Image</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~53</td>
          <td style="text-align: left">13.0%</td>
          <td style="text-align: left">87.0%</td>
      </tr>
      <tr>
          <td style="text-align: left">Whole Image (Half Grid)</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~41</td>
          <td style="text-align: left">9.0%</td>
          <td style="text-align: left">91.0%</td>
      </tr>
      <tr>
          <td style="text-align: left">Lower Part (Half Grid)</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~35</td>
          <td style="text-align: left">6.0%</td>
          <td style="text-align: left">94.0%</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset consists of handwritten samples of 23 specific heterocyclic rings.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">Heterocyclic Rings</td>
          <td style="text-align: left">1500 samples</td>
          <td style="text-align: left">Split: 300 (S), 400 (N), 400 (O), 400 (Others)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">Heterocyclic Rings</td>
          <td style="text-align: left">1150 samples</td>
          <td style="text-align: left">Split: 150 (S), 300 (O), 400 (N), 300 (Others)</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing Steps</strong>:</p>
<ol>
<li><strong>Monochrome Conversion</strong>: Convert image to monochrome bitmap.</li>
<li><strong>Grid Scaling</strong>: Convert drawing area (regardless of original size) to a fixed <strong>40x40</strong> grid.</li>
<li><strong>Bounding</strong>: Scale the ring shape itself to fit the 40x40 grid.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>The &ldquo;Lower Part with Half Size&rdquo; Pipeline</strong>:</p>
<ol>
<li><strong>Cut Point</strong>: A horizontal midline is defined; the algorithm separates the &ldquo;Upper Part&rdquo; and &ldquo;Lower Part&rdquo;.</li>
<li><strong>Phase 1 Input</strong>: The <strong>Upper Part</strong> (rows 0-15 approx, scaled) is fed to the Classifier NN to determine the class (S, N, O, or Others).</li>
<li><strong>Phase 2 Input</strong>:
<ul>
<li>For classes <strong>S, N, O</strong>: The <strong>Lower Part</strong> of the image is used.</li>
<li>For class <strong>Others</strong>: The <strong>Whole Ring</strong> is used.</li>
</ul>
</li>
<li><strong>Dimensionality Reduction</strong>: For the recognizer networks, only <strong>odd rows</strong> are used (effectively a 20x40 input grid) to reduce inputs from 1600 to 800.</li>
</ol>
<h3 id="models">Models</h3>
<p>The system uses multiple distinct Feed-Forward Neural Networks (Backpropagation is implied by &ldquo;training&rdquo; and &ldquo;epochs&rdquo; context, though not explicitly named as the algorithm):</p>
<ul>
<li><strong>Structure</strong>: 1 Classifier NN + 4 Recognizer NNs (one for each class).</li>
<li><strong>Hidden Layers</strong>: The preliminary &ldquo;ordinary method&rdquo; experiment used 1600 hidden units. The Classifier-Recognizer methods all used 50 hidden nodes per Table 2. The paper also notes that the ordinary approach tried various hidden layer sizes.</li>
<li><strong>Input Nodes</strong>:
<ul>
<li>Standard: 1600 (40x40).</li>
<li>Optimized: ~800 (20x40 via half-grid).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Classifier Phase Testing Results (Table 3)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Class</th>
          <th style="text-align: left">Samples</th>
          <th style="text-align: left">Correct</th>
          <th style="text-align: left">Accuracy</th>
          <th style="text-align: left">Error</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>S</strong></td>
          <td style="text-align: left">150</td>
          <td style="text-align: left">150</td>
          <td style="text-align: left"><strong>100.00%</strong></td>
          <td style="text-align: left">0.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>O</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">296</td>
          <td style="text-align: left"><strong>98.67%</strong></td>
          <td style="text-align: left">1.33%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>N</strong></td>
          <td style="text-align: left">400</td>
          <td style="text-align: left">391</td>
          <td style="text-align: left"><strong>97.75%</strong></td>
          <td style="text-align: left">2.25%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Others</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">293</td>
          <td style="text-align: left"><strong>97.67%</strong></td>
          <td style="text-align: left">2.33%</td>
      </tr>
  </tbody>
</table>
<p><strong>Recognizer Phase Testing Results (Lower Part Image Recognizer with Half Size Grid, Table 4)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Class</th>
          <th style="text-align: left">Samples</th>
          <th style="text-align: left">Correct</th>
          <th style="text-align: left">Accuracy</th>
          <th style="text-align: left">Error</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>S</strong></td>
          <td style="text-align: left">150</td>
          <td style="text-align: left">147</td>
          <td style="text-align: left"><strong>98.00%</strong></td>
          <td style="text-align: left">2.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>O</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">289</td>
          <td style="text-align: left"><strong>96.33%</strong></td>
          <td style="text-align: left">3.67%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>N</strong></td>
          <td style="text-align: left">400</td>
          <td style="text-align: left">386</td>
          <td style="text-align: left"><strong>96.50%</strong></td>
          <td style="text-align: left">3.50%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Others</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">279</td>
          <td style="text-align: left"><strong>93.00%</strong></td>
          <td style="text-align: left">7.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Overall</strong></td>
          <td style="text-align: left"><strong>1150</strong></td>
          <td style="text-align: left"><strong>-</strong></td>
          <td style="text-align: left"><strong>~94.0%</strong></td>
          <td style="text-align: left"><strong>-</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p>No source code, trained models, or datasets were released with this paper. The handwritten ring samples were collected by the authors, and the software described (a desktop application) is not publicly available. The neural network architecture details (50 hidden nodes, 1000 iterations) and preprocessing pipeline are described in sufficient detail for reimplementation, but reproducing results would require collecting a new handwritten dataset of heterocyclic rings.</p>
<p><strong>Status</strong>: Closed (no public code, data, or models).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hewahi, N., Nounou, M. N., Nassar, M. S., Abu-Hamad, M. I., &amp; Abu-Hamad, H. I. (2008). Chemical Ring Handwritten Recognition Based on Neural Networks. <em>Ubiquitous Computing and Communication Journal</em>, 3(3).</p>
<p><strong>Publication</strong>: Ubiquitous Computing and Communication Journal 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{hewahiCHEMICALRINGHANDWRITTEN2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{CHEMICAL RING HANDWRITTEN RECOGNITION BASED ON NEURAL NETWORKS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hewahi, Nabil and Nounou, Mohamed N and Nassar, Mohamed S and Abu-Hamad, Mohamed I and Abu-Hamad, Husam I}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2008}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Ubiquitous Computing and Communication Journal}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Deep Learning for Molecular Structure Extraction (2019)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</guid><description>An end-to-end deep learning approach using U-Net segmentation and a CNN encoder with GridLSTM decoder to predict chemical structures from document images.</description><content:encoded><![CDATA[<h2 id="contribution-type-method-and-resource">Contribution Type: Method and Resource</h2>
<p>This is primarily a <strong>methodological</strong> paper with a secondary <strong>resource</strong> contribution.</p>
<p><strong>Method</strong>: It proposes a novel end-to-end deep learning architecture (Segmentation U-Net + Recognition Encoder-Decoder) to replace traditional rule-based optical chemical structure recognition (OCSR) systems.</p>
<p><strong>Resource</strong>: It details a pipeline for generating large-scale synthetic datasets (images overlaying patent/journal backgrounds) necessary to train the deep learning models.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Existing tools for extracting chemical structures from literature (e.g., OSRA, CLIDE) rely on complex, handcrafted rules and heuristics (edge detection, vectorization). These systems suffer from:</p>
<ol>
<li><strong>Brittleness</strong>: They fail when image quality is low (low resolution, noise) or when artistic styles vary (wavy bonds, crossing lines).</li>
<li><strong>Maintenance difficulty</strong>: Improvements require manual codification of new rules for every edge case, which is difficult to scale.</li>
<li><strong>Data volume</strong>: The explosion of published life science papers (2000+ per day in Medline) creates a need for automated, robust curation tools that humans cannot match.</li>
</ol>
<h2 id="core-innovation-end-to-end-pixel-to-smiles-recognition">Core Innovation: End-to-End Pixel-to-SMILES Recognition</h2>
<p>The authors present an <strong>end-to-end deep learning approach</strong> for this task that operates directly on raw pixels without explicit subcomponent recognition (e.g., detecting atoms and bonds separately). Key innovations include:</p>
<ol>
<li><strong>Pixel-to-SMILES</strong>: Treating structure recognition as an image captioning problem using an encoder-decoder architecture with attention, generating SMILES directly.</li>
<li><strong>Low-Resolution Robustness</strong>: The model is trained on aggressively downsampled images (~60 dpi for segmentation, 256x256 for prediction), making it robust to poor quality and noisy inputs from legacy PDF extractions.</li>
<li><strong>Implicit Superatom Handling</strong>: The model learns to recognize and generate sequences for superatoms (e.g., &ldquo;OTBS&rdquo;) contextually.</li>
</ol>
<h2 id="experimental-setup-and-large-scale-synthetic-data">Experimental Setup and Large-Scale Synthetic Data</h2>
<p>The authors validated their approach using a mix of large-scale synthetic training sets and real-world test sets:</p>
<ol>
<li><strong>Synthetic Generation</strong>: They created a segmentation dataset by overlaying USPTO molecules onto &ldquo;whited-out&rdquo; journal pages.</li>
<li><strong>Ablation/Training</strong>: Metrics were tracked on Indigo (synthetic) and USPTO (real patent images) datasets.</li>
<li><strong>External Validation</strong>:
<ul>
<li><strong>Valko Dataset</strong>: A standard benchmark of 454 heterogeneous images from literature.</li>
<li><strong>Proprietary Dataset</strong>: A collection of images from 47 articles and 5 patents to simulate real-world drug discovery curation.</li>
</ul>
</li>
<li><strong>Stress Testing</strong>: They analyzed performance distributions across molecular weight, heavy atom count, and rare elements (e.g., Uranium, Vanadium).</li>
</ol>
<h2 id="results-and-limitations-in-complex-structures">Results and Limitations in Complex Structures</h2>
<ul>
<li><strong>High Accuracy on Standard Sets</strong>: The model achieved <strong>82% accuracy</strong> on the Indigo validation set and <strong>77%</strong> on the USPTO validation set. No apparent overfitting was observed on the Indigo data (57M training examples), though some overfitting occurred on the smaller USPTO set (1.7M training examples).</li>
<li><strong>Real-World Viability</strong>: It achieved <strong>83% accuracy</strong> on the proprietary internal test set, with validation and proprietary accuracies ranging from 77-83%, indicating the training sets reasonably approximate real drug discovery data.</li>
<li><strong>Segmentation Quality</strong>: Low segmentation error rates were observed: only 3.3% of the Valko dataset and 6.6% of the proprietary images failed to segment properly.</li>
<li><strong>Limitations on Complexity</strong>: Performance dropped to <strong>41% on the Valko test set</strong>. Superatoms were the single largest contributor to prediction errors, with 21% of Valko samples containing one or more incorrectly predicted superatoms. Only 6.6% of total training images contained any superatom, limiting the model&rsquo;s exposure.</li>
<li><strong>Stereochemistry Challenges</strong>: 60% of compounds with incorrectly predicted stereochemistry had explicit stereochemistry in both the ground truth and the prediction, but with wrong configurations assigned (e.g., predicting R instead of S). The model often correctly identified which atoms have stereocenters but assigned the wrong direction, suggesting the architecture may not incorporate sufficient spatial context for configuration assignment.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized three primary sources for generating training data. All inputs were strictly downsampled to improve robustness.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>Indigo Set</strong></td>
          <td>57M</td>
          <td>PubChem molecules rendered via Indigo (256x256).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO Set</strong></td>
          <td>1.7M</td>
          <td>Image/SMILES pairs from public patent data.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>OS X Indigo</strong></td>
          <td>10M</td>
          <td>Additional Indigo renders from Mac OS for style diversity.</td>
      </tr>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td><strong>Synthetic Pages</strong></td>
          <td>N/A</td>
          <td>Generated by overlaying USPTO images on text-cleared PDF pages.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Segmentation Inputs</strong>: Grayscale, downsampled to ~60 dpi.</li>
<li><strong>Prediction Inputs</strong>: Resized to 256x256 such that bond lengths are approximately 3-12 pixels.</li>
<li><strong>Augmentation</strong>: Random affine transforms, brightness scaling, and binarization applied during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Segmentation Pipeline</strong>:</p>
<ul>
<li><strong>Multi-scale Inference</strong>: Masks generated at resolutions from 30 to 60 dpi (3 dpi increments) and averaged for the final mask.</li>
<li><strong>Post-processing</strong>: Hough transform used to remove long straight lines (table borders). Mask blobs filtered by pixel count thresholds.</li>
</ul>
<p><strong>Prediction Pipeline</strong>:</p>
<ul>
<li><strong>Sequence Generation</strong>: SMILES generated character-by-character via greedy decoding. During inference, predictions are made at several low resolutions and the sequence with the highest confidence (product of per-character softmax outputs) is returned.</li>
<li><strong>Attention-based Verification</strong>: Attention weights used to re-project predicted atoms back into 2D space to visually verify alignment with the input image.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>1. Segmentation Model (U-Net Variant)</strong>:</p>
<ul>
<li><strong>Architecture</strong>: U-Net style with skip connections.</li>
<li><strong>Input</strong>: 128x128x1 grayscale image.</li>
<li><strong>Layers</strong>: Alternating 3x3 Conv and 2x2 Max Pool.</li>
<li><strong>Activation</strong>: Parametric ReLU (pReLU).</li>
<li><strong>Parameters</strong>: ~380,000.</li>
</ul>
<p><strong>2. Structure Prediction Model (Encoder-Decoder)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: CNN with 5x5 convolutions, 2x2 Max Pooling, pReLU. No pooling in first layers to preserve fine features.</li>
<li><strong>Decoder</strong>: 3 layers of <strong>GridLSTM</strong> cells.</li>
<li><strong>Attention</strong>: Soft/Global attention mechanism conditioned on the encoder state.</li>
<li><strong>Input</strong>: 256x256x1 image.</li>
<li><strong>Output</strong>: Sequence of characters (vocab size 65).</li>
<li><strong>Parameters</strong>: ~46.3 million.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation required an exact string match of the Canonical SMILES (including stereochemistry) to the ground truth.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Dataset</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td><strong>82%</strong></td>
          <td>Indigo Val</td>
          <td>Synthetic validation set</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>77%</strong></td>
          <td>USPTO Val</td>
          <td>Real patent images</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>83%</strong></td>
          <td>Proprietary</td>
          <td>Internal pharma dataset (real world)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>41%</strong></td>
          <td>Valko Test</td>
          <td>External benchmark; difficult due to superatoms</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Segmentation Training</strong>: 1 GPU, ~4 days (650k steps).</li>
<li><strong>Prediction Training</strong>: 8 NVIDIA Pascal GPUs, ~26 days (1M steps).</li>
<li><strong>Framework</strong>: TensorFlow.</li>
<li><strong>Optimizer</strong>: Adam.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code, pre-trained models, or generated datasets were released with this paper. The training pipeline relies on publicly available molecular databases (PubChem, USPTO) and open-source rendering tools (Indigo), but the specific training sets, model weights, and inference code remain unavailable.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Staker, J., Marshall, K., Abel, R., &amp; McQuaw, C. (2019). Molecular Structure Extraction From Documents Using Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 59(3), 1017-1029. <a href="https://doi.org/10.1021/acs.jcim.8b00669">https://doi.org/10.1021/acs.jcim.8b00669</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.schrodinger.com/publications/">Schrödinger Publication Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{stakerMolecularStructureExtraction2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Structure Extraction From Documents Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Staker, Joshua and Marshall, Kyle and Abel, Robert and McQuaw, Carolyn}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{feb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{59}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1017--1029}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.8b00669}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1021/acs.jcim.8b00669}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER: Deep Learning for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/</guid><description>Deep learning method for optical chemical structure recognition using image captioning networks trained on millions of synthetic molecular images.</description><content:encoded><![CDATA[<h2 id="contribution-method-for-optical-chemical-entity-recognition">Contribution: Method for Optical Chemical Entity Recognition</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper with a strong <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (DECIMER) that repurposes &ldquo;show-and-tell&rdquo; image captioning networks for Optical Chemical Entity Recognition (OCER), providing an alternative to traditional rule-based segmentation pipelines.</li>
<li><strong>Resource</strong>: It establishes a framework for generating large-scale synthetic training data using open-source cheminformatics tools (CDK) and databases (PubChem), circumventing the scarcity of manually annotated chemical images.</li>
</ul>
<h2 id="motivation-brittleness-of-heuristic-pipelines">Motivation: Brittleness of Heuristic Pipelines</h2>
<p>The extraction of chemical structures from scientific literature (OCER) is critical for populating open-access databases. Traditional OCER systems (like OSRA or CLiDE) rely on complex multi-step pipelines involving vectorization, character recognition, and graph compilation. These systems are brittle and incorporating new structural features requires laborious engineering. Inspired by the success of deep neural network approaches like AlphaGo Zero, the authors sought to formulate an end-to-end deep learning approach that learns directly from data with minimal prior assumptions.</p>
<h2 id="novelty-image-captioning-for-molecular-graphs">Novelty: Image Captioning for Molecular Graphs</h2>
<ul>
<li><strong>Image-to-Text Formulation</strong>: The paper frames chemical structure recognition as an image captioning problem, translating a bitmap image directly into a SMILES string using an encoder-decoder network. This bypasses explicit segmentation of atoms and bonds entirely.</li>
<li><strong>Synthetic Data Strategy</strong>: The authors generate synthetic images from PubChem using the CDK Structure Diagram Generator, scaling the dataset size to 15 million.</li>
<li><strong>Robust String Representations</strong>: The study performs key ablation experiments on string representations, comparing standard SMILES against DeepSMILES to evaluate how syntactic validity affects the network&rsquo;s learning capability.</li>
</ul>
<h2 id="experimental-setup-and-validation-strategies">Experimental Setup and Validation Strategies</h2>
<ul>
<li><strong>Data Scaling</strong>: Models were trained on dataset sizes ranging from 54,000 to 15 million synthetic images to observe empirical scaling laws regarding accuracy and compute time.</li>
<li><strong>Representation Comparison</strong>: The authors compared the validity of predicted strings and recognition accuracy when training on SMILES versus DeepSMILES. The cross-entropy loss formulation for sequence generation can be represented as:
$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$
where $\mathbf{x}$ is the image representation and $y_t$ are the tokens of the SMILES/DeepSMILES string.</li>
<li><strong>Metric Evaluation</strong>: Performance was measured using Validity (syntactic correctness) and Tanimoto Similarity $T$, computed on molecular fingerprints to capture partial correctness even if the exact string prediction failed:
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</li>
</ul>
<h2 id="results-and-critical-conclusions">Results and Critical Conclusions</h2>
<ul>
<li><strong>Data Representation</strong>: DeepSMILES proved superior to standard SMILES for training stability and output validity. Preliminary tests suggested SELFIES performs even better (0.78 Tanimoto vs 0.53 for DeepSMILES at 6M images).</li>
<li><strong>Scaling Behavior</strong>: Accuracy improves linearly with dataset size. The authors extrapolate that near-perfect detection would require training on 50 to 100 million structures.</li>
<li><strong>Current Limitations</strong>: At the reported training scale (up to 15M), the model does not yet rival traditional heuristic approaches, but the learning curve suggests it is a viable trajectory given sufficient compute and data.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is synthetic, generated using the Chemistry Development Kit (CDK) Structure Diagram Generator (SDG) based on molecules from PubChem.</p>
<p><strong>Curation Rules</strong> (applied to PubChem data):</p>
<ul>
<li>Molecular weight &lt; 1500 Daltons.</li>
<li>Elements restricted to: C, H, O, N, P, S, F, Cl, Br, I, Se, B.</li>
<li>No counter ions or charged groups.</li>
<li>No isotopes (e.g., D, T).</li>
<li>Bond count between 5 and 40.</li>
<li>SMILES length &lt; 40 characters.</li>
<li>Implicit hydrogens only (except in functional groups).</li>
</ul>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Images</strong>: Generated as 299x299 bitmaps to match Inception V3 input requirements.</li>
<li><strong>Augmentation</strong>: One random rotation applied per molecule; no noise or blurring added in this iteration.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem)</td>
          <td>54k - 15M</td>
          <td>Scaled across 12 experiments</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>Independent Set</td>
          <td>6k - 1.6M</td>
          <td>10% of training size</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: <code>&quot;Show, Attend and Tell&quot;</code> (Attention-based Image Captioning).</li>
<li><strong>Optimization</strong>: Adam optimizer with learning rate 0.0005.</li>
<li><strong>Loss Function</strong>: Sparse Categorical Crossentropy.</li>
<li><strong>Training Loop</strong>: Trained for 25 epochs per model. Batch size of 640 images.</li>
</ul>
<h3 id="models">Models</h3>
<p>The network is implemented in TensorFlow 2.0.</p>
<ul>
<li><strong>Encoder</strong>: Inception V3 (Convolutional NN), used unaltered. Extracts feature vectors saved as NumPy arrays.</li>
<li><strong>Decoder</strong>: Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) with soft attention mechanism.</li>
<li><strong>Embeddings</strong>: Image embedding dimension size of 600.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is Tanimoto similarity (Jaccard index) on PubChem fingerprints, which is robust for measuring structural similarity even when exact identity is not reached.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td>Percentage of predictions that are chemically identical to ground truth (isomorphic).</td>
      </tr>
      <tr>
          <td><strong>Average Tanimoto</strong></td>
          <td>Mean similarity score across the test set (captures partial correctness).</td>
      </tr>
      <tr>
          <td><strong>Validity</strong></td>
          <td>Percentage of predicted strings that are valid DeepSMILES/SMILES.</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER">DECIMER (Java utilities)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>CDK-based data generation and conversion tools</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image-to-SMILES">DECIMER-Image-to-SMILES</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>TensorFlow training and inference scripts (archived)</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source of molecular structures for synthetic training data</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a single node.</p>
<ul>
<li><strong>GPU</strong>: 1x NVIDIA Tesla V100.</li>
<li><strong>CPU</strong>: 2x Intel Xeon Gold 6230.</li>
<li><strong>RAM</strong>: 384 GB.</li>
<li><strong>Compute Time</strong>:
<ul>
<li>Linear scaling with data size.</li>
<li>15 million structures took ~27 days (91,881s per epoch).</li>
<li>Projected time for 100M structures: ~4 months on single GPU.</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2020). DECIMER: towards deep learning for chemical image recognition. <em>Journal of Cheminformatics</em>, 12(1), 65. <a href="https://doi.org/10.1186/s13321-020-00469-w">https://doi.org/10.1186/s13321-020-00469-w</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER">Official GitHub Repository</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image-to-SMILES">DECIMER Image-to-SMILES Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERDeepLearning2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{DECIMER}}: Towards Deep Learning for Chemical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{DECIMER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{65}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00469-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGrapher: Deep Learning for Chemical Graph OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/</guid><description>Deep learning OCSR method using semantic segmentation and classification CNNs to reconstruct chemical graphs with improved stereochemistry.</description><content:encoded><![CDATA[<h2 id="classifying-the-methodology">Classifying the Methodology</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture and a specific graph-reconstruction algorithm to solve the problem of Optical Chemical Structure Recognition (OCSR). It validates this method by comparing it against the existing standard tool (OSRA), demonstrating superior performance on specific technical challenges like stereochemistry.</p>
<h2 id="the-ocr-stereochemistry-challenge">The OCR Stereochemistry Challenge</h2>
<p>Chemical knowledge is frequently locked in static images within scientific publications. Extracting this structure into machine-readable formats (graphs, SMILES) is essential for drug discovery and database querying. Existing tools, such as OSRA, rely on optical character recognition (OCR) and expert systems or hand-coded rules. These tools struggle with bond multiplicity and stereochemical information, often missing atoms or misinterpreting 3D cues (wedges and dashes). A machine learning approach allows for improvement via data scaling.</p>
<h2 id="decoupled-semantic-segmentation-and-classification-pipeline">Decoupled Semantic Segmentation and Classification Pipeline</h2>
<p>The core novelty is the <strong>segmentation-classification pipeline</strong> which decouples object detection from type assignment:</p>
<ol>
<li><strong>Semantic Segmentation</strong>: The model first predicts pixel-wise maps for atoms, bonds, and charges using a Dense Prediction Convolutional Network built on dilated convolutions.</li>
<li><strong>Graph Building Algorithm</strong>: A specific algorithm iterates over the segmentation maps to generate candidate locations for atoms and bonds.</li>
<li><strong>Refinement via Classification</strong>: Dedicated classification networks take cutouts of the original image combined with the segmentation mask to verify and classify each candidate (e.g., distinguishing a single bond from a double bond, or a wedge from a dash).</li>
</ol>
<p>Additionally, the authors developed a novel method for <strong>synthetic data generation</strong> by modifying the source code of RDKit to output pixel-wise labels during the image drawing process. This solves the lack of labeled training data.</p>
<h2 id="evaluating-synthetics-and-benchmarks">Evaluating Synthetics and Benchmarks</h2>
<ul>
<li><strong>Synthetic Benchmarking</strong>: The authors generated test sets in 3 different stylistic variations. For each style, they tested on both stereo (complex 3D information) and non-stereo compounds.</li>
<li><strong>Baseline Comparison</strong>: They compared the error rates of ChemGrapher against <strong>OSRA</strong> (Optical Structure Recognition Application).</li>
<li><strong>Component-level Evaluation</strong>: They analyzed the F1 scores of the segmentation networks versus the classification networks independently to understand where errors propagated.</li>
<li><strong>Real-world Case Study</strong>: They manually curated 61 images cut from journal articles to test performance on real, non-synthetic data.</li>
</ul>
<h2 id="advancements-over-osra">Advancements Over OSRA</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ChemGrapher consistently achieved lower error rates than OSRA across all synthetic styles, particularly for stereochemical information (wedge and dash bonds).</li>
<li><strong>Component Performance</strong>: The classification networks showed higher F1 scores than the segmentation networks across all prediction types (Figure 4 in the paper). This suggests the two-stage approach allows the classifier to correct segmentation noise.</li>
<li><strong>Real-world Viability</strong>: In the manual case study, ChemGrapher correctly predicted 46 of 61 images, compared to 42 of 61 for OSRA.</li>
<li><strong>Limitations</strong>: The model struggles with thick bond lines in real-world images. Performance is stronger on carbon-only compounds, where no letters appear in the image.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors created a custom synthetic dataset using ChEMBL and RDKit, as no pixel-wise labeled dataset existed.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>ChEMBL</td>
          <td>1.9M</td>
          <td>Split into training pool (1.5M), val/train pool (300K), and test pools (35K each).</td>
      </tr>
      <tr>
          <td><strong>Segmentation Train</strong></td>
          <td>Synthetic</td>
          <td>~114K</td>
          <td>Sampled from ChEMBL pool such that every atom type appears in &gt;1000 compounds.</td>
      </tr>
      <tr>
          <td><strong>Labels</strong></td>
          <td>Pixel-wise</td>
          <td>N/A</td>
          <td>Generated by modifying <strong>RDKit</strong> source code to output label masks (atom type, bond type, charge) during drawing.</td>
      </tr>
      <tr>
          <td><strong>Candidates (Val)</strong></td>
          <td>Cutouts</td>
          <td>~27K (Atom)<br>~55K (Bond)</td>
          <td>Validation candidates generated from ~450 compounds for evaluating the classification networks.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Algorithm 1: Graph Building</strong></p>
<ol>
<li><strong>Segment</strong>: Apply segmentation network $s(x)$ to get maps $S^a$ (atoms), $S^b$ (bonds), $S^c$ (charges).</li>
<li><strong>Atom Candidates</strong>: Identify candidate blobs in $S^a$.</li>
<li><strong>Classify Atoms</strong>: For each candidate, crop the input image and segmentation map. Feed to $c_A$ and $c_C$ to predict Atom Type and Charge. Add to Vertex set $V$ if valid.</li>
<li><strong>Bond Candidates</strong>: Generate all pairs of nodes in $V$ within $2 \times$ bond length distance.</li>
<li><strong>Classify Bonds</strong>: For each pair, create a candidate mask (two rectangles meeting in the middle to encode directionality). Feed to $c_B$ to predict Bond Type (single, double, wedge, etc.). Add to Edge set $E$.</li>
</ol>
<h3 id="models">Models</h3>
<p>The pipeline uses four distinct Convolutional Neural Networks (CNNs).</p>
<p><strong>1. Semantic Segmentation Network ($s$)</strong></p>
<ul>
<li><strong>Architecture</strong>: 8 convolutional layers (3x3) plus a final 1x1 linear layer (Dense Prediction Convolutional Network).</li>
<li><strong>Kernels</strong>: $3 \times 3$ for all convolutional layers; $1 \times 1$ for the final linear layer.</li>
<li><strong>Dilation</strong>: Uses dilated convolutions to expand receptive field without losing resolution. Six of the eight convolutional layers use dilation (factors: 2, 4, 8, 8, 4, 2); the first and last convolutional layers have no dilation.</li>
<li><strong>Input</strong>: Binary B/W image.</li>
<li><strong>Output</strong>: Multi-channel probability maps for Atom Types ($S^a$), Bond Types ($S^b$), and Charges ($S^c$).</li>
</ul>
<p><strong>2. Classification Networks ($c_A, c_B, c_C$)</strong></p>
<ul>
<li><strong>Purpose</strong>: Refines predictions on small image patches.</li>
<li><strong>Architecture</strong>: 5 convolutional layers, followed by a MaxPool layer and a final linear (1x1) layer.
<ul>
<li>Layer 1: <strong>Depthwise separable convolution</strong> (no dilation).</li>
<li>Layers 2-4: Dilated convolutions (factors 2, 4, 8).</li>
<li>Layer 5: Standard convolution (no dilation).</li>
<li>MaxPool: $124 \times 124$.</li>
<li>Final: 1x1 linear layer.</li>
</ul>
</li>
<li><strong>Inputs</strong>:
<ul>
<li>Crop of the binary image ($x^{cut}$).</li>
<li>Crop of the segmentation map ($S^{cut}$).</li>
<li>&ldquo;Highlight&rdquo; mask ($h_L$) indicating the specific candidate location (e.g., a dot for atoms, two rectangles for bonds).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: <strong>F1 Score</strong> for individual network performance (segmentation pixels and classification accuracy).</li>
<li><strong>Metric</strong>: <strong>Error Rate</strong> (percentage of incorrect graphs) for overall system. A graph is &ldquo;incorrect&rdquo; if there is at least one mistake in atoms or bonds.</li>
<li><strong>Baselines</strong>: Compared against <strong>OSRA</strong>.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Training and inference performed on a single <strong>NVIDIA Titan Xp</strong> (donated by NVIDIA).</li>
</ul>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Closed.</strong> The authors did not release source code, pre-trained models, or the synthetic dataset. The data generation pipeline requires modifications to RDKit&rsquo;s internal drawing code, which are not publicly available. The ChEMBL source compounds are public, but the pixel-wise labeling procedure cannot be reproduced without the modified RDKit code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., Arany, Á., Moreau, Y., &amp; Simm, J. (2020). ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 60(10), 4506-4517. <a href="https://doi.org/10.1021/acs.jcim.0c00459">https://doi.org/10.1021/acs.jcim.0c00459</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2020 (arXiv preprint Feb 2020)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2002.09914">arXiv Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{oldenhof2020chemgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Oldenhof, Martijn and Arany, Ádám and Moreau, Yves and Simm, Jaak}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4506--4517}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c00459}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Research on Chemical Expression Images Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</guid><description>A 2015 methodology for Optical Chemical Structure Recognition (OCSR) focusing on improved handling of adhesive symbols and wedge bonds.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hong, C., Du, X., &amp; Zhang, L. (2015). Research on Chemical Expression Images Recognition. <em>Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference</em>, 267-271. <a href="https://doi.org/10.2991/jimet-15.2015.50">https://doi.org/10.2991/jimet-15.2015.50</a></p>
<p><strong>Publication</strong>: JIMET 2015 (Atlantis Press)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://jsme-editor.github.io/">JSME Editor (used for visualization)</a></li>
</ul>
<h2 id="contribution-new-ocsr-workflow-for-adhesion-and-wedge-bonds">Contribution: New OCSR Workflow for Adhesion and Wedge Bonds</h2>
<p><strong>Method</strong>. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.</p>
<h2 id="motivation-challenges-with-connecting-symbols-and-stereochemistry">Motivation: Challenges with Connecting Symbols and Stereochemistry</h2>
<p>A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or CML is labor-intensive. Existing tools face challenges with:</p>
<ol>
<li><strong>Adhesion</strong>: Poor separation when chemical symbols touch or overlap with bonds.</li>
<li><strong>Stereochemistry</strong>: Incomplete identification of &ldquo;real&rdquo; (solid) and &ldquo;virtual&rdquo; (dashed/hashed) wedge bonds.</li>
</ol>
<h2 id="core-innovation-vector-based-separation-and-stereochemical-logic">Core Innovation: Vector-Based Separation and Stereochemical Logic</h2>
<p>The authors propose a specific <strong>OCSR (Optical Chemical Structure Recognition)</strong> workflow with two key technical improvements:</p>
<ol>
<li><strong>Vector-based Separation</strong>: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of &ldquo;adhesive&rdquo; chemical symbols (like H, N, O attached to bonds).</li>
<li><strong>Stereochemical Logic</strong>: Specific rules for identifying wedge bonds:
<ul>
<li><strong>Virtual (Dashed) Wedges</strong>: Identified by grouping connected domains and checking linear correlation of their center points.</li>
<li><strong>Real (Solid) Wedges</strong>: Identified after thinning by analyzing linear correlation and width variance of line segments.</li>
</ul>
</li>
</ol>
<h2 id="methodology--experimental-setup">Methodology &amp; Experimental Setup</h2>
<ul>
<li>
<p><strong>Dataset</strong>: 200 chemical structure images collected from the network.</p>
</li>
<li>
<p><strong>Baselines</strong>: Compared against <strong>OSRA</strong> (Optical Structure Recognition Application), a free online tool.</p>
</li>
<li>
<p><strong>Metric</strong>: <strong>Tanimoto Coefficient</strong>, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:</p>
<p>$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</p>
</li>
</ul>
<h2 id="results--conclusions">Results &amp; Conclusions</h2>
<ul>
<li><strong>Performance</strong>: The proposed OCSR method achieved higher recognition rates than OSRA.
<ul>
<li><strong>Exact Match (100%)</strong>: OCSR achieved 90.0% vs. OSRA&rsquo;s 82.2%.</li>
<li><strong>High Similarity (&gt;85%)</strong>: OCSR recognized 157 structures vs. OSRA&rsquo;s 114.</li>
</ul>
</li>
<li><strong>Limitations</strong>: The paper notes that &ldquo;real wedge&rdquo; and &ldquo;virtual wedge&rdquo; identification was a primary focus, but general recognition effectiveness still &ldquo;has room for improvement&rdquo;.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom collection of images, not a standard benchmark.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Web-crawled chemical images</td>
          <td>200 structures</td>
          <td>Images containing 2D organic structures; specific source URLs not provided.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows these specific steps:</p>
<ol>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Grayscale</strong>: via <code>cvCvtColor</code> (OpenCV).</li>
<li><strong>Binarization</strong>: via Otsu&rsquo;s method.</li>
</ul>
</li>
<li><strong>Isolated Symbol Removal</strong>:
<ul>
<li>Identifies connected domains with aspect ratios in <code>[0.8, 3.0]</code>.</li>
<li>Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.</li>
</ul>
</li>
<li><strong>Virtual Wedge Recognition</strong>:
<ul>
<li>Groups small connected domains (points/clumps).</li>
<li>Calculates linear correlation of center points; if collinear, treats as a dashed bond.</li>
</ul>
</li>
<li><strong>Vectorization &amp; Thinning</strong>:
<ul>
<li><strong>Thinning</strong>: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.</li>
<li><strong>Vectorization</strong>: Uses <strong>Potrace</strong> to convert pixels to vector segments.</li>
<li><strong>Merging</strong>: Combines split vector segments based on angle thresholds to form long straight lines.</li>
</ul>
</li>
<li><strong>Adhesive Symbol Separation</strong>:
<ul>
<li>Identifies curves (short segments after vectorization) attached to long lines.</li>
<li>Separates these domains and re-runs OCR.</li>
</ul>
</li>
<li><strong>&ldquo;Super Atom&rdquo; Merging</strong>:
<ul>
<li>Merges adjacent vertical/horizontal symbols (e.g., &ldquo;HO&rdquo;, &ldquo;CH3&rdquo;) based on distance thresholds between bounding boxes.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.</p>
<ul>
<li><strong>OCR Engines</strong>: GOCR, OCRAD, TESSERACT.</li>
<li><strong>Visualization</strong>: JSME (JavaScript Molecule Editor) used to render output strings.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (OCSR)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Exact Match (100%)</td>
          <td><strong>90.0%</strong></td>
          <td>82.2%</td>
          <td>Percentage of 200 images perfectly recognized.</td>
      </tr>
      <tr>
          <td>&gt;95% Similarity</td>
          <td><strong>95 images</strong></td>
          <td>71 images</td>
          <td>Count of images with Tanimoto &gt; 0.95.</td>
      </tr>
      <tr>
          <td>&gt;85% Similarity</td>
          <td><strong>157 images</strong></td>
          <td>114 images</td>
          <td>Count of images with Tanimoto &gt; 0.85.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{hongResearchChemicalExpression2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Research on {{Chemical Expression Images Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hong, Chen and Du, Xiaoping and Zhang, Lu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Atlantis Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Chongqing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2991/jimet-15.2015.50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-94-6252-129-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Probabilistic OCSR with Markov Logic Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</guid><description>A probabilistic approach using Markov Logic Networks to recognize chemical structures from images, improving robustness over rule-based systems.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Frasconi, P., Gabbrielli, F., Lippi, M., &amp; Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 54(8), 2380-2390. <a href="https://doi.org/10.1021/ci5002197">https://doi.org/10.1021/ci5002197</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2014</p>
<h2 id="contribution-probabilistic-method-for-ocsr">Contribution: Probabilistic Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It proposes a novel algorithmic architecture (<strong>MLOCSR</strong>) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.</p>
<ul>
<li><strong>Limitation of Prior Work</strong>: Existing systems (OSRA, CLiDE, ChemReader) rely on &ldquo;empirical hard-coded geometrical rules&rdquo; to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.</li>
<li><strong>Gap</strong>: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).</li>
<li><strong>Goal</strong>: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.</li>
</ul>
<h2 id="core-innovation-markov-logic-networks-for-diagram-interpretation">Core Innovation: Markov Logic Networks for Diagram Interpretation</h2>
<p>The core novelty is the application of <strong>Markov Logic Networks (MLNs)</strong> to the problem of diagram interpretation.</p>
<ul>
<li><strong>Probabilistic Reasoning</strong>: The system treats extracted visual elements (lines, text boxes) as &ldquo;evidence&rdquo; and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model:
$$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$
where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.</li>
<li><strong>Unified Knowledge Representation</strong>: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.</li>
<li>Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated the system on recognition accuracy against the leading open-source baseline, <strong>OSRA (v1.4.0)</strong>.</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>USPTO Clustered</strong>: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.</li>
<li><strong>ChemInfty</strong>: 869 images from Japanese patents.</li>
<li><strong>Degraded Images</strong>: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.</li>
</ul>
</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Geometric</strong>: Precision, Recall, and $F_1$ scores for individual atoms and bonds.</li>
<li><strong>Chemical</strong>: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).</li>
</ul>
</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Superior Robustness</strong>: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA&rsquo;s 76.0%.</li>
<li><strong>Geometric Accuracy</strong>: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).</li>
<li><strong>Chemical Fidelity</strong>: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).</li>
<li><strong>Limitation</strong>: OSRA slightly outperformed MLOCSR on &ldquo;Full <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>&rdquo; matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized public datasets, with specific preprocessing to ensure non-redundancy.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>USPTO Clustered</strong></td>
          <td>937 images</td>
          <td>Selected via spectral clustering from 5,719 raw images to remove near-duplicates.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>ChemInfty</strong></td>
          <td>869 images</td>
          <td>Ground-truthed dataset from Japanese patent applications (2008).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.</p>
<p><strong>1. Low-Level Extraction (Image Processing)</strong></p>
<ul>
<li><strong>Binarization</strong>: Global thresholding followed by morphological closing.</li>
<li><strong>Text/Stroke Estimation</strong>:
<ul>
<li>Finds text height ($T$) by looking for &ldquo;N&rdquo; or &ldquo;H&rdquo; characters via OCR, or averaging compatible connected components.</li>
<li>Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.</li>
</ul>
</li>
<li><strong>Vectorization</strong>:
<ul>
<li><strong>Canny Edge Detection</strong> + <strong>Hough Transform</strong> to find lines.</li>
<li><strong>Douglas-Peucker algorithm</strong> for polygonal approximation of contours.</li>
<li><strong>Circle Detection</strong>: Finds aromatic rings by checking for circular arrangements of carbon candidates.</li>
</ul>
</li>
</ul>
<p><strong>2. High-Level Inference (Markov Logic)</strong></p>
<ul>
<li><strong>Evidence Generation</strong>: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., <code>LineBetweenCpoints(c1, c2)</code>).</li>
<li><strong>Inference Engine</strong>: Uses <strong>MaxWalkSAT</strong> for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., <code>DoubleBond(c1, c2)</code>).</li>
<li><strong>Parameters</strong>: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Markov Logic Network (MLN)</strong>:
<ul>
<li>Contains <strong>128 first-order logic formulas</strong>.</li>
<li><strong>Geometric Rules</strong>: Example: <code>VeryCloseCpoints(c1, c2) =&gt; SameCarbon(c1, c2)</code> (weighted rule to merge close nodes).</li>
<li><strong>Chemical Rules</strong>: Example: <code>IsHydroxyl(t) ^ Connected(c,t) =&gt; SingleBond(c,t)</code> (imposes valency constraints).</li>
</ul>
</li>
<li><strong>OCR Engine</strong>: Tesseract is used for character recognition on text connected components.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., &ldquo;COOH&rdquo;) are not expanded.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Atom/Bond $F_1$</strong></td>
          <td>Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>Standard unique identifier string. &ldquo;Basic&rdquo; ignores stereochemistry; &ldquo;Full&rdquo; includes it.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Jaccard index of path fingerprints between predicted and ground truth molecules.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software</strong>: Logic inference performed using the <strong>Alchemy</strong> software package (University of Washington).</li>
<li><strong>Web Server</strong>: The system was made available at <code>http://mlocsr.dinfo.unifi.it</code> (Note: URL likely inactive).</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{frasconiMarkovLogicNetworks2014,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2014</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{54}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2380--2390}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1549-9596, 1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci5002197}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at CLEF-IP 2012: Native TIFF Processing for Patents</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</guid><description>Evaluation of OSRA on CLEF-IP 2012 patent data showing native TIFF processing outperforms external splitting tools and pairwise-distance segmentation.</description><content:encoded><![CDATA[<h2 id="contribution-evaluating-native-processing-in-osra">Contribution: Evaluating Native Processing in OSRA</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. <code>tiffsplit</code>) to demonstrate how implementation choices impact precision, recall, and F1 scores.</p>
<h2 id="motivation-advancing-chemical-structure-recognition">Motivation: Advancing Chemical Structure Recognition</h2>
<p>The primary motivation is to solve the <strong>Chemical Structure Recognition</strong> task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).</p>
<p>A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.</p>
<h2 id="core-innovation-pairwise-distance-segmentation">Core Innovation: Pairwise Distance Segmentation</h2>
<p>The core novelty lies in the algorithmic approach to object detection and page segmentation:</p>
<ol>
<li>
<p><strong>Rejection of Bounding Boxes</strong>: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the <strong>minimum pairwise distance</strong> between points of different connected components. This allows the system to correctly handle cases where a larger molecule &ldquo;surrounds&rdquo; a smaller one, which bounding boxes would incorrectly merge.</p>
</li>
<li>
<p><strong>Native TIFF Processing</strong>: The authors identify that external tools (specifically <code>tiffsplit</code>) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which substantially improves precision (from 0.433 to 0.708 at tolerance 0).</p>
</li>
</ol>
<h2 id="experimental-setup-segmentation-and-recognition-tracks">Experimental Setup: Segmentation and Recognition Tracks</h2>
<p>The authors performed two specific tracks for the CLEF-IP 2012 challenge:</p>
<ol>
<li>
<p><strong>Page Segmentation</strong>:</p>
<ul>
<li><strong>Dataset</strong>: 5421 ground truth structures.</li>
<li><strong>Comparison</strong>: Run 1 used <code>tiffsplit</code> (external tool) to separate pages; Run 2 used OSRA&rsquo;s native internal page splitting.</li>
<li><strong>Metrics</strong>: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).</li>
</ul>
</li>
<li>
<p><strong>Structure Recognition</strong>:</p>
<ul>
<li><strong>Dataset</strong>: A test set split into an &ldquo;Automatic&rdquo; evaluation set (865 structures checkable via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> keys) and a &ldquo;Manual&rdquo; evaluation set (95 structures requiring human review due to Markush labels).</li>
<li><strong>Metric</strong>: Recognition rate (Recalled %).</li>
</ul>
</li>
</ol>
<h2 id="results-and-conclusions-native-processing-gains">Results and Conclusions: Native Processing Gains</h2>
<ul>
<li><strong>Native vs. External Splitting</strong>: The native OSRA page splitting outperformed the external <code>tiffsplit</code> tool by a wide margin. At tolerance 0, native processing achieved <strong>0.708 Precision</strong> compared to <strong>0.433</strong> for <code>tiffsplit</code>. The authors attribute this gap to artifacts introduced during <code>tiffsplit</code>&rsquo;s internal TIFF format conversion. The native run also returned far fewer records (5,254 vs. 8,800 for <code>tiffsplit</code>), indicating fewer false detections.</li>
<li><strong>Recognition Rate</strong>: Across 960 total structures, the system achieved an <strong>83% recognition rate</strong> (88% on the automatic set, 40% on the manual Markush set).</li>
<li><strong>Context</strong>: The results were consistent with OSRA&rsquo;s second-place finish (out of 6 participants) at TREC-CHEM 2011.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The experiments used the CLEF-IP 2012 benchmark datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Set</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td>Ground Truth</td>
          <td>5,421 structures</td>
          <td>Used to evaluate bounding box/coordinate accuracy.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Automatic</td>
          <td>865 structures</td>
          <td>Evaluated via InChI key matching.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Manual</td>
          <td>95 structures</td>
          <td>Evaluated manually due to Markush-style labels.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Component Clustering (Pairwise Distance)</strong></p>
<p>The segmentation algorithm avoids bounding boxes.</p>
<ul>
<li><strong>Logic</strong>: Calculate the minimum pairwise distance between points of distinct graphical components.</li>
<li><strong>Criterion</strong>: If distance $d &lt; \text{threshold}$, components are grouped.</li>
<li><strong>Advantage</strong>: Enables separation of complex geometries where a bounding box $B_1$ might fully encompass $B_2$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.</li>
</ul>
<p><strong>2. Image Pre-processing</strong></p>
<ul>
<li><strong>Workflow A (Run 1)</strong>: Multi-page TIFF → <code>tiffsplit</code> binary → Single TIFFs → OSRA.</li>
<li><strong>Workflow B (Run 2)</strong>: Multi-page TIFF → OSRA Internal Split → Recognition.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Page Segmentation Results (tiffsplit, Run 1)</strong></p>
<p>Using <code>tiffsplit</code> for page splitting returned 8,800 records against 5,421 ground truth structures.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.433</td>
          <td>0.703</td>
          <td>0.536</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.490</td>
          <td>0.795</td>
          <td>0.606</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.507</td>
          <td>0.823</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.536</td>
          <td>0.870</td>
          <td>0.663</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.549</td>
          <td>0.891</td>
          <td>0.679</td>
      </tr>
  </tbody>
</table>
<p><strong>Page Segmentation Results (Native Split, Run 2)</strong></p>
<p>Using OSRA&rsquo;s native TIFF reading returned 5,254 records, with much higher precision.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.708</td>
          <td>0.686</td>
          <td>0.697</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.793</td>
          <td>0.769</td>
          <td>0.781</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.821</td>
          <td>0.795</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.867</td>
          <td>0.840</td>
          <td>0.853</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.887</td>
          <td>0.860</td>
          <td>0.873</td>
      </tr>
  </tbody>
</table>
<p><strong>Structure Recognition Results</strong></p>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Count</th>
          <th>Recalled</th>
          <th>Percentage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Automatic</td>
          <td>865</td>
          <td>761</td>
          <td>88%</td>
      </tr>
      <tr>
          <td>Manual</td>
          <td>95</td>
          <td>38</td>
          <td>40%</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>960</strong></td>
          <td><strong>799</strong></td>
          <td><strong>83%</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://cactus.nci.nih.gov/osra">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Official project page at NCI/NIH</td>
      </tr>
  </tbody>
</table>
<p>OSRA is described as an open source utility. The CLEF-IP 2012 benchmark datasets were provided as part of the shared task. No hardware or compute requirements are specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>.</p>
<p><strong>Publication</strong>: CLEF 2012</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="http://cactus.nci.nih.gov/osra">Project Home Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{filippovOpticalStructureRecognition2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolRec at CLEF 2012: Rule-Based Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</guid><description>Overview and failure analysis of the MolRec rule-based chemical structure recognition system evaluated on the CLEF 2012 chemical structure recognition task.</description><content:encoded><![CDATA[<h2 id="contribution-to-chemical-structure-recognition">Contribution to Chemical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper. It describes the architecture of an engineered artifact (the &ldquo;MolRec&rdquo; system) and evaluates its efficacy on a specific task (Chemical Structure Recognition) using a standardized benchmark. It focuses on the mechanisms of vectorization and rule-based rewriting.</p>
<h2 id="motivation-and-clef-2012-context">Motivation and CLEF 2012 Context</h2>
<p>The work was motivated by the <strong>CLEF 2012 chemical structure recognition task</strong>. The goal was to automatically interpret chemical diagram images clipped from patent documents. This is challenging because real-world patent images contain complex structures, such as bridge bonds and elements not supported by standard conversion tools like OpenBabel.</p>
<h2 id="novelty-in-rule-based-vectorization">Novelty in Rule-Based Vectorization</h2>
<p>The primary contribution is an <strong>improved rule-based rewrite engine</strong> compared to the authors&rsquo; previous TREC 2011 submission, featuring a fully overhauled implementation that improves both recognition performance and computational efficiency. The system uses a two-stage approach:</p>
<ol>
<li><strong>Vectorization</strong>: Extracts geometric primitives (lines, circles, arrows) and characters.</li>
<li><strong>Rule Engine</strong>: Applies 18 specific geometric rewriting rules to transform primitives into a chemical graph, which can then be exported to MOL or SMILES format.</li>
</ol>
<p>Notably, the system explicitly handles &ldquo;bridge bonds&rdquo; (3D perspective structures) by applying specific recognition rules before general bond detection.</p>
<h2 id="experimental-setup-on-the-clef-2012-corpus">Experimental Setup on the CLEF 2012 Corpus</h2>
<p>The system was evaluated on the <strong>CLEF 2012 corpus</strong> of 961 test images, split into two distinct sets to test different capabilities:</p>
<ul>
<li><strong>Automatic Set</strong>: 865 images evaluated automatically using OpenBabel to compare generated MOL files against ground truth.</li>
<li><strong>Manual Set</strong>: 95 &ldquo;challenging&rdquo; images containing elements beyond OpenBabel&rsquo;s scope (e.g., Markush structures), evaluated via manual visual inspection.</li>
</ul>
<p>The authors performed <strong>four runs</strong> with slightly different internal parameters to test system stability.</p>
<h2 id="performance-outcomes-and-failure-analysis">Performance Outcomes and Failure Analysis</h2>
<p><strong>Performance:</strong></p>
<ul>
<li><strong>Automatic Set</strong>: High performance, achieving accuracy between <strong>94.91% and 96.18%</strong>.</li>
<li><strong>Manual Set</strong>: Lower performance, with accuracy between <strong>46.32% and 58.95%</strong>, reflecting the difficulty of complex patent diagrams containing Markush structures and other elements beyond OpenBabel&rsquo;s scope.</li>
</ul>
<p><strong>Failure Analysis:</strong></p>
<p>The authors conducted a detailed error analysis on 52 distinct mis-recognized diagrams from the manual set and 46 from the automatic set. Key failure modes include:</p>
<ul>
<li><strong>Character Grouping</strong>: The largest error source in the manual set (26 images). A bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, and closely-spaced atom groups were incorrectly merged.</li>
<li><strong>Touching Characters</strong>: 8 images in the manual set and 1 in the automatic set. The system lacks segmentation for characters that touch, causing OCR failure.</li>
<li><strong>Four-way Junctions</strong>: 6 manual and 7 automatic images. Vectorization failed to correctly identify junctions where four lines meet.</li>
<li><strong>Missed Wedge Bonds</strong>: 6 images each for missed solid wedge and dashed wedge bonds in the automatic set.</li>
<li><strong>OCR Errors</strong>: 5 manual and 11 automatic images, including misrecognition of &ldquo;G&rdquo; as &ldquo;O&rdquo; and &ldquo;I&rdquo; interpreted as a vertical single bond.</li>
<li><strong>Charge Signs</strong>: MolRec correctly recognized positive charge signs but missed three negative charge signs, including one placed at the top left of an atom name.</li>
<li><strong>Dataset Errors</strong>: The authors identified 11 images where the ground truth MOL files were incorrect, but MolRec&rsquo;s recognition was actually correct.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was provided by CLEF 2012 organizers and consists of images clipped from patent documents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation (Auto)</td>
          <td>CLEF 2012 Set 1</td>
          <td>865 images</td>
          <td>Evaluated via OpenBabel</td>
      </tr>
      <tr>
          <td>Evaluation (Manual)</td>
          <td>CLEF 2012 Set 2</td>
          <td>95 images</td>
          <td>Complex/Markush structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>MolRec</strong> pipeline consists of two primary modules:</p>
<p><strong>1. Vectorization Module</strong></p>
<ul>
<li><strong>Binarization</strong>: Uses <strong>Otsu&rsquo;s method</strong>.</li>
<li><strong>OCR</strong>: Extracts connected components and classifies them using <strong>nearest neighbor classification</strong> with a Euclidean metric. Detected characters are removed from the image.</li>
<li><strong>Bond Separation</strong>:
<ul>
<li>Thins remaining components to single-pixel width.</li>
<li>Builds polyline representations.</li>
<li>Splits polylines at junctions (3+ lines meeting).</li>
<li><strong>Simplification</strong>: Applies the <strong>Douglas-Peucker algorithm</strong> with a threshold of 1-2 average line widths to remove scanning artifacts while preserving corners. The threshold is based on measured average line width, allowing adaptation to different line styles.</li>
<li>Also detects circles, arrow heads, and solid triangles (annotated with direction).</li>
</ul>
</li>
</ul>
<p><strong>2. Rule Engine</strong></p>
<ul>
<li><strong>Input</strong>: Geometric primitives (segments, circles, triangles, arrows, character groups).</li>
<li><strong>Structure</strong>: 18 rewrite rules.</li>
<li><strong>Priority</strong>: Two rules for <strong>Bridge Bonds</strong> (Open/Closed) are applied <em>first</em>.</li>
<li><strong>Standard Rules</strong>: 16 rules applied in arbitrary order for standard bonds (Single, Double, Triple, Wedge, Dative, etc.).</li>
<li><strong>Implicit Nodes</strong>: Some rules handle cases where carbon atoms are implicit at bond junctions. These rules detect double or triple bonds while producing new geometric objects by splitting bonds at implicit nodes for further processing.</li>
<li><strong>Example Rule (Wavy Bond)</strong>:
<ul>
<li><em>Condition 1</em>: Set of line segments $L$ where $n \ge 3$.</li>
<li><em>Condition 2</em>: Segment lengths match &ldquo;dash length&rdquo; parameter.</li>
<li><em>Condition 3</em>: All elements are connected.</li>
<li><em>Condition 4</em>: Center points are approximately collinear.</li>
<li><em>Condition 5</em>: Endpoints form a single sequence (end elements have 1 neighbor, internal have 2).</li>
<li><em>Condition 6</em>: Two unconnected endpoints must be the pair of endpoints that are furthest apart.</li>
<li><em>Consequence</em>: Replace $L$ with a Wavy Bond between the furthest two endpoints. The bond has unknown direction.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>MolRec is a <strong>rule-based system</strong> and does not use trained deep learning models or weights.</p>
<ul>
<li><strong>Superatoms</strong>: Uses a dictionary look-up to resolve character groups representing superatoms into subgraphs.</li>
<li><strong>Disambiguation</strong>: Context-based logic is applied <em>after</em> graph construction to resolve ambiguities (e.g., distinguishing vertical bond <code>|</code> from letter <code>I</code> or digit <code>1</code>).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Run 3</th>
          <th>Run 4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Auto (865 images)</td>
          <td>96.18% (832/865)</td>
          <td>94.91% (821/865)</td>
          <td>94.91% (821/865)</td>
          <td>96.18% (832/865)</td>
      </tr>
      <tr>
          <td>Manual (95 images)</td>
          <td>46.32% (44/95)</td>
          <td>58.95% (56/95)</td>
          <td>46.32% (44/95)</td>
          <td>56.84% (54/95)</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Parameters</strong>:</p>
<ul>
<li><strong>Dash Length</strong>: Range of acceptable values for dashed lines.</li>
<li><strong>Simplification Threshold</strong>: 1-2x average line width for Douglas-Peucker.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">CLEF 2012 Workshop Paper</a></td>
          <td>Other</td>
          <td>Open Access</td>
          <td>CEUR Workshop Proceedings</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-classification-closed">Reproducibility Classification: Closed</h3>
<p>No source code for the MolRec system has been publicly released. The CLEF 2012 evaluation dataset was distributed to task participants and is not openly available. The rule-based algorithm is described in sufficient detail to re-implement, but exact parameter values and the character classification training set are not fully specified. No hardware or compute requirements are reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012 &ndash; Overview and Analysis of Results. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawi2012molrec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolRec at CLEF 2012--Overview and Analysis of Results}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sadawi, Noureddin M and Sexton, Alan P and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader Image-to-Structure OCR at TREC 2011 Chemical IR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</guid><description>ChemReader OCR software evaluation on TREC 2011 Chemical IR campaign achieving 93% accuracy on image-to-structure task.</description><content:encoded><![CDATA[<h2 id="methodological-application-applying-chemreader-to-chemical-ocr">Methodological Application: Applying ChemReader to Chemical OCR</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$).</p>
<p>The dominant vector is $\Psi_{\text{Method}}$ because the paper&rsquo;s core contribution is the empirical evaluation and refinement of ChemReader on the Image-to-Structure (I2S) task. The rhetorical indicators align with this classification through the reporting of quantitative performance metrics, detailed <strong>error analysis</strong>, and a focus on <strong>how well the system works</strong> and how its underlying algorithms need refinement.</p>
<h2 id="motivation-bridging-the-gap-in-image-to-structure-tasks">Motivation: Bridging the Gap in Image-to-Structure Tasks</h2>
<p>The motivation is two-fold:</p>
<ol>
<li>
<p><strong>Scientific Need</strong>: Traditional text-based chemical mining methods cannot utilize image data in scientific literature. Chemical OCR software is required to extract 2D chemical structure diagrams from raster images and convert them into a machine-readable chemical file format, paving the way for advanced chemical literature mining.</p>
</li>
<li>
<p><strong>Benchmark Participation</strong>: The immediate motivation was participation in the <strong>TREC Chemical IR campaign&rsquo;s Image-to-Structure (I2S) task</strong>, which was designed to evaluate existing chemical OCR software and establish a platform for developing chemical information retrieval techniques utilizing image data.</p>
</li>
</ol>
<h2 id="novelty-benchmark-evaluation-and-error-analysis-of-chemreader">Novelty: Benchmark Evaluation and Error Analysis of ChemReader</h2>
<p>ChemReader was previously introduced in earlier publications and is a chemical OCR system tailored to a chemical database annotation scheme. The novelty of this paper lies in <strong>evaluating ChemReader within the formal I2S benchmark setting</strong> and conducting a detailed <strong>error analysis</strong> of its performance. After fixing a stereo bond omission and a corner detection bug discovered during the evaluation, ChemReader achieved 93% accuracy (930/1000) on the benchmark test set.</p>
<h2 id="experimental-setup-the-trec-2011-i2s-challenge">Experimental Setup: The TREC 2011 I2S Challenge</h2>
<p>The experiment was the application of the ChemReader software to the <strong>Image-to-Structure (I2S) task</strong> of the TREC Chemical IR campaign.</p>
<ul>
<li><strong>Setup</strong>: The software was used to process image data provided for the I2S task.</li>
<li><strong>Evaluation</strong>: The system was initially evaluated, revealing two issues: the omission of <strong>bond stereo types</strong> in the output structures and a bug in the <strong>corner detection</strong> code that failed on lines touching the image boundary. Each issue lowered accuracy by approximately 10%.</li>
<li><strong>Analysis</strong>: After fixing these issues, ChemReader was re-evaluated on the full 1000-image test set (<strong>Test III</strong>). A detailed error analysis was then conducted on 20 randomly selected samples from Test III results.</li>
</ul>
<h2 id="training-progress">Training Progress</h2>
<p>The paper reports three rounds of major training, with approximately 15% accuracy gain per round:</p>
<ul>
<li><strong>Initial (untrained)</strong>: 57% accuracy on 100 selected training images</li>
<li>Key changes included deactivating unnecessary heuristic algorithms (resizing, de-noising, line merging), limiting the character set, updating the chemical dictionary to a lightweight version, and fixing precision loss from type conversions.</li>
<li>Each round improved accuracy by approximately 15% (Figure 1 in the paper shows the progression).</li>
</ul>
<h2 id="outcomes-high-accuracy-hindered-by-complex-connectivity-rules">Outcomes: High Accuracy Hindered by Complex Connectivity Rules</h2>
<ul>
<li>
<p><strong>Submitted Results</strong>: Test I achieved 691/1000 correct outputs (avg. Tanimoto similarity 0.9769), and Test II achieved 689/1000 (avg. Tanimoto similarity 0.9823). Both scored lower than training accuracy due to the stereo bond omission and corner detection bug.</p>
</li>
<li>
<p><strong>Key Finding</strong>: After fixing these two issues, ChemReader achieved <strong>93% accuracy</strong> (930/1000) on the I2S task (Test III), comparable to the highest accuracy among participants.</p>
</li>
<li>
<p><strong>Limitation/Future Direction</strong>: A detailed <strong>error analysis</strong> on 20 randomly selected samples from Test III (Table 2) showed that the software requires the incorporation of <strong>more chemical intelligence in its algorithms</strong> to address remaining systematic errors. The most frequent errors were:</p>
<ul>
<li>Wrongly merged nodes: 6 samples (30%), caused by nodes too close to be distinguished by a distance threshold</li>
<li>Missed bonds: 4 samples (20%), caused by filtering out short line segments</li>
<li>Nonstandard representations: noise symbols confusing the system, nonstandard wedge/hatched bond styles, and 3D crossing bonds that ChemReader cannot interpret</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Training</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Training Set</td>
          <td style="text-align: left">1000 images (100 used for quick eval)</td>
          <td style="text-align: left">TIF format, one chemical structure per image</td>
      </tr>
      <tr>
          <td style="text-align: left">Evaluation</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Test Set</td>
          <td style="text-align: left">1000 images (20 sampled for error analysis)</td>
          <td style="text-align: left">Same format constraints; 930/1000 correct in Test III</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>ChemReader is a <strong>chemical Optical Character Recognition (OCR) system</strong> with a 17-step pipeline:</p>
<ol>
<li><strong>Pixel clustering</strong>: Region-growing to identify the chemical structure region</li>
<li><strong>Preprocessing</strong>: Resizing, de-noising, and bond length estimation (deactivated for I2S task)</li>
<li><strong>Text identification</strong>: Connected components with similar heights/areas labeled as characters</li>
<li><strong>Benzene ring detection</strong>: Identifying circles representing aromatic bonds</li>
<li><strong>Hatched bond detection</strong>: Finding short collinear line segments of uniform length</li>
<li><strong>Skeletonization</strong>: Thinning bond pixels for downstream processing</li>
<li><strong>Ring structure detection</strong>: Pentagonal/hexagonal rings via Generalized Hough Transformation (GHT)</li>
<li><strong>Line detection</strong>: Modified Hough Transformation with corner detection for bond extraction</li>
<li><strong>Line filtering</strong>: Removing spurious short segments</li>
<li><strong>Secondary text identification</strong>: Re-examining unidentified fragments for text</li>
<li><strong>Character recognition</strong>: Dual-engine approach (GOCR template matching + Euclidean distance-based engine)</li>
<li><strong>Chemical spell checker</strong>: Matching against a dictionary of 770 chemical abbreviations</li>
<li><strong>Secondary line detection</strong>: Re-running line detection on remaining pixels</li>
<li><strong>Line merging/breaking</strong>: Combining fragmented bonds or splitting at junction nodes</li>
<li><strong>Graph construction</strong>: Creating nodes from bond endpoints and chemical symbol centers, merging nearby nodes</li>
<li><strong>Connected component selection</strong>: Selecting the largest graph component</li>
<li><strong>Output</strong>: Connection table in machine-readable format</li>
</ol>
<h3 id="models">Models</h3>
<p>ChemReader is a rule-based system relying on traditional computer vision (Hough Transformation, region growing, skeletonization) and template-based character recognition. It does not use machine learning model architectures such as CNNs or neural networks.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Test</th>
          <th style="text-align: left">Correct Outputs</th>
          <th style="text-align: left">Avg. Tanimoto Similarity</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Test I (submitted)</td>
          <td style="text-align: left">691/1000</td>
          <td style="text-align: left">0.9769</td>
          <td style="text-align: left">Original submission</td>
      </tr>
      <tr>
          <td style="text-align: left">Test II (submitted)</td>
          <td style="text-align: left">689/1000</td>
          <td style="text-align: left">0.9823</td>
          <td style="text-align: left">Alternative parameter setting</td>
      </tr>
      <tr>
          <td style="text-align: left">Test III (post-fix)</td>
          <td style="text-align: left">930/1000 (93%)</td>
          <td style="text-align: left">0.9913</td>
          <td style="text-align: left">After fixing stereo bond omission and corner detection bug</td>
      </tr>
  </tbody>
</table>
<p><strong>Error Breakdown</strong> (from 20-sample analysis of Test III):</p>
<ul>
<li>Wrongly merged nodes: 6 (30%)</li>
<li>Missed bonds: 4 (20%)</li>
<li>Nonstandard representations (noise symbols, nonstandard wedge/hatched bonds, 3D crossing bonds): remaining errors</li>
</ul>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p>ChemReader&rsquo;s source code is not publicly available. The TREC 2011 Chemical IR I2S image sets were distributed to task participants but are not openly hosted. No pre-trained models apply (rule-based system). The paper provides a detailed algorithmic description (17-step pipeline) and parameter values, but full reproduction requires access to both the ChemReader codebase and the TREC image sets.</p>
<p><strong>Status</strong>: Closed</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Li, Y., Rosania, G. R., &amp; Saitou, K. (2011). Image-to-Structure Task by ChemReader. <em>TREC 2011 Chemical IR Track Report</em>.</p>
<p><strong>Publication</strong>: TREC 2011 Chemical IR Track</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/papers/CHEM.OVERVIEW.pdf">TREC 2011 Chemical IR Track Overview</a></li>
<li><a href="/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/">ChemReader 2009 original paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{parkImagetoStructureTaskChemReader2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image-to-Structure Task by {ChemReader}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Park, Jungkap and Li, Ye and Rosania, Gus R. and Saitou, Kazuhiro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{University of Michigan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span> = <span style="color:#e6db74">{TREC 2011 Chemical IR Track Report}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Reconstruction with chemoCR (2011)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</guid><description>A hybrid system combining pattern recognition and rule-based expert systems to reconstruct chemical structures from bitmap images.</description><content:encoded><![CDATA[<h2 id="contribution-the-chemocr-architecture">Contribution: The chemoCR Architecture</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper focuses entirely on the architecture and workflow of the <strong>chemoCR</strong> system. It proposes specific algorithmic innovations (texture-based vectorization, graph constraint exploration) and defines a comprehensive pipeline for converting raster images into semantic chemical graphs. The primary contribution is the system design and its operational efficacy.</p>
<h2 id="motivation-digitizing-image-locked-chemical-structures">Motivation: Digitizing Image-Locked Chemical Structures</h2>
<p>Chemical structures are the preferred language of chemistry, yet they are frequently locked in non-machine-readable formats (bitmap images like GIF, BMP) within patents and journals.</p>
<ul>
<li><strong>The Problem:</strong> Once published as images, chemical structure information is &ldquo;dead&rdquo; to analysis software.</li>
<li><strong>The Gap:</strong> Manual reconstruction is slow and error-prone. Existing tools struggled with the diversity of drawing styles (e.g., varying line thickness, font types, and non-standard bond representations).</li>
<li><strong>The Goal:</strong> To automate the conversion of these depictions into connection tables (SDF/MOL files) to make the data accessible for computational chemistry applications.</li>
</ul>
<h2 id="core-innovation-rule-based-semantic-object-identification">Core Innovation: Rule-Based Semantic Object Identification</h2>
<p>The system is based on a &ldquo;Semantic Entity&rdquo; approach that identifies chemically significant objects (chiral bonds, superatoms, reaction arrows) from structural formula depictions. Key technical innovations include:</p>
<ol>
<li><strong>Texture-based Vectorization:</strong> A new algorithm that computes local directions to vectorize lines, robust against varying drawing styles.</li>
<li><strong>Expert System Integration:</strong> A graph constraint exploration algorithm that applies an XML-based rule set to classify geometric primitives into chemical classes such as <code>BOND</code>, <code>DOUBLEBOND</code>, <code>TRIPLEBOND</code>, <code>BONDSET</code>, <code>DOTTED CHIRAL</code>, <code>STRINGASSOCIATION</code>, <code>DOT</code>, <code>RADICAL</code>, <code>REACTION</code>, <code>REACTION ARROW</code>, <code>REACTION PLUS</code>, <code>CHARGE</code>, and <code>UNKNOWN</code>.</li>
<li><strong>Validation Scoring:</strong> A built-in validation module that tests valences, bond lengths and angles, typical atom types, and fragments to assign a confidence score (0 to 1) to the reconstruction.</li>
</ol>
<h2 id="experiments-the-trec-2011-image-to-structure-task">Experiments: The TREC 2011 Image-to-Structure Task</h2>
<p>The system was evaluated as part of the <strong>TREC 2011 Image-to-Structure (I2S) Task</strong>.</p>
<ul>
<li><strong>Dataset:</strong> 1,000 unique chemical structure images provided by USPTO.</li>
<li><strong>Configuration:</strong> The authors used chemoCR v0.93 in batch mode with a single pre-configured parameter set (&ldquo;Houben-Weyl&rdquo;), originally developed for the Houben-Weyl book series of organic chemistry reactions published by Thieme.</li>
<li><strong>Process:</strong> The workflow included image binarization, connected component analysis, OCR for atom labels, and final molecule assembly.</li>
<li><strong>Metric:</strong> Perfect match recall against ground-truth MOL files.</li>
</ul>
<h2 id="results-and-conclusions-expert-systems-vs-dirty-data">Results and Conclusions: Expert Systems vs. &ldquo;Dirty&rdquo; Data</h2>
<ul>
<li><strong>Performance:</strong> The system achieved a <strong>perfect match for 656 out of 1,000 structures (65.6%)</strong>.</li>
<li><strong>Error Analysis:</strong> Failures were primarily attributed to &ldquo;unclear semantics&rdquo; in drawing styles, such as:
<ul>
<li>Overlapping objects (e.g., atom labels clashing with bonds).</li>
<li>Ambiguous primitives (dots interpreted as both radicals and chiral centers).</li>
<li>Markush structures (variable groups), which were excluded from the I2S task definition. A prototype for Markush detection existed but was not used.</li>
</ul>
</li>
<li><strong>Limitations:</strong> The vectorizer cannot recognize curves and circles, only straight lines. Aromatic ring detection (via a heuristic that looks for a large &ldquo;O&rdquo; character in the center of a ring system) was switched off for the I2S task. The system maintained 12 different parameter sets for various drawing styles, and selecting the correct set was critical.</li>
<li><strong>Impact:</strong> Demonstrated that rule-based expert systems combined with standard pattern recognition could handle high-quality datasets effectively, though non-standard drawing styles remain a challenge.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper relies on the TREC 2011 I2S dataset, comprising images extracted from USPTO patents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 I2S</td>
          <td>1,000 images</td>
          <td>Binarized bitmaps from USPTO patents.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Internal Training Set</td>
          <td>Unknown</td>
          <td>Used to optimize parameter sets (e.g., &ldquo;Houben-Weyl&rdquo; set).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper describes three main workflow phases (preprocessing, semantic entity recognition, and molecule reconstruction plus validation), organized into four pipeline sections:</p>
<ol>
<li>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><em>Vaporizer Unit:</em> Erases parts of the image that are presumably not structure diagrams (e.g., text or other human-readable information), isolating the chemical depictions.</li>
<li><em>Connected Components:</em> Groups all foreground pixels that are 8-connected into components.</li>
<li><em>Text Tagging and OCR:</em> Identifies components that map to text areas and converts bitmap letters into characters.</li>
</ul>
</li>
<li>
<p><strong>Vectorization:</strong></p>
<ul>
<li><em>Algorithm:</em> <strong>Compute Local Directions</strong>. It analyzes segment clusters to detect ascending, descending, horizontal, and vertical trends in pixel data, converting them into vectors.</li>
<li><em>Feature:</em> Explicitly handles &ldquo;thick chirals&rdquo; (wedges) by computing orientation.</li>
</ul>
</li>
<li>
<p><strong>Reconstruction (Expert System):</strong></p>
<ul>
<li><em>Core Logic:</em> <strong>Graph Constraint Exploration</strong>. It visits connected components and evaluates them against an XML Rule Set.</li>
<li><em>Classification:</em> Objects are tagged with chemical keywords (e.g., <code>BONDSET</code> for ring systems and chains, <code>STRINGASSOCIATION</code> for atom labels, <code>DOTTED CHIRAL</code> for chiral bonds).</li>
<li><em>Rules:</em> Configurable via <code>chemoCRSettings.xml</code>. The successful rule with the highest priority value defines the annotation for each component.</li>
</ul>
</li>
<li>
<p><strong>Assembly &amp; Validation:</strong></p>
<ul>
<li>Combines classified vectors and OCR text into a semantic graph.</li>
<li><em>Superatoms:</em> Matches text groups against a loaded superatom database (e.g., &ldquo;COOH&rdquo;, &ldquo;Boc&rdquo;).</li>
<li><em>Validation:</em> Calculates a score (0-1) based on chemical feasibility (valences, bond lengths and angles, typical atom types, and fragments).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system is primarily rule-based but utilizes machine learning components for specific sub-tasks:</p>
<ul>
<li><strong>OCR:</strong> A trainable OCR module using supervised machine learning to recognize atom labels ($H, C, N, O$). The specific classifier is not detailed in the paper.</li>
<li><strong>Rule Base:</strong> An XML file containing the expert system logic. This is the &ldquo;model&rdquo; for structural interpretation.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed strictly within the context of the TREC competition.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall (Perfect Match)</td>
          <td>656 / 1000</td>
          <td>N/A</td>
          <td>Strict structural identity required.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software Stack:</strong> Platform-independent JAVA libraries.</li>
<li><strong>Compute:</strong> Batch mode processing supported; specific hardware requirements (CPU/RAM) were not disclosed.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>chemoCR (Fraunhofer SCAI)</td>
          <td>Software</td>
          <td>Unknown</td>
          <td>Project page defunct; tool was proprietary</td>
      </tr>
      <tr>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemoCR.chem.update.pdf">TREC 2011 Proceedings Paper</a></td>
          <td>Paper</td>
          <td>Public</td>
          <td>Official NIST proceedings</td>
      </tr>
  </tbody>
</table>
<p>No source code was publicly released. The chemoCR system was a proprietary tool from Fraunhofer SCAI. The TREC 2011 I2S dataset was distributed to competition participants and is not independently hosted.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zimmermann, M. (2011). Chemical Structure Reconstruction with chemoCR. <em>TREC 2011 Proceedings</em>.</p>
<p><strong>Publication</strong>: Text REtrieval Conference (TREC) 2011</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zimmermannChemicalStructureReconstruction2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Structure Reconstruction with {{chemoCR}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Text {{REtrieval Conference}} ({{TREC}}) 2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zimmermann, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Structural Analysis of Handwritten Chemical Formulas</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/</guid><description>A 1999 methodology for recognizing handwritten chemical structures using a structural graph representation and recursive specialists.</description><content:encoded><![CDATA[<h2 id="contribution-structural-approach-to-document-analysis">Contribution: Structural Approach to Document Analysis</h2>
<p><strong>Method</strong>.
This paper proposes a system architecture for document analysis. It introduces a specific pipeline (Global Perception followed by Incremental Extraction) and validates this strategy with recognition rates on specific tasks. The core contribution is the shift from bitmap-based processing to a <strong>structural graph representation</strong> of graphical primitives.</p>
<h2 id="motivation-overcoming-bitmap-limitations-in-freehand-drawings">Motivation: Overcoming Bitmap Limitations in Freehand Drawings</h2>
<ul>
<li><strong>Complexity of Freehand</strong>: Freehand drawings contain fluctuating lines and noise that make standard vectorization techniques difficult to apply directly.</li>
<li><strong>Limitation of Bitmap Analysis</strong>: Most existing systems at the time attempted to interpret the document by working directly on the static bitmap image throughout the process.</li>
<li><strong>Need for Context</strong>: Interpretation requires a dynamic resource that can evolve as knowledge is extracted (e.g., recognizing a polygon changes the context for its neighbors).</li>
</ul>
<h2 id="novelty-dynamic-structural-graphs-and-recursive-specialists">Novelty: Dynamic Structural Graphs and Recursive Specialists</h2>
<p>The authors propose a <strong>Structural Representation</strong> as the unique resource for interpretation.</p>
<ul>
<li><strong>Quadrilateral Primitives</strong>: The system builds Quadrilaterals (pairs of vectors) to represent thin shapes, which are robust to handwriting fluctuations.</li>
<li><strong>Structural Graph</strong>: These primitives are organized into a graph where arcs represent geometric relationships (T-junctions, L-junctions, parallels).</li>
<li><strong>Specialist Agents</strong>: Interpretation is driven by independent modules (specialists) that browse this graph recursively to identify high-level chemical entities like rings (polygons) or chains.</li>
</ul>
<h2 id="experimental-setup-and-outcomes">Experimental Setup and Outcomes</h2>
<ul>
<li><strong>Validation Set</strong>: The system was tested on 20 handwritten off-line documents containing chemical formulas at 300 dpi resolution.</li>
<li><strong>Text Database</strong>: A separate base of 328 models was used for the text recognition component.</li>
<li><strong>High Graphical Accuracy</strong>: The system achieved a $\approx 97%$ recognition rate for graphical parts (chemical elements like rings and bonds).</li>
<li><strong>Text Recognition</strong>: The text recognition module achieved a $\approx 93%$ success rate.</li>
<li><strong>Robustness</strong>: The structural graph approach successfully handled multiple liaisons, polygons, chains and allowed for the progressive construction of a solution consistent with the context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Handwritten Documents</td>
          <td>20 docs</td>
          <td>Off-line documents at 300 dpi</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Character Models</td>
          <td>328 models</td>
          <td>Used for the Pattern Matching text recognition base</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The interpretation process is divided into two distinct phases:</p>
<p><strong>1. Global Perception (Graph Construction)</strong></p>
<ul>
<li><strong>Vectorization</strong>: Contour tracking produces a chain of vectors, which are simplified via iterative polygonal approximation until fusion stabilizes (2-5 iterations).</li>
<li><strong>Quadrilateral Formation</strong>: Vectors are paired to form quadrilaterals based on Euclidean distance and &ldquo;empirical&rdquo; alignment criteria.</li>
<li><strong>Graph Generation</strong>: Quadrilaterals become nodes. Arcs are created based on &ldquo;zones of influence&rdquo; and classified into 5 types: T-junction, Intersection (X), Parallel (//), L-junction, and Successive (S).</li>
<li><strong>Redraw Heuristic</strong>: A pre-processing step transforms T, X, and S junctions into L or // relations, as chemical drawings primarily consist of L-junctions and parallels.</li>
</ul>
<p><strong>2. Specialists (Interpretation)</strong></p>
<ul>
<li><strong>Liaison Specialist</strong>: Scans the graph for // arcs or quadrilaterals with free extremities to identify bonds.</li>
<li><strong>Polygon/Chain Specialist</strong>: Uses recursive <code>look-left</code> and <code>look-right</code> procedures. If a search returns to the start node after $n$ steps, a polygon is detected.</li>
<li><strong>Text Localization</strong>: Clusters &ldquo;short&rdquo; quadrilaterals by physical proximity into &ldquo;focus zones&rdquo;. Zones are classified as text/non-text based on connected components.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Text Recognition Hybrid</strong>:</p>
<ol>
<li><strong>Normalization &amp; Pattern Matching</strong>: A classic method using the database of 328 models.</li>
<li><strong>Structural Rule Base</strong>: Uses &ldquo;significant&rdquo; quadrilaterals (length $\ge 1/3$ of zone dimension) to verify characters. A rule base defines the expected count of horizontal, vertical, right-diagonal, and left-diagonal lines for each character.</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graphical Element Recognition</td>
          <td>~97%</td>
          <td>N/A</td>
          <td>Evaluated on 20 documents (Fig. 7 examples)</td>
      </tr>
      <tr>
          <td>Text Recognition</td>
          <td>~93%</td>
          <td>N/A</td>
          <td>Evaluated on 20 documents</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ramel, J.-Y., Boissier, G., &amp; Emptoz, H. (1999). Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image. <em>Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR &lsquo;99)</em>, 83-86. <a href="https://doi.org/10.1109/ICDAR.1999.791730">https://doi.org/10.1109/ICDAR.1999.791730</a></p>
<p><strong>Publication</strong>: ICDAR 1999</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ramelAutomaticReadingHandwritten1999,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the {{Fifth International Conference}} on {{Document Analysis}} and {{Recognition}}. {{ICDAR}} &#39;99 ({{Cat}}. {{No}}.{{PR00318}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ramel, J.-Y. and Boissier, G. and Emptoz, H.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1999</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{83--86}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Bangalore, India}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.1999.791730}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-0-7695-0318-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at TREC-CHEM 2011: Optical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</guid><description>A methodological overview of OSRA, an open-source pipeline for converting chemical structure images into machine-readable formats.</description><content:encoded><![CDATA[<h2 id="contribution-method-and-resource">Contribution: Method and Resource</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>, with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<p>It is Methodological because it details the specific algorithmic workflow (segmentation, binarization, vectorization, and rule-based recognition) used to translate pixel data into chemical semantics. It specifically addresses the &ldquo;Image2Structure&rdquo; task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.</p>
<h2 id="motivation-limitations-of-standard-ocr-in-chemistry">Motivation: Limitations of Standard OCR in Chemistry</h2>
<p>A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) is insufficient, and widely used techniques such as wavelet transforms or neural networks (as used in face recognition) are not applicable here because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.</p>
<h2 id="core-innovation-chemistry-aware-heuristic-pipeline">Core Innovation: Chemistry-Aware Heuristic Pipeline</h2>
<p>The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:</p>
<ul>
<li><strong>Entropy-based Page Segmentation</strong>: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.</li>
<li><strong>Custom Binarization</strong>: A specific grayscale conversion ($Gr=\min(R,G,B)$).</li>
<li><strong>Heuristic Confidence Scoring</strong>: A linear &ldquo;confidence function&rdquo; derived from atom and ring counts to select the best structure resolution.</li>
<li><strong>Specialized Bond Recognition</strong>: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.</li>
</ul>
<h2 id="methodology-evaluation-on-trec-chem-image2structure">Methodology: Evaluation on TREC-CHEM Image2Structure</h2>
<p>The system was validated through submission to the <strong>Image2Structure task of TREC-CHEM</strong>.</p>
<ul>
<li><strong>Version</strong>: OSRA version 1.3.8 was used without modifications.</li>
<li><strong>Setup</strong>: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.</li>
<li><strong>Data</strong>: The evaluation used a &ldquo;Training set&rdquo; and a &ldquo;Challenge Set&rdquo; provided by the task organizers.</li>
<li><strong>Metric</strong>: Recall rates were measured for both sets.</li>
</ul>
<h2 id="results-and-real-world-impact">Results and Real-World Impact</h2>
<ul>
<li><strong>Performance</strong>: The default settings achieved an <strong>84.3%</strong> recall on the training set and <strong>84.8%</strong> on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).</li>
<li><strong>Utility</strong>: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).</li>
<li><strong>Validation</strong>: Recognition rates have shown steady improvement over a 3-year development period.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://osra.sourceforge.net">OSRA (SourceForge)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Open-source OCSR tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The primary evaluation data came from the TREC-CHEM Image2Structure task.</li>
<li><strong>Reference Datasets</strong>: The paper references the &ldquo;Chem-Infty Dataset&rdquo; as a source of ground-truthed chemical structure images.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:</p>
<p><strong>1. Page Segmentation</strong></p>
<ul>
<li><strong>Entropy Calculation</strong>: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.</li>
<li><strong>Thresholds</strong>: Max entropy &gt; 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of <strong>4</strong> is used to distinguish the two.</li>
<li><strong>Separator Removal</strong>: Linear separators (aspect ratio above 100 or below 0.01, size above 300 pixels) are deleted early. Table frames are identified as connected components with aspect ratio between 0.1 and 10, with at least 300 pixels lying on the surrounding rectangle.</li>
<li><strong>Text Removal</strong>: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains &gt; 8 segments, has a fill ratio &gt; 0.2, or aspect ratio &gt; 10.</li>
</ul>
<p><strong>2. Image Preprocessing</strong></p>
<ul>
<li><strong>Grayscale</strong>: $Gr = \min(R, G, B)$.</li>
<li><strong>Resolutions</strong>: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.</li>
<li><strong>Noise Factor</strong>: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between <strong>0.5 and 1.0</strong>, anisotropic smoothing (GREYCstoration) is applied.</li>
<li><strong>Thinning</strong>: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.</li>
</ul>
<p><strong>3. Vectorization &amp; Atom Detection</strong></p>
<ul>
<li><strong>Library</strong>: Potrace is used for vectorization.</li>
<li><strong>Atom Identification</strong>: Atoms are detected at Bezier curve control points if:
<ul>
<li>Potrace classifies it as a corner.</li>
<li>Direction change normal component is $\ge$ 2 pixels.</li>
<li>The distance from the last atom to the next control point is less than the distance from the last atom to the current control point.</li>
</ul>
</li>
<li><strong>OCR</strong>: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract and Cuneiform were also tested but did not improve recognition results.</li>
</ul>
<p><strong>4. Chemical Logic</strong></p>
<ul>
<li><strong>Average Bond Length</strong>: Defined as the value at the <strong>75th percentile</strong> of the sorted bond length list (to avoid bias from small artifacts).</li>
<li><strong>Aromaticity</strong>: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are less than 90 degrees.</li>
<li><strong>Bridge Bonds</strong>: Detected if an atom connected to 4 pairwise collinear single bonds (none terminal) can be removed without changing fragment count, rotatable bonds, or reducing the number of 5- and 6-membered rings by 2.</li>
</ul>
<p><strong>5. Connection Table Compilation</strong></p>
<ul>
<li><strong>Library</strong>: OpenBabel is used for conversion into SMILES or SDF formats.</li>
<li><strong>Process</strong>: A molecular object is constructed from connectivity information along with stereo- and aromaticity flags. Superatom fragments are added at this stage using a user-modifiable dictionary.</li>
</ul>
<h3 id="models">Models</h3>
<p>This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.</p>
<p><strong>Confidence Function</strong>: Used to select the best resolution result.</p>
<p>$$
\begin{aligned}
\text{confidence} &amp;= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\
&amp;+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\
&amp;+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\
&amp;+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments}
\end{aligned}
$$</p>
<p>Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run</th>
          <th>Training Set</th>
          <th>Challenge Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall</td>
          <td>Default Settings</td>
          <td>84.3%</td>
          <td>84.8%</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>Fixed 300 dpi</td>
          <td>86.1%</td>
          <td>85.6%</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. <em>TREC-CHEM</em>.</p>
<p><strong>Publication</strong>: TREC-CHEM 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://osra.sourceforge.net">SourceForge Project</a></li>
<li><a href="https://launchpad.net/cuneiform-linux">Cuneiform Linux Port</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{filippovOpticalStructureRecognition2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{National Cancer Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">note</span> = <span style="color:#e6db74">{TREC-CHEM Entry}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kekulé-1 System for Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</guid><description>Foundational OCSR method combining neural OCR with chemical rule-based post-processing for automated structure interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1996). Automatic Interpretation of Chemical Structure Diagrams. <em>Graphics Recognition. Methods and Applications</em>, 148-158. <a href="https://doi.org/10.1007/3-540-61226-2_13">https://doi.org/10.1007/3-540-61226-2_13</a></p>
<p><strong>Publication</strong>: Lecture Notes in Computer Science (LNCS), Vol. 1072, Springer, 1996.</p>
<h2 id="system-architecture-and-contribution">System Architecture and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel software architecture (&ldquo;Kekulé-1&rdquo;) designed to solve the specific technical problem of converting rasterized chemical diagrams into machine-readable connection tables. The paper is characterized by:</p>
<ul>
<li><strong>Algorithmic Specification</strong>: It details specific algorithms for vectorization, polygon approximation, and character recognition.</li>
<li><strong>Performance Metrics</strong>: It validates the method using quantitative accuracy (98.9%) and speed comparisons against manual entry.</li>
<li><strong>System Architecture</strong>: It describes the integration of typically disparate components (OCR, vectorization, chemical rules) into a cohesive pipeline.</li>
</ul>
<h2 id="motivation-the-chemical-data-entry-bottleneck">Motivation: The Chemical Data Entry Bottleneck</h2>
<p>Chemical structure diagrams are the primary medium for communication between chemists, but computers cannot natively &ldquo;read&rdquo; these raster images.</p>
<ul>
<li><strong>Efficiency Gap</strong>: Manual redrawing of structures into chemical databases takes 6 to 10 minutes per structure.</li>
<li><strong>Technical Challenge</strong>: Existing commercial OCR systems failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), nor could they recognize small fonts (3-7 points) or chemical symbols accurately.</li>
<li><strong>Goal</strong>: To create an &ldquo;Optical Chemical Structure Recognition&rdquo; (OCSR) system that reduces processing time to seconds while handling complex notation like stereochemistry and group formulas.</li>
</ul>
<h2 id="core-innovations-in-chemical-ocr">Core Innovations in Chemical OCR</h2>
<p>Kekulé-1 represents the &ldquo;first successful attempt&rdquo; to integrate image processing, OCR, and structure editing into a single workflow. Key innovations include:</p>
<ul>
<li><strong>Context-Aware OCR</strong>: Unlike standard OCR, Kekulé-1 uses &ldquo;chemical spell checking&rdquo; by applying valence rules and chemical context to correct raw character recognition errors (e.g., distinguishing &lsquo;5&rsquo; from &lsquo;S&rsquo; based on bonding).</li>
<li><strong>Adaptive Polygon Approximation</strong>: A modified vectorization algorithm that partitions objects at the farthest node to prevent artifact nodes in U-shaped structures.</li>
<li><strong>Hybrid Parsing</strong>: It treats the diagram as a graph where nodes can be explicit atoms or geometric intersections, using rule-based logic to parse &ldquo;group formulas&rdquo; (like $COOH$) recursively.</li>
</ul>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The authors evaluated the system on a private test set to validate robustness and speed.</p>
<ul>
<li><strong>Dataset</strong>: 524 chemical structures chosen from a &ldquo;wide variety of sources&rdquo; specifically to test the system&rsquo;s limits.</li>
<li><strong>Metrics</strong>: Success rate (percentage of structures processed with minimal editing) and processing time per structure.</li>
<li><strong>Comparators</strong>: Performance was compared against the &ldquo;manual redrawing&rdquo; baseline.</li>
</ul>
<h2 id="results-performance-and-conclusions">Results, Performance, and Conclusions</h2>
<ul>
<li><strong>High Accuracy</strong>: 98.9% of the test structures were successfully processed (with an average of 0.74 user prompts per structure).</li>
<li><strong>Speedup</strong>: Processing took 7 to 30 seconds per structure, a significant improvement over the 6 to 10 minute manual baseline.</li>
<li><strong>Robustness</strong>: The system successfully handled pathological cases like broken characters, skew (rotation), and touching characters.</li>
<li><strong>Impact</strong>: The authors conclude that the techniques are generalizable to other domains like electrical circuits and utility maps.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training/Test Data</strong>: The evaluation used 524 chemical structures. These were not released publicly but were selected to represent &ldquo;limit&rdquo; cases.</li>
<li><strong>Input format</strong>: Scanned images at 300-400 dpi. The authors note that higher resolutions do not add information due to ink wicking and paper limitations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details several specific algorithmic implementations:</p>
<p><strong>Vectorization (Polygon Approximation)</strong>:</p>
<ul>
<li>Standard thinning and raster-to-vector translation are used.</li>
<li><strong>Innovation</strong>: The algorithm searches for the node <em>farthest</em> from the current start node to partition the object. This prevents artifact nodes in curved lines.</li>
<li><strong>Threshold Formula</strong>: The allowed deviation ($dist$) from a straight line is adaptive based on segment length ($length$):</li>
</ul>
<p>$$dist = \max(1, \frac{length}{10.0} + 0.4)$$</p>
<p>(Units in pixels)</p>
<p><strong>Rotation Correction</strong>:</p>
<ul>
<li>The system computes the angle of all &ldquo;long&rdquo; line segments modulo 15 degrees.</li>
<li>It bins these angles; the bin with the highest count (representing &lt; 4 degrees rotation) is treated as the scan skew and corrected.</li>
</ul>
<p><strong>Optical Character Recognition (OCR)</strong>:</p>
<ul>
<li>Uses a neural network with linked/shared weights (similar to Convolutional Neural Networks, though not named as such) acting as a feature detector.</li>
<li><strong>Training</strong>: Trained on specific chemical fonts.</li>
<li><strong>Inference</strong>: Outputs are ranked; if multiple characters (e.g., &lsquo;5&rsquo; and &lsquo;S&rsquo;) exceed a threshold, both are kept, and chemical context resolves the ambiguity later.</li>
</ul>
<p><strong>Chemical Parsing</strong>:</p>
<ul>
<li>Group formulas (e.g., $COOH$) are parsed left-to-right by subtracting valences.</li>
<li>Example: For $COOH$, the external bond reduces Carbon&rsquo;s valence to 3. The first Oxygen takes 2, leaving 1. The final Oxygen takes 1 (attaching to Carbon), and the Hydrogen takes 1 (attaching to Oxygen).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Model</strong>: A neural network with a &ldquo;shared weights&rdquo; paradigm, effectively creating a learned convolution map. It achieves ~99.9% raw accuracy on isolated test sets of chemical fonts.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: The evaluation was performed on an <strong>80486 processor at 33 MHz</strong>.</li>
<li><strong>Time</strong>: Average processing time was 9 seconds per structure.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{mcdanielAutomaticInterpretationChemical1996,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic Interpretation of Chemical Structure Diagrams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Graphics Recognition. Methods and Applications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{O&#39;Gorman, Lawrence and Kasturi, Rangachar}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{Lecture Notes in Computer Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1072}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{148--158}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1996}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/3-540-61226-2_14}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CLiDE Pro: Optical Chemical Structure Recognition Tool</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</guid><description>A methodological paper presenting CLiDE Pro, an OCSR system for reconstructing chemical graphs from images with ~90% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Valko, A. T., &amp; Johnson, A. P. (2009). CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 49(4), 780-787. <a href="https://doi.org/10.1021/ci800449t">https://doi.org/10.1021/ci800449t</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2009</p>
<h2 id="contribution-robust-algorithmic-pipeline-for-ocsr">Contribution: Robust Algorithmic Pipeline for OCSR</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper, as it proposes a specific algorithmic architecture (CLiDE Pro) for converting raster images of chemical structures into connection tables. It details the procedural steps for segmentation, vectorization, and graph reconstruction.</p>
<p>It also has a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution, as the authors compile and release a validation set of 454 real-world images to serve as a community benchmark for OCSR systems.</p>
<h2 id="motivation-bridging-the-gap-between-legacy-document-images-and-machine-readable-chemistry">Motivation: Bridging the Gap Between Legacy Document Images and Machine-Readable Chemistry</h2>
<p>While modern chemical drawing software captures structural information explicitly, the vast majority of legacy and current chemical literature (journals, patents, reports) exists as images or PDF documents. These images are human-readable but lack the semantic &ldquo;connection table&rdquo; data required for chemical databases and software. Manual redrawing is time-consuming and error-prone. Therefore, there is a critical need for efficient Optical Chemical Structure Recognition (OCSR) systems to automate this extraction.</p>
<h2 id="novelty-integrated-document-segmentation-and-ambiguity-resolution-heuristics">Novelty: Integrated Document Segmentation and Ambiguity Resolution Heuristics</h2>
<p>CLiDE Pro introduces several algorithmic improvements over its predecessor (CLiDE) and contemporary tools:</p>
<ul>
<li><strong>Integrated Document Segmentation</strong>: Unlike page-oriented systems, it processes whole documents to link information across pages.</li>
<li><strong>Robust &ldquo;Difficult Feature&rdquo; Handling</strong>: It implements specific heuristic rules to resolve ambiguities in crossing bonds (bridged structures), which are often misinterpreted as carbon atoms in other systems.</li>
<li><strong>Generic Structure Interpretation</strong>: It includes a module to parse &ldquo;generic&rdquo; (Markush) structures by matching R-group labels in the diagram with text-based definitions found in the document.</li>
<li><strong>Ambiguity Resolution</strong>: It uses context-aware rules to distinguish between geometrically similar features, such as vertical lines representing bonds vs. the letter &rsquo;l&rsquo; in &lsquo;Cl&rsquo;.</li>
</ul>
<h2 id="methodology-and-benchmarking-on-real-world-data">Methodology and Benchmarking on Real-World Data</h2>
<p>The authors conducted a systematic validation on a dataset of <strong>454 images</strong> containing <strong>519 structure diagrams</strong>.</p>
<ul>
<li><strong>Source Material</strong>: Images were extracted from published materials (journals, patents), ensuring &ldquo;real artifacts&rdquo; like noise and scanning distortions were present.</li>
<li><strong>Automation</strong>: The test was fully automated without human intervention.</li>
<li><strong>Metrics</strong>: The primary metric was the &ldquo;success rate,&rdquo; defined as the correct reconstruction of the molecule&rsquo;s connection table. They also performed fine-grained error analysis on specific features (e.g., atom labels, dashed bonds, wavy bonds).</li>
</ul>
<h2 id="results-high-topological-accuracy-and-persistent-ocr-challenges">Results: High Topological Accuracy and Persistent OCR Challenges</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved a <strong>89.79%</strong> retrieval rate (466/519 molecules correctly reconstructed).</li>
<li><strong>Robustness on Primitives</strong>: Solid straight bonds were recognized with 99.92% accuracy.</li>
<li><strong>Key Failure Modes</strong>: The majority of errors (58 cases) occurred in atom label construction, specifically when labels touched nearby bonds or other artifacts, causing OCR failures.</li>
<li><strong>Impact</strong>: The study demonstrated that handling &ldquo;difficult&rdquo; drawing features like crossing bonds and bridged structures significantly reduces topological errors. The authors released the test set to encourage standardized benchmarking in the OCSR field.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized a custom dataset designed to reflect real-world noise.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>CLiDE Pro Validation Set</td>
          <td>454 images (519 structures)</td>
          <td>Extracted from scanned journals and PDFs. Includes noise/artifacts. Available in Supporting Information.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The CLiDE Pro pipeline consists of five distinct phases. To replicate this system, one would need to implement:</p>
<ol>
<li>
<p><strong>Image Binarization</strong>:</p>
<ul>
<li>Input images are binarized using a threshold-based technique to separate foreground (molecule) from background.</li>
<li><strong>Connected Component Analysis (CCA)</strong>: A non-recursive scan identifies connected components (CCs) and generates interpixel contours (using N, S, E, W directions).</li>
</ul>
</li>
<li>
<p><strong>Document Segmentation</strong>:</p>
<ul>
<li><strong>Layout Analysis</strong>: Uses a bottom-up approach building a tree structure. It treats CCs as graph vertices and distances as edges.</li>
<li><strong>Clustering</strong>: A minimal-cost spanning tree (Kruskal&rsquo;s algorithm) groups CCs into words, lines, and blocks.</li>
<li><strong>Classification</strong>: CCs are classified (Character, Dash, Line, Graphics, Noise) based on size thresholds derived from statistical image analysis.</li>
</ul>
</li>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li><strong>Contour Approximation</strong>: Uses a method similar to <strong>Sklansky and Gonzalez</strong> to approximate contours into polygons.</li>
<li><strong>Vector Formation</strong>: Long polygon sides become straight lines; short consecutive sides become curves. Opposing borders of a line are matched to define the bond vector.</li>
<li><strong>Wavy Bonds</strong>: Detected by finding groups of short vectors lying on a straight line.</li>
<li><strong>Dashed Bonds</strong>: Detected using the <strong>Hough transform</strong> to find collinear or parallel dashes.</li>
</ul>
</li>
<li>
<p><strong>Atom Label Construction</strong>:</p>
<ul>
<li><strong>OCR</strong>: An OCR engine (filtering + topological analysis) interprets characters.</li>
<li><strong>Grouping</strong>: Characters are grouped into words based on horizontal and vertical proximity (for vertical labels).</li>
<li><strong>Superatom Lookup</strong>: Labels are matched against a database of elements, functional groups, and R-groups. Unknown linear formulas (e.g., $\text{CH}_2\text{CH}_2\text{OH}$) are parsed.</li>
</ul>
</li>
<li>
<p><strong>Graph Reconstruction</strong>:</p>
<ul>
<li><strong>Connection Logic</strong>: Bond endpoints are joined to atoms if they are within a distance threshold and &ldquo;point toward&rdquo; the label.</li>
<li><strong>Implicit Carbons</strong>: Unconnected bond ends are joined if close; parallel bonds merge into double/triple bonds.</li>
<li><strong>Crossing Bonds</strong>: Rules check proximity, length, and ring membership to determine if crossing lines are valid atoms or 3D visual artifacts.</li>
</ul>
</li>
<li>
<p><strong>Generic Structure Interpretation</strong>:</p>
<ul>
<li><strong>Text Mining</strong>: A lexical/syntactic analyzer extracts R-group definitions (e.g., &ldquo;R = Me or H&rdquo;) from text blocks.</li>
<li><strong>Matching</strong>: The system attempts to match R-group labels in the diagram with the parsed text definitions.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Engine</strong>: The system relies on a customized OCR engine capable of handling rotation and chemical symbols, though the specific architecture (neural vs. feature-based) is not detailed beyond &ldquo;topological and geometrical feature analysis&rdquo;.</li>
<li><strong>Superatom Database</strong>: A lookup table containing elements, common functional groups, and R-group labels.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The evaluation focused on the topological correctness of the output.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total Success Rate</strong></td>
          <td>89.79%</td>
          <td>466/519 structures perfectly reconstructed.</td>
      </tr>
      <tr>
          <td><strong>Atom Label Accuracy</strong></td>
          <td>98.54%</td>
          <td>3923/3981 labels correct. Main error source: labels touching bonds.</td>
      </tr>
      <tr>
          <td><strong>Solid Bond Accuracy</strong></td>
          <td>&gt;99.9%</td>
          <td>16061/16074 solid bonds correct.</td>
      </tr>
      <tr>
          <td><strong>Dashed Bond Accuracy</strong></td>
          <td>98.37%</td>
          <td>303/308 dashed bonds correct.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; described as efficient.</li>
<li><strong>Performance</strong>: The system processed the complex Palytoxin structure &ldquo;within a few seconds&rdquo;. This implies low computational overhead suitable for standard desktop hardware of the 2009 era.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{valkoCLiDEProLatest2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Valko, Aniko T. and Johnson, A. Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{49}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{780--787}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci800449t}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemInk: Real-Time Recognition for Chemical Drawings</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/</guid><description>A sketch recognition framework for chemical diagrams using a joint CRF model to combine multi-level visual features for real-time interpretation.</description><content:encoded><![CDATA[<h2 id="contribution-real-time-sketch-recognition-method">Contribution: Real-Time Sketch Recognition Method</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural framework for sketch recognition that integrates visual features at three distinct levels (inkpoints, segments, symbols) into a single probabilistic model. The rhetorical structure centers on the proposal of this new architecture, the introduction of a specific &ldquo;trainable corner detector&rdquo; algorithm, and the validation of these methods against existing benchmarks and alternative toolsets (ChemDraw).</p>
<h2 id="motivation-bridging-the-gap-between-sketching-and-cad">Motivation: Bridging the Gap Between Sketching and CAD</h2>
<p>The primary motivation is to bridge the gap between the natural, efficient process of drawing chemical diagrams by hand and the cumbersome &ldquo;point-click-and-drag&rdquo; interactions required by CAD tools like ChemDraw. While chemists prefer sketching for communication, existing digital tools do not offer the same speed or ease of use. The goal is to build an intelligent system that understands freehand sketches in real-time, converting them into structured data suitable for analysis or search.</p>
<h2 id="core-innovation-hierarchical-joint-crf-model">Core Innovation: Hierarchical Joint CRF Model</h2>
<p>The core novelty lies in the <strong>hierarchical joint model</strong>. Unlike previous approaches that might treat stroke segmentation and symbol recognition as separate, isolated steps, ChemInk uses a <strong>Conditional Random Field (CRF)</strong> to jointly model dependencies across three levels:</p>
<ol>
<li><strong>Inkpoints</strong>: Local visual appearance.</li>
<li><strong>Segments</strong>: Stroke fragments separated by corners.</li>
<li><strong>Candidates</strong>: Potential symbol groupings.</li>
</ol>
<p>Additionally, the paper introduces a <strong>trainable corner detector</strong> that learns domain-specific corner definitions from data.</p>
<h2 id="experimental-design-and-baselines">Experimental Design and Baselines</h2>
<p>The authors conducted two primary evaluations:</p>
<ol>
<li><strong>Off-line Accuracy Evaluation</strong>:
<ul>
<li><strong>Dataset</strong>: 12 real-world organic compounds drawn by 10 participants.</li>
<li><strong>Metric</strong>: Recognition accuracy (Recall and Precision).</li>
<li><strong>Baseline</strong>: Comparison against their own previous work (O&amp;D 2009) and ablations (with/without context).</li>
</ul>
</li>
<li><strong>On-line User Study</strong>:
<ul>
<li><strong>Task</strong>: 9 participants (chemistry students) drew 5 diagrams using both ChemInk (Tablet PC) and ChemDraw (Mouse/Keyboard).</li>
<li><strong>Metric</strong>: Time to completion and subjective user ratings (speed/ease of use).</li>
</ul>
</li>
</ol>
<h2 id="results-accuracy-and-user-study-outcomes">Results: Accuracy and User Study Outcomes</h2>
<ul>
<li><strong>Accuracy</strong>: The system achieved <strong>97.4% symbol recognition accuracy</strong>, slightly outperforming the best prior result (97.1%). The trainable corner detector achieved <strong>99.91% recall</strong>.</li>
<li><strong>Speed</strong>: Users were <strong>twice as fast</strong> using ChemInk (avg. 36s) compared to ChemDraw (avg. 79s).</li>
<li><strong>Usability</strong>: Participants rated ChemInk significantly higher for speed (6.3 vs 4.5) and ease of use (6.3 vs 4.7) on a 7-point scale.</li>
<li><strong>Conclusion</strong>: Sketch recognition is a viable, superior alternative to standard CAD tools for authoring chemical diagrams.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training/Test Data</strong>: 12 real-world organic compounds (e.g., Aspirin, Penicillin) drawn by 10 participants (organic chemistry familiar).</li>
<li><strong>Evaluation Split</strong>: User-independent cross-validation (training on 9 users, testing on 1).</li>
<li><strong>Input</strong>: Raw digital ink (strokes) collected on a Tablet PC.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Corner Detection (Trainable)</strong></p>
<ul>
<li><strong>Method</strong>: Iterative vertex elimination.</li>
<li><strong>Cost Function</strong>: $cost(p_{i}) = \sqrt{mse(s_{i}; p_{i-1}, p_{i+1})} \cdot dist(p_{i}; p_{i-1}, p_{i+1})$</li>
<li><strong>Procedure</strong>: Repeatedly remove the vertex with the lowest cost until the classifier (trained on features like cost, diagonal length, ink density) predicts the remaining vertices are corners.</li>
</ul>
<p><strong>2. Feature Extraction</strong></p>
<ul>
<li><strong>Inkpoints</strong>: Sampled at regular intervals. Features = $10 \times 10$ pixel orientation filters (0, 45, 90, 135 degrees) at two scales ($L/2$, $L$), smoothed and downsampled to $5 \times 5$. Total 400 features.</li>
<li><strong>Segments</strong>: Similar image features centered at segment midpoint, plus geometric features (length, ink density).</li>
<li><strong>Candidates</strong>: 5 feature images ($20 \times 20$) including an &ldquo;endpoint&rdquo; image, stretched to normalize aspect ratio.</li>
<li><strong>Dimensionality Reduction</strong>: PCA used to compress feature images to 256 components.</li>
</ul>
<p><strong>3. Structure Generation</strong></p>
<ul>
<li><strong>Clustering</strong>: Agglomerative clustering with a complete-link metric to connect symbols.</li>
<li><strong>Threshold</strong>: Stop clustering at distance $0.4L$.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Conditional Random Field (CRF)</strong></p>
<ul>
<li><strong>Structure</strong>: 3-level hierarchy (Inkpoints $V_p$, Segments $V_s$, Candidates $V_c$).</li>
<li><strong>Nodes</strong>:
<ul>
<li>$V_p, V_s$ labels: &ldquo;bond&rdquo;, &ldquo;hash&rdquo;, &ldquo;wedge&rdquo;, &ldquo;text&rdquo;.</li>
<li>$V_c$ labels: specific candidate interpretations.</li>
</ul>
</li>
<li><strong>Edges/Potentials</strong>:
<ul>
<li><strong>Entity-Feature</strong>: $\phi(y, x)$ (Linear classifier).</li>
<li><strong>Consistency</strong>: $\psi(y_i, y_j)$ (Hard constraint: child must match parent label).</li>
<li><strong>Spatial Context</strong>: $\psi_{ss}(y_i, y_j)$ (Pairwise geometric relationships between segments: angle, distance).</li>
<li><strong>Overlap</strong>: Prevents conflicting candidates from sharing segments.</li>
</ul>
</li>
<li><strong>Inference</strong>: Loopy Belief Propagation (up to 100 iterations).</li>
<li><strong>Training</strong>: Maximum Likelihood via gradient ascent (L-BFGS).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Accuracy (Recall/Precision) of symbol detection.</li>
<li><strong>Comparison</strong>: Compared against Ouyang &amp; Davis 2009 (previous SOTA).</li>
<li><strong>Speed Metric</strong>: Wall-clock time for diagram creation (ChemInk vs. ChemDraw).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Processor</strong>: 3.7 GHz processor (single thread) for base benchmarking (approx. 1 sec/sketch).</li>
<li><strong>Deployment</strong>: Validated on 1.8 GHz Tablet PCs using multi-core parallelization for real-time feedback.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, T. Y., &amp; Davis, R. (2011). ChemInk: A Natural Real-Time Recognition System for Chemical Drawings. <em>Proceedings of the 16th International Conference on Intelligent User Interfaces</em>, 267&ndash;276. <a href="https://doi.org/10.1145/1943403.1943444">https://doi.org/10.1145/1943403.1943444</a></p>
<p><strong>Publication</strong>: IUI &lsquo;11</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ouyangChemInkNaturalRealtime2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ChemInk: A Natural Real-Time Recognition System for Chemical Drawings}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{ChemInk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 16th International Conference on Intelligent User Interfaces}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ouyang, Tom Y. and Davis, Randall}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = feb,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{267--276}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Palo Alto, CA, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/1943403.1943444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4503-0419-1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{http://hdl.handle.net/1721.1/78898}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Recognition (Rule-Based)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</guid><description>A strictly rule-based expert system (MolRec) for converting raster chemical diagrams into graph representations.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). Chemical structure recognition: A rule based approach. <em>Proceedings of SPIE</em>, 8297, 82970E. <a href="https://doi.org/10.1117/12.912185">https://doi.org/10.1117/12.912185</a></p>
<p><strong>Publication</strong>: IS&amp;T/SPIE Electronic Imaging 2012</p>
<h2 id="methodological-contribution">Methodological Contribution</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a &ldquo;strictly rule based system&rdquo; to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).</p>
<h2 id="motivation-overcoming-procedural-heuristics">Motivation: Overcoming Procedural Heuristics</h2>
<p>Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.</p>
<h2 id="core-innovation-geometric-rewrite-rules">Core Innovation: Geometric Rewrite Rules</h2>
<p>The core novelty is the <strong>geometric rewrite rule system</strong> (MolRec).</p>
<ul>
<li><strong>Geometric Primitives</strong>: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.</li>
<li><strong>Fuzzy Parameters</strong>: It introduces formal definitions for &ldquo;fuzzy&rdquo; relationships (e.g., <code>dash-neighbouring</code>, <code>approximate collinearity</code>) to handle drawing irregularities and scanning artifacts.</li>
<li><strong>Ambiguity Resolution</strong>: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a &ldquo;triple bond&rdquo; from a &ldquo;dashed bold bond&rdquo; based on context (connected atoms).</li>
<li><strong>Explicit &ldquo;Cutting&rdquo;</strong>: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).</li>
</ul>
<h2 id="experimental-setup-vs-baselines">Experimental Setup vs. Baselines</h2>
<p>The authors compared their system (MolRec) against <strong>OSRA</strong> (the leading open-source system) on two datasets:</p>
<ol>
<li><strong>OSRA Benchmark</strong>: 5,735 computer-generated diagrams with ground truth MOL files.</li>
<li><strong>Maybridge Dataset</strong>: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.</li>
</ol>
<p>Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.</p>
<h2 id="results-and-key-findings">Results and Key Findings</h2>
<p><strong>MolRec outperformed OSRA</strong> on both datasets:</p>
<ul>
<li><strong>OSRA Benchmark</strong>: MolRec achieved <strong>88.46%</strong> accuracy vs. OSRA&rsquo;s 77.23%.</li>
<li><strong>Maybridge Dataset</strong>: MolRec achieved <strong>83.84%</strong> accuracy vs. OSRA&rsquo;s 72.57%.</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>Robustness</strong>: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.</li>
<li><strong>Failure Modes</strong>: Major remaining errors were caused by &ldquo;touching components&rdquo; (ligatures, characters touching bonds) and complex &ldquo;superatoms&rdquo; (abbreviations like &ldquo;-Ph&rdquo; or &ldquo;-COOH&rdquo;) with ambiguous connection points.</li>
<li><strong>Triangle Detection</strong>: The &ldquo;expanding disc&rdquo; method for identifying wedge bonds was highly effective.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>Two distinct datasets were used for validation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA Benchmark</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">5,735</td>
          <td style="text-align: left">Computer-generated diagrams provided by the OSRA project.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">5,730</td>
          <td style="text-align: left">Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> $\to$ OpenBabel.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline consists of three stages: <strong>Vectorization</strong>, <strong>Geometric Processing</strong>, and <strong>Rule Application</strong>.</p>
<p><strong>1. Vectorization &amp; Primitives</strong></p>
<ul>
<li><strong>Binarization &amp; OCR</strong>: Connected components are labelled and passed to an OCR engine to extract &ldquo;Character Groups&rdquo;.</li>
<li><strong>Thinning</strong>: Image is thinned to unit width.</li>
<li><strong>Simplification</strong>: Douglas-Peucker algorithm converts pixel paths into straight <strong>Line Segments</strong>.</li>
<li><strong>Triangle Detection</strong>: A disc growing algorithm walks inside black regions to identify <strong>Triangles</strong> (wedges). If the disc cannot grow, it is a thick line (Bold Bond).</li>
</ul>
<p><strong>2. Fuzzy Parameters</strong></p>
<p>The rules rely on tolerating drawing imperfections using defined parameters:</p>
<ul>
<li>$r_e$: Radius of collinearity (strict).</li>
<li>$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).</li>
<li>$bdl$ / $bdw$: Bold dash length / width (fuzzy).</li>
<li>$bs$: Bond separation (max distance between parallel bonds).</li>
<li>$ol$: Minimal overlap.</li>
</ul>
<p><strong>3. The Rule System (R1-R18)</strong></p>
<p>The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.</p>
<ul>
<li><strong>Planar Bonds</strong>:
<ul>
<li><strong>R1-R3 (Single/Double/Triple)</strong>: Identifies parallel lines based on <code>bs</code> and <code>ol</code>. Uses &ldquo;cutting&rdquo; to split lines at implicit nodes.</li>
</ul>
</li>
<li><strong>Ambiguity Resolution (Stereo vs. Planar)</strong>:
<ul>
<li><strong>R4 (Dashed Bold vs. Triple)</strong>: Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.</li>
<li><strong>R5 (Dashed Wedge vs. Triple)</strong>: Similar disambiguation based on length monotonicity.</li>
<li><strong>R6 (Dashed Wedge vs. Double)</strong>: Differentiates based on line length differences ($l_1 &gt; l_2$).</li>
</ul>
</li>
<li><strong>Stereo Bonds</strong>:
<ul>
<li><strong>R7-R9 (Dashed Types)</strong>: Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).</li>
<li><strong>R10-R11 (Hollow Wedge)</strong>: Detects triangles formed by 3 or 4 lines.</li>
<li><strong>R14 (Solid Wedge)</strong>: Direct mapping from Triangle primitive.</li>
</ul>
</li>
<li><strong>Special Structures</strong>:
<ul>
<li><strong>R12 (Wavy Bond)</strong>: Zig-zag line segments.</li>
<li><strong>R13 (Arrow)</strong>: Dative bond.</li>
<li><strong>R16 (Aromatic Ring)</strong>: Circle inside a cycle of &gt;5 lines.</li>
<li><strong>R17-R18 (Bridge Bonds)</strong>: Handles 2.5D crossing bonds (open or closed gaps).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.</p>
<p><strong>Results Table</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">System</th>
          <th style="text-align: left">Success Rate</th>
          <th style="text-align: left">Fail Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>88.46%</strong></td>
          <td style="text-align: left">11.54%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">77.23%</td>
          <td style="text-align: left">22.77%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>83.84%</strong></td>
          <td style="text-align: left">16.16%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">72.57%</td>
          <td style="text-align: left">27.43%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.</li>
</ul>
]]></content:encoded></item><item><title>Tea Party in the House: Legislative Ideology via HIPTM</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</guid><description>A hierarchical probabilistic model combining roll call votes, bill text, and legislative speeches to analyze political polarization and framing.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p><strong>Method</strong>.</p>
<p>This paper is primarily a <strong>Methodological</strong> contribution. It proposes a novel probabilistic architecture, the Hierarchical Ideal Point Topic Model (HIPTM), designed to solve the specific limitations of existing political science models that typically rely on either voting data or text data in isolation. The paper validates this method by demonstrating its superior performance in predicting &ldquo;Tea Party&rdquo; membership compared to text-only baselines and its ability to provide interpretable &ldquo;framing&rdquo; analysis.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The primary motivation is to better understand political polarization, specifically the &ldquo;Tea Party&rdquo; phenomenon within the Republican party during the 112th Congress.</p>
<p>An ideal point is a scalar score representing a legislator&rsquo;s ideological position, estimated from voting patterns. Standard &ldquo;Ideal Point&rdquo; models (like DW-NOMINATE) typically project legislators onto a single liberal-conservative dimension using only binary voting data. This is insufficient for capturing complex, multi-dimensional intra-party conflicts where legislators might agree on votes but differ on policy &ldquo;framing&rdquo; or specific sub-issues. Furthermore, existing multi-dimensional models often produce dimensions that are difficult for humans to interpret.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Hierarchical Ideal Point Topic Model (HIPTM)</strong>. It distinguishes itself from prior work through three main technical innovations:</p>
<ol>
<li><strong>Joint Modeling of Three Data Sources</strong>: It integrates roll call votes, the text of bills, and the floor speeches of legislators into a single probabilistic framework.</li>
<li><strong>Hierarchical Topic Structure</strong>: It models &ldquo;frames&rdquo; as a second level of the topic hierarchy. &ldquo;Issues&rdquo; (level 1) are fixed and non-polarized, while &ldquo;Frames&rdquo; (level 2) are discovered dynamically and carry polarity (ideal point weights). For example, Health Care is an issue; &ldquo;government overreach&rdquo; vs. &ldquo;patient protection&rdquo; are frames legislators use when debating it.</li>
<li><strong>Text-Based Ideal Point Prediction</strong>: HIPTM regresses ideal points on speech text, allowing it to predict the political alignment of legislators based solely on their writing or speeches without requiring voting records for inference.</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model using data from the 112th U.S. Congress (Republican legislators only).</p>
<ul>
<li><strong>Prediction Task</strong>: Classifying legislators as members of the &ldquo;Tea Party Caucus&rdquo;.</li>
<li><strong>Baselines</strong>: The model was compared against Support Vector Machines (SVM) trained on:
<ul>
<li>TF-IDF vectors (Text only)</li>
<li>Normalized TF-IDF vectors (Text only)</li>
<li>Binary Vote vectors (Vote only)</li>
</ul>
</li>
<li><strong>Metric</strong>: Area Under the Receiver Operating Characteristic Curve (AUC-ROC) via 5-fold cross-validation.</li>
<li><strong>Qualitative Analysis</strong>: The authors examined the &ldquo;span&rdquo; of ideal points within specific topics (e.g., Macroeconomics, Health) to identify which issues were most polarized between Tea Party and Establishment Republicans.</li>
</ul>
<h2 id="what-were-the-outcomes-and-conclusions-drawn">What were the outcomes and conclusions drawn?</h2>
<ul>
<li><strong>Quantitative Performance</strong>: HIPTM features combined with voting data (HIPTM-VOTE) achieved the highest classification performance (AUC-ROC in the ~0.70-0.75 range, approximate, read from Figure 2). Vote-only features slightly trail HIPTM-VOTE, while text-only baselines (TF-IDF, normalized TF-IDF) fall considerably lower. The one-dimensional Tea Party ideal points correlate with DW-NOMINATE ($\rho = 0.91$). When voting data was withheld (simulating a candidate without a record), HIPTM&rsquo;s text-based features outperformed standard text baselines TF-IDF and normalized TF-IDF (approximate, read from Figure 3).</li>
<li><strong>Political Insight</strong>: The model identified &ldquo;Government Operations,&rdquo; &ldquo;Macroeconomics,&rdquo; and &ldquo;Transportation&rdquo; as the three most polarized topics between Tea Party and establishment Republicans.</li>
<li><strong>Framing Analysis</strong>: The hierarchical topic structure reveals how legislators frame issues differently. For Macroeconomics, frame M3 (most Tea Party-oriented) focuses on criticizing government overspending, while frame M1 (least Tea Party-oriented) focuses on the downsides of a government shutdown. For Health, frame H3 captures Tea Party framing of the Affordable Care Act as an unconstitutional government takeover, while frame H1 frames opposition in terms of implementation costs and health care exchanges.</li>
<li><strong>Framing vs. Voting Taxonomy</strong>: The authors construct a 2x2 taxonomy of disagreement across issues, crossing whether ideal points are polarized with whether issue frames are polarized. Issues like Civil Rights fall in the &ldquo;neither polarized&rdquo; quadrant, where cooperation is expected. Banking/Finance and Transportation fall in the &ldquo;ideal points polarized, frames not&rdquo; quadrant, where Republicans frame the issue similarly but have underlying policy disagreements. Issues like Health and Public Lands fall in the &ldquo;frames polarized, ideal points not&rdquo; quadrant: Republicans voted similarly but framed the issue very differently. Issues like Macroeconomics and Government Operations fall in the &ldquo;both polarized&rdquo; quadrant, posing the greatest challenge for Republican leadership.</li>
<li><strong>Sub-group Identification</strong>: The model identifies legislators whose language marks them as ideologically aligned with the Tea Party even without formal caucus membership. For example, Jeff Flake (R-AZ) received the second-highest ideal point, disagreeing with Freedom Works on only one of 60 key votes, despite not being a Tea Party Caucus member. Justin Amash (R-MI), founder and chairman of the Liberty Caucus, agreed with Freedom Works on every key vote since 2011. Conversely, some self-identified Tea Partiers like Rodney Alexander (R-LA) only agreed with Freedom Works 48% of the time. Alexander and Ander Crenshaw (R-FL, 50% agreement) are categorized as &ldquo;Green Tea&rdquo; by Gervais and Morris (2014): Republican legislators who associate with the Tea Party on their own initiative but lack support from Tea Party organizations.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li>HIPTM does not formally distinguish frames from other kinds of subtopics. For example, the model discovered a strongly Tea Party-oriented frame under &ldquo;Labor, Employment and Immigration&rdquo; that reflected a Boeing labor dispute specific to South Carolina legislators, capturing geographic rather than ideological framing.</li>
<li>The model is validated only on Republican legislators in the 112th Congress. Generalization to other parties, chambers, or time periods is untested.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study focuses on the <strong>112th U.S. Congress</strong> (Jan 2011 - Jan 2013).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Subjects</strong></td>
          <td>Republican Legislators</td>
          <td>240 Reps</td>
          <td>60 are Tea Party Caucus members.</td>
      </tr>
      <tr>
          <td><strong>Votes</strong></td>
          <td>Roll Call Votes</td>
          <td>13,856 votes</td>
          <td>Agreement/disagreement with Freedom Works on 60 key votes (40 in 2011, 20 in 2012).</td>
      </tr>
      <tr>
          <td><strong>Text</strong></td>
          <td>Floor Speeches</td>
          <td>5,349 word types</td>
          <td>Sourced from GovTrack. Vocabulary size after preprocessing.</td>
      </tr>
      <tr>
          <td><strong>Priors</strong></td>
          <td>Congressional Bills Project</td>
          <td>19 Topics</td>
          <td>Used to set informed priors $\phi^*_k$ for top-level issues.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The model uses a <strong>Stochastic EM</strong> approach for inference.</p>
<ul>
<li><strong>Generative Process</strong>:
<ul>
<li><strong>Speeches</strong>: Modeled as a mixture of $K$ Hierarchical Dirichlet Processes (HDPs). A legislator chooses an issue $z$, then a frame $t$ from a Dirichlet Process, then a word $w$.</li>
<li><strong>Bills</strong>: Modeled using Latent Dirichlet Allocation (LDA). Each bill is a mixture over $K$ issues.</li>
<li><strong>Votes</strong>: Modeled via a probabilistic ideal point function (logistic/inverse-logit). The probability of a &ldquo;Yes&rdquo; vote depends on the bill&rsquo;s polarity $x_b$, popularity $y_b$, and the legislator&rsquo;s issue-specific ideal point $u_{a,k}$.</li>
</ul>
</li>
<li><strong>Optimization Steps</strong>:
<ol>
<li><strong>Sampling</strong>: Issue assignments $z$ and frame assignments $t$ are sampled for tokens in speeches and bills.</li>
<li><strong>Regression</strong>: Frame-specific regression weights $\eta_{k,j}$ are optimized using <strong>L-BFGS</strong>.</li>
<li><strong>Ideal Points</strong>: Legislator ideal points $u_{a,k}$ and bill parameters ($x_b, y_b$) are updated using <strong>Gradient Ascent</strong>.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Ideal Point Definition</strong>: A legislator&rsquo;s ideal point on issue $k$ ($u_{a,k}$) is defined as a linear combination of the ideal points of the <em>frames</em> they use ($\eta_{k,j}$), weighted by their usage frequency ($\hat{\psi}_{a,k,j}$).</li>
<li><strong>Topic Hierarchy</strong>:
<ul>
<li><strong>Level 1 (Issues)</strong>: Fixed at $K=19$ (based on Policy Agendas Project major headings). These nodes use informed Dirichlet priors.</li>
<li><strong>Level 2 (Frames)</strong>: Unbounded number of frames per issue, discovered non-parametrically via Dirichlet Process.</li>
</ul>
</li>
<li><strong>Prediction Features</strong>: The model runs for 1,000 iterations total with a 500-iteration burn-in. After burn-in, the sampled state is kept every 50 iterations, and feature values are averaged over the 10 stored models.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: AUC-ROC (Area Under the Receiver Operating Characteristic Curve).</li>
<li><strong>Classifier</strong>: $\text{SVM}^{\text{light}}$ (Joachims, 1999).</li>
<li><strong>Cross-Validation</strong>: 5-fold stratified sampling.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack Congressional Speeches</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source of floor speech text</td>
      </tr>
      <tr>
          <td><a href="http://www.congressionalbills.org/">Congressional Bills Project</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Bill text with Policy Agendas Project topic labels</td>
      </tr>
      <tr>
          <td>Freedom Works Key Votes</td>
          <td>Dataset</td>
          <td>Public</td>
          <td>60 key votes used to define Tea Party alignment (freedomworks.org is no longer available)</td>
      </tr>
  </tbody>
</table>
<p>No official code release accompanies this paper. The inference algorithm (Stochastic EM with Gibbs sampling, L-BFGS, and gradient ascent) is described in detail in Section 4 of the paper, but a full reimplementation would be required.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nguyen, V., Boyd-Graber, J., Resnik, P., &amp; Miler, K. (2015). Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. <em>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics</em>, 1438-1448. <a href="https://doi.org/10.3115/v1/P15-1139">https://doi.org/10.3115/v1/P15-1139</a></p>
<p><strong>Publication</strong>: ACL 2015</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{nguyenTeaPartyHouse2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}: {{A Hierarchical Ideal Point Topic Model}} and {{Its Application}} to {{Republican Legislators}} in the 112th {{Congress}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 53rd {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} and the 7th {{International Joint Conference}} on {{Natural Language Processing}} ({{Volume}} 1: {{Long Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Nguyen, Viet-An and {Boyd-Graber}, Jordan and Resnik, Philip and Miler, Kristina}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1438--1448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Beijing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3115/v1/P15-1139}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2023-11-02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We introduce the Hierarchical Ideal Point Topic Model, which provides a rich picture of policy issues, framing, and voting behavior using a joint model of votes, bill text, and the language that legislators use when debating bills. We use this model to look at the relationship between Tea Party Republicans and ``establishment&#39;&#39; Republicans in the U.S. House of Representatives during the 112th Congress.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://aclanthology.org/P15-1139/">ACL Anthology: Tea Party in the House</a></li>
<li>Gervais, B. T., &amp; Morris, I. L. (2012). Reading the tea leaves: Understanding Tea Party Caucus membership in the US House of Representatives. <em>PS: Political Science &amp; Politics</em>, 45(2), 245-250.</li>
<li>Gervais, B. T., &amp; Morris, I. L. (2014). Black Tea, Green Tea, White Tea, and Coffee: Understanding the variation in attachment to the Tea Party among members of Congress. In <em>Annual Meeting of the American Political Science Association</em>. (Source of the &ldquo;Green Tea&rdquo; Republican taxonomy cited in the paper)</li>
</ul>
]]></content:encoded></item><item><title>Stillinger-Weber Potential for Silicon Simulation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/stillinger-weber-1985/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/stillinger-weber-1985/</guid><description>The 1985 paper introducing the Stillinger-Weber potential, a 3-body interaction model for molecular dynamics of tetrahedral semiconductors.</description><content:encoded><![CDATA[<h2 id="core-methodological-contribution">Core Methodological Contribution</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>Its primary contribution is the formulation of the <strong>Stillinger-Weber potential</strong>, a non-additive potential energy function designed to model tetrahedral semiconductors. The paper also uses molecular dynamics simulation to explore physical properties of silicon in both crystalline and liquid phases, but the methodological contribution (the potential architecture) is what enabled subsequent research on covalent materials.</p>
<h2 id="the-failure-of-pair-potentials-in-silicon">The Failure of Pair Potentials in Silicon</h2>
<p>The authors aimed to simulate the melting and liquid properties of tetrahedral semiconductors (Silicon and Germanium).</p>
<ul>
<li><strong>The Problem:</strong> Standard pair potentials (like Lennard-Jones) favor close-packed structures (12 nearest neighbors) and cannot stabilize the open diamond structure (4 nearest neighbors) of Silicon.</li>
<li><strong>The Gap:</strong> Earlier classical potentials lacked the flexibility to describe the profound structural change where Silicon shrinks upon melting (coordination number increases from 4 to &gt;6) while remaining conductive.</li>
<li><strong>The Goal:</strong> To construct a potential that spans the entire configuration space, describing both the rigid crystal and the diffusive liquid, without requiring quantum mechanical calculations.</li>
</ul>
<h2 id="the-three-body-interaction-novelty">The Three-Body Interaction Novelty</h2>
<p>The core novelty is the introduction of a stabilizing <strong>three-body interaction term</strong> ($v_3$) to the potential energy function.</p>
<ul>
<li><strong>3-Body Term:</strong> Explicitly penalizes deviations from the ideal tetrahedral angle ($\cos \theta_t = -1/3$).</li>
<li><strong>Unified Model:</strong> This potential handles bond breaking and reforming, allowing for the simulation of melting and liquid diffusion. Previous &ldquo;Keating&rdquo; potentials model only small elastic deformations.</li>
<li><strong>Mapping Technique:</strong> The application of &ldquo;steepest-descent mapping&rdquo; to quench dynamical configurations into their underlying &ldquo;inherent structures&rdquo; (local minima), revealing the fundamental topology of the liquid energy landscape.</li>
</ul>
<h2 id="molecular-dynamics-validation">Molecular Dynamics Validation</h2>
<p>The authors performed Molecular Dynamics (MD) simulations using the proposed potential.</p>
<ul>
<li><strong>System:</strong> 216 Silicon atoms in a cubic cell with periodic boundary conditions.</li>
<li><strong>State Points:</strong> Fixed density $\rho = 2.53 \text{ g/cm}^3$ (matching experimental liquid density at melting).</li>
<li><strong>Process:</strong>
<ol>
<li>Start with diamond crystal at low temperature.</li>
<li>Systematically heat to induce spontaneous nucleation and melting.</li>
<li>Equilibrate the liquid.</li>
<li>Periodically map configurations to potential minima (inherent structures) using steepest descent.</li>
</ol>
</li>
</ul>
<h2 id="phase-topology-and-inverse-lindemann-criterion">Phase Topology and Inverse Lindemann Criterion</h2>
<ul>
<li><strong>Validation:</strong> The potential successfully stabilizes the diamond structure as the global minimum at zero pressure.</li>
<li><strong>Liquid Structure:</strong> The simulated liquid pair-correlation function $g(r)$ and structure factor $S(k)$ qualitatively match experimental diffraction data, including the characteristic shoulder on the structure factor peak.</li>
<li><strong>Inherent Structure:</strong> The liquid possesses a temperature-independent inherent structure (amorphous network) hidden beneath thermal vibrations.</li>
<li><strong>Melting/Freezing Criteria:</strong> The study proposes an &ldquo;Inverse Lindemann Criterion&rdquo;: while crystals melt when vibration amplitude exceeds ~0.19 lattice spacings, liquids freeze when atom displacements from their inherent minima drop below ~0.30 neighbor spacings.</li>
</ul>
<h2 id="limitations-and-energy-scale-problem">Limitations and Energy Scale Problem</h2>
<p>The authors acknowledge a quantitative energy scale discrepancy. To match the observed melting temperature of Si ($1410°$C), $\epsilon$ would need to be approximately 42 kcal/mol, considerably less than the 50 kcal/mol required to reproduce the correct cohesive energy of the crystal. The authors suggest this could be resolved either by further optimization of $v_2$ and $v_3$, or by adding position-independent single-particle terms $v_1 \approx -16$ kcal/mol arising from the electronic structure. Adding $v_1$ terms only affects the temperature scale and has no influence on local structure at a given reduced temperature.</p>
<p>The simulated liquid coordination number (8.07) is also higher than the experimentally reported value of approximately 6.4, though the authors note that the experimental definition of &ldquo;nearest neighbors&rdquo; was not precisely stated.</p>
<h2 id="bonding-statistics-in-inherent-structures">Bonding Statistics in Inherent Structures</h2>
<p>Analysis of potential-energy minima (inherent structures) using a bond cutoff of $r/\sigma = 1.40$ reveals the coordination distribution in the liquid:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Coordination Number</th>
          <th style="text-align: left">Fraction of Atoms</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">4</td>
          <td style="text-align: left">0.201</td>
      </tr>
      <tr>
          <td style="text-align: left">5</td>
          <td style="text-align: left">0.568</td>
      </tr>
      <tr>
          <td style="text-align: left">6</td>
          <td style="text-align: left">0.205</td>
      </tr>
      <tr>
          <td style="text-align: left">7</td>
          <td style="text-align: left">0.024</td>
      </tr>
  </tbody>
</table>
<p>Five-coordinate atoms dominate the liquid&rsquo;s inherent structure, with four- and six-coordinate atoms each accounting for about 20% of the population. The three-body interactions prevent any occurrence of coordination numbers near 12 that would indicate local close packing.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Integration:</strong> Equations of motion integrated using a <strong>fifth-order Gear algorithm</strong>.</li>
<li><strong>Time Step:</strong> $\Delta t = 5 \times 10^{-3} \tau$ (approx $3.83 \times 10^{-16}$ s), where $\tau = \sigma(m/\epsilon)^{1/2} = 7.6634 \times 10^{-14}$ s.</li>
<li><strong>Minimization:</strong> Steepest-descent mapping utilized <strong>Newton&rsquo;s method</strong> to find limiting solutions ($\nabla \Phi = 0$).</li>
</ul>
<h3 id="models">Models</h3>
<p>To reproduce this work, one must implement the potential $\Phi = \sum v_2 + \sum v_3$ with the exact functional forms and parameters provided.</p>















<figure class="post-figure center ">
    <img src="/img/notes/chemistry/stillinger-weber-potential.webp"
         alt="Stillinger-Weber potential visualization"
         title="Stillinger-Weber potential visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Left: Two-body radial potential $v_2(r)$ showing the characteristic well at $r_{min} \approx 1.12\sigma$. Right: Three-body angular penalty $h(r_{min}, r_{min}, \theta)$ demonstrating the minimum at the tetrahedral angle (109.5°), which enforces the diamond crystal structure.</figcaption>
    
</figure>

<h4 id="reduced-units">Reduced Units</h4>
<ul>
<li>$\sigma = 0.20951 \text{ nm}$</li>
<li>$\epsilon = 50 \text{ kcal/mol} = 3.4723 \times 10^{-12} \text{ erg}$</li>
</ul>
<h4 id="two-body-term-v_2">Two-Body Term ($v_2$)</h4>
<p>$$
v_2(r_{ij}) = \epsilon A (B r_{ij}^{-p} - r_{ij}^{-q}) \exp[(r_{ij} - a)^{-1}] \quad \text{for } r_{ij} &lt; a
$$</p>
<p><em>(Vanishes for $r \geq a$)</em></p>
<h4 id="three-body-term-v_3">Three-Body Term ($v_3$)</h4>
<p>$$
v_3(r_i, r_j, r_k) = \epsilon [h(r_{ij}, r_{ik}, \theta_{jik}) + h(r_{ji}, r_{jk}, \theta_{ijk}) + h(r_{ki}, r_{kj}, \theta_{ikj})]
$$</p>
<p>where:</p>
<p>$$
h(r_{ij}, r_{ik}, \theta_{jik}) = \lambda \exp[\gamma(r_{ij}-a)^{-1} + \gamma(r_{ik}-a)^{-1}] (\cos\theta_{jik} + \frac{1}{3})^2
$$</p>
<p><em>(Vanishes if distances $\geq a$)</em></p>
<h4 id="parameters">Parameters</h4>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">$A$</td>
          <td style="text-align: left">$7.049556277$</td>
      </tr>
      <tr>
          <td style="text-align: left">$B$</td>
          <td style="text-align: left">$0.6022245584$</td>
      </tr>
      <tr>
          <td style="text-align: left">$p$</td>
          <td style="text-align: left">$4$</td>
      </tr>
      <tr>
          <td style="text-align: left">$q$</td>
          <td style="text-align: left">$0$</td>
      </tr>
      <tr>
          <td style="text-align: left">$a$</td>
          <td style="text-align: left">$1.80$</td>
      </tr>
      <tr>
          <td style="text-align: left">$\lambda$</td>
          <td style="text-align: left">$21.0$</td>
      </tr>
      <tr>
          <td style="text-align: left">$\gamma$</td>
          <td style="text-align: left">$1.20$</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>The paper evaluates the model against experimental diffraction data.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Simulated Value</th>
          <th style="text-align: left">Experimental Value</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Melting Point ($T_m^*$)</strong></td>
          <td style="text-align: left">$\approx 0.080$</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Reduced units. Requires $\epsilon \approx 42$ kcal/mol to match real $T_m = 1410°$C, vs 50 kcal/mol for correct cohesive energy.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Coordination (Liquid)</strong></td>
          <td style="text-align: left">$8.07$</td>
          <td style="text-align: left">$\approx 6.4$</td>
          <td style="text-align: left">Evaluated at first $g(r)$ minimum ($r/\sigma = 1.625$). Simulated value is higher than experiment.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$S(k)$ First Peak</strong></td>
          <td style="text-align: left">$2.53$ $\AA^{-1}$</td>
          <td style="text-align: left">$2.80$ $\AA^{-1}$</td>
          <td style="text-align: left">From Table I.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$S(k)$ Shoulder</strong></td>
          <td style="text-align: left">$3.25$ $\AA^{-1}$</td>
          <td style="text-align: left">$3.25$ $\AA^{-1}$</td>
          <td style="text-align: left">From Table I. Exact match with experiment.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$S(k)$ Second Peak</strong></td>
          <td style="text-align: left">$5.35$ $\AA^{-1}$</td>
          <td style="text-align: left">$5.75$ $\AA^{-1}$</td>
          <td style="text-align: left">From Table I.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$S(k)$ Third Peak</strong></td>
          <td style="text-align: left">$8.16$ $\AA^{-1}$</td>
          <td style="text-align: left">$8.50$ $\AA^{-1}$</td>
          <td style="text-align: left">From Table I.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$S(k)$ Fourth Peak</strong></td>
          <td style="text-align: left">$10.60$ $\AA^{-1}$</td>
          <td style="text-align: left">$11.20$ $\AA^{-1}$</td>
          <td style="text-align: left">From Table I.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Entropy of Melting ($\Delta S / N k_B$)</strong></td>
          <td style="text-align: left">$\approx 3.7$</td>
          <td style="text-align: left">$3.25$</td>
          <td style="text-align: left">Simulated at constant volume; experimental at constant pressure (1 atm).</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Stillinger, F. H., &amp; Weber, T. A. (1985). Computer simulation of local order in condensed phases of silicon. <em>Physical Review B</em>, 31(8), 5262-5271. <a href="https://doi.org/10.1103/PhysRevB.31.5262">https://doi.org/10.1103/PhysRevB.31.5262</a></p>
<p><strong>Publication</strong>: Physical Review B, 1985</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{stillingerComputerSimulationLocal1985,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Computer Simulation of Local Order in Condensed Phases of Silicon}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Stillinger, Frank H. and Weber, Thomas A.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1985</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Physical Review B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{31}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5262--5271}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Physical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1103/PhysRevB.31.5262}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Second-Order Langevin Equation for Field Simulations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/second-order-langevin-1987/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/second-order-langevin-1987/</guid><description>Hyperbolic Algorithm adds second-order derivatives to Langevin dynamics, reducing systematic errors to O(ε²) for lattice field simulations.</description><content:encoded><![CDATA[<h2 id="contribution-and-paper-type">Contribution and Paper Type</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$). It proposes a novel stochastic algorithm, the Hyperbolic Algorithm (HA), and validates its superior efficiency against the existing Langevin Algorithm (LA) through formal error analysis and numerical simulation. It contains significant theoretical derivation (Liouville dynamics) that serves primarily to justify the algorithmic performance claims.</p>
<h2 id="motivation-and-gaps-in-prior-work">Motivation and Gaps in Prior Work</h2>
<p>The standard Langevin Algorithm (LA) for numerical simulation of Euclidean field theories suffers from efficiency bottlenecks. The simplest Euler-discretization of the LA introduces systematic errors of $O(\epsilon)$ (where $\epsilon$ is the step size). To maintain accuracy, $\epsilon$ must be kept small, which increases the sweep-sweep correlation time (autocorrelation time), making simulations computationally expensive.</p>
<h2 id="core-novelty-second-order-dynamics">Core Novelty: Second-Order Dynamics</h2>
<p>The core contribution is the introduction of a <strong>second-order derivative in fictitious time</strong> to the stochastic equation. This converts the parabolic Langevin equation into a hyperbolic equation:</p>
<p>$$
\begin{aligned}
\frac{\partial^{2}\phi}{\partial t^{2}}+\gamma\frac{\partial\phi}{\partial t}=-\frac{\partial S}{\partial\phi}+\eta
\end{aligned}
$$</p>
<h3 id="equation-comparison">Equation Comparison</h3>
<p>The key difference from the standard (first-order) Langevin equation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Equation Type</th>
          <th style="text-align: left">Formula</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Hyperbolic (Second Order)</strong></td>
          <td style="text-align: left">$$\frac{\partial^{2}\phi}{\partial t^{2}}+\gamma\frac{\partial\phi}{\partial t}=-\frac{\partial S}{\partial\phi}+\eta$$</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Langevin (First Order)</strong></td>
          <td style="text-align: left">$$\frac{\partial\phi}{\partial t}=-\frac{\partial S}{\partial\phi}+\eta$$</td>
      </tr>
  </tbody>
</table>
<p>The standard Langevin equation corresponds to the overdamped limit where the acceleration term is absent. Physically, the Hyperbolic equation can be viewed as microcanonical equations of motion with an added friction term.</p>
<h3 id="key-innovations">Key Innovations</h3>
<ul>
<li><strong>Higher Order Accuracy</strong>: The simplest discretization of this equation leads to systematic errors of only $O(\epsilon^2)$ compared to $O(\epsilon)$ for LA.</li>
<li><strong>Tunable Damping</strong>: The addition of the damping parameter $\gamma$ allows tuning to minimize autocorrelation tails.</li>
<li><strong>Uniform Evolution</strong>: The method evolves structures of different wavelengths more uniformly than LA due to the specific dissipation structure.</li>
</ul>
<h2 id="methodology-and-experiments">Methodology and Experiments</h2>
<p>The author validated the method using the <strong>XY Model</strong> on 2D lattices.</p>
<ul>
<li><strong>System</strong>: Euclidean action $S = -\sum_{x,\mu} \cos(\theta_{x+\mu} - \theta_x)$.</li>
<li><strong>Setup</strong>:
<ul>
<li>Lattice sizes: $15^2$ (helical boundary conditions) and $30^2$.</li>
<li>$\beta$ range: 0.9 to 1.2 (crossing the critical point $\approx 1.0$).</li>
<li>Run length: &gt;100,000 updates in equilibrium.</li>
</ul>
</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Autocorrelation time ($\tau$)</strong>: Defined as the number of updates for the time-correlation function to drop to 10% of its initial value.</li>
<li><strong>Systematic Error</strong>: Measured via deviation of average action from Monte Carlo values.</li>
</ul>
</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Efficiency</strong>: The Hyperbolic Algorithm (HA) is far more efficient. For equal systematic errors, sweep-sweep correlation times are significantly lower than LA.</li>
<li><strong>Error Scaling</strong>: Numerical results confirmed that HA step size $\epsilon_H = 0.1$ yields systematic errors comparable to LA step size $\epsilon_L \approx 0.008$ ($O(\epsilon^2)$ vs $O(\epsilon)$ scaling).</li>
<li><strong>Speedup</strong>: In the disordered phase, HA is roughly $\epsilon_H / \epsilon_L$ times faster (approximately a factor of 12.5 for $\epsilon_H = 0.1$, $\epsilon_L = 0.008$). In the ordered phase, efficiency gains increase with distance scale, reaching factors of 20 or more for long-range correlations.</li>
<li><strong>Optimal Damping</strong>: For the XY model, the optimal damping parameter was found to be $\gamma \approx 0.4$.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. The Hyperbolic Algorithm (HA)</strong></p>
<p>The discretized update equations for scalar fields are:</p>
<p>$$
\begin{aligned}
\pi_{t+\epsilon} - \pi_{t} &amp;= -\epsilon\gamma\pi_{t} - \epsilon\frac{\partial S}{\partial\phi_{t}} + \sqrt{2\epsilon\gamma/\beta}\xi_{t} \\
\phi_{t+\epsilon} - \phi_{t} &amp;= \epsilon\pi_{t+\epsilon}
\end{aligned}
$$</p>
<ul>
<li><strong>Variables</strong>: $\phi$ is the field, $\pi$ is the conjugate momentum ($\dot{\phi}$).</li>
<li><strong>Parameters</strong>: $\epsilon$ (step size), $\gamma$ (damping constant).</li>
<li><strong>Noise</strong>: $\xi$ is Gaussian noise with $\langle\xi_x \xi_y\rangle = \delta_{x,y}$.</li>
<li><strong>Storage</strong>: Requires storing both $\phi$ and $\pi$ vectors.</li>
</ul>
<p><strong>2. Non-Abelian Generalization</strong></p>
<p>For Lie group elements $U$ with generators $T^a$:</p>
<p>$$
\begin{aligned}
\pi_{t+\epsilon}^a - \pi_{t}^a &amp;= -\epsilon\gamma\pi_{t}^a - \epsilon\delta^a S[U_t] + \sqrt{2\epsilon\gamma/\beta}\xi_{t}^a \\
U_{t+\epsilon} &amp;= e^{i\epsilon\pi_{t+\epsilon}^a T^a} U_t
\end{aligned}
$$</p>
<h3 id="theoretical-proof-of-oepsilon2-accuracy">Theoretical Proof of $O(\epsilon^2)$ Accuracy</h3>
<p>The derivation relies on the generalized Liouville equation for the probability distribution $P[\phi, \pi; t]$.</p>
<ol>
<li><strong>Transition Probability</strong>: The transition $W$ for one iteration is defined.</li>
<li><strong>Effective Liouville Operator</strong>: The evolution is written as $P(t+\epsilon) = \exp(\epsilon L_{\text{eff}}) P(t)$.</li>
<li><strong>Baker-Hausdorff Expansion</strong>: Using normal ordering of operators, the equilibrium distribution $P_{\text{eq}}$ is derived through $O(\epsilon^2)$:</li>
</ol>
<p>$$
\begin{aligned}
P_{\text{eq}} &amp;= \exp\left\lbrace-\frac{1}{2}\beta_{1}\sum_{x}\pi_{x}^{2} - \beta S[\phi] + \frac{1}{2}\epsilon\beta\sum_{x}\pi_{x}S_{x} + \epsilon^{2}G + O(\epsilon^3)\right\rbrace
\end{aligned}
$$</p>
<p>where $\beta_1 = \beta\left(1 - \frac{1}{2}\epsilon\gamma\right)$.</p>
<ol start="4">
<li><strong>Effective Action</strong>: Integrating out $\pi$ yields the effective action for $\phi$:</li>
</ol>
<p>$$
\begin{aligned}
S_{\text{eff}}[\phi] &amp;= S[\phi] - \frac{1}{8}\epsilon^2 \sum_x S_x^2 + \dots
\end{aligned}
$$</p>
<p>The absence of $O(\epsilon)$ terms proves the higher-order accuracy.</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Model</strong>: XY Model (2D)</li>
<li><strong>Hamiltonian</strong>: $H = \frac{1}{2}\sum \pi^2 + S[\phi]$ where $S = -\sum \cos(\Delta \theta)$.</li>
<li><strong>Observables</strong>:
<ul>
<li>$\Gamma_n = \cos(\theta_{m+n} - \theta_m)$ (averaged over lattice $m$).</li>
</ul>
</li>
<li><strong>Comparisons</strong>:
<ul>
<li><strong>LA Step</strong>: $\epsilon_L \approx 0.005 - 0.02$.</li>
<li><strong>HA Step</strong>: $\epsilon_H \approx 0.1 - 0.2$.</li>
<li><strong>Equivalence</strong>: $\epsilon_H = 0.1$ matches error of $\epsilon_L \approx 0.008$.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="terminology-note">Terminology Note</h2>
<p>The naming conventions in this paper differ from those commonly used in molecular dynamics (MD). The following table provides a cross-field mapping:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Concept</th>
          <th style="text-align: left"><strong>Field Theory (This Paper)</strong></th>
          <th style="text-align: left"><strong>Molecular Dynamics</strong></th>
          <th style="text-align: left"><strong>Mathematics</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Equation 1</strong></td>
          <td style="text-align: left">&ldquo;Langevin Equation&rdquo;</td>
          <td style="text-align: left">Brownian Dynamics (BD)</td>
          <td style="text-align: left">Overdamped Langevin</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Equation 2</strong></td>
          <td style="text-align: left">&ldquo;Hyperbolic Equation&rdquo;</td>
          <td style="text-align: left">Langevin Dynamics (LD)</td>
          <td style="text-align: left">Underdamped Langevin</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Integrator 1</strong></td>
          <td style="text-align: left">Euler Discretization</td>
          <td style="text-align: left">Euler Integrator</td>
          <td style="text-align: left">Euler-Maruyama</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Integrator 2</strong></td>
          <td style="text-align: left">Hyperbolic Algorithm (HA)</td>
          <td style="text-align: left">Velocity Verlet / Leapfrog</td>
          <td style="text-align: left">Quasi-Symplectic Splitting</td>
      </tr>
  </tbody>
</table>
<p><strong>Key insight</strong>: The paper&rsquo;s &ldquo;Hyperbolic Algorithm&rdquo; is mathematically equivalent to Langevin Dynamics with a Leapfrog/Verlet integrator, commonly used in MD. The baseline &ldquo;Langevin Algorithm&rdquo; corresponds to Brownian Dynamics. The term &ldquo;Langevin equation&rdquo; is overloaded: field theorists often use it for overdamped dynamics (no inertia), while chemists assume it includes momentum ($F=ma$).</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Horowitz, A. M. (1987). The Second Order Langevin Equation and Numerical Simulations. <em>Nuclear Physics B</em>, 280, 510-522. <a href="https://doi.org/10.1016/0550-3213(87)90159-3">https://doi.org/10.1016/0550-3213(87)90159-3</a></p>
<p><strong>Publication</strong>: Nuclear Physics B 1987</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{horowitzSecondOrderLangevin1987,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The Second Order {{Langevin}} Equation and Numerical Simulations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Horowitz, Alan M.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1987</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Nuclear Physics B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{280}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{510--522}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{05503213}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1016/0550-3213(87)90159-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Reconstruction of Chemical Molecules from Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</guid><description>A 5-module system converting raster images of chemical structures into machine-readable SDF files with custom vectorization.</description><content:encoded><![CDATA[<h2 id="methodological-basis">Methodological Basis</h2>
<p>This paper is a clear methodological contribution describing a novel system architecture. It proposes a five-stage pipeline to solve a specific engineering problem: converting rasterized chemical images into structured chemical files (SDF). The authors validate the method by benchmarking it against a commercial product (CLIDE) and analyzing performance across multiple databases.</p>
<h2 id="the-inaccessibility-of-raster-chemical-images">The Inaccessibility of Raster Chemical Images</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A massive amount of chemical knowledge (scientific articles, patents) exists only as raster images, rendering it inaccessible to computational analysis.</li>
<li><strong>Inefficiency of Manual Entry</strong>: Manual replication of molecules into CAD programs is the standard but unscalable solution for extracting this information.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous academic and commercial attempts (early 90s systems like CLIDE) had faded or remained limited in robustness, leaving the problem &ldquo;wide open&rdquo;.</li>
</ul>
<h2 id="topology-preserving-chemical-vectorization">Topology-Preserving Chemical Vectorization</h2>
<p>The core novelty is the <strong>topology-preserving vectorization</strong> strategy designed specifically for chemical graphs.</p>
<ul>
<li><strong>Graph-Centric Vectorizer</strong>: This system prioritizes graph characteristics over the pixel precision of traditional CAD vectorizers, ensuring one line in the image becomes exactly one vector, regardless of line width or vertex thickness.</li>
<li><strong>Chemical Knowledge Module</strong>: The inclusion of a final validation step that applies chemical rules (valence, charge) to detect and potentially correct reconstruction errors.</li>
<li><strong>Hybrid Recognition</strong>: The separation of the pipeline into a &ldquo;Body&rdquo; path (vectorizer for bonds) and an &ldquo;OCR&rdquo; path (SVM for atomic symbols), which are re-integrated in a reconstruction phase.</li>
</ul>
<h2 id="validating-reconstruction-accuracy">Validating Reconstruction Accuracy</h2>
<p>The authors performed a quantitative validation using <strong>ground-truth SDF files</strong> to verify reconstruction accuracy. The success rate metric evaluated whether the reconstructed graph perfectly matched the true SDF:</p>
<p>$$ \text{Accuracy} = \frac{\text{Correctly Reconstructed SDFs}}{\text{Total Images Evaluated}} $$</p>
<ul>
<li><strong>Baselines</strong>: The system was benchmarked against the commercial software <strong>CLIDE</strong> on &ldquo;Database 1&rdquo;.</li>
<li><strong>Datasets</strong>: Three distinct databases were used:
<ul>
<li><strong>Database 1</strong>: 100 images (varied fonts/line widths).</li>
<li><strong>Database 2</strong>: 100 images.</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale test).</li>
</ul>
</li>
</ul>
<h2 id="system-performance-and-scalability">System Performance and Scalability</h2>
<ul>
<li><strong>Superior Performance</strong>: On Database 1, the proposed system correctly reconstructed <strong>97%</strong> of images, whereas the commercial CLIDE system only reconstructed <strong>25%</strong> (after parameter tuning).</li>
<li><strong>Scalability</strong>: The system maintained reasonable performance on the large dataset (Database 3), achieving <strong>67%</strong> accuracy.</li>
<li><strong>Robustness</strong>: The system can handle varying fonts and line widths via parameterization.</li>
<li><strong>Future Work</strong>: The authors plan to implement a feedback loop where the Chemical Knowledge Module can send error signals back to earlier modules to correct inconsistencies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Reproducible (Paywalled paper, no public code or data).</p>
<h3 id="data">Data</h3>
<p>The paper utilizes three databases for validation. The authors note that for these images, the correct SDF files were already available, allowing for direct automated checking.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Varied line widths, fonts, symbols; used for CLIDE comparison.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>General chemical database.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale database.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The system is composed of five distinct modules executed in sequence:</p>
<p><strong>1. Binarization &amp; Segmentation</strong></p>
<ul>
<li><strong>Preprocessing</strong>: Removal of anti-aliasing effects followed by <strong>adaptive histogram binarization</strong>.</li>
<li><strong>Connected Components</strong>: A non-recursive raster-scan algorithm identifies connected Run-Length Encoded (RLE) segments.</li>
</ul>
<p><strong>2. Optical Character Recognition (OCR)</strong></p>
<ul>
<li><strong>Feature Extraction</strong>: Uses functions similar to <strong>Zernike moments</strong> and a <strong>wavelet transform strategy</strong>.</li>
<li><strong>Classification</strong>: Identifies isolated characters/symbols and separates them from the molecular &ldquo;body&rdquo;.</li>
</ul>
<p><strong>3. Vectorizer</strong></p>
<ul>
<li><strong>Logic</strong>: Assigns local directions to RLE segments based on neighbors, then groups segments with similar local direction patterns.</li>
<li><strong>Constraint</strong>: Enforces a 1-to-1 mapping between visual lines and graph vectors to prevent spurious small vectors at thick joints.</li>
</ul>
<p><strong>4. Reconstruction (Heuristics)</strong></p>
<p>This module annotates vectors with chemical significance:</p>
<ul>
<li><strong>Chiral Bonds (Wedges)</strong>: Identified by registering vectors against original pixel density. If a vector corresponds to a thick geometric form (triangle/rectangle), it is labeled chiral.</li>
<li><strong>Dotted Chiral Bonds</strong>: Identified by clustering isolated vectors (no neighbors) using <strong>quadtree clustering</strong> on geometric centers. Coherent parallel clusters are fused into a single bond.</li>
<li><strong>Double/Triple Bonds</strong>: Detected by checking for parallel vectors within a <strong>Region of Interest (ROI)</strong> defined as the vector&rsquo;s bounding box <strong>dilated by a factor of 2</strong>.</li>
<li><strong>Superatoms</strong>: OCR results are clustered by dilating bounding boxes; overlapping boxes are grouped into names (e.g., &ldquo;COOH&rdquo;).</li>
</ul>
<p><strong>5. Chemical Knowledge</strong></p>
<p>Validates the generated graph against rules for valences and charges. If valid, an SDF file is generated.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM (Support Vector Machine)</strong>: Used within the OCR module to classify connected components as characters or symbols. It is trained to be tolerant to rotation and font variations.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is binary success rate per molecule (perfect reconstruction of the SDF).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (DB1)</th>
          <th>Value (DB3)</th>
          <th>Baseline (CLIDE on DB1)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Reconstruction</td>
          <td><strong>97%</strong></td>
          <td>67%</td>
          <td>25%</td>
          <td>CLIDE required significant parameter tuning to reach 25%.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., Friedrich, C. M., Akle, S., &amp; Hofmann-Apitius, M. (2007). Reconstruction of Chemical Molecules from Images. <em>Proceedings of the 29th Annual International Conference of the IEEE EMBS</em>, 4609-4612. <a href="https://doi.org/10.1109/IEMBS.2007.4353366">https://doi.org/10.1109/IEMBS.2007.4353366</a></p>
<p><strong>Publication venue</strong>: IEEE EMBS 2007</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriReconstructionChemicalMolecules2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Reconstruction of {{Chemical Molecules}} from {{Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 29th Annual International Conference of the IEEE EMBS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and Friedrich, Christoph M. and Akle, Santiago and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4609--4612}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/IEMBS.2007.4353366}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Party Matters: Enhancing Legislative Vote Embeddings</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</guid><description>A method for improving legislative vote prediction across sessions by augmenting bill text embeddings with sponsor metadata.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel neural architecture that modifies how bill embeddings are constructed by explicitly incorporating sponsor metadata alongside text. The authors validate this method by comparing it against text-only baselines (MWE and CNN) and demonstrating superior performance in a newly defined &ldquo;out-of-session&rdquo; evaluation setting.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Existing models for predicting legislative roll-call votes rely heavily on text or voting history within a single session. However, these models fail to generalize across sessions because the underlying data generation process changes. Specifically, the ideological position of bills on similar topics shifts depending on which party is in power. A model trained on a single session learns an implicit ideological prior that becomes inaccurate when the political context changes in subsequent sessions.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is a neural architecture that augments bill text representations with sponsor ideology, specifically the percentage of Republican vs. Democrat sponsors.</p>
<ul>
<li><strong>Sponsor-Weighted Embeddings</strong>: They compute a composite embedding where the text representation is weighted by party sponsorship percentages ($p_{r}, p_{d}$) and party-specific influence vectors ($a_{r}, a_{d}$).</li>
<li><strong>Out-of-Session Evaluation</strong>: They introduce a rigorous evaluation setting where models trained on past sessions (e.g., 2005-2012) are tested on future sessions (e.g., 2013-2014) to test generalization, which previous work had ignored.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated their models using a dataset of U.S. Congressional bills from 2005 to 2016.</p>
<ul>
<li><strong>Models Tested</strong>: They compared text-only models (MWE (Mean Word Embedding), CNN) against metadata-augmented versions (MWE+Meta, CNN+Meta) and a &ldquo;Meta-Only&rdquo; baseline (using dummy text).</li>
<li><strong>Settings</strong>:
<ul>
<li><strong>In-Session</strong>: 5-fold cross-validation on 2005-2012 data.</li>
<li><strong>Out-of-Session</strong>: Training on 2005-2012 and testing on 2013-2014 and 2015-2016.</li>
</ul>
</li>
<li><strong>Baselines</strong>: Comparisons included a &ldquo;Guess Yes&rdquo; baseline and an SVM trained on bag-of-words summaries with sponsor indicators.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Metadata is Critical</strong>: Augmenting text with sponsor metadata consistently outperformed text-only models. The <code>CNN+Meta</code> model achieved the highest accuracy in-session (86.21% vs. 83.24% for CNN) and on 2013-2014 out-of-session (83.59%), while <code>MWE+Meta</code> achieved the best 2015-2016 accuracy (71.90%).</li>
<li><strong>Generalization</strong>: Text-only models degraded significantly in out-of-session testing. For example, CNN dropped from 83.24% in-session to 77.49% on 2013-2014 and 69.63% on 2015-2016, confirming that text alone fails to capture shifting ideological contexts.</li>
<li><strong>Sponsor Signal</strong>: The <code>Meta-Only</code> model (using no text) outperformed text-only models in the 2013-2014 out-of-session test (82.28% vs. 77.57% for MWE), suggesting that in some contexts, the author&rsquo;s identity provides a stronger predictive signal than the bill&rsquo;s content.</li>
<li><strong>2015-2016 Difficulty</strong>: All models performed worse on the 2015-2016 session, where intra-party divisions within the House Republican caucus disrupted typical voting dynamics.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: Collected from GovTrack. The paper text references the &ldquo;106th to 111th&rdquo; Congressional sessions, but the data tables show coverage from 2005 to 2016, which corresponds to the 109th through 114th sessions.</li>
<li><strong>Content</strong>: Non-unanimous roll call votes, full text of bills/resolutions, and Congressional Research Service (CRS) summaries.</li>
<li><strong>Filtering</strong>: Bills with unanimous votes were excluded.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>Text lowercased and stop-words removed.</li>
<li>Summaries truncated to $N=400$ words; full text truncated to $N=2000$ words (80th percentile lengths).</li>
</ul>
</li>
<li><strong>Splits</strong>:
<ul>
<li><em>Training</em>: Sessions 2005-2012 (1718 bills).</li>
<li><em>Testing</em>: Sessions 2013-2014 (360 bills) and 2015-2016 (382 bills).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Bill Representation ($v_{B}$)</strong>:
$$v_{B}=((a_{r}p_{r})\cdot T_{r})+((a_{d}p_{d})\cdot T_{d})$$
Where $T$ is the text embedding (CNN or MWE), $p$ is the percentage of sponsors from a party, and $a$ is a learnable party influence vector. $T_{r}$ and $T_{d}$ are Republican and Democratic copies of the same bill&rsquo;s text representation, each weighted by the corresponding party&rsquo;s sponsorship proportion.</li>
<li><strong>Vote Prediction</strong>:
<ul>
<li>Project bill embedding to legislator space: $v_{BL}=W_{B}v_{B}+b_{B}$.</li>
<li>Alignment score: $W_{v}(v_{BL}\odot v_{L})+b_{v}$ (using element-wise multiplication).</li>
<li>Output: Sigmoid activation.</li>
</ul>
</li>
<li><strong>Optimization</strong>: AdaMax algorithm with binary cross-entropy loss.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Text Encoders</strong>:
<ul>
<li><strong>CNN</strong>: 4-grams with 400 filter maps.</li>
<li><strong>MWE</strong>: <a href="/posts/intro-to-word-embeddings/">Mean Word Embedding</a>.</li>
</ul>
</li>
<li><strong>Embeddings</strong>:
<ul>
<li>Initialized with 50-dimensional GloVe vectors.</li>
<li>Embeddings are non-static (updated during training).</li>
<li>Legislator embedding size ($v_{L}$): 25 dimensions.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Weights initialized with Glorot uniform distribution.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy.</li>
<li><strong>Comparison</strong>:
<ul>
<li><strong>In-session</strong>: 5-fold cross-validation.</li>
<li><strong>Out-of-session</strong>: Train on past sessions, predict future sessions.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Config</strong>: Models trained for 50 epochs with mini-batches of size 50. No specific GPU or compute requirements are reported.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source for bill texts and roll-call votes</td>
      </tr>
  </tbody>
</table>
<p>No official code repository or pretrained models were released with this paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kornilova, A., Argyle, D., &amp; Eidelman, V. (2018). Party Matters: Enhancing Legislative Embeddings with Author Attributes for Vote Prediction. <em>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>, 510-515. <a href="https://doi.org/10.18653/v1/p18-2081">https://doi.org/10.18653/v1/p18-2081</a></p>
<p><strong>Publication</strong>: ACL 2018</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kornilovaPartyMattersEnhancing2018,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Party {{Matters}}: {{Enhancing Legislative Embeddings}} with {{Author Attributes}} for {{Vote Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Party {{Matters}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kornilova, Anastassia and Argyle, Daniel and Eidelman, Vlad}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 56th {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} ({{Volume}} 2: {{Short Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{510--515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Melbourne, Australia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.18653/v1/p18-2081}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{1805.08182}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Oscillatory CO Oxidation on Pt(110): Temporal Modeling</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/surface-science/oscillatory-co-oxidation-pt110-1992/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/surface-science/oscillatory-co-oxidation-pt110-1992/</guid><description>A kinetic model using coupled ODEs to explain temporal self-organization and mixed-mode oscillations in catalytic CO oxidation on Pt(110).</description><content:encoded><![CDATA[<p><strong>Related Work</strong>: This builds on <a href="/notes/chemistry/molecular-simulation/surface-science/kinetic-oscillations-pt100-1985/">Kinetic Oscillations on Pt(100)</a>, which established that surface phase transitions drive oscillatory catalysis. The Pt(110) system exhibits richer dynamics including mixed-mode oscillations and chaos.</p>
<h2 id="method-presentation-modeling-temporal-self-organization">Method Presentation: Modeling Temporal Self-Organization</h2>
<p>This is primarily a <strong>Method</strong> paper, supported by <strong>Theory</strong>.</p>
<ul>
<li><strong>Method</strong>: The authors construct a specific computational architecture, a set of coupled Ordinary Differential Equations (ODEs), to simulate the catalytic oxidation of CO. They systematically &ldquo;ablate&rdquo; the model, starting with 2 variables (bistability only), adding a 3rd (simple oscillations), and finally a 4th (mixed-mode oscillations) to demonstrate the necessity of each physical component.</li>
<li><strong>Theory</strong>: The model is analyzed using formal bifurcation theory (continuation methods) to map the topology of the phase space (Hopf bifurcations, saddle-node loops, etc.).</li>
</ul>
<h2 id="motivation-bridging-microscopic-structure-and-macroscopic-dynamics">Motivation: Bridging Microscopic Structure and Macroscopic Dynamics</h2>
<p>The Pt(110) surface exhibits complex temporal behavior during CO oxidation, including bistability, sustained oscillations, mixed-mode oscillations (MMOs), and chaos. Previous simple models could explain bistability but failed to capture the oscillatory dynamics observed experimentally. There was a need for a &ldquo;realistic&rdquo; model that used physically derived parameters to quantitatively link microscopic surface changes (structural phase transitions) to macroscopic reaction rates.</p>
<h2 id="novelty-coupling-reaction-kinetics-and-surface-phase-transitions">Novelty: Coupling Reaction Kinetics and Surface Phase Transitions</h2>
<p>The core novelty is the <strong>&ldquo;Reconstruction Model&rdquo;</strong>, which couples the chemical kinetics (Langmuir-Hinshelwood mechanism) with the physical structural phase transition of the platinum surface ($1\times1 \leftrightarrow 1\times2$).</p>
<ul>
<li>They treat the surface structure as a dynamic variable ($w$).</li>
<li>They introduce a fourth variable ($z$) representing &ldquo;faceting&rdquo; to explain complex mixed-mode oscillations, identifying the interplay between two negative feedback loops on different time scales as the driver for this behavior.</li>
</ul>
<h2 id="methodology-experimental-parameters-and-bifurcation-topology">Methodology: Experimental Parameters and Bifurcation Topology</h2>
<p>The validation approach involved a tight loop between numerical simulation and physical experiment:</p>
<ol>
<li><strong>Parameter Determination</strong>: They experimentally measured individual rate constants (sticking coefficients, desorption energies) using Surface Science techniques (LEED, TDS) to ground the model in reality.</li>
<li><strong>Bifurcation Analysis</strong>: They used numerical continuation methods (AUTO package) to compute &ldquo;skeleton bifurcation diagrams,&rdquo; mapping the boundaries between stable states, simple oscillations, and chaos in parameter space ($p_{CO}$ vs $p_{O_2}$).</li>
<li><strong>Physical Validation</strong>: These diagrams were compared directly against experimental work function ($\Delta \phi$) measurements and LEED intensity profiles to verify the existence regions of different dynamic regimes.</li>
</ol>
<h2 id="results-and-limitations-mixed-mode-oscillations-vs-spatiotemporal-chaos">Results and Limitations: Mixed-Mode Oscillations vs. Spatiotemporal Chaos</h2>
<ul>
<li><strong>Successes</strong>: The 3-variable model successfully reproduces bistability and simple oscillations (limit cycles). The extended 4-variable model qualitatively captures mixed-mode oscillations (MMOs).</li>
<li><strong>Mechanism</strong>: Oscillations arise from the delay between CO adsorption and the resulting surface phase transition (which changes oxygen sticking probabilities).</li>
<li><strong>Limitations</strong>: The 4-variable model only reproduces one type of MMO; certain experimental patterns (e.g., square-wave forms with small oscillations on both high and low work-function levels) were not obtained. The oscillatory region also does not extend to low temperatures as observed experimentally. More fundamentally, the ODE model fails to predict the period-doubling cascade to chaos or hyperchaos observed in experiments. The authors conclude these are likely spatiotemporal phenomena (involving wave propagation and pattern formation) that require Partial Differential Equations (PDEs).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The paper provides a complete set of equations and parameters required to reproduce the dynamics.</p>
<h3 id="data-parameters">Data (Parameters)</h3>
<p>The model uses kinetic parameters derived from Pt(110) experiments. Key constants for reproduction:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">$\kappa_c$</td>
          <td style="text-align: left">$3.135 \times 10^5 , s^{-1} \text{mbar}^{-1}$</td>
          <td style="text-align: left">Rate of CO hitting surface</td>
      </tr>
      <tr>
          <td style="text-align: left">$s_c$</td>
          <td style="text-align: left">$1.0$</td>
          <td style="text-align: left">CO sticking coefficient</td>
      </tr>
      <tr>
          <td style="text-align: left">$q$</td>
          <td style="text-align: left">$3$</td>
          <td style="text-align: left">Mobility parameter of precursor adsorption</td>
      </tr>
      <tr>
          <td style="text-align: left">$u_s$</td>
          <td style="text-align: left">$1.0$</td>
          <td style="text-align: left">Saturation coverage ($CO$)</td>
      </tr>
      <tr>
          <td style="text-align: left">$\kappa_o$</td>
          <td style="text-align: left">$5.858 \times 10^5 , s^{-1} \text{mbar}^{-1}$</td>
          <td style="text-align: left">Rate of $O_2$ hitting surface</td>
      </tr>
      <tr>
          <td style="text-align: left">$s_{o,1\times2}$</td>
          <td style="text-align: left">$0.4$</td>
          <td style="text-align: left">$O_2$ sticking coeff ($1\times2$ phase)</td>
      </tr>
      <tr>
          <td style="text-align: left">$s_{o,1\times1}$</td>
          <td style="text-align: left">$0.6$</td>
          <td style="text-align: left">$O_2$ sticking coeff ($1\times1$ phase)</td>
      </tr>
      <tr>
          <td style="text-align: left">$v_s$</td>
          <td style="text-align: left">$0.8$</td>
          <td style="text-align: left">Saturation coverage ($O$)</td>
      </tr>
      <tr>
          <td style="text-align: left">$k_{r}^{0}$</td>
          <td style="text-align: left">$3 \times 10^6 , s^{-1}$</td>
          <td style="text-align: left">Reaction pre-exponential</td>
      </tr>
      <tr>
          <td style="text-align: left">$E_r$</td>
          <td style="text-align: left">$10 , \text{kcal/mol}$</td>
          <td style="text-align: left">Reaction activation energy</td>
      </tr>
      <tr>
          <td style="text-align: left">$k_{d}^{0}$</td>
          <td style="text-align: left">$2 \times 10^{16} , s^{-1}$</td>
          <td style="text-align: left">Desorption pre-exponential</td>
      </tr>
      <tr>
          <td style="text-align: left">$E_d$</td>
          <td style="text-align: left">$38 , \text{kcal/mol}$</td>
          <td style="text-align: left">Desorption activation energy</td>
      </tr>
      <tr>
          <td style="text-align: left">$k_{p}^{0}$</td>
          <td style="text-align: left">$10^2 , s^{-1}$</td>
          <td style="text-align: left">Phase transition pre-exponential</td>
      </tr>
      <tr>
          <td style="text-align: left">$E_p$</td>
          <td style="text-align: left">$7 , \text{kcal/mol}$</td>
          <td style="text-align: left">Phase transition activation energy</td>
      </tr>
      <tr>
          <td style="text-align: left">$k_f$</td>
          <td style="text-align: left">$0.03 , s^{-1}$</td>
          <td style="text-align: left">Rate of facet formation</td>
      </tr>
      <tr>
          <td style="text-align: left">$k_{t}^{0}$</td>
          <td style="text-align: left">$2.65 \times 10^5 , s^{-1}$</td>
          <td style="text-align: left">Thermal annealing pre-exponential</td>
      </tr>
      <tr>
          <td style="text-align: left">$E_t$</td>
          <td style="text-align: left">$20 , \text{kcal/mol}$</td>
          <td style="text-align: left">Thermal annealing activation energy</td>
      </tr>
      <tr>
          <td style="text-align: left">$s_{o,3}$</td>
          <td style="text-align: left">$0.2$</td>
          <td style="text-align: left">Increase of $s_o$ for max faceting ($z=1$)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms-the-equations">Algorithms (The Equations)</h3>
<p>The system is defined by a set of coupled Ordinary Differential Equations (ODEs).</p>
<p><strong>1. Basic 3-Variable Model (Reconstruction Model)</strong></p>
<p>The core system is structured as a single mathematical block of coupled variables representing CO coverage ($u$), Oxygen coverage ($v$), and the surface phase fraction ($w$):</p>
<p>$$
\begin{aligned}
\dot{u} &amp;= p_{CO} \kappa_c s_c \left(1 - \left(\frac{u}{u_s}\right)^q \right) - k_d u - k_r u v \\
\dot{v} &amp;= p_{O_2} \kappa_o s_o \left(1 - \frac{u}{u_s} - \frac{v}{v_s}\right)^2 - k_r u v \\
\dot{w} &amp;= k_p (w_{eq}(u) - w)
\end{aligned}
$$</p>
<p><em>Note:</em> The oxygen sticking coefficient $s_o$ dynamically depends on the structure $w$, calculated as $s_o = w \cdot s_{o,1\times1} + (1-w) \cdot s_{o,1\times2}$. The equilibrium function $w_{eq}(u)$ is a polynomial step function that activates the phase transition:</p>
<p>$$
w_{eq}(u) =
\begin{cases}
0 &amp; u \le 0.2 \
\sum_{i=0}^3 r_i u^i &amp; 0.2 &lt; u &lt; 0.5 \
1 &amp; u \ge 0.5
\end{cases}
$$</p>
<p>The polynomial coefficients from Table II are: $r_3 = -1/0.0135$, $r_2 = -1.05 r_3$, $r_1 = 0.3 r_3$, $r_0 = -0.026 r_3$.</p>
<p><strong>2. Extended 4-Variable Model (Faceting)</strong></p>
<p>To reproduce Mixed-Mode Oscillations, the model adds a faceting variable $z$:</p>
<p>$$
\begin{aligned}
s_o &amp;= w \cdot s_{o,1\times1} + (1-w) \cdot s_{o,1\times2} + s_{o,3} z \\
\dot{z} &amp;= k_f \cdot u \cdot v \cdot w \cdot (1-z) - k_t z (1-u)
\end{aligned}
$$</p>
<h3 id="models">Models</h3>
<p>The authors define two distinct configurations:</p>
<ol>
<li><strong>3-Variable (u, v, w)</strong>: Sufficient for bistability and simple oscillations (limit cycles).</li>
<li><strong>4-Variable (u, v, w, z)</strong>: Required for mixed-mode oscillations (small oscillations superimposed on large relaxation spikes).</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Bifurcation Analysis</strong>: The system should be evaluated by computing steady states and detecting Hopf bifurcations as a function of $p_{CO}$ and $p_{O_2}$.</li>
<li><strong>Time Integration</strong>: Stiff ODE solvers (e.g., <code>scipy.integrate.odeint</code> or <code>solve_ivp</code> with &lsquo;Radau&rsquo; or &lsquo;BDF&rsquo; method) are recommended due to the differing time scales of reaction ($u,v$) and reconstruction ($w,z$).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Original</strong>: VAX 6800 and VAX station 3100.</li>
<li><strong>Modern Reqs</strong>: Minimal. Can be solved in milliseconds on any modern CPU using standard scientific libraries (Python/Matlab).</li>
</ul>
<h3 id="reference-implementation">Reference Implementation</h3>
<p>The following Python script implements the 3-variable Reconstruction Model described in the paper, replicating the stable oscillations shown in Figure 7 (T=540K):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> scipy.integrate <span style="color:#f92672">import</span> odeint
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># --- 1. CONSTANTS &amp; PARAMETERS ---</span>
</span></span><span style="display:flex;"><span>R <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.001987</span>
</span></span><span style="display:flex;"><span>k_c, s_c, q <span style="color:#f92672">=</span> <span style="color:#ae81ff">3.135e5</span>, <span style="color:#ae81ff">1.0</span>, <span style="color:#ae81ff">3.0</span>
</span></span><span style="display:flex;"><span>k_o, s_o1, s_o2 <span style="color:#f92672">=</span> <span style="color:#ae81ff">5.858e5</span>, <span style="color:#ae81ff">0.6</span>, <span style="color:#ae81ff">0.4</span>
</span></span><span style="display:flex;"><span>k_d0, E_d <span style="color:#f92672">=</span> <span style="color:#ae81ff">2.0e16</span>, <span style="color:#ae81ff">38.0</span>
</span></span><span style="display:flex;"><span>k_r0, E_r <span style="color:#f92672">=</span> <span style="color:#ae81ff">3.0e6</span>, <span style="color:#ae81ff">10.0</span>
</span></span><span style="display:flex;"><span>k_p0, E_p <span style="color:#f92672">=</span> <span style="color:#ae81ff">100.0</span>, <span style="color:#ae81ff">7.0</span>
</span></span><span style="display:flex;"><span>u_s, v_s <span style="color:#f92672">=</span> <span style="color:#ae81ff">1.0</span>, <span style="color:#ae81ff">0.8</span>
</span></span><span style="display:flex;"><span>T, p_CO, p_O2 <span style="color:#f92672">=</span> <span style="color:#ae81ff">540.0</span>, <span style="color:#ae81ff">3.0e-5</span>, <span style="color:#ae81ff">6.67e-5</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate Arrhenius rates</span>
</span></span><span style="display:flex;"><span>k_d <span style="color:#f92672">=</span> k_d0 <span style="color:#f92672">*</span> np<span style="color:#f92672">.</span>exp(<span style="color:#f92672">-</span>E_d <span style="color:#f92672">/</span> (R <span style="color:#f92672">*</span> T))
</span></span><span style="display:flex;"><span>k_r <span style="color:#f92672">=</span> k_r0 <span style="color:#f92672">*</span> np<span style="color:#f92672">.</span>exp(<span style="color:#f92672">-</span>E_r <span style="color:#f92672">/</span> (R <span style="color:#f92672">*</span> T))
</span></span><span style="display:flex;"><span>k_p <span style="color:#f92672">=</span> k_p0 <span style="color:#f92672">*</span> np<span style="color:#f92672">.</span>exp(<span style="color:#f92672">-</span>E_p <span style="color:#f92672">/</span> (R <span style="color:#f92672">*</span> T))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">model</span>(y, t):
</span></span><span style="display:flex;"><span>    u, v, w <span style="color:#f92672">=</span> y
</span></span><span style="display:flex;"><span>    s_o <span style="color:#f92672">=</span> w <span style="color:#f92672">*</span> s_o1 <span style="color:#f92672">+</span> (<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> w) <span style="color:#f92672">*</span> s_o2
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Smooth step function for Equilibrium w</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> u <span style="color:#f92672">&lt;=</span> <span style="color:#ae81ff">0.2</span>: weq <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> u <span style="color:#f92672">&gt;=</span> <span style="color:#ae81ff">0.5</span>: weq <span style="color:#f92672">=</span> <span style="color:#ae81ff">1.0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        x <span style="color:#f92672">=</span> (u <span style="color:#f92672">-</span> <span style="color:#ae81ff">0.2</span>) <span style="color:#f92672">/</span> <span style="color:#ae81ff">0.3</span>
</span></span><span style="display:flex;"><span>        weq <span style="color:#f92672">=</span> <span style="color:#ae81ff">3</span><span style="color:#f92672">*</span>x<span style="color:#f92672">**</span><span style="color:#ae81ff">2</span> <span style="color:#f92672">-</span> <span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>x<span style="color:#f92672">**</span><span style="color:#ae81ff">3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    r_reac <span style="color:#f92672">=</span> k_r <span style="color:#f92672">*</span> u <span style="color:#f92672">*</span> v
</span></span><span style="display:flex;"><span>    du <span style="color:#f92672">=</span> p_CO <span style="color:#f92672">*</span> k_c <span style="color:#f92672">*</span> s_c <span style="color:#f92672">*</span> (<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> (u<span style="color:#f92672">/</span>u_s)<span style="color:#f92672">**</span>q) <span style="color:#f92672">-</span> k_d <span style="color:#f92672">*</span> u <span style="color:#f92672">-</span> r_reac
</span></span><span style="display:flex;"><span>    dv <span style="color:#f92672">=</span> p_O2 <span style="color:#f92672">*</span> k_o <span style="color:#f92672">*</span> s_o <span style="color:#f92672">*</span> (<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> u<span style="color:#f92672">/</span>u_s <span style="color:#f92672">-</span> v<span style="color:#f92672">/</span>v_s)<span style="color:#f92672">**</span><span style="color:#ae81ff">2</span> <span style="color:#f92672">-</span> r_reac
</span></span><span style="display:flex;"><span>    dw <span style="color:#f92672">=</span> k_p <span style="color:#f92672">*</span> (weq <span style="color:#f92672">-</span> w)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> [du, dv, dw]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># --- 2. SIMULATION STRATEGY ---</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Simulate for 300 seconds to kill transients</span>
</span></span><span style="display:flex;"><span>t_full <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linspace(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">300</span>, <span style="color:#ae81ff">3000</span>)
</span></span><span style="display:flex;"><span>y0 <span style="color:#f92672">=</span> [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.0</span>]
</span></span><span style="display:flex;"><span>solution <span style="color:#f92672">=</span> odeint(model, y0, t_full)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># --- 3. SLICING FOR FIGURE 7 ---</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Only take the last 60 seconds (stable limit cycle)</span>
</span></span><span style="display:flex;"><span>mask <span style="color:#f92672">=</span> (t_full <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">240</span>) <span style="color:#f92672">&amp;</span> (t_full <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">300</span>)
</span></span><span style="display:flex;"><span>t_plot <span style="color:#f92672">=</span> t_full[mask]
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Shift time axis to start at 10s (matching Fig 7 style)</span>
</span></span><span style="display:flex;"><span>t_display <span style="color:#f92672">=</span> t_plot <span style="color:#f92672">-</span> t_plot[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">+</span> <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>u_plot <span style="color:#f92672">=</span> solution[mask, <span style="color:#ae81ff">0</span>]
</span></span><span style="display:flex;"><span>v_plot <span style="color:#f92672">=</span> solution[mask, <span style="color:#ae81ff">1</span>]
</span></span><span style="display:flex;"><span>w_plot <span style="color:#f92672">=</span> solution[mask, <span style="color:#ae81ff">2</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># --- 4. VISUALIZATION ---</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">5</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Plot CO (u) and Structure (w) on top (Primary Axis)</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>plot(t_display, w_plot, <span style="color:#e6db74">&#39;g--&#39;</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;1x1 Fraction (w)&#39;</span>, linewidth<span style="color:#f92672">=</span><span style="color:#ae81ff">1.5</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>plot(t_display, u_plot, <span style="color:#e6db74">&#39;k-&#39;</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;CO Coverage (u)&#39;</span>, linewidth<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Plot Oxygen (v) on bottom</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>plot(t_display, v_plot, <span style="color:#e6db74">&#39;r-.&#39;</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Oxygen (v)&#39;</span>, linewidth<span style="color:#f92672">=</span><span style="color:#ae81ff">1.5</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Replication of Figure 7: Stable Oscillations&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>xlabel(<span style="color:#e6db74">&#39;Time (s)&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>ylabel(<span style="color:#e6db74">&#39;Coverage [ML]&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>legend(loc<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;upper center&#39;</span>, ncol<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>xlim(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">60</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>ylim(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1.0</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>grid(<span style="color:#66d9ef">True</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.3</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/notes/oscillatory-co-pt110-replication.webp"
         alt="Replication of Figure 7 showing stable oscillations in CO oxidation on Pt(110)"
         title="Replication of Figure 7 showing stable oscillations in CO oxidation on Pt(110)"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Output of the reference implementation showing stable oscillations on Pt(110)</figcaption>
    
</figure>

<p>This plot faithfully replicates the stable limit cycle shown in <strong>Figure 7</strong> of the paper:</p>
<ul>
<li><strong>Timeframe</strong>: Shows a 50-second window (labeled 10-60s) after initial transients have died out.</li>
<li><strong>Period</strong>: Regular oscillations with a period of roughly 7-8 seconds.</li>
<li><strong>Phase Relationship</strong>: The surface phase reconstruction ($w$, green dashed) lags slightly behind the CO coverage ($u$, black solid). This delay is the crucial &ldquo;memory&rdquo; effect that enables the oscillation.</li>
<li><strong>Anticorrelation</strong>: The oxygen coverage ($v$, red dash-dot) spikes exactly when the surface is in the active $1\times1$ phase (high $w$) and CO is low, confirming the &ldquo;Langmuir-Hinshelwood&rdquo; reaction mechanism.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krischer, K., Eiswirth, M., &amp; Ertl, G. (1992). Oscillatory CO oxidation on Pt(110): Modeling of temporal self-organization. <em>The Journal of Chemical Physics</em>, 96(12), 9161-9172. <a href="https://doi.org/10.1063/1.462226">https://doi.org/10.1063/1.462226</a></p>
<p><strong>Publication</strong>: Journal of Chemical Physics 1992</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krischerOscillatoryCOOxidation1992,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Oscillatory {{CO}} Oxidation on {{Pt}}(110): {{Modeling}} of Temporal Self-organization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Oscillatory {{CO}} Oxidation on {{Pt}}(110)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krischer, K. and Eiswirth, M. and Ertl, G.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1992</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{The Journal of Chemical Physics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{96}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{9161--9172}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0021-9606, 1089-7690}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1063/1.462226}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Optical Recognition of Chemical Graphics</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</guid><description>A 1993 prototype system for converting scanned chemical diagrams into connection tables using vectorization and heuristic-based structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-early-ocsr-pipeline-methodology">Contribution: Early OCSR Pipeline Methodology</h2>
<p><strong>Method</strong>. This paper proposes a novel architectural pipeline for the automatic recognition of chemical structure diagrams. It defines a specific sequence of algorithmic steps, including diagram separation, vectorization, segmentation, and structural analysis, which converts pixel data into a semantic chemical representation (MDL Molfile).</p>
<h2 id="motivation-digitizing-legacy-chemical-data">Motivation: Digitizing Legacy Chemical Data</h2>
<p><strong>Problem</strong>: In 1993, vast databases of chemical information existed, but the entry of graphical data was significantly less advanced than the facilities for manipulating it.</p>
<p><strong>Gap</strong>: Creating digital chemical structures required trained operators to manually redraw diagrams that already existed in printed journals and catalogs, leading to a costly duplication of effort.</p>
<p><strong>Goal</strong>: To automate the creation of coded representations (connection tables) directly from optically scanned diagrams on printed pages.</p>
<h2 id="novelty-general-document-analysis-integrated-with-chemical-rules">Novelty: General Document Analysis Integrated with Chemical Rules</h2>
<p><strong>Pipeline Approach</strong>: The authors present a complete end-to-end system that integrates general document analysis with domain-specific chemical rules.</p>
<p><strong>Convex Bounding Separation</strong>: A novel use of &ldquo;bounding polygons&rdquo; defined by 8 fixed-direction bands to distinguish diagram components from text with linear computational cost.</p>
<p><strong>Vector-Based Segmentation</strong>: The system uses the output of a vectorizer (GIFTS) to classify diagram elements. It relies on the observation that vectorizers approximate characters with sets of short vectors to distinguish them from bonds.</p>
<h2 id="methodology-and-system-evaluation">Methodology and System Evaluation</h2>
<p><strong>System Implementation</strong>: The algorithm was implemented in &lsquo;C&rsquo; on IBM PS/2 personal computers running OS/2 Presentation Manager.</p>
<p><strong>Input Specification</strong>: The system was tested on documents scanned at 300 dpi using an IBM 3119 scanner.</p>
<p><strong>Qualitative Evaluation</strong>: The authors evaluated the system on &ldquo;typical scanned structures&rdquo; and &ldquo;simple planar diagrams&rdquo;. Large-scale quantitative benchmarking was not conducted in this work.</p>
<h2 id="results-performance-and-limitations">Results, Performance, and Limitations</h2>
<p><strong>Performance</strong>: The prototype processes a typical structure (after extraction) in less than one minute.</p>
<p><strong>Accuracy</strong>: It is reported to be accurate for simple planar diagrams.</p>
<p><strong>Output Format</strong>: The system successfully generates MDL Molfiles that interface with standard chemistry software like REACCS, MACCS, and modeling tools.</p>
<p><strong>Limitations</strong>: The system struggles with broken lines, characters touching bond structures, and requires manual intervention for complex errors.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status:</strong> Closed (Historical). As an early prototype from 1993, no source code, datasets, or digital models were publicly released. Reproducing this exact system would require recreating the pipeline from the described heuristics and sourcing vintage OCR software.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No digital artifacts were released with this 1993 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The paper does not release a dataset but specifies the input requirements for the system.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input</td>
          <td>Scanned Documents</td>
          <td>N/A</td>
          <td>Black ink on white paper; scanned at 300 dpi bi-level.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper relies on a pipeline of specific heuristics and geometric rules.</p>
<p><strong>1. Diagram Separation (Region Growing)</strong></p>
<ul>
<li><strong>Bounding Polygons</strong>: Uses convex polygons defined by pairs of parallel sides in 8 fixed directions. This approximation improves distance estimation compared to bounding rectangles.</li>
<li><strong>Seed Detection</strong>: Finds a connected component with bounding dimension $D &gt; d_{\text{max char size}}$.</li>
<li><strong>Aggregation</strong>: Iteratively searches for neighboring components within a specific distance threshold $d_t$ (where $d_t$ is smaller than the whitespace margin) and merges them into the bounding polygon.</li>
</ul>
<p><strong>2. Vectorization &amp; Segmentation</strong></p>
<ul>
<li><strong>Vectorization</strong>: Uses the GIFTS system (IBM Tokyo) to fit lines to pixels.</li>
<li><strong>Classification Heuristics</strong>:
<ul>
<li><strong>Ratio Test</strong>: If the ratio of a group&rsquo;s dimension to the full diagram dimension is below a threshold $\tau$, it is classified as a <strong>Symbol</strong>:
$$ \frac{D_{\text{group}}}{D_{\text{diagram}}} &lt; \tau $$</li>
<li><strong>Context Rule</strong>: Small vector groups near letters are classified as <strong>Characters</strong> (handles &rsquo;l&rsquo; in &lsquo;Cl&rsquo;).</li>
<li><strong>Circle Rule</strong>: A group is a <strong>Circle</strong> (aromatic ring) if it contains $N \ge 8$ vectors in a roughly circular arrangement.</li>
<li><strong>Default</strong>: Otherwise, classified as <strong>Bond Structure</strong>.</li>
</ul>
</li>
</ul>
<p><strong>3. Cleanup &amp; Structure Recognition</strong></p>
<ul>
<li><strong>Short Vector Removal</strong>: Vectors shorter than a fraction of the median line length $L_{\text{median}}$ are shrunk to their midpoint (fixing broken junctions).</li>
<li><strong>Vertex Merging</strong>: If two vectors meet at an angle $\theta &lt; 35^{\circ}$, the vertex is removed (fixing single lines broken into two).</li>
<li><strong>Aromatic Processing</strong>: If a circle is detected, the system identifies the 6 closest atoms and adds double bonds to every second bond in the ring.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>OCR</strong>:</p>
<ul>
<li>The system uses a feature-based, single-font OCR engine.</li>
<li>It assumes non-serif, plain styles typical of drafting standards.</li>
<li>Character images are normalized for size before recognition.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Scanner</strong>: IBM 3119 (300 dpi).</li>
<li><strong>Compute</strong>: IBM PS/2 series running OS/2.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Casey, R., et al. (1993). Optical Recognition of Chemical Graphics. <em>Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR &lsquo;93)</em>, 627-631. <a href="https://doi.org/10.1109/ICDAR.1993.395658">https://doi.org/10.1109/ICDAR.1993.395658</a></p>
<p><strong>Publication</strong>: ICDAR 1993</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{caseyOpticalRecognitionChemical1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical Recognition of Chemical Graphics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of 2nd {{International Conference}} on {{Document Analysis}} and {{Recognition}} ({{ICDAR}} &#39;93)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Casey, R. and Boyer, S. and Healey, P. and Miller, A. and Oudot, B. and Zilles, K.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{627--631}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE Comput. Soc. Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Tsukuba Science City, Japan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.1993.395658}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Mixture Density Networks: Modeling Multimodal Distributions</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/mixture-density-networks/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/mixture-density-networks/</guid><description>A 1994 technical report introducing Mixture Density Networks (MDNs) to model arbitrary conditional probability distributions using neural networks.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It identifies a specific failure mode in existing neural network methodologies (least-squares regression on multi-valued inverse problems) and proposes a novel architecture (combining MLPs with Mixture Models) to solve it. It derives the mathematical framework for training this architecture via standard back-propagation and validates it against the established baseline.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Standard neural networks trained with sum-of-squares (MSE) or cross-entropy error functions approximate the <strong>conditional average</strong> of the target data, $\langle t|x \rangle$.</p>
<p>While optimal for single-valued functions or classification, this produces completely erroneous results for <strong>inverse problems</strong> where the mapping is multi-valued (one input has multiple valid outputs). For example, in robot inverse kinematics, &ldquo;elbow-up&rdquo; and &ldquo;elbow-down&rdquo; configurations can achieve the same hand position. An MSE-trained network will average these two valid angles, resulting in an invalid configuration (the paper shows this produces end-effector positions at the outer boundary of the accessible region, corresponding to $\theta_2 = \pi$).</p>















<figure class="post-figure center ">
    <img src="/img/notes/single-gaussian-mse-prediction.webp"
         alt="Single Gaussian MSE prediction averaging multimodal distribution"
         title="Single Gaussian MSE prediction averaging multimodal distribution"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">MSE-trained networks predict the mean, which averages across modes and produces invalid outputs for inverse problems.</figcaption>
    
</figure>

<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The introduction of the <strong>Mixture Density Network (MDN)</strong>.</p>
<p>The neural network predicts the <strong>parameters</strong> (mixing coefficients, means, and variances) of a kernel mixture distribution (typically Gaussian).</p>















<figure class="post-figure center ">
    <img src="/img/notes/gaussian-mixture-mdn-prediction.webp"
         alt="Gaussian mixture model prediction capturing multimodal distribution"
         title="Gaussian mixture model prediction capturing multimodal distribution"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">MDNs predict mixture parameters to capture the full conditional probability density, representing all modes.</figcaption>
    
</figure>

<p>Key technical contributions include:</p>
<ol>
<li><strong>Architecture</strong>: Mapping network outputs to mixture parameters using specific activation functions to satisfy constraints (Softmax for priors $\alpha$, Exponential for variances $\sigma$).</li>
<li><strong>Training</strong>: Deriving the error function as the negative log-likelihood of the mixture model.</li>
<li><strong>Optimization</strong>: Deriving the exact derivatives (gradients) of the error with respect to network outputs, allowing the mixture model parameters to be learned via standard back-propagation.</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>Bishop validated the method on two tasks, comparing an MDN against a standard MLP trained with least-squares:</p>
<ol>
<li><strong>Toy Inverse Problem</strong>: A sinusoidal mapping $x = t + 0.3\sin(2\pi t) + \epsilon$. The forward problem ($t \to x$) is single-valued, but the inverse ($x \to t$) is multi-valued.</li>
<li><strong>Robot Kinematics</strong>: A 2-link robot arm simulation. The task is to map end-effector Cartesian coordinates $(x_1, x_2)$ back to joint angles $(\theta_1, \theta_2)$.</li>
</ol>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Toy Problem</strong>: The standard least-squares network failed completely, drawing a smooth curve through the average of the multiple branches, which did not correspond to valid data. The MDN correctly modeled the tri-modal density and discontinuous jumps in the most probable solution.</li>
<li><strong>Robot Kinematics</strong>: The MDN reduced the RMS positioning error by an order of magnitude compared to the standard network (0.0053 vs 0.0578).</li>
<li><strong>Generality</strong>: The paper concludes that MDNs provide a complete description of the conditional probability density, allowing users to calculate any statistic (mean, mode, variance) needed for the application.</li>
</ul>
<h2 id="extracting-predictions">Extracting Predictions</h2>
<p>Once trained, the MDN outputs a full conditional density $p(t|x)$, from which several useful statistics can be derived:</p>
<ul>
<li><strong>Conditional mean</strong>: $\langle t|x \rangle = \sum_i \alpha_i(x) \mu_i(x)$, equivalent to the standard least-squares network output.</li>
<li><strong>Conditional variance</strong>: $s^2(x) = \sum_i \alpha_i(x) { \sigma_i(x)^2 + | \mu_i(x) - \sum_j \alpha_j(x) \mu_j(x) |^2 }$, which is input-dependent (more general than the constant-variance least-squares assumption).</li>
<li><strong>Most probable branch</strong>: Select the kernel $i$ with the largest mixing coefficient $\alpha_i(x)$, then use its center $\mu_i$ as the prediction. This yields a discontinuous but accurate mapping for multi-valued problems.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Model order selection</strong>: The number of mixture components $m$ must be chosen in advance. The paper acknowledges this as an open problem and suggests cross-validation or Bayesian model comparison as potential approaches.</li>
<li><strong>Computational overhead</strong>: The number of network outputs grows as $(c + 2) \times m$, where $c$ is the target dimensionality. For high-dimensional targets or many kernels, this can become significant.</li>
<li><strong>Isotropic kernels</strong>: The paper uses a single variance parameter $\sigma_i$ per kernel (shared across target dimensions), which assumes isotropic covariance. The paper notes this can be generalized to full covariance matrices at the cost of additional parameters.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>1. Toy Inverse Problem</strong></p>
<ul>
<li><strong>Function</strong>: $x = t + 0.3\sin(2\pi t) + \epsilon$</li>
<li><strong>Noise</strong>: $\epsilon \sim U(-0.1, 0.1)$</li>
<li><strong>Sampling</strong>: 1,000 points generated by sampling $t$ at equal intervals in range $(0, 1)$.</li>
<li><strong>Task</strong>: Inverse mapping (predict $t$ given $x$).</li>
</ul>
<p><strong>2. Robot Kinematics</strong></p>
<ul>
<li><strong>System</strong>: 2-link arm with lengths $L_1=0.8, L_2=0.2$.</li>
<li><strong>Forward Kinematics</strong>:
<ul>
<li>$x_1 = L_1 \cos(\theta_1) - L_2 \cos(\theta_1 + \theta_2)$</li>
<li>$x_2 = L_1 \sin(\theta_1) - L_2 \sin(\theta_1 + \theta_2)$</li>
</ul>
</li>
<li><strong>Constraints</strong>: $\theta_1 \in (0.3, 1.2)$, $\theta_2 \in (\pi/2, 3\pi/2)$.</li>
<li><strong>Dataset</strong>: 1,000 training points, 1,000 test points.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Mixture Model Definition</strong></p>
<p>The conditional density is defined as:</p>
<p>$$p(t|x) = \sum_{i=1}^{m} \alpha_i(x) \phi_i(t|x)$$</p>
<p>Where kernels $\phi_i$ are Gaussians with centers $\mu_i(x)$ and variances $\sigma_i(x)$.</p>
<p><strong>Network Output Mappings</strong></p>
<p>If the network produces raw outputs $z$, they are mapped to parameters as follows to satisfy probability constraints:</p>
<ul>
<li><strong>Mixing Coefficients ($\alpha$)</strong>: Softmax. $\alpha_i = \frac{\exp(z_i^\alpha)}{\sum_j \exp(z_j^\alpha)}$</li>
<li><strong>Variances ($\sigma$)</strong>: Exponential. $\sigma_i = \exp(z_i^\sigma)$</li>
<li><strong>Means ($\mu$)</strong>: Linear/Identity. $\mu_{ik} = z_{ik}^\mu$</li>
</ul>
<p><strong>Loss Function</strong></p>
<p>Negative Log Likelihood:</p>
<p>$$E^q = - \ln \left{ \sum_{i=1}^{m} \alpha_i(x^q) \phi_i(t^q|x^q) \right}$$</p>
<h3 id="models">Models</h3>
<p><strong>1. Toy Problem Configuration</strong></p>
<ul>
<li><strong>Structure</strong>: MLP with 1 input ($x$), 1 hidden layer.</li>
<li><strong>Hidden Units</strong>: 20 units (tanh activation).</li>
<li><strong>Outputs</strong>: 9 units.
<ul>
<li>$m=3$ Gaussian kernels.</li>
<li>Parameters per kernel: 1 $\alpha$, 1 $\sigma$, 1 $\mu$. Total = $3 \times 3 = 9$.</li>
</ul>
</li>
<li><strong>Training</strong>: 1,000 cycles of BFGS.</li>
</ul>
<p><strong>2. Robot Kinematics Configuration (Least-Squares Baseline)</strong></p>
<ul>
<li><strong>Structure</strong>: MLP with 2 inputs ($x_1, x_2$), 2 linear outputs ($\theta_1, \theta_2$).</li>
<li><strong>Hidden Units</strong>: Best result with 20 units (tanh activation), tested with 5, 10, 15, 20, 25, 30.</li>
<li><strong>Training</strong>: 3,000 cycles of BFGS.</li>
</ul>
<p><strong>3. Robot Kinematics Configuration (MDN)</strong></p>
<ul>
<li><strong>Structure</strong>: MLP with 2 inputs ($x_1, x_2$).</li>
<li><strong>Hidden Units</strong>: 10 units (tanh activation).</li>
<li><strong>Outputs</strong>: 8 units.
<ul>
<li>$m=2$ Gaussian kernels.</li>
<li>Target dimension $c=2$ (predicting $\theta_1, \theta_2$).</li>
<li>Parameters per kernel: 1 $\alpha$ + 1 $\sigma$ (common variance) + 2 $\mu$ (means for $\theta_1, \theta_2$).</li>
<li>Total = $2 \times (1 + 1 + 2) = 8$.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: RMS Euclidean distance between the desired end-effector position and the achieved position (calculated by plugging predicted angles back into forward kinematics).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Hidden Units</th>
          <th>Kernels</th>
          <th>RMS Error</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Least Squares</td>
          <td>20</td>
          <td>N/A</td>
          <td>0.0578</td>
      </tr>
      <tr>
          <td>MDN</td>
          <td>10</td>
          <td>2</td>
          <td>0.0053</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bishop, C. M. (1994). Mixture Density Networks. <em>Neural Computing Research Group Report: NCRG/94/004</em>, Aston University.</p>
<p><strong>Publication</strong>: Neural Computing Research Group Technical Report 1994</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{bishopMixtureDensityNetworks1994,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Mixture {{Density Networks}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Bishop, Christopher M.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1994</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = feb,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{NCRG/94/004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{Aston University}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kekulé: OCR-Optical Chemical Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</guid><description>A seminal 1992 system for Optical Chemical Structure Recognition (OCSR) using neural networks and heuristic graph compilation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. <em>Journal of Chemical Information and Computer Sciences</em>, 32(4), 373-378. <a href="https://doi.org/10.1021/ci00008a018">https://doi.org/10.1021/ci00008a018</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1992</p>
<h2 id="system-architecture-and-methodological-approach">System Architecture and Methodological Approach</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$). It proposes a novel software architecture (&ldquo;Kekulé&rdquo;) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the &ldquo;how&rdquo; of the system by detailing the seven-step pipeline from scanning to graph compilation, validating the method through performance testing on a specific dataset.</p>
<h2 id="motivation-bridging-visual-diagrams-and-connection-tables">Motivation: Bridging Visual Diagrams and Connection Tables</h2>
<p>The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).</p>
<ul>
<li><strong>Inefficiency of Manual Entry</strong>: Manual compilation of structural descriptions is &ldquo;tedious and highly prone to error&rdquo;.</li>
<li><strong>Redrawing Costs</strong>: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.</li>
<li><strong>Lack of Existing Solutions</strong>: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.</li>
</ul>
<h2 id="novelty-a-hybrid-ocr-and-heuristic-approach">Novelty: A Hybrid OCR and Heuristic Approach</h2>
<p>Kekulé represents the first successful attempt to integrate all of the required elements of image processing, OCR, structure editing, and database communication into a complete system.</p>
<ul>
<li><strong>Hybrid OCR Approach</strong>: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a <strong>multilayer perceptron neural network</strong> trained specifically on small fonts (down to 3.2 points).</li>
<li><strong>Heuristic Feature Extraction</strong>: The authors developed specific heuristics to handle chemical artifacts, such as an exhaustive search for dashed lines, explicitly rejecting Hough transforms as unreliable for short segments.</li>
<li><strong>Contextual &ldquo;Spell Checking&rdquo;</strong>: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters.</li>
</ul>
<h2 id="experimental-setup-and-dataset-validation">Experimental Setup and Dataset Validation</h2>
<p>The authors performed a validation study on a diverse set of chemical structures to stress-test the system:</p>
<ul>
<li><strong>Dataset</strong>: 444 chemical structures were selected from a wide variety of sources, including the <em>Merck Index</em>, <em>Aldrich Handbook</em>, and <em>ACS Nomenclature Guide</em>, specifically chosen to &ldquo;test Kekulé&rsquo;s limits&rdquo;.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Processing Success</strong>: Percentage of structures processed.</li>
<li><strong>User Intervention</strong>: Average number of prompts per structure for verification.</li>
<li><strong>Editing Time</strong>: Time required to correct interpretation errors (arbitrary &ldquo;good&rdquo; limit set at 30 seconds).</li>
</ul>
</li>
</ul>
<h2 id="results-and-system-performance">Results and System Performance</h2>
<ul>
<li><strong>High Success Rate</strong>: 98.9% of the 444 structures were processed successfully.</li>
<li><strong>Performance Speed</strong>: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.</li>
<li><strong>Error Modes</strong>: The primary bottleneck was broken characters in scanned images (e.g., breaks in &lsquo;H&rsquo; or &lsquo;N&rsquo; crossbars), which slowed down the OCR significantly.</li>
<li><strong>Impact</strong>: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The following details outline the specific technical implementation described in the 1992 paper.</p>
<h3 id="data">Data</h3>
<p>The authors did not release a public dataset but described their test set sources in detail.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Mixed Chemical Sources</td>
          <td>444 structures</td>
          <td>Sourced from <em>Merck Index</em>, <em>Aldrich Handbook</em>, <em>ACS Nomenclature Guide</em>, etc.</td>
      </tr>
      <tr>
          <td>Training (OCR)</td>
          <td>Font Exemplars</td>
          <td>Unknown</td>
          <td>&ldquo;Exemplars of characters from numerous serif and sanserif fonts&rdquo;.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a 7-step pipeline. Key algorithmic choices include:</p>
<ul>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li>Images are reduced to 1-pixel width using <strong>thinning</strong> and <strong>raster-to-vector translation</strong>.</li>
<li>An <strong>adaptive smoothing algorithm</strong> is applied to remove pixel-level jitter.</li>
</ul>
</li>
<li>
<p><strong>Feature Extraction (Dashed Lines)</strong>:</p>
<ul>
<li><strong>Hough Transforms</strong> were rejected due to poor performance on short line segments.</li>
<li><strong>Slope sorting</strong> was rejected due to variance in short dashes.</li>
<li><strong>Chosen Method</strong>: Exhaustive search/testing of all features that <em>might</em> be dashed lines (subset of features).</li>
</ul>
</li>
<li>
<p><strong>Graph Compilation</strong>:</p>
<ul>
<li><strong>Character Grouping</strong>: Characters are assembled into strings based on XY adjacency.</li>
<li><strong>Node Creation</strong>: Character strings become nodes. Vectors with endpoints &ldquo;too far&rdquo; from strings create new nodes.</li>
<li><strong>Heuristics</strong>: Circles are converted to alternating single-double bonds; &ldquo;thick&rdquo; bonds between wedges are automatically generated.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The core machine learning component is the OCR engine.</p>
<ul>
<li><strong>Architecture</strong>: A <strong>multilayer perceptron neural network</strong> (fully connected).</li>
<li><strong>Input</strong>: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.</li>
<li><strong>Output</strong>: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., &lsquo;5&rsquo; vs &lsquo;S&rsquo;), both are kept and resolved via chemical context.</li>
<li><strong>Performance</strong>: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The system was developed and tested on hardware typical of the early 1990s.</p>
<ul>
<li><strong>Processor</strong>: Intel 80486 at 33 MHz.</li>
<li><strong>Scanners</strong>: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).</li>
<li><strong>Platform</strong>: Microsoft Windows.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mcdanielKekuleOCRopticalChemical1992,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Kekulé: {{OCR-optical}} Chemical (Structure) Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Kekulé}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1992</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{32}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{373--378}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00008a018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>IMG2SMI: Translating Molecular Structure Images to SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</guid><description>Campos &amp; Ji's method for converting 2D molecular images to SMILES strings using Transformers and SELFIES representation.</description><content:encoded><![CDATA[<h2 id="contributions--taxonomy">Contributions &amp; Taxonomy</h2>
<p>This is both a <strong>Method</strong> and <strong>Resource</strong> paper:</p>
<ul>
<li><strong>Method</strong>: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.</li>
<li><strong>Resource</strong>: It introduces <strong>MOLCAP</strong>, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.</li>
</ul>
<h2 id="the-bottleneck-in-chemical-literature-translation">The Bottleneck in Chemical Literature Translation</h2>
<p>Chemical literature is &ldquo;full of recipes written in a language computers cannot understand&rdquo; because molecules are depicted as 2D images. This creates a fundamental bottleneck:</p>
<ul>
<li><strong>The Problem</strong>: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.</li>
<li><strong>Existing Tools</strong>: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.</li>
<li><strong>The Goal</strong>: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.</li>
</ul>
<h2 id="core-innovation-selfies-and-image-captioning">Core Innovation: SELFIES and Image Captioning</h2>
<p>The core novelty is demonstrating that <strong>how you represent the output text is as important as the model architecture itself</strong>. Key contributions:</p>
<ol>
<li>
<p><strong>Image Captioning Framework</strong>: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence:
$$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$</p>
</li>
<li>
<p><strong>SELFIES as Target Representation</strong>: The key mechanism relies on using <strong>SELFIES</strong> (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.</p>
</li>
<li>
<p><strong>MOLCAP Dataset</strong>: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.</p>
</li>
<li>
<p><strong>Task-Specific Evaluation</strong>: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on <strong>molecular fingerprints</strong> (MACCS, RDK, Morgan) and <strong>Tanimoto similarity</strong>:
$$ T(a, b) = \frac{c}{a + b - c} $$
where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule&rsquo;s fingerprint. This formulation reliably measures functional chemical similarity.</p>
</li>
</ol>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:</p>
<ol>
<li>
<p><strong>Baseline Comparisons</strong>: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Extensive ablations isolating key factors:</p>
<ul>
<li><strong>Decoder Architecture</strong>: Transformer vs. RNN/LSTM decoders</li>
<li><strong>Encoder Fine-tuning</strong>: Fine-tuned vs. frozen pre-trained ResNet weights</li>
<li><strong>Output Representation</strong>: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)</li>
</ul>
</li>
</ol>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>MACCS FTS</th>
          <th>Valid Captions</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN + Fixed Encoder</td>
          <td>0.1526</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RNN + Fine-tuned Encoder</td>
          <td>0.4180</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Transformer + Fixed Encoder</td>
          <td>0.7674</td>
          <td>61.1%</td>
      </tr>
      <tr>
          <td>Transformer + Fine-tuned Encoder</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>Character-level SMILES (fine-tuned)</td>
          <td>N/A</td>
          <td>2.1%</td>
      </tr>
      <tr>
          <td>BPE SMILES (2000 vocab, fine-tuned)</td>
          <td>N/A</td>
          <td>20.0%</td>
      </tr>
      <tr>
          <td>SELFIES (fine-tuned)</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
  </tbody>
</table>
<ol start="3">
<li><strong>Metric Analysis</strong>: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.</li>
</ol>
<h2 id="results-findings-and-limitations">Results, Findings, and Limitations</h2>
<p><strong>Performance Gains</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>Random Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>0.0000</td>
          <td>0.3378</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>0.0000</td>
          <td>0.2229</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>0.0000</td>
          <td>0.1081</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>0.0000</td>
          <td>0.0422</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>0.00%</td>
          <td>0.00%</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<ul>
<li>163% improvement over OSRA on MACCS Tanimoto similarity.</li>
<li>Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).</li>
<li>Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>SELFIES is Critical</strong>: Using SELFIES yields <strong>99.4% valid molecules</strong>, compared to only ~2% validity for character-level SMILES.</li>
<li><strong>Architecture Matters</strong>: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).</li>
<li><strong>Metric Insights</strong>: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Low Exact Match</strong>: Only <strong>7.24%</strong> exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.</li>
<li><strong>Complexity Bias</strong>: Trained on large molecules (average length &gt;40 tokens), so it performs poorly on very simple structures where OSRA still excels.</li>
</ul>
<p><strong>Conclusion</strong>: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image captioning system based on DETR (Detection Transformer) framework.</p>
<p><strong>Visual Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: ResNet-101 pre-trained on ImageNet</li>
<li><strong>Feature Extraction</strong>: 4th layer extraction (convolutions only)</li>
<li><strong>Output</strong>: 2048-dimensional dense feature vector</li>
</ul>
<p><strong>Caption Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Transformer encoder-decoder</li>
<li><strong>Layers</strong>: 3 stacked encoder layers, 3 stacked decoder layers</li>
<li><strong>Attention Heads</strong>: 8</li>
<li><strong>Hidden Dimensions</strong>: 2048 (feed-forward networks)</li>
<li><strong>Dropout</strong>: 0.1</li>
<li><strong>Layer Normalization</strong>: 1e-12</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate</strong>: 5e-5 (selected after sweep from 1e-4 to 1e-6)</li>
<li><strong>Weight Decay</strong>: 1e-4</li>
<li><strong>Batch Size</strong>: 32</li>
<li><strong>Epochs</strong>: 5</li>
<li><strong>Codebase</strong>: Built on open-source DETR implementation</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>MOLCAP Dataset</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total Size</td>
          <td>81,230,291 molecules</td>
          <td>Aggregated from PubChem, ChEMBL, GDB13</td>
      </tr>
      <tr>
          <td>Training Split</td>
          <td>1,000,000 molecules</td>
          <td>Randomly selected unique molecules</td>
      </tr>
      <tr>
          <td>Validation Split</td>
          <td>5,000 molecules</td>
          <td>Randomly selected for evaluation</td>
      </tr>
      <tr>
          <td>Image Resolution</td>
          <td>256x256 pixels</td>
          <td>Generated using RDKit</td>
      </tr>
      <tr>
          <td>Median SELFIES Length</td>
          <td>&gt;45 characters</td>
          <td>More complex than typical benchmarks</td>
      </tr>
      <tr>
          <td>Full Dataset Storage</td>
          <td>~16.24 TB</td>
          <td>Necessitated use of 1M subset</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>None</td>
          <td>No cropping, rotation, or other augmentation</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li>Images generated using RDKit at 256x256 resolution</li>
<li>Molecules converted to canonical representations</li>
<li>SELFIES tokenization for model output</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metrics</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI Value</th>
          <th>OSRA Baseline</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>Fingerprint Tanimoto Similarity (functional groups)</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>RDKit fingerprint similarity</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>Morgan fingerprint similarity (circular)</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>Text overlap metric</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>Structural identity (strict)</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>Syntactic validity (with SELFIES)</td>
      </tr>
      <tr>
          <td>Levenshtein Distance</td>
          <td>21.13</td>
          <td>32.76</td>
          <td>String edit distance (lower is better)</td>
      </tr>
  </tbody>
</table>
<p><strong>Secondary Metrics</strong> (shown to be less informative for chemical tasks):</p>
<ul>
<li>BLEU, ROUGE (better suited for natural language)</li>
<li>Levenshtein distance (doesn&rsquo;t capture chemical similarity)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Single NVIDIA GeForce RTX 2080 Ti</li>
<li><strong>Training Time</strong>: ~5 hours per epoch, approximately 25 hours total for 5 epochs</li>
<li><strong>Memory</strong>: Sufficient for batch size 32 with ResNet-101 + Transformer architecture</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MOLCAP dataset</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>81M molecules; claimed released but no public URL found</td>
      </tr>
      <tr>
          <td>IMG2SMI code</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Built on DETR; claimed released but no public URL found</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Campos, D., &amp; Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. <a href="https://doi.org/10.48550/arXiv.2109.04202">https://doi.org/10.48550/arXiv.2109.04202</a></p>
<p><strong>Publication</strong>: arXiv preprint (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2109.04202">Paper on arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{campos2021img2smi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Campos, Daniel and Ji, Heng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2109.04202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2109.04202}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Hand-Drawn Chemical Diagram Recognition (AAAI 2007)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/</guid><description>A sketch recognition system for organic chemistry that uses domain knowledge (chemical valence) to correct recognition errors.</description><content:encoded><![CDATA[<h2 id="contribution-and-methodological-approach">Contribution and Methodological Approach</h2>
<p>This is a <strong>Method</strong> paper. It proposes a multi-stage pipeline for interpreting hand-drawn diagrams that integrates a trainable symbol recognizer with a domain-specific verification step. The authors validate the method through an ablation study comparing the full system against a baseline lacking domain knowledge.</p>
<h2 id="motivation-for-sketch-based-interfaces">Motivation for Sketch-Based Interfaces</h2>
<p>Current software for specifying chemical structures (e.g., ChemDraw, IsisDraw) relies on mouse and keyboard interfaces, which lack the speed, ease of use, and naturalness of drawing on paper. The goal is to bridge the gap between natural expression and computer interpretation by building a system that understands freehand chemical sketches.</p>
<h2 id="novel-integration-of-chemical-domain-knowledge">Novel Integration of Chemical Domain Knowledge</h2>
<p>The primary novelty is the integration of <strong>domain knowledge</strong> (specifically chemical valence rules) directly into the interpretation loop to resolve ambiguities and correct errors.</p>
<p>Specific technical contributions include:</p>
<ul>
<li><strong>Hybrid Recognizer</strong>: Combines feature-based SVMs, image-based template matching (modified Tanimoto), and off-the-shelf handwriting recognition to handle the mix of geometry and text.</li>
<li><strong>Domain Verification Loop</strong>: A post-processing step that checks the chemical validity of the structure (e.g., nitrogen must have 3 bonds). If an inconsistency is found, the system searches the space of alternative hypotheses generated during the initial parsing phase to find a valid interpretation.</li>
<li><strong>Contextual Parsing</strong>: Uses a sliding window (up to 7 strokes) and spatial context to parse interspersed symbols.</li>
<li><strong>Implicit Structure Handling</strong>: Supports two common chemistry notations: (1) implicit elements, where carbon and hydrogen atoms are omitted and inferred from bond connectivity and valence rules, and (2) aromatic rings, detected as a circle drawn inside a hexagonal 6-carbon cycle.</li>
</ul>
<h2 id="experimental-design-and-user-study">Experimental Design and User Study</h2>
<p>The authors conducted a user study to evaluate the system&rsquo;s robustness on unconstrained sketches.</p>
<ul>
<li><strong>Participants</strong>: 6 users familiar with organic chemistry.</li>
<li><strong>Task</strong>: Each user drew 12 pre-specified molecular compounds on a Tablet PC.</li>
<li><strong>Conditions</strong>: The system was evaluated in two modes:
<ol>
<li><strong>Domain</strong>: The full system with chemical valence checks.</li>
<li><strong>Baseline</strong>: A simplified version with no knowledge of chemical valence/verification.</li>
</ol>
</li>
<li><strong>Data Split</strong>: Evaluated on collected sketches using a leave-one-out style approach (training on 11 examples from the same users).</li>
</ul>
<h2 id="results-and-error-reduction-analysis">Results and Error Reduction Analysis</h2>
<ul>
<li><strong>Performance</strong>: The full system achieved an overall <strong>F-measure of 0.87</strong> (Precision 0.86, Recall 0.89).</li>
<li><strong>Impact of Domain Knowledge</strong>: Using domain knowledge reduced the overall error rate (measured by recall) by <strong>27%</strong> compared to the baseline. The improvement was statistically significant ($p &lt; .05$).</li>
<li><strong>Error Recovery</strong>: The system successfully recovered from interpretations that were geometrically plausible but chemically impossible (e.g., misinterpreting &ldquo;N&rdquo; as bonds), as illustrated in their qualitative analysis.</li>
<li><strong>Output Integration</strong>: Once interpreted, the resulting structure is expressed in a standard chemical specification format that can be passed to tools such as ChemDraw (for rendering) or SciFinder (for database queries).</li>
<li><strong>Limitations</strong>: The system struggled with &ldquo;messy&rdquo; sketches where users drew single bonds with multiple strokes or over-traced lines, as the current bond recognizer assumes single-stroke straight bonds.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study collected a custom dataset of hand-drawn diagrams.</p>
<ul>
<li><strong>Volume</strong>: 6 participants $\times$ 12 molecules = 72 total sketches (implied).</li>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Scale Normalization</strong>: The system estimates scale based on the average length of straight bonds (chosen because they are easy to identify). This normalizes geometric features for the classifier.</li>
<li><strong>Stroke Segmentation</strong>: Poly-line approximation using recursive splitting (minimizing least squared error) to break multi-segment strokes (e.g., connected bonds) into primitives.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Ink Parsing (Sliding Window)</strong></p>
<ul>
<li>Examines all combinations of up to <strong>$n=7$</strong> sequential strokes.</li>
<li>Classifies each group as a valid symbol or invalid garbage.</li>
</ul>
<p><strong>2. Template Matching (Image-based)</strong></p>
<ul>
<li>Used for resolving ambiguities in text/symbols (e.g., &lsquo;H&rsquo; vs &lsquo;N&rsquo;).</li>
<li><strong>Metric</strong>: Modified <strong>Tanimoto coefficient</strong>. Unlike standard Tanimoto (point overlap), this version accounts for relative angle and curvature at each point.</li>
</ul>
<p><strong>3. Domain Verification</strong></p>
<ul>
<li><strong>Trigger</strong>: An element with incorrect valence (e.g., Hydrogen with &gt;1 bond).</li>
<li><strong>Resolution</strong>: Searches stored alternative hypotheses for the affected strokes. It accepts a new hypothesis if it resolves the valence error without introducing new ones.</li>
<li><strong>Constraint</strong>: It keeps an inconsistent structure if the original confidence score is significantly higher than alternatives (assuming user is still drawing or intentionally left it incomplete).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Symbol Recognizer (Discriminative Classifier)</strong></p>
<ul>
<li><strong>Type</strong>: Support Vector Machine (SVM).</li>
<li><strong>Classes</strong>: Element letters, straight bonds, hash bonds, wedge bonds, invalid groups.</li>
<li><strong>Input Features</strong>:
<ol>
<li>Number of strokes</li>
<li>Bounding-box dimensions (width, height, diagonal)</li>
<li>Ink density (ink length / diagonal length)</li>
<li>Inter-stroke distance (max distance between strokes in group)</li>
<li>Inter-stroke orientation (vector of relative orientations)</li>
</ol>
</li>
</ul>
<p><strong>Text Recognition</strong></p>
<ul>
<li><strong>Microsoft Tablet PC SDK</strong>: Used for recognizing alphanumeric characters (elements and subscripts).</li>
<li>Integrated with the SVM and Template Matcher via a combined scoring mechanism.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Overall)</th>
          <th>Baseline Comparison</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Precision</strong></td>
          <td>0.86</td>
          <td>0.81 (Baseline)</td>
          <td>Full system vs. no domain knowledge</td>
      </tr>
      <tr>
          <td><strong>Recall</strong></td>
          <td>0.89</td>
          <td>0.85 (Baseline)</td>
          <td>27% error reduction</td>
      </tr>
      <tr>
          <td><strong>F-Measure</strong></td>
          <td>0.87</td>
          <td>0.83 (Baseline)</td>
          <td>Statistically significant ($p &lt; .05$)</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>True Positive Definition</strong>: Match in both location (stroke grouping) and classification (label).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: 1.5GHz Tablet PC.</li>
<li><strong>Performance</strong>: Real-time feedback.</li>
</ul>
<h3 id="reproducibility">Reproducibility</h3>
<p>No source code, trained models, or collected sketch data were publicly released. The paper is openly available through the AAAI digital library. The system depends on the Microsoft Tablet PC SDK (a proprietary, now-discontinued component), which would make exact replication difficult even with the algorithm descriptions provided.</p>
<p><strong>Status</strong>: Closed</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, T. Y., &amp; Davis, R. (2007). Recognition of Hand Drawn Chemical Diagrams. <em>Proceedings of the 22nd National Conference on Artificial Intelligence</em> (AAAI-07), 846-851.</p>
<p><strong>Publication</strong>: AAAI 2007</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ouyang2007recognition,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Recognition of Hand Drawn Chemical Diagrams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ouyang, Tom Y and Davis, Randall}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 22nd National Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{846--851}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Graph Perception for Chemical Structure OCR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</guid><description>A 1990 methodological paper presenting an early OCR system for digitizing chemical structure images into connectivity tables using C and Prolog.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Contreras, M. L., Allendes, C., Alvarez, L. T., &amp; Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. <em>Journal of Chemical Information and Computer Sciences</em>, 30(3), 302-307. <a href="https://doi.org/10.1021/ci00067a014">https://doi.org/10.1021/ci00067a014</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1990</p>
<h2 id="contribution-graph-perception-and-character-recognition">Contribution: Graph Perception and Character Recognition</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.</p>
<p>It proposes a specific algorithmic pipeline (&ldquo;graph perception and character recognition&rdquo;) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).</p>
<h2 id="motivation-automating-chemical-database-entry">Motivation: Automating Chemical Database Entry</h2>
<p>The primary motivation is to automate the input of chemical structures into databases.</p>
<ul>
<li><strong>Problem</strong>: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.</li>
<li><strong>Gap</strong>: Existing methods required significant human intervention. The authors created a system that handles the &ldquo;graph/skeleton&rdquo; and the &ldquo;alphanumeric characters&rdquo; effectively to speed up entry into systems like ARIUSA or CAD tools.</li>
</ul>
<h2 id="algorithmic-novelty-circular-inspection-processing">Algorithmic Novelty: Circular Inspection Processing</h2>
<p>The paper introduces a unified &ldquo;capture-to-recognition&rdquo; system written in C that handles both type-printed and hand-printed structures. Key novelties include:</p>
<ul>
<li><strong>Circular Inspection Algorithm</strong>: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.</li>
<li><strong>Hybrid Recognition</strong>: Combining &ldquo;graph perception&rdquo; (vectorizing the lines) with &ldquo;character recognition&rdquo; (OCR for atom labels) in a single pipeline.</li>
<li><strong>Matrix Parametrization for OCR</strong>: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and &ldquo;semibytes&rdquo;.</li>
</ul>
<h2 id="methodology-validation-via-custom-structure-dataset">Methodology: Validation via Custom Structure Dataset</h2>
<p>The authors validated the system by digitizing and recognizing a set of test structures:</p>
<ul>
<li><strong>Dataset</strong>: 200 type-printed structures and 50 hand-printed structures.</li>
<li><strong>Metric</strong>: &ldquo;Reliability&rdquo; percentage (correct recognition of the connectivity table).</li>
<li><strong>Speed Comparison</strong>: Measured processing time against a &ldquo;qualified person&rdquo; performing manual input for an average 20-atom molecule.</li>
</ul>
<h2 id="results-speed-and-file-size-efficiency">Results: Speed and File Size Efficiency</h2>
<ul>
<li><strong>Accuracy</strong>: The system achieved <strong>94% reliability</strong> for both type- and hand-printed graphs.</li>
<li><strong>Character Recognition</strong>: Isolated character recognition achieved <strong>&gt;99% reliability</strong>.</li>
<li><strong>Speed</strong>: The system was <strong>3-5 times faster</strong> than manual human input.</li>
<li><strong>Efficiency</strong>: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use a standard external dataset but rather a custom set of structures for validation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Type-printed structures</td>
          <td style="text-align: left">200 images</td>
          <td style="text-align: left">Used to test reliability</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Hand-printed structures</td>
          <td style="text-align: left">50 images</td>
          <td style="text-align: left">&ldquo;Straight enough&rdquo; drawings required</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details three specific algorithmic components crucial for replication:</p>
<ol>
<li>
<p><strong>Graph Perception (Contour Search)</strong>:</p>
<ul>
<li><strong>Sweep</strong>: Left-to-right horizontal sweep to find the first pixel.</li>
<li><strong>Contour Follow</strong>: Counter-clockwise algorithm used to trace borders.</li>
<li><strong>Vertex Detection</strong>: A vertex is flagged if the linear trajectory deflection angle is $&gt;18^\circ$.</li>
<li><strong>Atom Localization</strong>: Two or more vertices in a small space indicate an atom position.</li>
</ul>
</li>
<li>
<p><strong>Circular Inspection (Branching/Rings)</strong>:</p>
<ul>
<li><strong>Radius</strong>: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.</li>
<li><strong>Branch Detection</strong>: &ldquo;Unknown border pixels&rdquo; found on this circle trigger new contour searches to find attached bonds or rings.</li>
</ul>
</li>
<li>
<p><strong>Character Recognition (Matrix Feature Extraction)</strong>:</p>
<ul>
<li><strong>Separation</strong>: Characters are separated into isolated matrices and &ldquo;relocated&rdquo; to the top-left corner.</li>
<li><strong>Parametrization</strong>: The matrix is divided into zones. A &ldquo;semibyte&rdquo; (4-bit code) is generated by checking for pixel density in specific directions.</li>
<li><strong>ID Assignment</strong>: Matrices are assigned a Hex ID (e.g., <code>8</code>, <code>1</code>, <code>0</code>, <code>6</code>) based on these semibytes.</li>
<li><strong>Differentiation</strong>: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between &lsquo;b&rsquo; and &lsquo;h&rsquo;).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system does not use learned weights (neural networks). It relies on <strong>rule-based topological recognition</strong>.</p>
<ul>
<li><strong>Representation</strong>: The final output is a Prolog data structure converted into a connectivity table.</li>
<li><strong>Atom Recognition</strong>: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.</p>
<ul>
<li><strong>Capture</strong>: PC-AT microcomputer with HP-Scanjet.</li>
<li><strong>Processing</strong>: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.</li>
<li><strong>Memory Usage</strong>: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.</li>
<li><strong>Time</strong>: Processing time per molecule was 0.7 - 1.0 minutes.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{contrerasComputationalPerceptionRecognition1990,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Computational Perception and Recognition of Digitized Molecular Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1990</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{30}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{302--307}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00067a014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Evans 1986: Thermal Conductivity of Lennard-Jones Fluid</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/evans-thermal-conductivity-1986/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/evans-thermal-conductivity-1986/</guid><description>A 1986 validation of the Evans NEMD method for simulating heat flow, identifying long-time tail anomalies near the critical point.</description><content:encoded><![CDATA[<h2 id="methodological-validation-and-physical-discovery">Methodological Validation and Physical Discovery</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>, with a significant secondary component of <strong>Discovery ($\Psi_{\text{Discovery}}$)</strong>.</p>
<p>It focuses on validating a specific algorithm (the &ldquo;Evans method&rdquo;) for Non-Equilibrium Molecular Dynamics (NEMD) by comparing its results against experimental benchmarks. However, it also uncovers physical anomalies, specifically &ldquo;long-time tails&rdquo; in the heat flux autocorrelation function that deviate significantly from theoretical predictions, marking a discovery about the physics of the Lennard-Jones fluid itself.</p>
<h2 id="flow-gradients-and-boundary-limitations">Flow Gradients and Boundary Limitations</h2>
<p>The primary motivation is to overcome the limitations of simulating heat flow using physical boundaries (e.g., walls at different temperatures), which causes severe interpretive difficulties due to density and temperature gradients.</p>
<p>The &ldquo;Evans method&rdquo; uses a fictitious external field to induce heat flow in a periodic, homogeneous system. This paper serves to:</p>
<ol>
<li>Validate this method across a wide range of state points (temperatures and densities) beyond the triple point.</li>
<li>Investigate the system&rsquo;s behavior near the critical point, where transport properties are known to be anomalous.</li>
</ol>
<h2 id="core-innovations-of-the-evans-algorithm">Core Innovations of the Evans Algorithm</h2>
<p>The core contribution is the rigorous stress-testing of the <strong>homogeneous heat flow algorithm</strong> (Evans method) combined with a <strong>Gaussian thermostat</strong>.</p>
<p>Specific novel insights include:</p>
<ul>
<li><strong>Linearity Validation</strong>: Establishing that, away from phase boundaries, the effective thermal conductivity is a monotonic, virtually linear function of the external field, justifying the extrapolation to zero field.</li>
<li><strong>Critical Anomaly Detection</strong>: Finding that near the critical point, conductivity becomes a non-monotonic function of the field, challenging standard simulation approaches in this regime.</li>
<li><strong>Tail Amplitude Discovery</strong>: Demonstrating that the &ldquo;long-time tails&rdquo; of the heat flux autocorrelation function have amplitudes roughly 6 times larger than those predicted by mode-coupling theory.</li>
</ul>
<h2 id="nemd-simulation-setup">NEMD Simulation Setup</h2>
<p>The author performed <strong>Non-Equilibrium Molecular Dynamics (NEMD)</strong> simulations using the Lennard-Jones potential.</p>
<ul>
<li><strong>System</strong>: Mostly $N=108$ particles, with some checks using $N=256$ to test size dependence.</li>
<li><strong>Thermostat</strong>: A Gaussian thermostat was used to keep the kinetic energy (temperature) constant.</li>
<li><strong>State Points</strong>:
<ul>
<li><strong>Critical Isotherm</strong>: $T=1.35$, varying density.</li>
<li><strong>Supercritical Isotherm</strong>: $T=2.0$.</li>
<li><strong>Freezing Line</strong>: Two points ($T=2.74, \rho=1.113$ and $T=2.0, \rho=1.04$).</li>
</ul>
</li>
<li><strong>Validation</strong>: Results were compared against <strong>experimental data for Argon</strong> (using standard LJ parameters).</li>
<li><strong>Ablation</strong>:
<ul>
<li><strong>Field Strength ($F$)</strong>: Varied to check for linearity/non-linearity.</li>
<li><strong>System Size ($N$)</strong>: Comparison between 108 and 256 particles to rule out finite-size artifacts.</li>
</ul>
</li>
</ul>
<h2 id="linearity-regimes-and-long-time-tail-anomalies">Linearity Regimes and Long-Time Tail Anomalies</h2>
<ul>
<li><strong>Agreement with Experiment</strong>: The Evans method yields thermal conductivities in broad agreement with experimental Argon data for most state points.</li>
<li><strong>Linearity</strong>: Away from the critical point, conductivity is a virtually linear function of the field strength $F$, allowing for accurate zero-field extrapolation.</li>
<li><strong>Critical Region Failure</strong>: Near the critical point ($T=1.35, \rho=0.4$), the method struggles; the conductivity is non-monotonic with respect to $F$, and the zero-field extrapolation underestimates the experimental value by ~11%.</li>
<li><strong>Long-Time Tails</strong>: The decay of the heat flux autocorrelation function follows a $t^{-3/2}$ tail (consistent with mode-coupling theory), but the <strong>amplitude is ~6x larger</strong> than predicted.</li>
<li><strong>Phase Hysteresis</strong>: In high-density regions near the freezing line, the system exhibits hysteresis and bi-stability between solid and liquid phases depending on the field strength.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The simulation relies on the Lennard-Jones (LJ) potential to model Argon. No external training data is used; the &ldquo;data&rdquo; consists of the physical constants defining the system.</p>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value/Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Potential</strong></td>
          <td>$\Phi(q)=4(q^{-12}-q^{-6})$</td>
          <td>Standard LJ 12-6 potential</td>
      </tr>
      <tr>
          <td><strong>Cutoff</strong></td>
          <td>$r_c = 2.5$</td>
          <td>Truncated at 2.5 distance units</td>
      </tr>
      <tr>
          <td><strong>Comparison</strong></td>
          <td>Argon Experimental Data</td>
          <td>Sourced from NBS recommended values</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The core algorithm is the <strong>Evans Homogeneous Heat Flow</strong> method. To reproduce this, one must implement the specific Equations of Motion (EOM) derived from linear response theory.</p>
<p><strong>Equations of Motion:</strong></p>
<p>The trajectories are generated by:
$$
\begin{aligned}
\dot{q}_i &amp;= \frac{p_i}{m} \\
\dot{p}_i &amp;= F_i^{\text{inter}} + (E_i - \bar{E})F(t) - \sum_{j} F_{ij} q_{ij} \cdot F(t) + \frac{1}{2N} \sum_{j,k} F_{jk} q_{jk} \cdot F(t) - \alpha p_i
\end{aligned}
$$</p>
<p>Where:</p>
<ul>
<li>$F(t)$ is the fictitious external field driving heat flow.</li>
<li>$E_i$ is the instantaneous energy of particle $i$.</li>
<li>$\alpha$ is the <strong>Gaussian Thermostat multiplier</strong> (calculated at every step to strictly conserve kinetic energy/Temperature):
$$\alpha = \frac{\sum_i [\dots]_{\text{force terms}} \cdot p_i}{\sum_i p_i \cdot p_i}$$</li>
</ul>
<p><strong>Conductivity Calculation:</strong></p>
<p>The zero-frequency limit is extrapolated as:
$$ \lambda = \lim_{F \to 0} \frac{J_Q}{FT} $$</p>
<p>The frequency-dependent conductivity relies on the heat-flux autocorrelation:
$$ \lambda(\omega) = \frac{V}{3k_B T^2} \int_0^\infty dt , e^{i\omega t} \langle J_Q(t) \cdot J_Q(0) \rangle $$</p>
<h3 id="models">Models</h3>
<p>The &ldquo;model&rdquo; here is the physical simulation setup.</p>
<ul>
<li><strong>Particle Count</strong>: $N = 108$ (primary), $N = 256$ (validation).</li>
<li><strong>Boundary Conditions</strong>: Periodic Boundary Conditions (PBC).</li>
<li><strong>Thermostat</strong>: Gaussian Isokinetic (Temperature is a constant of motion).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is the <strong>Thermal Conductivity</strong> ($\lambda$).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
          <th>Baseline</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Thermal Conductivity</strong></td>
          <td>Ratio of heat flux $J_Q$ to field $F$ (extrapolated to $F=0$)</td>
          <td>Experimental Argon (NBS Data)</td>
          <td>Good agreement away from critical point</td>
      </tr>
      <tr>
          <td><strong>Tail Amplitude</strong></td>
          <td>Coefficient of the $\omega^{1/2}$ term in frequency-dependent conductivity</td>
          <td>Mode-Coupling Theory ($\approx 0.05$)</td>
          <td>Simulation value $\approx 0.3$ (6x larger)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: While 1986 hardware is obsolete, reproducing this requires a standard MD code capable of non-conservative forces (NEMD).</li>
<li><strong>Compute Cost</strong>: Low by modern standards. 108 particles for $\sim 10^5$ to $10^6$ steps is trivial on modern CPUs.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Evans, D. J. (1986). Thermal conductivity of the Lennard-Jones fluid. <em>Physical Review A</em>, 34(2), 1449-1453. <a href="https://doi.org/10.1103/PhysRevA.34.1449">https://doi.org/10.1103/PhysRevA.34.1449</a></p>
<p><strong>Publication</strong>: Physical Review A, 1986</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{PhysRevA.34.1449,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Thermal conductivity of the Lennard-Jones fluid}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Evans, Denis J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Phys. Rev. A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1449--1453}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">numpages</span> = <span style="color:#e6db74">{0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1986}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{Aug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Physical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1103/PhysRevA.34.1449}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://link.aps.org/doi/10.1103/PhysRevA.34.1449}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Dynamical Corrections to TST for Surface Diffusion</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/surface-science/self-diffusion-lj-fcc111-1989/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/surface-science/self-diffusion-lj-fcc111-1989/</guid><description>Application of dynamical corrections formalism to TST for LJ surface diffusion, revealing bounce-back recrossings at low T.</description><content:encoded><![CDATA[<h2 id="bridging-md-and-tst-for-surface-diffusion">Bridging MD and TST for Surface Diffusion</h2>
<p>This is primarily a <strong>Methodological Paper</strong> with a secondary contribution in <strong>Discovery</strong>.</p>
<p>The authors&rsquo; primary goal is to demonstrate the validity of the &ldquo;dynamical corrections formalism&rdquo; for calculating diffusion constants. They validate this by reproducing Molecular Dynamics (MD) results at high temperatures and then extending the method into low-temperature regimes where MD is infeasible.</p>
<p>By applying this method, they uncover a specific physical phenomenon, &ldquo;bounce-back recrossings&rdquo;, that causes a dip in the diffusion coefficient at low temperatures, a detail previously unobserved.</p>
<h2 id="timescale-limits-in-molecular-dynamics">Timescale Limits in Molecular Dynamics</h2>
<p>The authors aim to solve the timescale problem in simulating surface diffusion.</p>
<p><strong>Limit of MD</strong>: Molecular Dynamics (MD) is effective at high temperatures but becomes computationally infeasible at low temperatures because the time between diffusive hops increases drastically.</p>
<p><strong>Limit of TST</strong>: Standard Transition State Theory (TST) can handle long timescales but assumes all barrier crossings are successful, ignoring correlated dynamical events like immediate recrossings or multiple jumps.</p>
<p><strong>Goal</strong>: They seek to apply a formalism that corrects TST using short-time trajectory data, allowing for accurate calculation of diffusion constants across the entire temperature range.</p>
<h2 id="the-bounce-back-mechanism">The Bounce-Back Mechanism</h2>
<p>The core novelty is the rigorous application of the dynamical corrections formalism to a multi-site system (fcc/hcp sites) to characterize non-Arrhenius behavior at low temperatures.</p>
<p><strong>Unified Approach</strong>: They demonstrate that this method works for all temperatures, bridging the gap between the &ldquo;rare-event regime&rdquo; and the high-temperature regime dominated by fluid-like motion.</p>
<p><strong>Bounce-back Mechanism</strong>: They identify a specific &ldquo;dip&rdquo; in the dynamical correction factor ($f_d &lt; 1$) at low temperatures ($T \approx 0.038$), attributed to trajectories where the adatom collides with a substrate atom on the far side of the binding site and immediately recrosses the dividing surface.</p>
<h2 id="simulating-the-lennard-jones-fcc111-surface">Simulating the Lennard-Jones fcc(111) Surface</h2>
<p>The authors performed computational experiments on a Lennard-Jones fcc(111) surface cluster.</p>
<p><strong>System Setup</strong>: A single adatom on a 3-layer substrate (30 atoms/layer) with periodic boundary conditions.</p>
<p><strong>Baselines</strong>: They compared their high-temperature results against standard Molecular Dynamics simulations to validate the method.</p>
<p><strong>Ablation of Substrate Freedom</strong>: They ran a control experiment with a 6-layer substrate (top 3 free, 800 trajectories) to confirm the bounce-back effect persisted independently of the fixed deep layers, obtaining $D/D^{TST} = 0.75 \pm 0.06$, consistent with the original result.</p>
<p><strong>Trajectory Analysis</strong>: They analyzed the angular distribution of initial momenta to characterize the specific geometry of the bounce-back trajectories. Bounce-back trajectories were more strongly peaked at $\phi = 90°$ (perpendicular to the TST gate), confirming the effect arises from interaction with the substrate atom directly across the binding site.</p>
<p><strong>Temperature Range</strong>: The full calculation spanned $0.013 \leq T \leq 0.383$ in reduced units, bridging the rare-event regime and the high-temperature fluid-like regime.</p>
<h2 id="resolving-non-arrhenius-behavior">Resolving Non-Arrhenius Behavior</h2>
<p><strong>Arrhenius Behavior of TST</strong>: The uncorrected TST diffusion constant ($D^{TST}$) followed a near-perfect Arrhenius law, with a linear least-squares fit of $\ln(D^{TST}) = -1.8 - 0.30/T$.</p>
<p><strong>High-Temperature Correction</strong>: At high T, the dynamical correction factor $D/D^{TST} &gt; 1$, indicating correlated multiple forward jumps (long flights).</p>
<p><strong>Low-Temperature Dip</strong>: At low T, $D/D^{TST} &lt; 1$ for $T = 0.013, 0.026, 0.038, 0.051$ (minimum at $T = 0.038$), caused by the bounce-back mechanism.</p>
<p><strong>Validation</strong>: The method successfully reproduced high-T literature values while providing access to low-T dynamics inaccessible to direct MD.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use external datasets but generates simulation data based on the Lennard-Jones potential.</p>
<table>
  <thead>
      <tr>
          <th>Type</th>
          <th>Parameter</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Potential</strong></td>
          <td>$\epsilon, \sigma$</td>
          <td>1.0 (Reduced units)</td>
          <td>Standard Lennard-Jones 6-12</td>
      </tr>
      <tr>
          <td><strong>Cutoff</strong></td>
          <td>Spline</td>
          <td>$r_1=1.5\sigma, r_2=2.5\sigma$</td>
          <td>5th-order spline smooths potential to 0 at $r_2$</td>
      </tr>
      <tr>
          <td><strong>Geometry</strong></td>
          <td>Lattice Constant</td>
          <td>$a_0 = 1.549$</td>
          <td>Minimum energy for this potential</td>
      </tr>
      <tr>
          <td><strong>Cluster</strong></td>
          <td>Size</td>
          <td>3 layers, 30 atoms/layer</td>
          <td>Periodic boundary conditions parallel to surface</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The diffusion constant $D$ is calculated as $D = D^{TST} \times (D/D^{TST})$.</p>
<p><strong>1. TST Rate Calculation ($D^{TST}$)</strong></p>
<ul>
<li><strong>Method</strong>: Monte Carlo integration of the flux through the dividing surface.</li>
<li><strong>Technique</strong>: Calculate free energy difference between the entire binding site and the TST dividing region.</li>
<li><strong>Dividing Surface</strong>: Defined geometrically with respect to equilibrium substrate positions (honeycomb boundaries around fcc/hcp sites).</li>
</ul>
<p><strong>2. Dynamical Correction Factor ($D/D^{TST}$)</strong></p>
<p>The method relies on evaluating the dynamical correction factor $f_d$, initialized via a Metropolis walk restricted to the TST boundary region, computed as:</p>
<p>$$
\begin{aligned}
f_d(i\rightarrow j) = \frac{2}{N}\sum_{I=1}^{N}\eta_{ij}(I)
\end{aligned}
$$</p>
<ul>
<li><strong>Initialization</strong>:
<ul>
<li><strong>Position</strong>: Sampled via Metropolis walk restricted to the TST boundary region.</li>
<li><strong>Momentum</strong>: Maxwellian distribution for parallel components; Maxwellian-flux distribution for normal component.</li>
<li><strong>Symmetry</strong>: Trajectories entering hcp sites are generated by reversing momenta of those entering fcc sites.</li>
</ul>
</li>
<li><strong>Integration</strong>:
<ul>
<li><strong>Integrator</strong>: Adams-Bashforth-Moulton predictor-corrector formulas of orders 1 through 12.</li>
<li><strong>Duration</strong>: Integrated until time $t &gt; \tau_{corr}$ (approximately $\tau_{corr} \approx 13$ reduced time units).</li>
<li><strong>Sample Size</strong>: 1400 trajectories per temperature point (700 initially entering each type of site).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>System</strong>: Single component Lennard-Jones solid (Argon-like).</li>
<li><strong>Adsorbate</strong>: Single adatom on fcc(111) surface.</li>
<li><strong>Substrate Flexibility</strong>: Adatom plus top layer atoms are free to move. Layers 2 and 3 are fixed. (Validation run used 6 layers with top 3 free).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is the Diffusion Constant $D$, analyzed via the Dynamical Correction Factor.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Slope ($E_a$)</strong></td>
          <td>0.30</td>
          <td>0.303 fcc / 0.316 hcp (Newton-Raphson)</td>
          <td>TST slope in good agreement with static barrier height.</td>
      </tr>
      <tr>
          <td><strong>$D/D^{TST}$ (Low T)</strong></td>
          <td>$0.82 \pm 0.04$</td>
          <td>1.0 (TST)</td>
          <td>At $T=0.038$. Indicates 18% reduction due to recrossing.</td>
      </tr>
      <tr>
          <td><strong>$D/D^{TST}$ (High T)</strong></td>
          <td>$&gt; 1.0$</td>
          <td>MD Literature</td>
          <td>Increases with T due to multiple jumps.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Specific hardware configurations (e.g., node architectures, supercomputers) or training times were not specified in the original publication, which is typical for 1989 literature. Modern open-source MD engines (e.g., LAMMPS, ASE) could perform identical Lennard-Jones molecular dynamics integrations in negligible time on any consumer workstation.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cohen, J. M., &amp; Voter, A. F. (1989). Self-diffusion on the Lennard-Jones fcc(111) surface: Effects of temperature on dynamical corrections. <em>The Journal of Chemical Physics</em>, 91(8), 5082-5086. <a href="https://doi.org/10.1063/1.457599">https://doi.org/10.1063/1.457599</a></p>
<p><strong>Publication</strong>: The Journal of Chemical Physics 1989</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cohenSelfDiffusionLennard1989,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Self-diffusion on the {{Lennard}}-{{Jones}} Fcc(111) Surface: {{Effects}} of Temperature on Dynamical Corrections}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Self-diffusion on the {{Lennard}}-{{Jones}} Fcc(111) Surface}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Cohen, J. M. and Voter, A. F.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1989}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{The Journal of Chemical Physics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{91}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5082--5086}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0021-9606, 1089-7690}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1063/1.457599}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader: Automated Structure Extraction</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</guid><description>ChemReader extracts chemical structures from raster images using modified Hough transform and chemical spell checking for improved accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., &amp; Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. <em>Chemistry Central Journal</em>, 3(1), 4. <a href="https://doi.org/10.1186/1752-153X-3-4">https://doi.org/10.1186/1752-153X-3-4</a></p>
<p><strong>Publication</strong>: Chemistry Central Journal 2009</p>
<h2 id="paper-contribution-method--pipeline">Paper Contribution: Method &amp; Pipeline</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel software system, <strong>ChemReader</strong>, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline, specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs. It validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).</p>
<h2 id="motivation-unlocking-analog-chemical-information">Motivation: Unlocking Analog Chemical Information</h2>
<p>There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as &ldquo;analog diagrams&rdquo; (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.</p>
<p>While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.</p>
<h2 id="core-innovation-modified-transforms-and-spell-checking">Core Innovation: Modified Transforms and Spell Checking</h2>
<p>The authors introduce <strong>ChemReader</strong>, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:</p>
<ul>
<li><strong>Modified Hough Transform (HT):</strong> Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.</li>
<li><strong>Chemical Spell Checker:</strong> A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.</li>
<li><strong>Specific Substructure Detection:</strong> Dedicated algorithms for detecting stereochemical &ldquo;wedge&rdquo; bonds using corner detection and aromatic rings using the Generalized Hough Transform.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors compared ChemReader against three other systems: <strong>OSRA V1.0.1</strong>, <strong>CLiDE V2.1 Lite</strong>, and <strong>Kekule V2.0 demo</strong>.</p>
<p>They used three distinct datasets to test robustness:</p>
<ol>
<li><strong>Set I (50 images):</strong> Diverse drawing styles and fonts collected via Google Image Search.</li>
<li><strong>Set II (100 images):</strong> Ligand images from the GLIDA database, linked to PubChem for ground truth.</li>
<li><strong>Set III (212 images):</strong> Low-resolution images embedded in 121 scanned journal articles from PubMed.</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Accuracy:</strong> ChemReader significantly outperformed competitors. In the difficult <strong>Set III</strong> (journal articles), ChemReader achieved <strong>30.2%</strong> correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.</li>
<li><strong>Similarity:</strong> Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.</li>
<li><strong>Substructure Recognition:</strong> ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.</li>
<li><strong>Error Correction:</strong> The &ldquo;Chemical Spell Checker&rdquo; improved character recognition accuracy from <strong>66% to 87%</strong>.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized three test sets collected from public sources.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set I</strong></td>
          <td>50 images</td>
          <td>Sourced from Google Image Search to vary styles/fonts.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set II</strong></td>
          <td>100 images</td>
          <td>Randomly selected ligands from the GLIDA database; ground truth via PubChem.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set III</strong></td>
          <td>212 images</td>
          <td>Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of several sequential processing steps:</p>
<ul>
<li><strong>De-noising:</strong> Uses <strong>GREYCstoration</strong>, an anisotropic smoothing algorithm, to regulate image noise.</li>
<li><strong>Segmentation:</strong> Uses an <strong>8-connectivity algorithm</strong> to group pixels. Components are classified as text or graphics based on height/area ratios.</li>
<li><strong>Line Detection (Modified Hough Transform):</strong>
<ul>
<li>Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.</li>
<li><strong>Weight Function ($W_{ij}$):</strong>
$$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) &amp; \text{if } x_{ij}/n_{ij} &gt; P_0 \\ 0 &amp; \text{otherwise} \end{cases}$$
Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.</li>
</ul>
</li>
<li><strong>Wedge Bond Detection:</strong> Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).</li>
<li><strong>Chemical Spell Checker:</strong>
<ul>
<li>Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.</li>
<li><strong>Similarity Metric:</strong>
$$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$
Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Character Recognition:</strong> Uses the open-source <strong>GOCR</strong> library. It employs template matching based on features like holes, pixel densities, and transitions.</li>
<li><strong>Chemical Dictionary:</strong> A lookup table containing <strong>770</strong> frequently used chemical abbreviations and fundamental valence rules.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using exact structure matching and fingerprint similarity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Set III)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>% Correct</strong></td>
          <td><strong>30.2%</strong></td>
          <td>17%</td>
          <td>Exact structure match using ChemAxon JChem.</td>
      </tr>
      <tr>
          <td><strong>Avg Similarity</strong></td>
          <td><strong>0.740</strong></td>
          <td>0.526</td>
          <td>Tanimoto similarity on PubChem Substructure Fingerprints.</td>
      </tr>
      <tr>
          <td><strong>Precision (Rings)</strong></td>
          <td><strong>0.87</strong></td>
          <td>0.84</td>
          <td>Precision rate for recognizing ring systems.</td>
      </tr>
      <tr>
          <td><strong>Recall (Rings)</strong></td>
          <td><strong>0.83</strong></td>
          <td>0.73</td>
          <td>Recall rate for recognizing ring systems.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> C++ implementation running on MS Windows.</li>
<li><strong>Dependencies:</strong> GOCR (OCR), GREYCstoration (Image processing).</li>
</ul>
]]></content:encoded></item><item><title>Chemical Machine Vision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</guid><description>Machine vision approach using Gabor wavelets and Kohonen networks to classify chemical raster images and extract structural metadata.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gkoutos, G. V., Rzepa, H., Clark, R. M., Adjei, O., &amp; Johal, H. (2003). Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images. <em>Journal of Chemical Information and Computer Sciences</em>, 43(5), 1342-1355. <a href="https://doi.org/10.1021/ci034017n">https://doi.org/10.1021/ci034017n</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 2003</p>
<h2 id="paper-classification-methodological-approach">Paper Classification: Methodological Approach</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural pipeline applying &ldquo;machine vision&rdquo; techniques (Gabor wavelets and Kohonen networks) to the problem of identifying chemical diagrams in low-resolution raster images. The paper focuses on the &ldquo;how&rdquo; (the algorithm and its parameters) and validates the method through quantitative experiments optimizing feature vectors and masks.</p>
<h2 id="motivation-extracting-legacy-chemical-data">Motivation: Extracting Legacy Chemical Data</h2>
<p>The primary motivation is to unlock the &ldquo;large amount of data&rdquo; trapped in legacy raster images (GIF, JPEG) on the Web that lack semantic metadata.</p>
<ul>
<li><strong>Legacy Data Problem</strong>: Most chemical structural information on the Web is embedded in raster images, not machine-readable formats like Molfiles.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous tools like Kekule and CLiDE acted as &ldquo;Chemical OCR,&rdquo; attempting to reconstruct exact atom-bond connections. This required high-resolution images (&gt;300 dpi) and human intervention, making them unsuitable for automated Web crawling of low-resolution (72-96 dpi) images.</li>
<li><strong>Goal</strong>: To create a low-cost, automated tool for a &ldquo;robot-based Internet resource discovery tool&rdquo; that can classify images (e.g., &ldquo;is this a molecule?&rdquo;).</li>
</ul>
<h2 id="core-innovation-texture-recognition-over-structural-ocr">Core Innovation: Texture Recognition over Structural OCR</h2>
<p>The core novelty is the shift from &ldquo;Optical Character Recognition&rdquo; (exact reconstruction) to <strong>&ldquo;Texture Recognition&rdquo;</strong> (classification).</p>
<ul>
<li><strong>Texture-Based Approach</strong>: The authors treat chemical diagrams as textures. They use <strong>Gabor wavelets</strong> to extract texture features. <strong>Crucially, this system does not recognize specific chemical structures</strong> (i.e., atom-bond connectivity tables, <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, or Molfiles). It only classifies images into broad categories.</li>
<li><strong>Incremental Learning</strong>: The system uses a <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong> combined with Class Boundary Analysis (CBA). This allows for &ldquo;incremental learning,&rdquo; where new classes (e.g., aromatic vs. non-aromatic) can be added without retraining the entire system.</li>
<li><strong>Optimization for Chemistry</strong>: The authors identify specific parameters (frequency channels, mask sizes) that are optimal for the &ldquo;texture&rdquo; of chemical diagrams.</li>
<li><strong>Integration with ChemDig</strong>: The method was designed to feed into ChemDig, a robot-based index engine for automated web crawling and metadata generation.</li>
</ul>
<h2 id="experimental-setup-parameter-optimization">Experimental Setup: Parameter Optimization</h2>
<p>The authors performed optimization and validation experiments using a dataset of <strong>300 images</strong> divided into three classes: Ring Systems, Non-Ring Systems, and Non-Chemistry (textures, biological figures, etc.).</p>
<ol>
<li><strong>Parameter Optimization</strong>: They systematically varied hyperparameters to find the optimal configuration:
<ul>
<li><strong>Feature Vector Size</strong>: Tested sizes from 100 to 4000 elements.</li>
<li><strong>Energy Mask Size</strong>: Tested windows from $3 \times 3$ to $15 \times 15$ pixels.</li>
<li><strong>Frequency Channels</strong>: Tested seven spatial frequencies ($\sqrt{2}$ to $64\sqrt{2}$).</li>
</ul>
</li>
<li><strong>Classification Performance</strong>: Evaluated the system&rsquo;s ability to classify unseen test images using a 50:50 training/test split.</li>
<li><strong>Comparison</strong>: Qualitatively compared the approach against vectorization tools (Autotrace, CR2V).</li>
</ol>
<h2 id="results-robust-classification-of-low-resolution-images">Results: Robust Classification of Low-Resolution Images</h2>
<ul>
<li><strong>Optimal Configuration</strong>: The system performed best with a feature vector size of ~1500 elements, a $9 \x9$ energy mask, and frequency channel $4\sqrt{2}$.</li>
<li><strong>High Accuracy</strong>: Achieved a recognition rate of <strong>91%</strong> with a 50:50 training/test split, and up to <strong>92%</strong> with a 70:30 split.</li>
<li><strong>Robustness</strong>: The system successfully distinguished between chemical and non-chemical images (zero false negatives for chemical images).</li>
<li><strong>Limitations</strong>: Misclassifications occurred between &ldquo;ring&rdquo; and &ldquo;non-ring&rdquo; systems when structures had similar visual &ldquo;textures&rdquo; (e.g., similar density or layout).</li>
<li><strong>Impact</strong>: The method is viable for automating metadata generation (e.g., <code>alt</code> tags) for web crawlers, functioning as a coarse-grained filter before more expensive processing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom dataset of raster images collected from the Web.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><strong>Custom Web Dataset</strong></td>
          <td>300 images</td>
          <td>Split into 3 classes: Ring Systems, Non-Ring Systems, Non-Chemistry.</td>
      </tr>
      <tr>
          <td>Resolution</td>
          <td><strong>Low-Res Web Images</strong></td>
          <td>72-96 dpi</td>
          <td>Deliberately chosen to mimic Web conditions where OCR fails.</td>
      </tr>
      <tr>
          <td>Format</td>
          <td><strong>Raster</strong></td>
          <td>GIF, JPEG</td>
          <td>Typical web formats.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The core pipeline consists of a <strong>Gabor Transform Unit</strong> followed by a <strong>Training/Classification Unit</strong>.</p>
<ul>
<li><strong>Gabor Wavelets</strong>: Used for feature extraction. The 2D Gabor wavelet equation is:
$$h(x,y)=\exp\left{-\frac{1}{2}\left[\frac{x^{2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right}\cos(2\pi\mu_{\sigma}x+\phi)$$
<ul>
<li><strong>Bank Structure</strong>: 28 filters total (4 orientations $\times$ 7 radial frequencies).</li>
<li><strong>Orientations</strong>: $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$.</li>
<li><strong>Frequencies</strong>: 1 octave apart, specifically $1\sqrt{2}, \dots, 64\sqrt{2}$.</li>
<li><strong>Selected Frequency</strong>: $4\sqrt{2}$ was found to be optimal for chemistry.</li>
</ul>
</li>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Buffer Mounting</strong>: Images are mounted in a buffer (set to 0) to handle edge artifacts.</li>
<li><strong>Look-Up-Tables (LUT/LUF)</strong>: A binary Look-Up-Frame (LUF) indicates Regions of Interest (ROI) to avoid computing empty space; values are stored in a Look-Up-Table (LUT) to prevent re-computation of overlapping windows.</li>
</ul>
</li>
<li><strong>Feature Extraction</strong>:
<ul>
<li><strong>Non-linear Thresholding</strong>: $\psi(t) = \tanh(\alpha t)$ with $\alpha = 0.25$.</li>
<li><strong>Energy Function</strong>: Calculated as average absolute deviation from the mean using a window $W_{xy}$.
$$e_{k}(x,y)=\frac{1}{M^{2}}\sum_{(a,b)\in W_{xy}}|\psi(r_{k}(a,b))|$$</li>
<li><strong>Optimal Window</strong>: $9 \times 9$ pixels.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The classification model relies on competitive learning.</p>
<ul>
<li><strong>Architecture</strong>: <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong>.</li>
<li><strong>Training</strong>:
<ul>
<li><strong>Learning Rate</strong>: Starts at 1.0, decreases to 0.1.</li>
<li><strong>Class Boundary Analysis (CBA)</strong>: Computes the centroid (mean) and variance of each cluster. The variance defines the class boundary.</li>
</ul>
</li>
<li><strong>Classification Metric</strong>: <strong>Euclidean Distance Norm</strong>. An unknown vector is classified based on the shortest distance to a cluster center, provided it falls within the variance boundary.
$$D_{ij}=||x_{i}-x_{j}||$$</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using recognition rate ($R_s$) and misclassification error ($E_s$).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recognition Rate</td>
          <td><strong>91%</strong></td>
          <td>N/A</td>
          <td>Achieved with 50:50 split. 92% with 70:30 split.</td>
      </tr>
      <tr>
          <td>Feature Size</td>
          <td><strong>~1500</strong></td>
          <td>4000</td>
          <td>Reducing vector size from 4000 to 1500 maintained ~80% accuracy while improving speed.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gkoutosChemicalMachineVision2003,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}: {{Automated Extraction}} of {{Chemical Metadata}} from {{Raster Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Gkoutos, Georgios V. and Rzepa, Henry and Clark, Richard M. and Adjei, Osei and Johal, Harpal}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2003</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{43}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1342--1355}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci034017n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Literature Data Extraction: The CLiDE Project</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</guid><description>Seminal OCSR system converting scanned chemical diagrams into connection tables via primitive recognition and semantic interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ibison, P., Jacquot, M., Kam, F., Neville, A. G., Simpson, R. W., Tonnelier, C., Venczel, T., &amp; Johnson, A. P. (1993). Chemical Literature Data Extraction: The CLiDE Project. <em>Journal of Chemical Information and Computer Sciences</em>, 33(3), 338-344. <a href="https://doi.org/10.1021/ci00013a010">https://doi.org/10.1021/ci00013a010</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 1993</p>
<h2 id="contribution-and-taxonomy">Contribution and Taxonomy</h2>
<p><strong>Classification: Method ($\Psi_{\text{Method}}$)</strong></p>
<p>This methodological paper proposes a novel software architecture for Optical Chemical Structure Recognition (OCSR). It details specific algorithms for image segmentation, vectorization, and chemical interpretation, validated through the successful extraction of complex structures from literature.</p>
<h2 id="motivation-automating-literature-extraction">Motivation: Automating Literature Extraction</h2>
<p>The manual creation of chemical reaction databases is a time-consuming and expensive process requiring trained chemists to abstract information from literature. While commercial tools existed for interpreting isolated scanned structures (like Kekulé), there was a lack of systems capable of processing whole pages of journals (including embedded text, reaction schemes, and structures) without significant human intervention.</p>
<h2 id="core-innovation-a-three-phase-hybrid-architecture">Core Innovation: A Three-Phase Hybrid Architecture</h2>
<p>CLiDE introduces a comprehensive <strong>three-phase architecture</strong> (Recognition, Grouping, Interpretation) that integrates computer vision with chemical knowledge. Key novelties include:</p>
<ul>
<li><strong>Context-Aware Interpretation:</strong> The use of an extendable <strong>superatom database</strong> to resolve ambiguities in chemical text (e.g., expanding &ldquo;OAc&rdquo; or &ldquo;Me&rdquo; into connection tables).</li>
<li><strong>Hybrid Primitive Detection:</strong> A combination of contour coding for solid lines and a modified Hough transform specifically tuned for detecting dashed chemical bonds.</li>
<li><strong>Semantic Re-construction:</strong> A scoring system for bond-atom association that considers both distance and vector direction to handle poorly drawn structures.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<p>The authors validated the system on a set of &ldquo;difficult cases&rdquo; selected to test specific capabilities. These included:</p>
<ul>
<li><strong>Crossing Bonds:</strong> Structures where bonds intersect without a central atom (Fig. 9d, 9e).</li>
<li><strong>Stereochemistry:</strong> Identification of wedged, dashed, and wavy bonds.</li>
<li><strong>Generic Structures:</strong> Parsing generic text blocks (e.g., $R^1 = Me$) and performing substitutions.</li>
<li><strong>Accuracy Estimation:</strong> The authors report an approximate 90% recognition rate for distinct characters in literature scans.</li>
</ul>
<h2 id="results-and-structural-reconstruction">Results and Structural Reconstruction</h2>
<p>The system successfully generates connection tables (exported as MOLfiles or ChemDraw files) from scanned bitmaps. It effectively distinguishes between graphical primitives (wedges, lines) and text, accurately reconstructing stereochemistry and resolving superatom synonyms (e.g., converting &ldquo;MeO&rdquo; to &ldquo;OMe&rdquo;). The authors conclude that while character recognition depends heavily on image quality, the graphic primitive recognition is robust for lines above a threshold length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Input Format:</strong> Binary bitmaps scanned from journal pages.</li>
<li><strong>Resolution:</strong> 300 dpi (generating ~1 MB per page).</li>
<li><strong>Superatom Database:</strong> A lookup table containing ~200 entries. Each entry includes:
<ul>
<li><strong>Valency/Charge:</strong> Explicit constraints (e.g., &ldquo;HO&rdquo; takes 1 bond, &ldquo;CO2&rdquo; takes 2).</li>
<li><strong>Bonding Index:</strong> Specifies which letter in the string serves as the attachment point (e.g., letter 2 for &ldquo;HO&rdquo;, letters 1 and 2 for &ldquo;CO2&rdquo;).</li>
<li><strong>Sub-Connection Table:</strong> The internal atomic representation of the group.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Primitive Recognition (Vectorization)</strong></p>
<ul>
<li><strong>Contour Coding:</strong> Uses the <strong>Ahronovitz-Bertier-Habib</strong> method to generate interpixel contours (directions N, S, E, W) for connected components.</li>
<li><strong>Polygonal Approximation:</strong> A method similar to <strong>Sklansky and Gonzalez</strong> breaks contours into &ldquo;fractions&rdquo;.
<ul>
<li><em>Rule:</em> Long sides are &ldquo;straight fractions&rdquo;; consecutive short sides are &ldquo;curved fractions&rdquo;.</li>
<li><em>Reconstruction:</em> Parallel fractions are paired to form bond borders. If a border is split (due to noise or crossing lines), the system attempts to merge collinear segments.</li>
</ul>
</li>
<li><strong>Dash Detection:</strong> A <strong>modified Hough transform</strong> is applied to small connected components. It requires at least <strong>three collinear dashes</strong> to classify a sequence as a dashed bond.</li>
</ul>
<p><strong>2. Interpretation Rules</strong></p>
<ul>
<li><strong>Bond-Atom Association:</strong>
<ul>
<li><em>Candidate Selection:</em> The system identifies $m$ closest bonds for a superatom requiring $n$ connections ($m \ge n$).</li>
<li><em>Scoring Function:</em> Connections are selected based on minimizing <strong>perpendicular distance</strong> (alignment).</li>
</ul>
</li>
<li><strong>Crossing Bonds:</strong> Resolved using rules based on <strong>proximity, length, collinearity, and ring membership</strong> to distinguish actual crossings from central carbon atoms.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR:</strong> A neural network trained on alphanumeric characters.
<ul>
<li><strong>Input Representation:</strong> Density matrices derived from character bitmaps.</li>
<li><strong>Post-processing:</strong> Unrecognized characters are flagged for manual correction.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> SUN SPARC workstation.</li>
<li><strong>Scanner:</strong> Agfa Focus S 800GS.</li>
<li><strong>Implementation Language:</strong> C++.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ibisonChemicalLiteratureData1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction: {{The CLiDE Project}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ibison, P. and Jacquot, M. and Kam, F. and Neville, A. G. and Simpson, R. W. and Tonnelier, C. and Venczel, T. and Johnson, A. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{338--344}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00013a010}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Automatic Recognition of Chemical Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</guid><description>A rule-based system for extracting chemical structure information from raster images, validated against commercial baselines.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-image-mining-architecture">Contribution: Rule-Based Image Mining Architecture</h2>
<p><strong>$\Psi_{\text{Method}}$ (Methodological Basis)</strong></p>
<p>This is a methodological paper describing a system architecture for <strong>image mining</strong> in the chemical domain. It focuses on the engineering challenge of converting rasterized depictions of molecules into computer-readable SDF files. The paper details the algorithmic pipeline and validates it through quantitative benchmarking against a commercial alternative.</p>
<h2 id="motivation-digitizing-chemical-literature">Motivation: Digitizing Chemical Literature</h2>
<ul>
<li><strong>Loss of Information</strong>: Chemical software creates images. The chemical significance is lost when published in scientific literature, making the data &ldquo;dead&rdquo; to computers.</li>
<li><strong>Gap in Technology</strong>: Image mining lags behind advances in text mining. Existing commercial solutions (like CLIDE) faded away or remained limited.</li>
<li><strong>Scale of Problem</strong>: The colossal production of chemical documents requires automated tools to exploit this information at large scale.</li>
</ul>
<h2 id="core-innovation-graph-preserving-vectorization">Core Innovation: Graph-Preserving Vectorization</h2>
<ul>
<li><strong>Graph-Preserving Vectorization</strong>: The system uses a custom vectorizer designed to preserve the &ldquo;graph characteristics&rdquo; of chemical diagrams (1 vector = 1 line), which avoids creating spurious vectors at thick joints. It aims to generate a mathematical graph, $G = (V, E)$, mapped geometrically to the image lines.</li>
<li><strong>Chemical Knowledge Integration</strong>: A distinct module validates the reconstructed graph against chemical rules (valences, charges) to ensure the output is chemically valid.</li>
<li><strong>Hybrid Processing</strong>: The system splits the image into &ldquo;connected components&rdquo; for an OCR path (text/symbols) and a &ldquo;body&rdquo; path (bonds), reassembling them later.</li>
</ul>
<h2 id="methodology--experiments-benchmark-validation">Methodology &amp; Experiments: Benchmark Validation</h2>
<p>The authors performed a quantitative validation using <strong>three different databases</strong> where ground-truth SDF files were available. They also compared their system against the commercial tool <strong>CLIDE</strong> (Chemical Literature Data Extraction).</p>
<ul>
<li><strong>Database 1</strong>: 100 images (varied line widths/fonts)</li>
<li><strong>Database 2</strong>: 100 images</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale batch processing)</li>
</ul>
<h2 id="results--conclusions-superior-accuracy-over-baselines">Results &amp; Conclusions: Superior Accuracy over Baselines</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved <strong>94%</strong> correct reconstruction on Database 1 and <strong>77%</strong> on Database 2. Accuracy was measured as correct recovery of identical geometry and connections.</li>
</ul>
<p>$$ \text{Acc} = \frac{\text{Correct Images}}{\text{Total Images}} $$</p>
<ul>
<li><strong>Baseline Superiority</strong>: The commercial tool CLIDE only successfully reconstructed ~50% of images in Database 1 (compared to the authors&rsquo; 94%).</li>
<li><strong>Scalability</strong>: On the large dataset (Database 3), the system achieved <strong>67%</strong> accuracy in batch mode.</li>
<li><strong>Robustness</strong>: The authors state the system uses a handful of parameters and works robustly across different image types. CLIDE lacked flexibility and required manual intervention.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Formally Reproducible. As is common with applied research from this era, the source code, training models (SVM), and specific datasets used for benchmarking do not appear to be publicly maintained or available.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">No public code, models, or datasets were released with this 2007 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Used for comparison with CLIDE; 94% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>77% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale test; 67% success rate</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper outlines a 5-module pipeline:</p>
<ol>
<li><strong>Pre-processing</strong>: Adaptive histogram binarization and non-recursive connected component labeling using RLE segments.</li>
<li><strong>OCR</strong>: A &ldquo;chemically oriented OCR&rdquo; using wavelet functions for feature extraction and a <strong>Support Vector Machine (SVM)</strong> for classification. It distinguishes characters from molecular structure.</li>
<li><strong>Vectorizer</strong>: Assigns local directions to RLE segments and groups them into patterns. Crucially, it enforces a one-to-one mapping between image lines and graph vectors.</li>
<li><strong>Reconstruction</strong>: A rule-based module that annotates vectors:
<ul>
<li><strong>Stereochemistry</strong>: Registers vectors against original pixels; thick geometric forms (triangles) become chiral wedges.</li>
<li><strong>Dotted Bonds</strong>: Identifies isolated vectors and clusters them using <strong>quadtree clustering</strong>.</li>
<li><strong>Multi-bonds</strong>: Identifies parallel vectors within a dilated bounding box (factor of 2).</li>
</ul>
</li>
<li><strong>Chemical Knowledge</strong>: Validates the graph valences and properties before exporting SDF.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM</strong>: Used in the OCR module to classify text/symbols. It supports dynamic training to correct classification mistakes.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is the percentage of correctly reconstructed images (generating a valid, matching SDF file).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>System Value (DB1)</th>
          <th>Baseline (CLIDE)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reconstruction Accuracy</td>
          <td><strong>94%</strong></td>
          <td>~50%</td>
          <td>CLIDE noted as unsuitable for batch processing</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., &amp; Hofmann-Apitius, M. (2007). Automatic Recognition of Chemical Images. <em>Eighth Mexican International Conference on Current Trends in Computer Science</em>, 41-46. <a href="https://doi.org/10.1109/ENC.2007.25">https://doi.org/10.1109/ENC.2007.25</a></p>
<p><strong>Publication</strong>: ENC 2007 (IEEE Computer Society)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriAutomaticRecognitionChemical2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic {{Recognition}} of {{Chemical Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Eighth {{Mexican International Conference}} on {{Current Trends}} in {{Computer Science}} ({{ENC}} 2007)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{41--46}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ENC.2007.25}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>A Spatial Model for Legislative Roll Call Analysis</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/nominate-1985/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/nominate-1985/</guid><description>Introduces NOMINATE, a probabilistic spatial model estimating legislator ideal points from roll call data via maximum likelihood.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It introduces a &ldquo;general nonlinear logit model&rdquo; and a specific estimation algorithm (<strong>NOMINATE</strong>) to analyze political choice data. The paper focuses on deriving a metric spatial map from nominal data (yea/nay votes). It validates this method by comparing it against existing techniques like Guttman scaling and factor analysis, demonstrating that the new method recovers geometric structures that previous methods obscured.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Prior research relied on &ldquo;black box&rdquo; statistical methods (like factor analysis or nonmetric scaling) or Guttman scaling to analyze legislative behavior. These methods had significant limitations:</p>
<ul>
<li><strong>Metric Recovery</strong>: They struggled to accurately recover the underlying Euclidean coordinates of legislators and choices from nominal data.</li>
<li><strong>Dimensionality</strong>: They tended to exaggerate the number of dimensions (issues) because they did not account for probabilistic error in voting.</li>
<li><strong>Identification</strong>: Pure Guttman scaling (assuming perfect voting) identifies only the order of legislators, leaving the location of policy alternatives unknown.</li>
</ul>
<p>The authors sought to bridge the &ldquo;crucial gap&rdquo; between spatial theory and data by developing a model-driven procedure that simultaneously estimates the locations of choosers and choices while accounting for error.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core contribution is the <strong>NOMINATE</strong> (Nominal Three-step Estimation) procedure. Key innovations include:</p>
<ul>
<li><strong>Simultaneous Estimation</strong>: This method estimates coordinates for <em>both</em> the legislators ($x_i$) and the roll call outcomes ($z_{jl}$) in a common space simultaneously.</li>
<li><strong>Probabilistic Utility</strong>: It employs a specific bell-shaped utility function with a stochastic error term (log of the inverse exponential), allowing for a tractable probabilistic voting model.</li>
<li><strong>Metric Unfolding</strong>: It successfully performs &ldquo;unfolding methodology for nominal level data,&rdquo; recovering metric distances solely from binary choices.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model through both historical data analysis and synthetic testing:</p>
<ul>
<li><strong>US House Analysis (1957-58)</strong>: Analyzed 172 roll calls from the 85th Congress to compare NOMINATE results against Miller and Stokes&rsquo; influential Guttman scales.</li>
<li><strong>US Senate Analysis (1979-1982)</strong>: Performed separate estimations for four years of Senate voting to assess stability and validity.</li>
<li><strong>Monte Carlo Simulations</strong>: Generated synthetic data (98 legislators and 291 roll calls in most runs, 50 legislators in one run) for different values of $\beta$ to test the robustness of parameter recovery under known &ldquo;truth&rdquo; conditions.</li>
<li><strong>Robustness Checks</strong>: Tested sensitivity to &ldquo;perfect&rdquo; legislators (who never vote against their side) and outliers (like Senator Proxmire).</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Unidimensionality</strong>: A single liberal-conservative dimension correctly classified ~80% of individual choices in the US House and Senate.</li>
<li><strong>Dimensionality Reduction</strong>: The model demonstrated that distinct &ldquo;issue scales&rdquo; found in previous research (e.g., social welfare vs. foreign policy) could largely be mapped onto a single dimension when error is accounted for.</li>
<li><strong>Strategic Behavior</strong>: The analysis revealed that majority leadership tends to place roll call midpoints slightly away from the median legislator to increase the probability of passage.</li>
<li><strong>Geometric Mean Probability</strong>: The authors introduced the geometric mean probability as a more robust metric than simple classification error for evaluating probabilistic models.</li>
<li><strong>Limitations</strong>: The authors acknowledge that the model is restricted to one dimension with a common utility function, and that civil rights voting represents a genuinely separate dimension not captured by the liberal-conservative axis. Standard errors computed from the alternating procedure are theoretically approximate (computed from separate information matrices rather than the full joint matrix), though Monte Carlo tests showed them to be reasonably reliable in practice. Extensions to multidimensional models and variable utility functions are deferred to later work.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper analyzes roll call voting matrices (a roll call is a procedure in which each legislator&rsquo;s name is called and their individual vote is recorded, producing a complete public record of who voted which way) where rows are legislators and columns are roll calls.</p>
<table>
  <thead>
      <tr>
          <th>Context</th>
          <th>Size</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>US House (85th)</strong></td>
          <td>440 Legislators x 172 Roll Calls</td>
          <td>68,284 choices; 1957-58</td>
      </tr>
      <tr>
          <td><strong>US Senate</strong></td>
          <td>~100 Senators/year</td>
          <td>Years 1979, 1980, 1981, 1982</td>
      </tr>
      <tr>
          <td><strong>Filtering</strong></td>
          <td>Cutoff &gt; 2.5%</td>
          <td>Roll calls with &lt; 2.5% minority vote are excluded to prevent &ldquo;noise&rdquo; from distorting estimates.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>NOMINATE</strong> algorithm maximizes the log-likelihood of observed choices using a constrained nonlinear maximum likelihood procedure.</p>
<p><strong>Utility Function</strong>:
The utility of legislator $i$ for outcome $j$ on roll call $l$ is:
$$U_{ijl}=\beta~\exp\left[\frac{-\omega^{2}d_{ijl}^{2}}{2}\right]+\epsilon_{ijl}$$
Where $d_{ijl}$ is the Euclidean distance between legislator $i$ and outcome $j$.</p>
<p><strong>Optimization Strategy (Global Iteration)</strong>:
Because estimating ~800 parameters simultaneously is impractical, the algorithm uses an alternating three-step method:</p>
<ol>
<li><strong>Utility Parameters</strong>: Estimate $\beta$ and $\omega$ while holding legislator ($x$) and roll call ($z$) coordinates fixed.</li>
<li><strong>Legislator Coordinates</strong>: Estimate $x_i$ for each legislator (independent of others) holding $\beta, \omega, z$ fixed.</li>
<li><strong>Roll Call Coordinates</strong>: Estimate $z_{yl}, z_{nl}$ for each roll call holding $\beta, \omega, x$ fixed.</li>
</ol>
<p>This cycle repeats until parameters correlate at the 0.99 level between iterations.</p>
<h3 id="models">Models</h3>
<p>The model estimates the following parameters for a one-dimensional space:</p>
<ul>
<li><strong>Legislator Coordinates ($x_i$)</strong>: The ideal point of each legislator, normalized to the range $[-1, +1]$.</li>
<li><strong>Outcome Coordinates ($z_{yl}, z_{nl}$)</strong>: The spatial location of the &ldquo;Yea&rdquo; and &ldquo;Nay&rdquo; policy outcomes for each vote.</li>
<li><strong>Signal-to-Noise ($\beta$)</strong>: Represents the weight of the spatial component versus the error term.</li>
<li><strong>Weighting ($\omega$)</strong>: A shape parameter for the utility function (often fixed to $0.5$ in practice due to collinearity with $\beta$).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is evaluated primarily via classification accuracy and probabilistic fit.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Context</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Classification</strong></td>
          <td>78.9%</td>
          <td>House (1957-58)</td>
          <td>Correctly predicts Yea/Nay choice</td>
      </tr>
      <tr>
          <td><strong>Classification</strong></td>
          <td>80.3 / 80.6 / 83.2 / 81.7%</td>
          <td>Senate (1979 / 1980 / 1981 / 1982)</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Geo. Mean Prob.</strong></td>
          <td>0.642 (House); 0.654 / 0.638 / 0.657 / 0.637 (Senate 1979 / 1980 / 1981 / 1982)</td>
          <td>Unconstrained roll calls</td>
          <td>Exponential of the average log likelihood</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Development</strong>: DEC-2060</li>
<li><strong>Production</strong>: VAX-11/780</li>
</ul>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p>This paper predates modern open-source conventions. No original source code was released, and the NOMINATE algorithm was described at an overview level rather than with full pseudocode. However, the underlying roll call voting data for the U.S. Congress is now freely available through the <a href="https://voteview.com/">Voteview</a> project, which Poole and Rosenthal later maintained. Modern open-source reimplementations exist, including the R packages <code>wnominate</code> and <code>pscl</code>. Reproducibility status: <strong>Partially Reproducible</strong> (data available, modern reimplementations exist, but original code not released).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Poole, K. T., &amp; Rosenthal, H. (1985). A Spatial Model for Legislative Roll Call Analysis. <em>American Journal of Political Science</em>, 29(2), 357-384. <a href="https://doi.org/10.2307/2111172">https://doi.org/10.2307/2111172</a></p>
<p><strong>Publication</strong>: American Journal of Political Science 1985</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{pooleSpatialModelLegislative1985,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A {{Spatial Model}} for {{Legislative Roll Call Analysis}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Poole, Keith T. and Rosenthal, Howard}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1985</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{American Journal of Political Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{29}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{357--384}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2307/2111172}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)">Wikipedia: NOMINATE</a></li>
<li><a href="https://voteview.com/">Voteview (Modern Repository)</a></li>
</ul>
]]></content:encoded></item><item><title>Correlations in the Motion of Atoms in Liquid Argon</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/correlations-motion-atoms-liquid-argon/</link><pubDate>Sat, 13 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/correlations-motion-atoms-liquid-argon/</guid><description>Rahman's 1964 MD simulation of 864 argon atoms with Lennard-Jones potential revealed the cage effect and validated classical molecular dynamics for liquids.</description><content:encoded><![CDATA[<h2 id="contribution-methodological-validation-of-md">Contribution: Methodological Validation of MD</h2>
<p>This is the archetypal <strong>Method</strong> paper (dominant classification with secondary <strong>Theory</strong> contribution). It establishes the architectural validity of Molecular Dynamics (MD) as a scientific tool. Rahman answers the question: &ldquo;Can a digital computer solving classical difference equations faithfully represent a physical liquid?&rdquo;</p>
<p>The paper utilizes specific rhetorical indicators of a methodological contribution:</p>
<ul>
<li><strong>Algorithmic Explication</strong>: A dedicated Appendix details the predictor-corrector difference equations.</li>
<li><strong>Validation against Ground Truth</strong>: Extensive comparison of calculated diffusion constants and pair-correlation functions against experimental neutron and X-ray scattering data.</li>
<li><strong>Robustness Checks</strong>: Ablation studies on the numerical integration stability (one vs. two corrector cycles).</li>
</ul>
<h2 id="motivation-bridging-neutron-scattering-and-many-body-theory">Motivation: Bridging Neutron Scattering and Many-Body Theory</h2>
<p>In the early 1960s, neutron scattering data provided insights into the dynamic structure of liquids, but theorists lacked concrete models to explain the observed two-body dynamical correlations. Analytic theories were limited by the difficulty of the many-body problem.</p>
<p>Rahman sought to bypass these analytical bottlenecks by assuming that <strong>classical dynamics</strong> with a simple 2-body potential (Lennard-Jones) could sufficiently describe the motion of atoms in liquid argon. The goal was to generate &ldquo;experimental&rdquo; data via simulation to test theoretical models (like the Vineyard convolution approximation) and provide a microscopic understanding of diffusion.</p>
<h2 id="core-innovation-system-stability-and-the-cage-effect">Core Innovation: System Stability and the Cage Effect</h2>
<p>This paper is widely considered the birth of modern molecular dynamics for continuous potentials. Its key novelties include:</p>
<ol>
<li><strong>System Size &amp; Stability</strong>: Successfully simulating 864 particles interacting via a continuous Lennard-Jones potential with stable temperature over the full simulation duration (approximately $10^{-11}$ sec, as confirmed by Table I in the paper).</li>
<li><strong>The &ldquo;Cage Effect&rdquo;</strong>: The discovery that the velocity autocorrelation function becomes negative after a short time:
$$ \langle \textbf{v}(0) \cdot \textbf{v}(t) \rangle &lt; 0 \quad \text{for } t &gt; 0.33 \times 10^{-12} \text{ s} $$
This proved that atoms in a liquid &ldquo;rattle&rdquo; against the cage of their nearest neighbors.</li>
<li><strong>Delayed Convolution</strong>: Proposing an improvement to the Vineyard approximation for the distinct Van Hove function $G_d(r,t)$ by introducing a time-delayed convolution to account for the persistence of local structure. Instead of convolving $g(r)$ with $G_s(r,t)$ at the same time $t$, Rahman convolves at a delayed time $t&rsquo; &lt; t$, using a one-parameter function with $\tau = 1.0 \times 10^{-12}$ sec. This makes $G_d(r,t)$ decay as $t^4$ at short times (instead of $t^2$ in the Vineyard approximation) and as $t$ at long times.</li>
</ol>
<h2 id="methodology-simulating-864-argon-atoms">Methodology: Simulating 864 Argon Atoms</h2>
<p>Rahman performed a &ldquo;computer experiment&rdquo; (simulation) of <strong>Liquid Argon</strong>:</p>
<ul>
<li><strong>System</strong>: 864 particles in a cubic box of side $L=10.229\sigma$.</li>
<li><strong>Conditions</strong>: Temperature $94.4^\circ$K, Density $1.374 \text{ g cm}^{-3}$.</li>
<li><strong>Interaction</strong>: Lennard-Jones potential, truncated at $R=2.25\sigma$.</li>
<li><strong>Time Step</strong>: $\Delta t = 10^{-14}$ s (780 steps total, covering approximately $7.8 \times 10^{-12}$ s).</li>
<li><strong>Output Analysis</strong>:
<ul>
<li>Radial distribution function $g(r)$.</li>
<li>Mean square displacement $\langle r^2 \rangle$.</li>
<li>Velocity autocorrelation function $\langle v(0)\cdot v(t) \rangle$.</li>
<li>Van Hove space-time correlation functions $G_s(r,t)$ and $G_d(r,t)$.</li>
</ul>
</li>
</ul>
<h2 id="results-validation-and-non-gaussian-diffusion-analysis">Results: Validation and Non-Gaussian Diffusion Analysis</h2>
<ul>
<li><strong>Validation</strong>: The calculated pair-distribution function $g(r)$ agreed well with X-ray scattering data from Eisenstein and Gingrich (at $91.8^\circ$K). The self-diffusion constant $D = 2.43 \times 10^{-5} \text{ cm}^2 \text{ sec}^{-1}$ at $94.4^\circ$K matched the experimental value from Naghizadeh and Rice at $90^\circ$K and the same density ($1.374 \text{ g cm}^{-3}$).</li>
<li><strong>Dynamics</strong>: The velocity autocorrelation has a negative region, contradicting simple exponential decay models (Langevin). Its frequency spectrum $f(\omega)$ shows a broad maximum at $\omega \approx 0.25 (k_BT/\hbar)$, reminiscent of solid-like behavior.</li>
<li><strong>Non-Gaussian Behavior</strong>: The self-diffusion function $G_s(r,t)$ attains its maximum departure from a Gaussian shape at about $t \approx 3.0 \times 10^{-12}$ s (with $\langle r^4 \rangle$ departing from its Gaussian value by about 13%), returning to Gaussian form by $\sim 10^{-11}$ s. At that time, the rms displacement ($3.8$ Angstrom) is close to the first-neighbor distance ($3.7$ Angstrom). This indicates that Fickian diffusion is an asymptotic limit and does not apply at short times.</li>
<li><strong>Fourier Transform Validation</strong>: The Fourier transform of $g(r)$ has peaks at $\kappa\sigma = 6.8$, 12.5, 18.5, 24.8, closely matching the X-ray scattering peaks at $\kappa\sigma = 6.8$, 12.3, 18.4, 24.4.</li>
<li><strong>Temperature Dependence</strong>: A second simulation at $130^\circ$K and $1.16 \text{ g cm}^{-3}$ yielded $D = 5.67 \times 10^{-5} \text{ cm}^2 \text{ sec}^{-1}$, compared to the experimental value of $6.06 \times 10^{-5} \text{ cm}^2 \text{ sec}^{-1}$ from Naghizadeh and Rice at $120^\circ$K and $1.16 \text{ g cm}^{-3}$. The paper notes that both calculated values are lower than experiment by about 20%, and suggests that allowing for a softer repulsive part in the interaction potential might reduce this discrepancy.</li>
<li><strong>Vineyard Approximation</strong>: The standard Vineyard convolution approximation ($G_d \approx g * G_s$) produces a too-rapid decay of $G_d(r,t)$ with time. The delayed convolution, matching pairs of $(t&rsquo;, t)$ in units of $10^{-12}$ sec as (0.2, 0.4), (0.5, 0.8), (1.0, 1.6), (1.5, 2.3), (2.0, 2.9), (2.5, 3.5), provides a substantially better fit.</li>
<li><strong>Conclusion</strong>: Classical N-body dynamics with a truncated pair potential is a sufficient model to reproduce both the structural and dynamical properties of simple liquids.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The simulation uses physical constants for Argon:</p>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Particle Mass ($M$)</td>
          <td>$39.95 \times 1.6747 \times 10^{-24}$ g</td>
          <td>Mass of Argon atom</td>
      </tr>
      <tr>
          <td>Potential Depth ($\epsilon/k_B$)</td>
          <td>$120^\circ$K</td>
          <td>Lennard-Jones parameter</td>
      </tr>
      <tr>
          <td>Potential Size ($\sigma$)</td>
          <td>$3.4$ Å</td>
          <td>Lennard-Jones parameter</td>
      </tr>
      <tr>
          <td>Cutoff Radius ($R$)</td>
          <td>$2.25\sigma$</td>
          <td>Potential truncated beyond this</td>
      </tr>
      <tr>
          <td>Density ($\rho$)</td>
          <td>$1.374$ g cm$^{-3}$</td>
          <td></td>
      </tr>
      <tr>
          <td>Particle Count ($N$)</td>
          <td>864</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Rahman utilized a <strong>Predictor-Corrector</strong> scheme for solving the second-order differential equations of motion.</p>
<p><strong>Step Size</strong>: $\Delta t = 10^{-14}$ sec.</p>
<p><strong>The Algorithm:</strong></p>
<ol>
<li><strong>Predict</strong> positions $\bar{\xi}$ at $t + \Delta t$ based on previous steps:
$$\bar{\xi}_i^{(n+1)} = \xi_i^{(n-1)} + 2\Delta u \eta_i^{(n)}$$</li>
<li><strong>Calculate Forces</strong> (Accelerations $\alpha$) using predicted positions.</li>
<li><strong>Correct</strong> positions and velocities using the trapezoidal rule:
$$
\begin{aligned}
\eta_i^{(n+1)} &amp;= \eta_i^{(n)} + \frac{1}{2}\Delta u (\alpha_i^{(n+1)} + \alpha_i^{(n)}) \\
\xi_i^{(n+1)} &amp;= \xi_i^{(n)} + \frac{1}{2}\Delta u (\eta_i^{(n+1)} + \eta_i^{(n)})
\end{aligned}
$$</li>
</ol>
<p><em>Note: The paper compared one vs. two repetitions of the corrector step, finding that two passes improved precision slightly. The results presented in the paper were obtained using two passes.</em></p>
<h3 id="models">Models</h3>
<p><strong>Interaction Potential</strong>: Lennard-Jones 12-6
$$V(r_{ij}) = 4\epsilon \left[ \left(\frac{\sigma}{r_{ij}}\right)^{12} - \left(\frac{\sigma}{r_{ij}}\right)^6 \right]$$</p>
<p><strong>Boundary Conditions</strong>: Periodic Boundary Conditions (PBC) in 3 dimensions. When a particle moves out of the box ($x &gt; L$), it re-enters at $x - L$.</p>
<h3 id="hardware">Hardware</h3>
<p>This is a historical benchmark for computational capability in 1964:</p>
<table>
  <thead>
      <tr>
          <th>Resource</th>
          <th>Specification</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Computer</strong></td>
          <td>CDC 3600</td>
          <td>Control Data Corporation mainframe</td>
      </tr>
      <tr>
          <td><strong>Compute Time</strong></td>
          <td>45 seconds / cycle</td>
          <td>Per predictor-corrector cycle for 864 particles (floating point)</td>
      </tr>
      <tr>
          <td><strong>Language</strong></td>
          <td>FORTRAN + Machine Language</td>
          <td>Machine language used for the most time-consuming parts</td>
      </tr>
  </tbody>
</table>
<p><em>Modern Context: Rahman&rsquo;s system (864 Argon atoms, LJ-potential) is highly reproducible today and serves as a classic pedagogical exercise. It can be simulated in standard MD frameworks (LAMMPS, OpenMM) in fractions of a second on consumer hardware.</em></p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rahman, A. (1964). Correlations in the Motion of Atoms in Liquid Argon. <em>Physical Review</em>, 136(2A), A405-A411. <a href="https://doi.org/10.1103/PhysRev.136.A405">https://doi.org/10.1103/PhysRev.136.A405</a></p>
<p><strong>Publication</strong>: Physical Review 1964</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rahman1964correlations,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Correlations in the motion of atoms in liquid argon}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rahman, A.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Physical Review}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{136}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{A405--A411}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1964}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{APS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1103/PhysRev.136.A405}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Aneesur_Rahman">Aneesur Rahman - Wikipedia</a></li>
</ul>
]]></content:encoded></item><item><title>Importance Weighted Autoencoders (IWAE) for Tighter Bounds</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/importance-weighted-autoencoders/</link><pubDate>Wed, 05 Nov 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/importance-weighted-autoencoders/</guid><description>Summary of Burda, Grosse &amp; Salakhutdinov's ICLR 2016 paper introducing Importance Weighted Autoencoders for tighter variational bounds</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper that introduces the <strong>Importance Weighted Autoencoder (IWAE)</strong>, a generative model that shares the same architecture as the Variational Autoencoder (VAE) but uses a different, tighter objective function. The key innovation is using importance weighting to derive a strictly tighter log-likelihood lower bound than the standard VAE objective.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The standard VAE has several limitations that motivated this work:</p>
<ul>
<li><strong>Strong assumptions</strong>: VAEs typically assume the posterior distribution is simple (e.g., approximately factorial) and that its parameters can be easily approximated from observations.</li>
<li><strong>Simplified representations</strong>: The VAE objective can force models to learn overly simplified representations that underutilize the network&rsquo;s full modeling capacity.</li>
<li><strong>Harsh penalization</strong>: The VAE objective harshly penalizes approximate posterior samples that are poor explanations for the data, which can be overly restrictive.</li>
<li><strong>Inactive units</strong>: VAEs tend to learn latent spaces with effective dimensions far below their capacity, where many latent units are ignored (a phenomenon later termed <strong>posterior collapse</strong>, where the approximate posterior collapses to the prior and conveys no information). The authors wanted to investigate whether a new objective could address this issue.</li>
</ul>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>IWAE objective function</strong>, denoted as $\mathcal{L}_{k}$.</p>
<ul>
<li>
<p><strong>VAE ($\mathcal{L}_{1}$ Bound)</strong>: The standard VAE maximizes $\mathcal{L}(x)=\mathbb{E}_{q(h|x)}[\log\frac{p(x,h)}{q(h|x)}]$. This is equivalent to the new bound when $k=1$.</p>
</li>
<li>
<p><strong>IWAE ($\mathcal{L}_{k}$ Bound)</strong>: The IWAE maximizes a tighter bound that uses $k$ samples drawn from the recognition model $q(h|x)$:</p>
</li>
</ul>
<p>$$\mathcal{L}_{k}(x)=\mathbb{E}_{h_{1},&hellip;,h_{k}\sim q(h|x)}\left[\log\frac{1}{k}\sum_{i=1}^{k}\frac{p(x,h_{i})}{q(h_{i}|x)}\right]$$</p>
<ul>
<li>
<p><strong>Tighter Bound</strong>: The authors prove that this bound is always tighter than or equal to the VAE bound ($\mathcal{L}_{k+1} \geq \mathcal{L}_{k}$) and that as $k$ approaches infinity, $\mathcal{L}_{k}$ approaches the true log-likelihood $\log p(x)$.</p>
</li>
<li>
<p><strong>Increased Flexibility</strong>: Using multiple samples gives the IWAE additional flexibility to learn generative models whose posterior distributions are complex and violate the VAE&rsquo;s simplifying assumptions.</p>
</li>
</ul>
<h3 id="key-concept-averaging-inside-vs-outside-the-log">Key Concept: Averaging Inside vs. Outside the Log</h3>
<p>A crucial distinction exists between how VAE and IWAE utilize $k$ samples. Understanding this difference explains why increasing $k$ in IWAE improves the bound. In VAE, it reduces variance.</p>















<figure class="post-figure center ">
    <img src="/img/notes/variational-autoencoder-vae-vs-importance-weighted-autoencoder-iwae.webp"
         alt="Flowchart comparing VAE and IWAE computation: VAE takes the log of each weight then averages (average of logs). IWAE averages the weights first then takes the log (log of average)"
         title="Flowchart comparing VAE and IWAE computation: VAE takes the log of each weight then averages (average of logs). IWAE averages the weights first then takes the log (log of average)"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">VAE vs IWAE computation flow. The key difference is where the log operation occurs: VAE computes log(w_i) for each sample then averages. IWAE averages the weights first then applies log to the result.</figcaption>
    
</figure>

<p><strong>VAE (Average of Logs):</strong></p>
<p>For a VAE, using $k$ samples approximates:</p>
<p>$$\mathbb{E}\left[ \frac{1}{k} \sum_{i=1}^k \log w_i \right] \approx \text{ELBO}$$</p>
<p>where $w_i = p(x, h_i) / q(h_i | x)$. Increasing $k$ here only reduces the variance of the gradient estimator; the model still targets the same ELBO bound, so performance gains saturate quickly.</p>
<p><strong>IWAE (Log of Average):</strong></p>
<p>IWAE performs the averaging <em>inside</em> the logarithm:</p>
<p>$$\mathbb{E}\left[ \log \left( \frac{1}{k} \sum_{i=1}^k w_i \right) \right] = \mathcal{L}_k$$</p>
<p>By Jensen&rsquo;s Inequality ($\log(\mathbb{E}[X]) \geq \mathbb{E}[\log(X)]$ for concave functions), this bound is mathematically guaranteed to be at least as tight as the VAE bound. Each increase in $k$ defines a new, strictly tighter lower bound on the log-likelihood.</p>
<p><strong>Why This Matters for Gradients:</strong></p>
<p>In IWAE, the gradient weights are normalized importance weights $\tilde{w}_i = w_i / \sum_j w_j$. This means &ldquo;bad&rdquo; samples (those with low $w_i$) contribute very little to the gradient update since they vanish from the weighted sum. VAE uses unweighted samples, so a single sample with extremely low probability produces a massive negative log value that can dominate the loss and harshly penalize the model. IWAE&rsquo;s formulation allows the model to focus learning on the samples that explain the data well.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors compared VAE and IWAE on density estimation tasks using the MNIST and Omniglot datasets. They evaluated two main network architectures: one with a single stochastic layer and another with two stochastic layers. The models were trained with varying numbers of importance samples ($k \in {1, 5, 50}$) to observe the effect on performance and latent space utilization. The primary metrics for evaluation were the test log-likelihood (estimated using 5000 samples) and the number of &ldquo;active&rdquo; latent units, which quantifies the richness of the learned representations.</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>















<figure class="post-figure center ">
    <img src="/img/notes/iwae-vs-vae-active-latent-units-comparison.webp"
         alt="Bar chart comparing active latent units between VAE and IWAE across different k values on MNIST and Omniglot datasets"
         title="Bar chart comparing active latent units between VAE and IWAE across different k values on MNIST and Omniglot datasets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Active latent units for VAE vs IWAE (1 stochastic layer). VAE active units remain flat. IWAE increases with k. Data from Table 1 of Burda et al. (2016).</figcaption>
    
</figure>

<ul>
<li>
<p><strong>Better Performance</strong>: IWAE achieved higher log-likelihoods than VAEs across all configurations. On MNIST with two stochastic layers and $k=50$, IWAE reached $-82.90$ nats compared to $-84.78$ for VAE. On Omniglot, the best IWAE achieved $-103.38$ nats versus $-106.30$ for VAE. IWAE performance improved consistently with increasing $k$, while VAE performance benefited only slightly from using more samples ($k&gt;1$).</p>
</li>
<li>
<p><strong>Richer Representations</strong>: In all experiments with $k&gt;1$, IWAE learned more active latent dimensions than VAE, suggesting richer latent representations.</p>
</li>
<li>
<p><strong>Objective Drives Representation</strong>: The authors found that latent dimension inactivation is driven by the objective function. They demonstrated this through an &ldquo;objective swap&rdquo; experiment:</p>
</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/notes/iwae-objective-swap-experiment.webp"
         alt="Bar charts showing the objective swap experiment results with active units and NLL changes when switching between VAE and IWAE objectives"
         title="Bar charts showing the objective swap experiment results with active units and NLL changes when switching between VAE and IWAE objectives"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Objective swap experiment on MNIST (1 stochastic layer). Switching a trained VAE to the IWAE objective improves both metrics. Switching IWAE to VAE degrades them. Data from Table 2 of Burda et al. (2016).</figcaption>
    
</figure>

<p>This experiment provides evidence that the objective function itself influences latent utilization:</p>
<ul>
<li><strong>VAE → IWAE</strong>: A converged VAE model, when fine-tuned with the IWAE objective ($k=50$), gained 3 active units (19 → 22) and improved test NLL from 86.76 to 84.88.</li>
<li><strong>IWAE → VAE</strong>: A converged IWAE model fine-tuned with the VAE objective lost 2 active units (25 → 23) and worsened test NLL from 84.78 to 86.02.</li>
</ul>
<p>These results strongly suggest that inactivation of latent dimensions is driven by the objective function rather than by optimization dynamics, initialization, or architecture. The authors note that optimization also plays a role, as the swap results do not exactly match training from scratch.</p>
<ul>
<li>
<p><strong>Comparison to Other Models</strong>: On MNIST, the best IWAE ($-82.90$ nats) outperformed deep belief networks ($-84.55$ nats) and deep autoregressive networks ($-84.13$ nats), though DRAW ($-80.97$ nats), which exploits spatial structure, achieved better results. On Omniglot, the best IWAE ($-103.38$ nats) fell slightly behind RBMs trained with persistent contrastive divergence ($-100.46$ nats).</p>
</li>
<li>
<p><strong>Conclusion</strong>: IWAEs learn richer latent representations and achieve better generative performance than VAEs with equivalent architectures and training time.</p>
</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>MNIST</strong>: $28 \times 28$ binarized handwritten digits (60,000 training / 10,000 test).</li>
<li><strong>Omniglot</strong>: $28 \times 28$ binarized handwritten characters from various alphabets (24,345 training / 8,070 test).</li>
<li><strong>Binarization</strong>: Dynamic sampling where binary values are sampled with expectations equal to the real pixel intensities (following Salakhutdinov &amp; Murray, 2008).</li>
<li><strong>Fixed Binarization</strong>: Results on a fixed binarization of MNIST (Larochelle, 2011) confirm that IWAE outperforms VAE across preprocessing methods. It exhibits notably more overfitting compared to dynamic sampling.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two main network architectures were tested:</p>
<ol>
<li>One stochastic layer (50 units) with two deterministic layers (200 units each).</li>
<li>Two stochastic layers (100 and 50 units). Between x and h1 were two deterministic layers with 200 units each. Between h1 and h2 were two deterministic layers with 100 units each.</li>
</ol>
<ul>
<li><strong>Activations</strong>: <code>tanh</code> for deterministic layers; <code>exp</code> applied to variance predictions to ensure positivity.</li>
<li><strong>Distributions</strong>: Gaussian latent layers with diagonal covariance; Bernoulli observation layer.</li>
<li><strong>Initialization</strong>: Glorot &amp; Bengio (2010) heuristic.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam ($\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-4}$).</li>
<li><strong>Batch Size</strong>: 20.</li>
<li><strong>Learning Rate Schedule</strong>: Annealed rate of $0.001 \cdot 10^{-i/7}$ for $3^i$ epochs (where $i=0&hellip;7$), totaling 3,280 passes over the data.</li>
<li><strong>Variance Control</strong>: A common concern with importance sampling is high variance. The authors prove that the Mean Absolute Deviation of their estimator is bounded by $2 + 2\delta$, where $\delta$ is the gap between the bound and true log-likelihood. As the bound tightens, variance remains controlled.</li>
<li><strong>Computational trick</strong>: In the basic IWAE implementation, both forward and backward passes must be done independently for each of the $k$ samples, so the cost scales linearly with $k$. However, the authors describe an optional optimization: stochastically approximate the gradient sum by sampling a single $\epsilon_i$ proportional to its normalized weight $\tilde{w}_i$, then computing only that one backward pass. This reduces the cost to $k$ forward passes and one backward pass. Since the backward pass costs roughly twice the forward pass, this yields approximately a 3x speedup for large $k$ at the cost of increased gradient variance.</li>
</ul>
<p><strong>Relationship to Reweighted Wake-Sleep (RWS):</strong> Both IWAE and Reweighted Wake-Sleep (Bornschein &amp; Bengio, 2015) use importance-weighted samples and have closely related generative model updates. The key difference is that IWAE derives a single unified lower bound $\mathcal{L}_k$ and uses the reparameterization trick to train the recognition network jointly. RWS instead uses separate wake and sleep phases for the recognition network, which are not derived from $\mathcal{L}_k$.</p>
<h3 id="evaluation">Evaluation</h3>
<ol>
<li><strong>Test Log-Likelihood</strong>: Primary measure of generative performance, estimated as the mean of $\mathcal{L}_{5000}$ (5000 samples) on the test set.</li>
<li><strong>Active Units</strong>: To quantify latent space richness, the authors measured &ldquo;active&rdquo; latent dimensions. A unit $u$ was defined as active if its activity statistic $A_{u}=\text{Cov}_{x}(\mathbb{E}_{u\sim q(u|x)}[u])$ exceeded $10^{-2}$. The $10^{-2}$ threshold is justified by a bimodal distribution of the log activity statistic, showing clear separation between active and inactive units.</li>
</ol>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Hardware</strong>: GPU-based implementation using mini-batch replication to parallelize the $k$ samples. Specific GPU type and training times are not reported.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/yburda/iwae">yburda/iwae</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official Theano implementation for MNIST and Omniglot</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Burda, Y., Grosse, R., &amp; Salakhutdinov, R. (2016). Importance Weighted Autoencoders. <em>International Conference on Learning Representations (ICLR) 2016</em>. <a href="https://arxiv.org/abs/1509.00519">https://arxiv.org/abs/1509.00519</a></p>
<p><strong>Publication</strong>: ICLR 2016</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{burda2016importance,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Importance Weighted Autoencoders}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yuri Burda and Roger Grosse and Ruslan Salakhutdinov}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/1509.00519}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/1509.00519">ArXiv</a></li>
</ul>
]]></content:encoded></item><item><title>Auto-Encoding Variational Bayes: VAE Paper Summary</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/autoencoding-variational-bayes/</link><pubDate>Wed, 05 Nov 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/autoencoding-variational-bayes/</guid><description>Summary of Kingma &amp; Welling's 2013 VAE paper introducing the reparameterization trick and variational autoencoders.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper that introduces a generative mechanism (the VAE) and an optimization technique (the reparameterization trick), with formal theoretical derivation. The method, called the Auto-Encoding VB (AEVB) algorithm, leads to what we now know as the <strong>variational auto-encoder (VAE)</strong> when neural networks are used as the recognition model.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The authors address two central intractabilities in directed probabilistic models with continuous latent variables:</p>















<figure class="post-figure center ">
    <img src="/img/notes/autoencoding-variational-bayes-figure-1-model-diagram.webp"
         alt="VAE graphical model showing latent variable z, observed variable x, and parameters phi and theta"
         title="VAE graphical model showing latent variable z, observed variable x, and parameters phi and theta"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Figure 1 from the paper: The directed graphical model. Solid lines denote the generative model $p_\theta(z)p_\theta(x|z)$, dashed lines denote the variational approximation $q_\phi(z|x)$. The variational parameters $\phi$ are learned jointly with the generative parameters $\theta$.</figcaption>
    
</figure>

<ol>
<li>
<p><strong>Intractable Posteriors</strong>: In models with continuous latent variables (like those with non-linear hidden layers), the true posterior $p_{\theta}(z|x)$ cannot be calculated analytically, preventing the use of standard EM algorithms.</p>
</li>
<li>
<p><strong>Large Datasets</strong>: Sampling-based solutions like Monte Carlo EM (MCEM) require expensive sampling loops per datapoint. This makes them too slow for large datasets where batch optimization is too costly and efficient minibatch updates are required.</p>
</li>
</ol>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<h3 id="the-reparameterization-trick-sgvb-estimator">The Reparameterization Trick (SGVB Estimator)</h3>
<p>The core innovation is the <strong>Stochastic Gradient Variational Bayes (SGVB)</strong> estimator. The authors solve the high variance of standard gradient estimation by &ldquo;reparameterizing&rdquo; the random variable $\tilde{z}$.</p>
<p>They express $z$ as a deterministic function of the input $x$ and an auxiliary noise variable $\epsilon$:</p>
<p>$$\tilde{z} = g_{\phi}(\epsilon, x) \quad \text{with} \quad \epsilon \sim p(\epsilon)$$</p>















<figure class="post-figure center ">
    <img src="/img/notes/variational-autoencoder-reparameterization-trick.webp"
         alt="Comparison of standard stochastic node vs reparameterization trick showing gradient flow"
         title="Comparison of standard stochastic node vs reparameterization trick showing gradient flow"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The reparameterization trick. (A) Standard stochastic nodes block gradient flow during backpropagation. (B) By expressing $z = \mu + \sigma \odot \epsilon$ with external noise $\epsilon \sim \mathcal{N}(0,1)$, gradients can flow through the deterministic path to the parameters $\phi$.</figcaption>
    
</figure>

<ul>
<li><strong>Mechanism</strong>: For a Gaussian posterior, $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$.</li>
<li><strong>Impact</strong>: This makes the Monte Carlo estimate differentiable with respect to the variational parameters $\phi$, allowing the variational lower bound to be optimized via standard stochastic gradient ascent (like SGD or Adagrad).</li>
</ul>
<h3 id="the-aevb-algorithm-the-vae">The AEVB Algorithm (The VAE)</h3>
<p>The <strong>Auto-Encoding VB (AEVB)</strong> algorithm amortizes inference by learning a global recognition model (encoder) $q_{\phi}(z|x)$ jointly with the generative model (decoder) $p_{\theta}(x|z)$.</p>
<p><strong>Objective Function</strong>: Maximize the variational lower bound $\mathcal{L}(\theta, \phi; x^{(i)})$:</p>
<p>$$\mathcal{L} \simeq -D_{KL}(q_\phi(z|x^{(i)}) | p_\theta(z)) + \frac{1}{L} \sum_{l=1}^L \log p_\theta(x^{(i)}|z^{(i,l)})$$</p>
<ul>
<li><strong>First Term (Regularizer)</strong>: Forces the approximate posterior to match the prior (integrated analytically for Gaussians).</li>
<li><strong>Second Term (Reconstruction Error)</strong>: The expected negative reconstruction error (estimated via sampling).</li>
</ul>
<p>This mirrors the standard auto-encoder objective, adding a variational regularizer.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The method was benchmarked against the <strong>Wake-Sleep</strong> algorithm and <strong>Monte Carlo EM (MCEM)</strong> using the <strong>MNIST</strong> (digits) and <strong>Frey Face</strong> (continuous faces) datasets.</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li>
<p><strong>Efficiency</strong>: AEVB converged faster and reached a better lower bound than Wake-Sleep (Figure 2). It scaled efficiently to the full MNIST dataset. MCEM&rsquo;s per-datapoint sampling cost made it impractical at full dataset scale, so comparisons were limited to small subsets (Figure 3).</p>
</li>
<li>
<p><strong>Regularization</strong>: The KL-divergence term provided a regularizing effect, preventing overfitting while increasing latent dimensions ($N_z$).</p>
</li>
<li>
<p><strong>Manifold Learning</strong>: The model successfully learned smooth 2D latent manifolds (visualized in Appendix A), grouping similar digits/faces together.</p>
</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Evaluation Data</strong>: For the marginal likelihood comparison (Figure 3), the paper used MNIST with $N_{\text{train}} = 100$ and $N_{\text{train}} = 5000$ to compare data efficiency (marginal log-likelihood vs. training samples seen) across algorithms. A smaller network (100 hidden units, 3 latent variables) was used for this comparison because the marginal likelihood estimator only works reliably in low-dimensional latent spaces.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Algorithm</strong>: Stochastic gradient ascent with <strong>Adagrad</strong> (global stepsizes chosen from ${0.01, 0.02, 0.1}$).</li>
<li><strong>Regularization</strong>: The objective included a weight decay term corresponding to a prior $p(\theta)=\mathcal{N}(0,I)$.</li>
<li><strong>Minibatches</strong>: Size $M=100$ with $L=1$ sample per datapoint.</li>
<li><strong>Initialization</strong>: Parameters sampled from $\mathcal{N}(0, 0.01)$.</li>
</ul>
<h3 id="models">Models</h3>
<p>The original VAE used simple Multi-Layered Perceptrons (MLPs):</p>
<ul>
<li><strong>Symmetry</strong>: The encoder and decoder were symmetric, having an equal number of hidden units.</li>
<li><strong>Hidden Units</strong>: 500 units for MNIST, 200 for Frey Face (to prevent overfitting on the smaller dataset).</li>
<li><strong>Activations</strong>: <strong>Tanh</strong> activation functions for the hidden layers.</li>
<li><strong>Latent Space</strong>: Experimented with $N_z$ ranging from 2 to 200.</li>
<li><strong>Outputs</strong>:
<ul>
<li><em>MNIST</em>: <strong>Bernoulli</strong> MLP (sigmoid output).</li>
<li><em>Frey Face</em>: <strong>Gaussian</strong> MLP, with means constrained to $(0,1)$ via sigmoid.</li>
</ul>
</li>
<li><strong>Encoder Architecture</strong>: For the Gaussian encoder, the mean $\mu$ and log-variance $\log(\sigma^2)$ are linear outputs from the shared hidden layer (they share the hidden layer weights and have separate output weights).</li>
<li><strong>Log-Variance</strong>: The encoder predicted $\log(\sigma^2)$ for numerical stability.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The paper distinguishes between two metrics:</p>
<ul>
<li><strong>Variational Lower Bound</strong>: Used as the training objective (what the model optimizes).</li>
<li><strong>Marginal Likelihood</strong>: Used for final evaluation (Figure 3). The true marginal likelihood $p_\theta(x)$ was estimated using an Importance Sampling estimator constructed from samples drawn via Hybrid Monte Carlo (HMC), as detailed in Appendix D. This estimator uses: $p_{\theta}(x^{(i)}) \simeq (\frac{1}{L}\sum \frac{q(z)}{p(z)p(x|z)})^{-1}$.</li>
</ul>
<p>This distinction is critical: the training metric (lower bound) differs from the evaluation metric (estimated marginal likelihood).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Hardware</strong>: Trained on a standard Intel Xeon CPU (approx. 40 GFLOPS); no GPUs were used.</li>
<li><strong>Training Time</strong>: Approximately 20-40 minutes per million training samples.</li>
</ul>
<h3 id="key-implementation-details-from-appendices">Key Implementation Details from Appendices</h3>
<ul>
<li><strong>Appendix A</strong>: Visualizations of 2D latent manifolds learned for MNIST and Frey Face datasets.</li>
<li><strong>Appendix B</strong>: Closed-form solution for the KL divergence of two Gaussians, essential for implementing the efficient version of the estimator (Equation 10).</li>
<li><strong>Appendix C</strong>: Exact MLP equations, including the use of tanh hidden layers and specific output layers for Bernoulli vs. Gaussian data. Includes specifications for <strong>Bernoulli MLPs</strong> (binary data) and <strong>Gaussian MLPs</strong> (real-valued data).</li>
<li><strong>Appendix D</strong>: Marginal likelihood estimation protocol using Hybrid Monte Carlo (HMC) and importance sampling for evaluation (Figure 3).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Diederik P. Kingma and Max Welling. &ldquo;Auto-Encoding Variational Bayes.&rdquo; arXiv:1312.6114 [stat.ML], 2013. <a href="https://doi.org/10.48550/arXiv.1312.6114">https://doi.org/10.48550/arXiv.1312.6114</a></p>
<p><strong>Publication</strong>: ICLR 2014 (arXiv preprint December 2013)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{kingma2022autoencodingvariationalbayes,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Auto-Encoding Variational Bayes}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Diederik P Kingma and Max Welling}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2013}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{1312.6114}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{stat.ML}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/1312.6114}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Variational_autoencoder">Wikipedia: Variational Autoencoder</a> - General overview</li>
<li><a href="https://openreview.net/forum?id=33X9fd2-9FyZd">OpenReview</a> - Original peer review with author responses</li>
<li><a href="/posts/modern-variational-autoencoder-in-pytorch/">Modern VAE in PyTorch</a> - Implementation tutorial on this site</li>
</ul>
]]></content:encoded></item><item><title>SMILES Notation: The Original Paper by Weininger (1988)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-original-paper/</guid><description>Weininger's 1988 paper introducing SMILES notation, a string-based molecular representation that became a standard in computational chemistry.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. <em>Journal of Chemical Information and Computer Sciences</em>, 28(1), 31-36. <a href="https://doi.org/10.1021/ci00057a005">https://doi.org/10.1021/ci00057a005</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1988</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES notation overview</a> - Modern usage summary</li>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/">Converting SMILES to 2D images</a> - Practical visualization tutorial</li>
</ul>
<h2 id="core-contribution-a-string-based-molecular-notation">Core Contribution: A String-Based Molecular Notation</h2>
<p>This is a <strong>Method</strong> paper that introduces a novel notation system for representing chemical structures as text strings. It establishes the encoding rules and input conventions for SMILES (Simplified Molecular Input Line Entry System), while explicitly deferring the canonicalization algorithm to subsequent papers in the series.</p>
<h2 id="the-computational-complexity-of-chemical-information-in-the-1980s">The Computational Complexity of Chemical Information in the 1980s</h2>
<p>As computers became central to chemical information processing in the 1980s, the field faced a fundamental problem: existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems required extensive training to write correctly and were prone to errors.</p>
<p>The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient. This would enable compact database storage, fast processing, and easy exchange between software systems.</p>
<h2 id="separating-input-rules-from-canonicalization">Separating Input Rules from Canonicalization</h2>
<p>Weininger&rsquo;s key insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while deferring to the computer the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.</p>
<p>The specific innovations include:</p>
<ol>
<li><strong>Simple input rules</strong> - Chemists could write molecules intuitively (e.g., <code>CCO</code> or <code>OCC</code> for ethanol)</li>
<li><strong>Ring closure notation</strong> - Breaking one bond and marking ends with matching digits</li>
<li><strong>Implicit hydrogens</strong> - Automatic calculation based on standard valences keeps strings compact</li>
<li><strong>Algorithmic aromaticity detection</strong> - Automatic recognition of aromatic systems from Kekulé structures</li>
<li><strong>Human-readable output</strong> - Unlike binary formats, SMILES strings are readable and debuggable</li>
</ol>
<p><strong>Important scope note</strong>: This first paper in the series establishes the input syntax and encoding rules. The canonicalization algorithm (how to generate unique SMILES) is explicitly stated as the subject of following papers: &ldquo;specification of isomerisms, substructures, and unique SMILES generation are the subjects of following papers.&rdquo;</p>
<h2 id="demonstrating-notation-rules-across-molecular-classes">Demonstrating Notation Rules Across Molecular Classes</h2>
<p>The paper is primarily a specification document establishing notation rules. The methodology is demonstrated through worked examples showing how to encode various molecular structures:</p>
<ul>
<li><strong>Basic molecules</strong>: Ethane (<code>CC</code>), ethylene (<code>C=C</code>), acetylene (<code>C#C</code>)</li>
<li><strong>Branches</strong>: Isobutyric acid (<code>CC(C)C(=O)O</code>)</li>
<li><strong>Rings</strong>: Cyclohexane (<code>C1CCCCC1</code>), benzene (<code>c1ccccc1</code>)</li>
<li><strong>Aromatic systems</strong>: Tropone (<code>O=c1cccccc1</code>), quinone (showing exocyclic bond effects)</li>
<li><strong>Complex structures</strong>: Morphine (40 characters vs 1000-2000 for connection tables)</li>
<li><strong>Edge cases</strong>: Salts, isotopes, charged species, tautomers</li>
</ul>
<p>Performance comparisons are mentioned qualitatively: SMILES processing was approximately 100 times faster than traditional connection table methods on the hardware of the era (1988), with dramatic reductions in storage space.</p>
<h2 id="performance-and-practical-viability">Performance and Practical Viability</h2>
<p>The paper successfully establishes SMILES as a practical notation system with several key outcomes:</p>
<p><strong>Practical benefits</strong>:</p>
<ul>
<li><strong>Compactness</strong>: 40 characters for morphine vs 1000-2000 for connection tables</li>
<li><strong>Speed</strong>: ~100x faster processing than traditional methods</li>
<li><strong>Accessibility</strong>: Simple enough for chemists to learn without extensive training</li>
<li><strong>Machine-friendly</strong>: Efficient parsing and string-based operations</li>
</ul>
<p><strong>Design principles validated</strong>:</p>
<ul>
<li>Separating user input from canonical representation makes the system both usable and rigorous</li>
<li>Implicit hydrogens reduce string length without loss of information</li>
<li>Ring closure notation with digit markers is more intuitive than complex graph syntax</li>
<li>Automatic aromaticity detection handles most cases correctly</li>
</ul>
<p><strong>Acknowledged limitations</strong>:</p>
<ul>
<li>Canonicalization algorithm not included in this paper</li>
<li>Stereochemistry handling deferred to subsequent papers</li>
<li>Some edge cases (like unusual valence states) require explicit specification</li>
</ul>
<p>The paper concludes by positioning SMILES as a foundation for database storage, substructure searching, and chemical informatics applications - a vision that proved accurate as SMILES became one of the most widely used molecular representations in computational chemistry.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>To implement the method described in this paper, the following look-up tables and algorithms are required. <strong>Note</strong>: These details are critical for replication but are often glossed over in high-level summaries.</p>
<h3 id="1-the-valence-look-up-table">1. The Valence Look-Up Table</h3>
<p>To calculate implicit hydrogens, the system assumes the &ldquo;lowest normal valence&rdquo; greater than or equal to the explicit bond count. The paper explicitly defines these valences:</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Allowed Valences</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>B</td>
          <td>3</td>
      </tr>
      <tr>
          <td>C</td>
          <td>4</td>
      </tr>
      <tr>
          <td>N</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>O</td>
          <td>2</td>
      </tr>
      <tr>
          <td>P</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>S (aliphatic)</td>
          <td>2, 4, 6</td>
      </tr>
      <tr>
          <td>S (aromatic)</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>F, Cl, Br, I</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p><strong>Example</strong>: For sulfur in $\text{H}_2\text{SO}_4$ written as <code>OS(=O)(=O)O</code>, the explicit bond count is 6 (two single bonds + two double bonds to four oxygens), so the system uses valence 6 with zero implicit hydrogens. Without knowing S allows valence 6, the algorithm would fail.</p>
<h3 id="2-explicit-hydrogen-requirements">2. Explicit Hydrogen Requirements</h3>
<p>The paper lists exactly three cases where hydrogen atoms are retained (not suppressed):</p>
<ol>
<li><strong>Hydrogen connected to other hydrogen</strong> (molecular hydrogen, $\text{H}_2$, written as <code>[H][H]</code>)</li>
<li><strong>Hydrogen connected to zero or more than one other atom</strong> (bridging hydrogens, isolated protons)</li>
<li><strong>Isotopic hydrogen specifications</strong> in isomeric SMILES (deuterium <code>[2H]</code>, tritium <code>[3H]</code>)</li>
</ol>
<p>For all other cases, hydrogens are implicit and calculated from the valence table.</p>
<h3 id="3-ring-closure-notation">3. Ring Closure Notation</h3>
<p>Standard SMILES supports single digits <code>1-9</code> for ring closures. For rings numbered 10 and higher, the notation requires a <strong>percent sign prefix</strong>:</p>
<ul>
<li>Ring closures 1-9: <code>C1CCCCC1</code></li>
<li>Ring closures 10+: <code>C%10CCCCC%10</code>, <code>C2%13%24</code> (ring 2, ring 13, ring 24)</li>
</ul>
<p>Without this rule, a parser would fail on large polycyclic structures.</p>
<h3 id="4-aromaticity-detection-algorithm">4. Aromaticity Detection Algorithm</h3>
<p>The system uses an extended version of Hückel&rsquo;s Rule ($4N+2$ π-electrons). The &ldquo;excess electron&rdquo; count for the aromatic system is determined by these rules:</p>
<p><strong>Carbon contribution</strong>:</p>
<ul>
<li><strong>C in aromatic ring</strong>: Contributes 1 electron</li>
<li><strong>C double-bonded to exocyclic electronegative atom</strong> (e.g., $\text{C}=\text{O}$ in quinone): Contributes 0 electrons (the carbon &ldquo;loses&rdquo; its electron to the oxygen)</li>
</ul>
<p><strong>Heteroatom contribution</strong>:</p>
<ul>
<li><strong>O, S in ring</strong>: Contributes 2 electrons (lone pair)</li>
<li><strong>N in ring</strong>: Contributes 1 electron (pyridine-like) or 2 electrons (pyrrole-like, must have explicit hydrogen <code>[nH]</code>)</li>
</ul>
<p><strong>Charge effects</strong>:</p>
<ul>
<li><strong>Positive charge</strong>: Reduces electron count by 1</li>
<li><strong>Negative charge</strong>: Increases electron count by 1</li>
</ul>
<p><strong>Critical example - Quinone</strong>:</p>
<pre tabindex="0"><code>O=C1C=CC(=O)C=C1
</code></pre><p>Quinone has 6 carbons in the ring, but the two carbons bonded to exocyclic oxygens contribute 0 electrons each. The four remaining carbons contribute 4 electrons total (not 6), so quinone is <strong>not aromatic</strong> by this algorithm. This exocyclic bond rule is essential for correct aromaticity detection.</p>
<p><strong>Aromatic ring test</strong>:</p>
<ol>
<li>All atoms must be sp² hybridized</li>
<li>Count excess electrons using the rules above</li>
<li>Calculate whether the system complies with Hückel&rsquo;s parity rule constraint:
$$ \text{Excess Electrons} \equiv 2 \pmod 4 \iff \text{Excess Electrons} = 4N + 2 $$
If the electron count satisfies this property for some integer $N$, the ring is determined to be aromatic.</li>
</ol>
<h2 id="encoding-rules-reference">Encoding Rules Reference</h2>
<p>The following sections provide a detailed reference for the six fundamental SMILES encoding rules. These are the rules a user would apply when writing SMILES strings.</p>
<h3 id="1-atoms">1. Atoms</h3>
<p>Atoms use their standard chemical symbols. Elements in the &ldquo;organic subset&rdquo; (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so <code>C</code> automatically means a carbon with enough implicit hydrogens to satisfy its valence.</p>
<p>Everything else goes in square brackets: <code>[Au]</code> for gold, <code>[NH4+]</code> for ammonium ion, or <code>[13C]</code> for carbon-13. Aromatic atoms get lowercase letters: <code>c</code> for aromatic carbon in benzene.</p>
<h3 id="2-bonds">2. Bonds</h3>
<p>Bond notation is straightforward:</p>
<ul>
<li><code>-</code> for single bonds (usually omitted)</li>
<li><code>=</code> for double bonds</li>
<li><code>#</code> for triple bonds</li>
<li><code>:</code> for aromatic bonds (also usually omitted)</li>
</ul>
<p>So <code>CC</code> and <code>C-C</code> both represent ethane, while <code>C=C</code> is ethylene.</p>
<h3 id="3-branches">3. Branches</h3>
<p>Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes <code>CC(C)C(=O)O</code> - the main chain is <code>CC C(=O)O</code> with a methyl <code>(C)</code> branch.</p>
<h3 id="4-rings">4. Rings</h3>
<p>This is where SMILES gets clever. You break one bond and mark both ends with the same digit. Cyclohexane becomes <code>C1CCCCC1</code> - the <code>1</code> connects the first and last carbon, closing the ring.</p>
<p>You can reuse digits for different rings in the same molecule, making complex structures manageable.</p>
<h3 id="5-disconnected-parts">5. Disconnected Parts</h3>
<p>Salts and other disconnected structures use periods. Sodium phenoxide: <code>[Na+].[O-]c1ccccc1</code>. The order doesn&rsquo;t matter - you&rsquo;re just listing the separate components.</p>
<h3 id="6-aromaticity">6. Aromaticity</h3>
<p>Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes <code>c1ccccc1C(=O)O</code>. The system can also detect aromaticity automatically from Kekulé structures, so <code>C1=CC=CC=C1C(=O)O</code> works just as well.</p>
<h3 id="simplified-subset-for-organic-chemistry">Simplified Subset for Organic Chemistry</h3>
<p>Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:</p>
<ol>
<li><strong>Atoms</strong>: Use standard symbols (C, N, O, etc.)</li>
<li><strong>Multiple bonds</strong>: Use <code>=</code> and <code>#</code> for double and triple bonds</li>
<li><strong>Branches</strong>: Use parentheses <code>()</code></li>
<li><strong>Rings</strong>: Use matching digits</li>
</ol>
<p>This &ldquo;basic SMILES&rdquo; covers the vast majority of organic compounds, making the system immediately accessible without having to learn all the edge cases.</p>
<h2 id="design-decisions-and-edge-cases">Design Decisions and Edge Cases</h2>
<p>Beyond the basic rules, the paper established several important conventions for handling ambiguous cases:</p>
<h3 id="hydrogen-handling">Hydrogen Handling</h3>
<p>Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So <code>C</code> represents CH₄, <code>N</code> represents NH₃, and so on. This keeps strings compact and readable.</p>
<p>Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like <code>[2H]</code> for deuterium.</p>
<h3 id="bond-representation">Bond Representation</h3>
<p>The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitromethane could be written as charge-separated <code>C[N+](=O)[O-]</code> or with covalent double bonds <code>CN(=O)=O</code>. Weininger chose to prefer the covalent form when possible, because it preserves the correct topological symmetry.</p>
<p>However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes <code>C=[N+]=[N-]</code> to avoid forcing carbon into an unrealistic valence state.</p>
<h3 id="tautomers">Tautomers</h3>
<p>SMILES doesn&rsquo;t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form <code>Oc1ncccc1</code> or the keto form <code>O=c1[nH]cccc1</code>. The system won&rsquo;t automatically convert between them.</p>
<p>This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.</p>
<h3 id="aromaticity-detection">Aromaticity Detection</h3>
<p>One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.</p>
<p>This means you can input benzene as the Kekulé structure <code>C1=CC=CC=C1</code> and the system will automatically recognize it as aromatic and convert it to <code>c1ccccc1</code>. The algorithm handles complex cases like tropone (<code>O=c1cccccc1</code>) and correctly identifies them as aromatic.</p>
<h3 id="aromatic-nitrogen">Aromatic Nitrogen</h3>
<p>The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as <code>n</code> and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: <code>[nH]1cccc1</code> for pyrrole.</p>
<p>This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.</p>
<h2 id="impact-and-legacy">Impact and Legacy</h2>
<p>Nearly four decades later, SMILES remains one of the most widely used molecular notations in computational chemistry. The notation became the foundation for:</p>
<ul>
<li><strong>Database storage</strong> - Compact, searchable molecular representations</li>
<li><strong>Substructure searching</strong> - Pattern matching in chemical databases</li>
<li><strong>Property prediction</strong> - Input format for QSAR models</li>
<li><strong>Chemical informatics</strong> - Standard exchange format between software</li>
<li><strong>Modern ML</strong> - Text-based representation for neural networks</li>
</ul>
<p>While newer approaches like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> have addressed some limitations (like the possibility of invalid strings), SMILES&rsquo; combination of simplicity and power has made it enduringly useful.</p>
<p>The paper established both a notation system and a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance remains relevant today as we develop new molecular representations for machine learning and AI applications.</p>
]]></content:encoded></item><item><title>SELFIES: The Original Paper on Robust Molecular Strings</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-original-paper/</guid><description>The 2020 paper introducing SELFIES, the 100% robust molecular representation that solves SMILES validity problems in ML applications.</description><content:encoded><![CDATA[<h2 id="contribution-a-100-robust-representation-for-ml">Contribution: A 100% Robust Representation for ML</h2>
<p>This is a <strong>Method</strong> paper that introduces a new molecular string representation designed specifically for machine learning applications.</p>
<h2 id="motivation-the-invalidity-bottleneck">Motivation: The Invalidity Bottleneck</h2>
<p>When neural networks generate molecules using <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES notation</a>, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces a large fraction of invalid molecules, you are wasting computational effort and severely limiting chemical space exploration.</p>
<h2 id="novelty-a-formal-grammar-approach">Novelty: A Formal Grammar Approach</h2>
<p>The authors&rsquo; key insight was using a <strong>formal grammar approach</strong> (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The &ldquo;state of the derivation&rdquo; tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.</p>
<p>For example, generating 2-Fluoroethenimine (<code>FC=C=N</code>) follows a state derivation where each step restricts the available valency for the next element:</p>
<p>$$
\mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N}
$$</p>
<p>This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.</p>
<h2 id="methodology--experiments-validating-robustness">Methodology &amp; Experiments: Validating Robustness</h2>
<p>The authors ran several experiments to demonstrate SELFIES&rsquo; robustness:</p>
<h3 id="random-mutation-test">Random Mutation Test</h3>
<p>They took the SELFIES and <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations of MDMA and introduced random changes:</p>
<ul>
<li><strong>SMILES</strong>: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).</li>
<li><strong>SELFIES</strong>: 100% of mutated strings still represented valid molecules (though different from the original).</li>
</ul>
<p>This empirical difference demonstrates why SELFIES is well suited for evolutionary algorithms and genetic programming approaches to molecular design, where random mutations of strings are a core operation.</p>
<h3 id="generative-model-performance">Generative Model Performance</h3>
<p>The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:</p>
<p><strong>VAE Results:</strong></p>
<ul>
<li>SMILES-based VAE: Large invalid regions scattered throughout the latent space</li>
<li>SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule</li>
<li>The SELFIES model encoded <strong>over 100 times more diverse molecules</strong></li>
</ul>
<p><strong>GAN Results:</strong></p>
<ul>
<li>Best SMILES GAN: 18.6% diverse, valid molecules</li>
<li>Best SELFIES GAN: 78.9% diverse, valid molecules</li>
</ul>
<p><strong>Evaluation Metrics:</strong></p>
<ul>
<li><strong>Validity</strong>: Percentage of generated strings representing valid molecular structures</li>
<li><strong>Diversity</strong>: Number of unique valid molecules produced</li>
<li><strong>Reconstruction Accuracy</strong>: How well the autoencoder reproduced input molecules</li>
</ul>
<h3 id="scalability-test">Scalability Test</h3>
<p>The authors showed SELFIES works beyond toy molecules by successfully encoding and decoding all <strong>72 million molecules</strong> from the PubChem database (with fewer than 500 SMILES characters per molecule), demonstrating practical applicability to real chemical databases.</p>
<h2 id="results--conclusions-chemical-space-exploration">Results &amp; Conclusions: Chemical Space Exploration</h2>
<p><strong>Key Findings:</strong></p>
<ul>
<li>SELFIES achieves 100% validity guarantee: every string represents a valid molecule</li>
<li>SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models</li>
<li>SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.6% for SMILES GANs</li>
<li>Successfully validated on all 72 million PubChem molecules</li>
</ul>
<p><strong>Limitations Acknowledged:</strong></p>
<ul>
<li>No standardization or canonicalization method at time of publication</li>
<li>The initial grammar covered only small biomolecules; extensions for stereochemistry, ions, polyvalency, and full periodic table coverage were planned</li>
<li>Requires community testing and adoption</li>
</ul>
<p><strong>Impact:</strong></p>
<p>This work demonstrated that designing ML-native molecular representations could enable new approaches in drug discovery and materials science. SELFIES was subsequently evaluated as an alternative input representation to SMILES in <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The machine learning experiments used two distinct datasets:</p>
<ul>
<li><strong><a href="/notes/chemistry/datasets/qm9/">QM9</a></strong> (134k molecules): Primary training dataset for VAE and GAN models</li>
<li><strong>PubChem</strong> (72M molecules): Used only to test representation coverage and scalability; not used for model training</li>
</ul>
<h3 id="models">Models</h3>
<p>The VAE implementation included:</p>
<ul>
<li><strong>Latent space</strong>: 241-dimensional with Gaussian distributions</li>
<li><strong>Input encoding</strong>: One-hot encoding of SELFIES/SMILES strings</li>
<li>Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The authors found GAN performance was highly sensitive to hyperparameter selection:</p>
<ul>
<li>Searched <strong>200 different hyperparameter configurations</strong> to achieve the reported 78.9% diversity</li>
<li>Specific optimizers, learning rates, and training duration detailed in Supplementary Information</li>
<li>Full rule generation algorithm provided in Table 2</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>All models evaluated on:</p>
<ul>
<li><strong>Validity rate</strong>: Percentage of syntactically and chemically valid outputs</li>
<li><strong>Diversity</strong>: Count of unique valid molecules generated</li>
<li><strong>Reconstruction accuracy</strong>: Fidelity of autoencoder reconstruction (VAEs only)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training performed on the SciNet supercomputing infrastructure.</li>
<li>The paper does not specify GPU types or training times.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation; has evolved significantly since the original paper</td>
      </tr>
  </tbody>
</table>
<h3 id="replication-resources">Replication Resources</h3>
<p>Complete technical replication is highly accessible due to the paper being published open-access in <em>Machine Learning: Science and Technology</em>. It primarily requires:</p>
<ul>
<li>The full rule generation algorithm (Table 2 in paper)</li>
<li>Code: <a href="https://github.com/aspuru-guzik-group/selfies">https://github.com/aspuru-guzik-group/selfies</a></li>
<li>Supplementary Information for complete architectural and hyperparameter specifications</li>
</ul>
<p><strong>Note</strong>: The <a href="/notes/chemistry/molecular-representations/notations/selfies/">modern SELFIES library</a> has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krenn, M., Häse, F., Nigam, A., Friederich, P., &amp; Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. <em>Machine Learning: Science and Technology</em>, <em>1</em>(4), 045024. <a href="https://doi.org/10.1088/2632-2153/aba947">https://doi.org/10.1088/2632-2153/aba947</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Krenn_2020,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/aba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1088%2F2632-2153%2Faba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{{IOP} Publishing}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{045024}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Mario Krenn and Florian H{\&#34;{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies/">Modern SELFIES Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>αExtractor: Chemical Info from Biomedical Literature</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</guid><description>αExtractor uses ResNet-Transformer to extract chemical structures from literature images, including noisy and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="methodological-contribution-a-robust-optical-recognition-system">Methodological Contribution: A Robust Optical Recognition System</h2>
<p>This is primarily a <strong>Method</strong> ($\Psi_{\text{Method}}$) paper with a significant secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contribution (see the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for more on these categories).</p>
<p>The dominant methodological contribution is the ResNet-Transformer recognition architecture that outperforms existing OCSR tools across multiple benchmarks through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question &ldquo;How well does this work?&rdquo; through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.</p>
<p>The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.</p>
<h2 id="motivation-extracting-visual-chemical-knowledge-from-biomedical-literature">Motivation: Extracting Visual Chemical Knowledge from Biomedical Literature</h2>
<p>The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.</p>
<p>Existing OCSR tools face two critical problems when applied to biomedical literature:</p>
<ol>
<li>
<p><strong>Real-world image quality</strong>: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.</p>
</li>
<li>
<p><strong>End-to-end extraction</strong>: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.</p>
</li>
</ol>
<p>The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.</p>
<h2 id="core-innovation-robust-resnet-transformer-architecture">Core Innovation: Robust ResNet-Transformer Architecture</h2>
<p>The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:</p>
<ol>
<li>
<p><strong>ResNet-Transformer Recognition Model</strong>: The core recognition system uses a <strong>Residual Neural Network (ResNet)</strong> encoder paired with a <strong>Transformer decoder</strong> in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$:
$$
\begin{aligned}
\mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{&lt;i}) - \lambda \sum_{i=1}^{L} \big(\log P(X_i \mid I, X_{&lt;i}) + \log P(Y_i \mid I, Y_{&lt;i})\big)
\end{aligned}
$$
Here, continuous $X$ and $Y$ atom coordinates are mapped strictly to 200 discrete bins to formulate the coordinate prediction as a standard classification task alongside SMILES generation.</p>
</li>
<li>
<p><strong>Enhanced Molecular Representation</strong>: The model produces an augmented representation that encompasses:</p>
<ul>
<li>Standard molecular connectivity information</li>
<li><strong>Bond type tokens</strong> (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information</li>
<li><strong>Atom coordinate predictions</strong> that allow reconstruction of the exact molecular pose from the original image</li>
</ul>
<p>This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.</p>
</li>
<li>
<p><strong>Massive Synthetic Training Dataset</strong>: The model was trained on approximately <strong>20 million synthetic molecular images</strong> generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.</p>
</li>
<li>
<p><strong>End-to-End Document Processing Pipeline</strong>: αExtractor integrates <strong>object detection</strong> and <strong>structure recognition</strong> into a complete document mining system:</p>
<ul>
<li>An object detection model automatically locates molecular images within PDF documents</li>
<li>The recognition model converts detected images to structured representations</li>
<li>A web service interface makes the entire pipeline accessible to researchers without machine learning expertise</li>
</ul>
</li>
<li>
<p><strong>Robustness-First Design</strong>: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.</p>
</li>
</ol>
<h2 id="experimental-methodology-stress-testing-under-real-world-conditions">Experimental Methodology: Stress Testing under Real-World Conditions</h2>
<p>The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:</p>
<ol>
<li>
<p><strong>Benchmark Dataset Evaluation</strong>: αExtractor was tested on four standard OCSR benchmarks:</p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
</ul>
<p>Performance was measured using exact SMILES match accuracy.</p>
</li>
<li>
<p><strong>Error Analysis and Dataset Correction</strong>: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.</p>
</li>
<li>
<p><strong>Robustness Stress Testing</strong>: The system was evaluated on two challenging datasets specifically designed to test robustness:</p>
<ul>
<li><strong>Color background images</strong> (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions</li>
<li><strong>Low-quality images</strong> (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents</li>
</ul>
<p>These tests compared αExtractor against three open-source tools (OSRA, Molvel, and Imago) under realistic degradation conditions.</p>
</li>
<li>
<p><strong>Generalization Testing</strong>: In the most challenging experiment, αExtractor was tested on the <strong>DECIMER hand-drawn molecule images dataset</strong> (Brinkhaus et al., 2022), representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.</p>
</li>
<li>
<p><strong>End-to-End Document Extraction</strong>: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.</p>
</li>
</ol>
<h2 id="results--conclusions-strong-performance-on-degraded-images">Results &amp; Conclusions: Strong Performance on Degraded Images</h2>
<ul>
<li>
<p><strong>Substantial Accuracy Gains</strong>: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), compared to previous best results of 84.6%, 90.0%, 72.2%, and 89.9% respectively. After correcting dataset labeling errors, the true accuracies were even higher, reaching <strong>95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO</strong>.</p>
</li>
<li>
<p><strong>Robustness on Degraded Images</strong>: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained <strong>over 90% accuracy</strong> on both color background and low-quality image datasets, demonstrating the effectiveness of the synthetic training strategy.</p>
</li>
<li>
<p><strong>Generalization to Hand-Drawn Molecules</strong>: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved <strong>61.4% accuracy</strong> while other tools scored between 0.69% and 2.93%. This suggests the model learned genuinely chemical features rather than style-specific patterns.</p>
</li>
<li>
<p><strong>Practical End-to-End Performance</strong>: In the complete document processing evaluation, αExtractor detected <strong>95.1% of molecular images</strong> (2,221 out of 2,336) and correctly recognized <strong>94.5% of detected structures</strong> (2,098 correct predictions). This demonstrates the system&rsquo;s readiness for real-world literature mining applications.</p>
</li>
<li>
<p><strong>Ablation Results</strong>: Ablation experiments confirmed that each architectural component (ResNet backbone, Transformer encoder, Transformer decoder) contributes to performance, with the Transformer decoder having the largest impact. Replacing the Transformer decoder with an LSTM decoder substantially reduced accuracy (Table S6 in the paper).</p>
</li>
<li>
<p><strong>Dataset Quality Issues</strong>: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.</p>
</li>
<li>
<p><strong>Spatial Layout Limitation</strong>: αExtractor correctly identifies molecular connectivity, but the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.</p>
</li>
<li>
<p><strong>Non-Standard Depiction Handling</strong>: For images with non-standard bond depictions or atomic valences, αExtractor correctly identifies and normalizes them to standard representations. While chemically accurate, this means the re-rendered structure may visually differ from the original image.</p>
</li>
</ul>
<p>Overall, αExtractor combines accurate recognition (over 90% on degraded images), end-to-end document processing, and strong generalization across image conditions. It targets large-scale literature mining tasks where previous tools struggled with degraded inputs. The focus on real-world robustness over benchmark optimization reflects a practical approach to deploying machine learning in scientific workflows.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is <strong>Partially Reproducible</strong>. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/CLEF_corrected">Corrected CLEF Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the CLEF benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/UOB_corrected">Corrected UOB Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the UOB benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/JPO_corrected">Corrected JPO Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the JPO benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Colored_Background">Color Background Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of molecular structures on complex, colorful backgrounds.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Low_Quality">Low Quality Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of degraded images with noise, blur, and artifacts.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/PDF">PDF Test Set</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Sample PDF files for end-to-end document extraction evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://extractor.alphama.com.cn/csr">αExtractor Web Server</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Online service for running inference using the proprietary system.</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Backbone:</strong> ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer</li>
<li><strong>Transformer Architecture:</strong> 3 encoder layers and 3 decoder layers with hidden dimension of 512</li>
<li><strong>Output Format:</strong> Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Architecture:</strong> DETR (Detection Transformer) with ResNet101 backbone</li>
<li><strong>Transformer Architecture:</strong> 6 encoder layers and 6 decoder layers with hidden dimension of 256</li>
<li><strong>Purpose:</strong> Locates molecular images within PDF pages before recognition</li>
</ul>
<p><strong>Coordinate Prediction:</strong></p>
<ul>
<li>Continuous X/Y coordinates are discretized into <strong>200 discrete bins</strong></li>
<li>Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>Training Data:</strong></p>
<ul>
<li><strong>Synthetic Generation:</strong> Python script rendering PubChem SMILES into 2D images</li>
<li><strong>Dataset Size:</strong> Approximately 20.3 million synthetic molecular images from PubChem</li>
<li><strong>Superatom Handling:</strong> 50% of molecules had functional groups replaced with superatoms (e.g., &ldquo;COOH&rdquo;) or generic labels (R1, X1) to match literature drawing conventions</li>
<li><strong>Rendering Augmentation:</strong> Randomized bond thickness, bond spacing, font size, font color, and padding size</li>
</ul>
<p><strong>Geometric Augmentation:</strong></p>
<ul>
<li>Shear along x-axis: $\pm 15^\circ$</li>
<li>Rotation: $\pm 15^\circ$</li>
<li>Piecewise affine scaling</li>
</ul>
<p><strong>Noise Injection:</strong></p>
<ul>
<li>Pepper noise: 0-2%</li>
<li>Salt noise: 0-40%</li>
<li>Gaussian noise: scale 0-0.16</li>
</ul>
<p><strong>Destructive Augmentation:</strong></p>
<ul>
<li>JPEG compression: severity levels 2-5</li>
<li>Random masking</li>
</ul>
<p><strong>Evaluation Datasets:</strong></p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
<li><strong>Color background images</strong>: 200 samples</li>
<li><strong>Low-quality images</strong>: 200 samples</li>
<li><strong>Hand-drawn structures</strong>: Test set for generalization</li>
<li><strong>End-to-end document extraction</strong>: 50 PDFs (567 pages, 2,336 molecular images)</li>
</ul>
<h3 id="training">Training</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 100</li>
<li><strong>Epochs:</strong> 5</li>
<li><strong>Loss Function:</strong> Cross-entropy loss for both SMILES prediction and coordinate prediction</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 24</li>
<li><strong>Training Strategy:</strong> Pre-trained on synthetic &ldquo;Lower Quality&rdquo; data for 5 epochs, then fine-tuned on annotated real &ldquo;High Quality&rdquo; data for 30 epochs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics:</strong></p>
<ul>
<li><strong>Recognition</strong>: SMILES accuracy (exact match)</li>
<li><strong>End-to-End Pipeline</strong>:
<ul>
<li><strong>Recall</strong>: 95.1% for detection</li>
<li><strong>Accuracy</strong>: 94.5% for recognition</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Inference Hardware:</strong></p>
<ul>
<li>Cloud CPU server (8 CPUs, 64 GB RAM)</li>
<li><strong>Throughput:</strong> Processed 50 PDFs (567 pages) in 40 minutes</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., &amp; Zheng, M. (2023). αExtractor: a system for automatic extraction of chemical information from biomedical literature. <em>Science China Life Sciences</em>, 67(3), 618-621. <a href="https://doi.org/10.1007/s11427-023-2388-x">https://doi.org/10.1007/s11427-023-2388-x</a></p>
<p><strong>Publication</strong>: Science China Life Sciences (2023)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.1007/s11427-023-2388-x">Paper on Springer</a></li>
</ul>
]]></content:encoded></item><item><title>MolRec: Rule-Based OCSR System at TREC 2011 Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</guid><description>Rule-based system for optical chemical structure recognition using vectorization and geometric analysis, achieving 95% accuracy on TREC 2011.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-ocsr-system">Contribution: Rule-Based OCSR System</h2>
<p>This is a <strong>Method</strong> paper that presents and validates MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). While the paper emphasizes performance analysis on the TREC 2011 benchmark, the core contribution is the system architecture itself: a multi-stage pipeline using vectorization, geometric rule-based analysis, and graph construction to convert chemical diagram images into machine-readable MOL files.</p>
<h2 id="motivation-robust-conversion-of-chemical-diagrams">Motivation: Robust Conversion of Chemical Diagrams</h2>
<p>Chemical molecular diagrams are ubiquitous in scientific documents across chemistry and life sciences. Converting these static raster images into machine-readable formats (like MOL files) that encode precise spatial and connectivity information is important for cheminformatics applications such as database indexing, similarity searching, and automated literature mining.</p>
<p>While pixel-based pattern matching approaches exist, they struggle with variations in drawing style, image quality, and diagram complexity. An approach that can handle the geometric and topological diversity of real-world chemical diagrams is needed.</p>
<h2 id="novelty-vectorization-and-geometric-rules">Novelty: Vectorization and Geometric Rules</h2>
<p>MolRec uses a <strong>vectorization and geometric rule-based pipeline</strong>. Key technical innovations include:</p>
<p><strong>Disk-Growing Heuristic for Wedge Bonds</strong>: A novel dynamic algorithm to distinguish wedge bonds from bold lines. A disk with radius greater than the average line width is placed inside the connected component and grown to the largest size that still covers only foreground pixels. The disk is then walked in the direction that allows it to continue growing. When it can grow no more, the base of the triangle (stereo-center) has been located, identifying the wedge orientation.</p>
<p><strong>Joint Breaking Strategy</strong>: Explicitly breaking all connected joints in the vectorization stage to avoid combinatorial connection complexity. This allows uniform treatment of all line segment connections regardless of junction complexity.</p>
<p><strong>Superatom Dictionary Mining</strong>: The system mines MOL files from the OSRA dataset to build a comprehensive superatom dictionary (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;), supplemented by the Marvin abbreviation collection.</p>
<p><strong>Comprehensive Failure Analysis</strong>: Unlike most OCSR papers that report only aggregate accuracy, this work provides a detailed categorization of all 55 failures, identifying 61 specific error reasons and their root causes.</p>
<h2 id="methodology-and-trec-2011-experiments">Methodology and TREC 2011 Experiments</h2>
<p><strong>Benchmark</strong>: The system was evaluated on the <strong>TREC 2011 Chemical Track</strong> test set consisting of 1,000 molecular diagram images. The authors performed two independent runs with slightly different internal parameter settings to assess reproducibility.</p>
<p><strong>Evaluation Metric</strong>: Correct recall of chemical structures. Output MOL files were compared semantically to ground truth using <strong>OpenBabel</strong>, which ignores syntactically different but chemically equivalent representations.</p>
<p><strong>Failure Analysis</strong>: Across both runs, 55 unique diagrams were misrecognized (50 in run 1, 51 in run 2, with significant overlap). The authors manually examined all 55 and categorized them, identifying 61 specific reasons for mis-recognition. This analysis provides insight into systematic limitations of the rule-based approach.</p>
<h2 id="results-and-top-failure-modes">Results and Top Failure Modes</h2>
<p><strong>High Accuracy</strong>: MolRec achieved a <strong>95% correct recovery rate</strong> on the TREC 2011 benchmark:</p>
<ul>
<li>Run 1: 950/1000 structures correctly recognized (95.0%)</li>
<li>Run 2: 949/1000 structures correctly recognized (94.9%)</li>
</ul>
<p>The near-identical results across runs with slightly different internal parameters show stability of the rule-based approach.</p>
<p><strong>Top Failure Modes</strong> (from detailed analysis of 55 unique misrecognized diagrams, yielding 61 total error reasons):</p>
<ul>
<li><strong>Dashed wedge bond misidentification (15 cases)</strong>: Most common failure. Short dashes at the narrow end were interpreted as a separate dashed bond while longer dashes were treated as a dashed wedge or dashed bold bond, splitting one bond into two with a spurious node.</li>
<li><strong>Incorrect stereochemistry (10 cases)</strong>: Heuristics guessed wrong 3D orientations for ambiguous bold/dashed bonds where syntax alone is insufficient.</li>
<li><strong>Touching components (6 cases)</strong>: Characters touching bonds, letters touching symbols, or ink bleed between close parallel lines caused segmentation failures.</li>
<li><strong>Incorrect character grouping (5 cases)</strong>: Characters too close together for reliable separation.</li>
<li><strong>Solid circles without 3D hydrogen bond (5 cases)</strong>: MolRec correctly interprets solid circles as implying a hydrogen atom via a solid wedge bond, but some solution MOL files in the test set omit this bond, causing a mismatch.</li>
<li><strong>Diagram caption confusion (5 cases)</strong>: Captions appearing within images are mistakenly parsed as part of the molecular structure.</li>
<li><strong>Unrecognised syntax (5 cases)</strong>: User annotations, unusual notations (e.g., wavy line crossing a dashed wedge), and repetition structures.</li>
<li><strong>Broken characters (3 cases)</strong>: Degraded or partial characters without recovery mechanisms.</li>
<li><strong>Connectivity of superatoms (3 cases)</strong>: Ambiguous permutation of connection points for multi-bonded superatoms.</li>
<li><strong>Problematic bridge bonds (3 cases)</strong>: Extreme perspective or angles outside MolRec&rsquo;s thresholds.</li>
<li><strong>Unhandled bond type (1 case)</strong>: A dashed dative bond not previously encountered.</li>
</ul>
<p><strong>System Strengths</strong>:</p>
<ul>
<li>Douglas-Peucker line simplification proves faster and more robust than Hough transforms across different drawing styles</li>
<li>Disk-growing wedge bond detection effectively distinguishes 3D orientations in most cases</li>
<li>Mining MOL files for superatom dictionary captures real-world chemical abbreviation usage patterns</li>
</ul>
<p><strong>Fundamental Limitations Revealed</strong>:</p>
<ul>
<li><strong>Brittleness</strong>: Small variations in drawing style or image quality can cause cascading failures</li>
<li><strong>Stereochemistry ambiguity</strong>: Even humans disagree on ambiguous cases; automated resolution based purely on syntax is inherently limited</li>
<li><strong>Segmentation dependence</strong>: Most failures trace back to incorrect separation of text, bonds, and graphical elements</li>
<li><strong>No error recovery</strong>: Early-stage mistakes propagate through the pipeline with no mechanism for correction</li>
</ul>
<p><strong>Test Set Quality Issues</strong>: The paper also highlights several cases where the TREC 2011 ground truth itself was questionable. Some solution MOL files omitted stereo bond information for solid circle notations, dative (polar) bonds were inconsistently interpreted as either double bonds or single bonds across the training and test sets, and one diagram contained over-connected carbon atoms (5 bonds without the required positive charge indication) that the solution MOL file did not flag.</p>
<p>The systematic error analysis reveals what 95% accuracy means in practice. The failure modes highlight scalability challenges for rule-based systems when applied to diverse real-world documents with noise, artifacts, and non-standard conventions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dictionary Mining</td>
          <td>OSRA Dataset</td>
          <td>Unknown</td>
          <td>Mined to create superatom dictionary for abbreviations like &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;</td>
      </tr>
      <tr>
          <td>Dictionary</td>
          <td>Marvin Collection</td>
          <td>N/A</td>
          <td>Integrated Marvin abbreviation group collection for additional superatoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 Test Set</td>
          <td>1,000 images</td>
          <td>Standard benchmark for Text REtrieval Conference Chemical Track</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The MolRec pipeline consists of sequential image processing and graph construction stages:</p>
<p><strong>1. Preprocessing</strong></p>
<ul>
<li><strong>Binarization</strong>: Input image converted to binary</li>
<li><strong>Connected Component Labeling</strong>: Identifies distinct graphical elements</li>
<li><strong>OCR</strong>: Simple metric space-based engine identifies characters (letters $L$, digits $N$, symbols $S$)</li>
<li><strong>Character Grouping</strong>: Spatial proximity and type-based heuristics group characters:
<ul>
<li>Horizontal: Letter-Letter, Digit-Digit, Letter-Symbol</li>
<li>Vertical: Letter-Letter only</li>
<li>Diagonal: Letter-Digit, Letter-Charge</li>
</ul>
</li>
</ul>
<p><strong>2. Vectorization (Line Finding)</strong></p>
<ul>
<li><strong>Image Thinning</strong>: Reduce lines to unit width</li>
<li><strong>Douglas-Peucker Algorithm</strong>: Simplify polylines into straight line segments</li>
<li><strong>Joint Breaking</strong>: Explicitly split lines at junctions where $&gt;2$ segments meet, avoiding combinatorial connection complexity</li>
</ul>
<p><strong>3. Bond Recognition Rules</strong></p>
<p>After erasing text from the image, remaining line segments are analyzed:</p>
<ul>
<li><strong>Double/Triple Bonds</strong>: Cluster segments with same slope within threshold distance</li>
<li><strong>Dashed Bonds</strong>: Identify repeated short segments of similar length with collinear center points</li>
<li><strong>Wedge/Bold Bonds</strong>: Dynamic disk algorithm:
<ul>
<li>Place disk with radius $&gt;$ average line width inside component</li>
<li>Grow disk to maximum size to locate triangle base (stereo-center)</li>
<li>&ldquo;Walk&rdquo; disk to find narrow end, distinguishing wedge orientation</li>
</ul>
</li>
<li><strong>Wavy Bonds</strong>: Identify sawtooth pattern polylines after thinning</li>
<li><strong>Implicit Nodes</strong>: Split longer segments at points where parallel shorter segments terminate (carbon atoms in chains)</li>
</ul>
<p><strong>4. Graph Construction</strong></p>
<ul>
<li><strong>Node Formation</strong>: Group line segment endpoints by distance threshold</li>
<li><strong>Disambiguation</strong>: Logic separates lowercase &ldquo;l&rdquo;, uppercase &ldquo;I&rdquo;, digit &ldquo;1&rdquo;, and vertical bonds</li>
<li><strong>Superatom Expansion</strong>: Replace abbreviations with full structures using mined dictionary</li>
<li><strong>Stereochemistry Resolution</strong>: Heuristics based on neighbor counts determine direction for ambiguous bold/dashed bonds (known limitation)</li>
</ul>
<p><strong>5. MOL File Generation</strong></p>
<ul>
<li>Final graph structure converted to standard MOL file format</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Recall</td>
          <td>950/1000</td>
          <td>949/1000</td>
          <td>Slightly different internal parameters between runs</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>95.0%</td>
          <td>94.9%</td>
          <td>Semantic comparison using OpenBabel</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison Method</strong>: OpenBabel converts graphs to MOL files and compares them semantically to ground truth, ignoring syntactic variations that don&rsquo;t affect chemical meaning.</p>
<p><strong>Failure Categorization</strong>: 55 unique misrecognized diagrams analyzed across both runs, identifying 61 specific error reasons across 11 categories including dashed wedge bond misidentification (15), incorrect stereochemistry (10), touching components (6), incorrect character grouping (5), solid circles (5), diagram caption confusion (5), unrecognised syntax (5), broken characters (3), superatom connectivity (3), problematic bridge bonds (3), and unhandled bond type (1).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://openbabel.org/">Open Babel</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Used for semantic MOL file comparison</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/projects/osra/">OSRA</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Source of superatom dictionary data (MOL files mined)</td>
      </tr>
      <tr>
          <td>TREC 2011 Chemical Track</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>1,000 molecular diagram images (available via NIST)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The MolRec source code is not publicly available. The evaluation dataset (TREC 2011) is accessible through NIST, and the tools used for comparison (OpenBabel) are open source. However, full reproduction of MolRec&rsquo;s pipeline would require reimplementation from the paper&rsquo;s descriptions.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Details</strong>: Not explicitly specified in the paper</li>
<li><strong>Performance Note</strong>: Vectorization approach noted as &ldquo;proven to be fast&rdquo; compared to Hough transform alternatives</li>
</ul>
<h3 id="references">References</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawiPerformanceMolRecTREC2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of {{MolRec}} at {{TREC}} 2011 {{Overview}} and {{Analysis}} of {{Results}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 20th {{Text REtrieval Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sadawi, Noureddin M. and Sexton, Alan P. and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2011). Performance of MolRec at TREC 2011 Overview and Analysis of Results. <em>Proceedings of the 20th Text REtrieval Conference</em>. <a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf</a></p>
<p><strong>Publication</strong>: TREC 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openbabel.org/">Open Babel</a> - Used for semantic MOL file comparison</li>
<li><a href="https://sourceforge.net/projects/osra/">OSRA Project</a> - Source of superatom dictionary data</li>
</ul>
]]></content:encoded></item><item><title>MolNexTR: A Dual-Stream Molecular Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molnextr/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molnextr/</guid><description>Dual-stream encoder combining ConvNext and ViT for robust optical chemical structure recognition across diverse molecular drawing styles.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, Y., Leung, C. T., Huang, Y., Sun, J., Chen, H., &amp; Gao, H. (2024). MolNexTR: a generalized deep learning model for molecular image recognition. <em>Journal of Cheminformatics</em>, 16(141). <a href="https://doi.org/10.1186/s13321-024-00926-w">https://doi.org/10.1186/s13321-024-00926-w</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/CYF2000127/MolNexTR">GitHub Repository</a></li>
<li><a href="https://huggingface.co/datasets/CYF200127/MolNexTR/tree/main">HuggingFace Dataset/Model</a></li>
</ul>
<h2 id="methodology-overview-and-taxonomic-classification">Methodology Overview and Taxonomic Classification</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$). It proposes a neural network architecture (MolNexTR) that integrates ConvNext and Vision Transformers to solve the Optical Chemical Structure Recognition (OCSR) task. The paper validates this method through ablation studies and benchmarking against existing methods including MolScribe and DECIMER.</p>
<h2 id="the-challenge-of-domain-specific-drawing-styles-in-ocsr">The Challenge of Domain-Specific Drawing Styles in OCSR</h2>
<p>Converting molecular images from chemical literature into machine-readable formats (SMILES) is critical but challenging due to the high variance in drawing styles, fonts, and conventions (e.g., Markush structures, abbreviations). Existing methods have limitations:</p>
<ul>
<li>CNN-based and ViT-based models often struggle to generalize across diverse, non-standard drawing styles found in real literature.</li>
<li>Pure ViT methods lack translation invariance and local feature representation, while pure CNNs struggle with global dependencies.</li>
<li>Many models predict SMILES strings directly, making it difficult to enforce chemical validity or resolve complex stereochemistry and abbreviations.</li>
</ul>
<h2 id="core-innovation-dual-stream-encoding-and-image-contamination">Core Innovation: Dual-Stream Encoding and Image Contamination</h2>
<p>MolNexTR introduces three main innovations:</p>
<ol>
<li><strong>Dual-Stream Encoder</strong>: A hybrid architecture processing images simultaneously through a ConvNext stream (for local features) and a Vision Transformer stream (for long-range dependencies), fusing them to capture multi-scale information.</li>
<li><strong>Image Contamination Augmentation</strong>: A specialized data augmentation algorithm that simulates real-world &ldquo;noise&rdquo; found in literature, such as overlapping text, arrows, and partial molecular fragments, to improve robustness.</li>
<li><strong>Graph-Based Decoding with Post-Processing</strong>: Unlike pure image-to-SMILES translation, it predicts atoms and bonds (graph generation) and uses a stereochemical discrimination and abbreviation self-correction module to enforce chemical rules (e.g., chirality) and resolve superatoms (e.g., &ldquo;Ph&rdquo;, &ldquo;Bn&rdquo;).</li>
</ol>
<p>The prediction of atom labels and coordinates is formulated as a conditional autoregressive generation task, optimized via a cross-entropy loss:
$$ \mathcal{L}_{\text{atom}} = -\sum_{t=1}^{T} \log P(x_t \mid \text{Image}, x_{&lt;t}) $$</p>
<h2 id="experimental-setup-benchmarking-on-synthetic-and-real-data">Experimental Setup: Benchmarking on Synthetic and Real Data</h2>
<p>The model was trained on synthetic data (PubChem) and real patent data (USPTO). It was evaluated on nine benchmarks (three synthetic, six real-world):</p>
<ul>
<li><strong>Synthetic</strong>: Indigo, ChemDraw, RDKit (rendered from 5,719 molecules)</li>
<li><strong>Real-World</strong>: CLEF, UOB, JPO, USPTO, Staker, and a newly curated ACS dataset (diverse styles)</li>
</ul>
<p><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec) and deep learning models (MolScribe, DECIMER, SwinOCSR, Img2Mol).</p>
<p><strong>Ablations</strong>: Tested the impact of the dual-stream encoder vs. single streams, and the contribution of individual augmentation strategies.</p>
<h2 id="empirical-results-and-robustness-findings">Empirical Results and Robustness Findings</h2>
<ul>
<li><strong>Performance</strong>: MolNexTR achieved 81-97% accuracy across test sets, outperforming the second-best method (often MolScribe) by margins of 0.3% to 10.0% (on the difficult ACS dataset).</li>
<li><strong>Perturbation resilience</strong>: The model maintained higher accuracy under image perturbations (rotation, noise) and &ldquo;curved arrow&rdquo; noise common in reaction mechanisms compared to MolScribe and DECIMER (Table 3).</li>
<li><strong>Ablation Results</strong>: The dual-stream encoder consistently outperformed single CNN or ViT baselines, and the image contamination algorithm significantly boosted performance on noisy real-world data (ACS).</li>
<li><strong>Limitations</strong>: The model still struggles with extremely complex hand-drawn molecules and mechanism diagrams where arrows or text are conflated with structure. The authors also note that R-group information in real literature often appears in separate text or tables, which the model does not incorporate.</li>
</ul>
<p><strong>Key Results (Table 2, SMILES exact match accuracy %)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MolScribe</th>
          <th>MolNexTR</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Indigo</td>
          <td>97.5</td>
          <td>97.8</td>
          <td>+0.3</td>
      </tr>
      <tr>
          <td>ChemDraw</td>
          <td>93.8</td>
          <td>95.1</td>
          <td>+1.3</td>
      </tr>
      <tr>
          <td>RDKit</td>
          <td>94.6</td>
          <td>96.4</td>
          <td>+1.8</td>
      </tr>
      <tr>
          <td>CLEF</td>
          <td>88.3</td>
          <td>90.4</td>
          <td>+2.1</td>
      </tr>
      <tr>
          <td>UOB</td>
          <td>87.9</td>
          <td>88.5</td>
          <td>+0.6</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>77.7</td>
          <td>82.1</td>
          <td>+4.4</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>92.6</td>
          <td>93.8</td>
          <td>+1.2</td>
      </tr>
      <tr>
          <td>Staker</td>
          <td>86.9</td>
          <td>88.3</td>
          <td>+1.4</td>
      </tr>
      <tr>
          <td>ACS</td>
          <td>71.9</td>
          <td>81.9</td>
          <td>+10.0</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong>:</p>
<ul>
<li><strong>Synthetic</strong>: ~1M molecules randomly selected from PubChem, rendered using RDKit and Indigo with varied styles (thickness, fonts, bond width)</li>
<li><strong>Real</strong>: 0.68M images from USPTO, with coordinates normalized from MOLfiles</li>
</ul>
<p><strong>Augmentation</strong>:</p>
<ul>
<li><strong>Render Augmentation</strong>: Randomized drawing styles (line width, font size, label modes)</li>
<li><strong>Image Augmentation</strong>: Rotation, cropping, blurring, noise (Gaussian, salt-and-pepper)</li>
<li><strong>Molecular Augmentation</strong>: Randomly replacing functional groups with abbreviations (from a list of &gt;100) or complex chains (e.g., CH3CH2NH2); adding R-groups</li>
<li><strong>Image Contamination</strong>: Adding &ldquo;noise&rdquo; objects (arrows, lines, text, partial structures) at a minimum distance from the main molecule to simulate literature artifacts</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Dual-Stream Encoder</strong>:</p>
<ul>
<li><strong>CNN Stream</strong>: ConvNext backbone (pre-trained on ImageNet), generating feature maps at scales $H/4$ to $H/32$</li>
<li><strong>ViT Stream</strong>: Parallel transformer blocks receiving patches of sizes $p=4, 8, 16, 32$. Uses Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN)</li>
<li><strong>Fusion</strong>: Outputs from both streams are concatenated</li>
</ul>
<p><strong>Decoder (Graph Generation)</strong>:</p>
<ul>
<li><strong>Transformer Decoder</strong>: 6 layers, 8 heads, hidden dim 256</li>
<li><strong>Task 1 (Atoms)</strong>: Autoregressive prediction of atom tokens $(l, x, y)$ (label + coordinates)</li>
<li><strong>Task 2 (Bonds)</strong>: Prediction of bond types between atom pairs (None, Single, Double, Triple, Aromatic, Solid Wedge, Dashed Wedge)</li>
</ul>
<p><strong>Post-Processing</strong>:</p>
<ul>
<li><strong>Stereochemistry</strong>: Uses predicted coordinates and bond types (wedge/dash) to resolve chirality using RDKit logic</li>
<li><strong>Abbreviation Correction</strong>: Matches superatoms to a dictionary; if unknown, attempts to greedily connect atoms based on valence or finds the nearest match ($\sigma=0.8$ similarity threshold)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder (ConvNext + ViT Encoder -&gt; Transformer Decoder)</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Optimizer: ADAM (max lr 3e-4, linear warmup for 5% of steps)</li>
<li>Batch Size: 256</li>
<li>Image Size: $384 \times 384$</li>
<li>Dropout: 0.1</li>
</ul>
</li>
<li><strong>Training</strong>: Fine-tuned CNN backbone for 40 epochs on 10 NVIDIA RTX 3090 GPUs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: SMILES sequence exact matching accuracy (canonicalized)</p>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>Synthetic</strong>: Indigo (5,719), ChemDraw (5,719), RDKit (5,719)</li>
<li><strong>Real</strong>: CLEF (992), UOB (5,740), JPO (450), USPTO (5,719), Staker (50,000), ACS (331)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: 10 NVIDIA RTX 3090 GPUs</li>
<li><strong>Cluster</strong>: HPC3 Cluster at HKUST (ITSC)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CYF2000127/MolNexTR">MolNexTR GitHub</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation (PyTorch, Jupyter notebooks)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/CYF200127/MolNexTR">MolNexTR HuggingFace</a></td>
          <td>Dataset/Model</td>
          <td>Apache-2.0</td>
          <td>Training data and model checkpoint</td>
      </tr>
  </tbody>
</table>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chenMolNexTRGeneralizedDeep2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{MolNexTR}: a generalized deep learning model for molecular image recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chen, Yufan and Leung, Ching Ting and Huang, Yong and Sun, Jianwei and Chen, Hao and Gao, Hanyu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{141}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00926-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemInfty: Chemical Structure Recognition in Patent Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</guid><description>Fujiyoshi et al.'s segment-based approach for recognizing chemical structures in challenging Japanese patent images with touching characters and broken lines.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fujiyoshi, A., Nakagawa, K., &amp; Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. <em>Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.</em></p>
<p><strong>Publication</strong>: GREC 2011 (Graphics Recognition Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.sciaccess.net/en/InftyReader/">InftyReader Project</a></li>
</ul>
<h2 id="contribution-segment-based-ocsr-method">Contribution: Segment-Based OCSR Method</h2>
<p>This is a <strong>method paper</strong> that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.</p>
<h2 id="motivation-the-challenge-of-degraded-patent-images">Motivation: The Challenge of Degraded Patent Images</h2>
<p>The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.</p>
<p>The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.</p>
<p>The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there&rsquo;s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.</p>
<p>The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.</p>
<h2 id="core-innovation-segment-decomposition-and-dynamic-programming">Core Innovation: Segment Decomposition and Dynamic Programming</h2>
<p>The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.</p>
<p>ChemInfty&rsquo;s approach has several distinctive elements:</p>
<ol>
<li>
<p><strong>Line and Curve Segmentation</strong>: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints&mdash;crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.</p>
</li>
<li>
<p><strong>Linear Order Assumption for Scalability</strong>: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.</p>
</li>
<li>
<p><strong>Dynamic Programming for Segment Combination</strong>: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the &ldquo;most suitable combination&rdquo; of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.</p>
</li>
<li>
<p><strong>Two-Pass OCR Strategy</strong>: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:</p>
<ul>
<li><strong>First pass</strong>: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image</li>
<li><strong>Second pass</strong>: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image</li>
</ul>
<p>This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.</p>
</li>
<li>
<p><strong>Image Thinning for Structure Analysis</strong>: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure&mdash;crossings, bends, and endpoints&mdash;making it easier to detect where segments should be divided.</p>
</li>
<li>
<p><strong>Proximity-Based Grouping</strong>: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.</p>
</li>
</ol>
<h2 id="methodology-real-world-patent-evaluation">Methodology: Real-World Patent Evaluation</h2>
<p>The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:</p>
<ol>
<li>
<p><strong>Large-Scale Patent Dataset</strong>: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.</p>
</li>
<li>
<p><strong>Touching Character Separation</strong>: The authors specifically measured the system&rsquo;s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.</p>
</li>
<li>
<p><strong>Recognition Accuracy by Object Type</strong>: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.</p>
</li>
<li>
<p><strong>End-to-End Performance</strong>: The overall recognition ratio was calculated across all object types to establish the system&rsquo;s practical utility for automated patent processing.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Effective Separation for Line-Touching Characters</strong>: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.</p>
</li>
<li>
<p><strong>Strong Overall Character Recognition</strong>: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.</p>
</li>
<li>
<p><strong>Weak Performance on Wedges</strong>: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.</p>
</li>
<li>
<p><strong>Image Quality Dependency</strong>: The authors acknowledge that the method&rsquo;s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.</p>
</li>
<li>
<p><strong>Overall System Performance</strong>: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.</p>
</li>
</ul>
<p>The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="technical-paradigm">Technical Paradigm</h3>
<p><strong>This is a pre-deep learning (2011) classical computer vision paper.</strong> The system uses rule-based methods and traditional OCR engines, not neural networks.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>InftyReader</strong>: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.</li>
<li><strong>DEF-based OCR</strong>: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a multi-step recognition pipeline:</p>
<ol>
<li><strong>Preprocessing</strong>: Binarization and smoothing</li>
<li><strong>Initial Character Removal</strong>: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation</li>
<li><strong>Skeletonization</strong>: Thinning using <strong>Hilditch&rsquo;s algorithm</strong> to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)</li>
<li><strong>Feature Point Detection</strong>:
<ul>
<li><strong>Crossing points</strong>: Direct detection on skeleton</li>
<li><strong>Bending points</strong>: Detected using the <strong>Hough transformation</strong></li>
</ul>
</li>
<li><strong>Dynamic Programming Search</strong>:
<ul>
<li><strong>Input</strong>: Set of line/curve segments $S$</li>
<li><strong>Procedure</strong>: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score</li>
<li><strong>Complexity</strong>: $O(n^2)$ where $n$ is the number of segments</li>
<li><strong>Scoring</strong>: Uses a function <code>Measure(S')</code> that returns a score (0-100) indicating if a subset of segments forms a valid character or bond</li>
</ul>
</li>
</ol>
<p>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Dataset</strong>: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Japanese Published Patent Applications (2008)</td>
          <td>1,599 images</td>
          <td>Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.</td>
      </tr>
      <tr>
          <td>Analysis</td>
          <td>Random subset for frequency analysis</td>
          <td>200 images</td>
          <td>Used to estimate frequency of touching/broken characters (found in ~20% of images).</td>
      </tr>
  </tbody>
</table>
<p><strong>No Training Set</strong>: The system is rule-based and uses pre-built OCR engines, so no model training was performed.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: Recognition ratio (percentage of correctly recognized objects)</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Line-touching Separation</td>
          <td>63.5%</td>
          <td>Success rate for separating text glued to lines</td>
      </tr>
      <tr>
          <td>Character Recognition</td>
          <td>85.86%</td>
          <td>For all character sizes</td>
      </tr>
      <tr>
          <td>Line segments</td>
          <td>90.73%</td>
          <td>Standard bond recognition</td>
      </tr>
      <tr>
          <td>Solid Wedge Recognition</td>
          <td>52.54%</td>
          <td>Low performance noted as area for improvement</td>
      </tr>
      <tr>
          <td>Hashed Wedges</td>
          <td>23.63%</td>
          <td>Poorest performing element type</td>
      </tr>
      <tr>
          <td>Overall</td>
          <td>86.58%</td>
          <td>Combined across all object types</td>
      </tr>
  </tbody>
</table>
<p><strong>Total Objects Evaluated</strong>: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.</p>
<h3 id="hardware">Hardware</h3>
<p>Not reported. Computational cost was not a primary concern for this classical CV system.</p>
<h3 id="replicability">Replicability</h3>
<p><strong>Low.</strong> The paper does not provide sufficient detail for full replication:</p>
<ul>
<li>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined</li>
<li>Dependency on the proprietary/specialized InftyReader engine</li>
<li>No pseudocode provided for the segment decomposition heuristics</li>
</ul>
<h3 id="notes-on-wedge-recognition">Notes on Wedge Recognition</h3>
<p>The system&rsquo;s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch&rsquo;s method, these &ldquo;blob&rdquo; shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fujiyoshiRobustMethodSegmentation2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2011</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser: End-to-End Molecular Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</guid><description>MolParser converts molecular images from scientific documents to machine-readable formats using end-to-end learning with Extended SMILES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., &amp; Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em> (pp. 24528-24538). <a href="https://doi.org/10.48550/arXiv.2411.11098">https://doi.org/10.48550/arXiv.2411.11098</a></p>
<p><strong>Publication</strong>: ICCV 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/">MolParser-7M Dataset</a> - 7M+ image-text pairs for OCSR</li>
<li><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M on HuggingFace</a> - Dataset repository</li>
<li><a href="https://huggingface.co/UniParser/MolDet">MolDet YOLO Detector</a> - Object detection model for extracting molecular images from documents</li>
</ul>
<h2 id="contribution-end-to-end-ocsr-and-real-world-resources">Contribution: End-to-End OCSR and Real-World Resources</h2>
<p>This is primarily a <strong>Method</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a>), with a significant secondary contribution as a <strong>Resource</strong> paper.</p>
<p><strong>Method contribution ($\Psi_{\text{Method}}$)</strong>: The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces <strong>Extended SMILES (E-SMILES)</strong>, a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).</p>
<p><strong>Resource contribution ($\Psi_{\text{Resource}}$)</strong>: The paper introduces <strong>MolParser-7M</strong>, the largest OCSR dataset to date (7.7M image-text pairs), and <strong>WildMol</strong>, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.</p>
<h2 id="motivation-extracting-chemistry-from-real-world-documents">Motivation: Extracting Chemistry from Real-World Documents</h2>
<p>The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.</p>
<p>Existing OCSR methods struggle with real-world documents for two fundamental reasons:</p>
<ol>
<li><strong>Representational limitations</strong>: Standard SMILES notation cannot capture complex structural templates like <strong>Markush structures</strong>, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.</li>
<li><strong>Data distribution mismatch</strong>: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.</li>
</ol>
<h2 id="novelty-e-smiles-and-human-in-the-loop-data-engine">Novelty: E-SMILES and Human-in-the-Loop Data Engine</h2>
<p>The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:</p>
<ol>
<li>
<p><strong>Extended SMILES (E-SMILES)</strong>: A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token <code>&lt;sep&gt;</code> to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.</p>
</li>
<li>
<p><strong>MolParser-7M Dataset</strong>: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 &ldquo;in-the-wild&rdquo; samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Data Engine</strong>: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.</p>
</li>
<li>
<p><strong>Efficient End-to-End Architecture</strong>: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:</p>
</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x; \theta)
\end{aligned}
$$</p>
<p>The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.</p>
<h2 id="experimental-setup-two-stage-training-and-benchmarking">Experimental Setup: Two-Stage Training and Benchmarking</h2>
<p>The evaluation focused on demonstrating that MolParser generalizes to real-world documents:</p>
<ol>
<li>
<p><strong>Two-Stage Training Protocol</strong>: The model underwent a systematic training process:</p>
<ul>
<li><strong>Pre-training</strong>: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).</li>
<li><strong>Fine-tuning</strong>: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.</li>
</ul>
</li>
<li>
<p><strong>Benchmark Evaluation</strong>: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.</p>
</li>
<li>
<p><strong>Real-World Document Analysis</strong>: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Experiments isolating the contribution of each component:</p>
<ul>
<li>The impact of real-world training data versus synthetic-only training</li>
<li>The effectiveness of curriculum learning versus standard training</li>
<li>The value of the human-in-the-loop annotation pipeline versus random sampling</li>
<li>The necessity of E-SMILES extensions for capturing complex structures</li>
</ul>
</li>
</ol>
<h2 id="outcomes-and-empirical-findings">Outcomes and Empirical Findings</h2>
<ul>
<li>
<p><strong>Performance on Benchmarks</strong>: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.</p>
</li>
<li>
<p><strong>Real-World Data is Critical</strong>: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.</p>
</li>
<li>
<p><strong>E-SMILES Enables Broader Coverage</strong>: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Scales Efficiently</strong>: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.</p>
</li>
<li>
<p><strong>Speed and Accuracy</strong>: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.</p>
</li>
<li>
<p><strong>Downstream Applications</strong>: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.</p>
</li>
</ul>
<p>The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.</p>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>7.7M image-SMILES pairs for OCSR pretraining and fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet</a></td>
          <td>Model</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>YOLO11-based molecule detector for PDF documents</td>
      </tr>
  </tbody>
</table>
<p>No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.</p>
<p><strong>Training Data Composition (MolParser-7M)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Name</th>
          <th>Size</th>
          <th>Composition / Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pre-training</strong></td>
          <td>MolParser-7M (Synthetic)</td>
          <td>~7.7M</td>
          <td><strong>Markush-3M</strong> (40%), <strong>ChEMBL-2M</strong> (27%), <strong>Polymer-1M</strong> (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-SFT-400k</td>
          <td>400k</td>
          <td>Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-Gen-200k</td>
          <td>200k</td>
          <td>Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Handwrite-5k</td>
          <td>5k</td>
          <td>Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Sources</strong>: 1.2M patents and scientific papers (PDF documents)</li>
<li><strong>Extraction</strong>: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates</li>
<li><strong>Selection</strong>: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation</li>
<li><strong>Annotation</strong>: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)</li>
</ul>
<p><strong>Test Benchmarks</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-10k</td>
          <td>10,000</td>
          <td>Standard synthetic benchmark</td>
      </tr>
      <tr>
          <td>Maybridge UoB</td>
          <td>-</td>
          <td>Synthetic molecules</td>
      </tr>
      <tr>
          <td>CLEF-2012</td>
          <td>-</td>
          <td>Patent images</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>-</td>
          <td>Japanese patent office</td>
      </tr>
      <tr>
          <td>ColoredBG</td>
          <td>-</td>
          <td>Colored background molecules</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td>10,000</td>
          <td>Ordinary molecules cropped from real PDFs (new)</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k-M</strong></td>
          <td>10,000</td>
          <td>Markush structures (significantly harder, new)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Extended SMILES (E-SMILES) Encoding</strong>:</p>
<ul>
<li><strong>Format</strong>: <code>SMILES&lt;sep&gt;EXTENSION</code> where <code>&lt;sep&gt;</code> separates core structure from supplementary annotations</li>
<li><strong>Extensions use XML-like tags</strong>:
<ul>
<li><code>&lt;a&gt;index:group&lt;/a&gt;</code> for substituents/variable groups (Markush structures)</li>
<li><code>&lt;r&gt;</code> for groups connected at any ring position</li>
<li><code>&lt;c&gt;</code> for abstract rings</li>
<li><code>&lt;dum&gt;</code> for connection points</li>
</ul>
</li>
<li><strong>Backward compatible</strong>: Core SMILES parseable by RDKit; extensions provide structured format for edge cases</li>
</ul>
<p><strong>Curriculum Learning Strategy</strong>:</p>
<ul>
<li><strong>Phase 1</strong>: No augmentation, simple molecules (&lt;60 tokens)</li>
<li><strong>Phase 2</strong>: Gradually increase augmentation intensity and sequence length</li>
<li>Progressive complexity allows stable training on diverse molecular structures</li>
</ul>
<p><strong>Active Learning Data Selection</strong>:</p>
<ol>
<li>Train 5 model folds on current dataset</li>
<li>Compute pairwise Tanimoto similarity of predictions on candidate images</li>
<li>Select samples with confidence scores <strong>0.6-0.9</strong> for human review (highest learning value)</li>
<li>Human experts correct model pre-annotations</li>
<li>Iteratively expand training set with hard samples</li>
</ol>
<p><strong>Data Augmentations</strong>:</p>
<ul>
<li>RandomAffine (rotation, scale, translation)</li>
<li>JPEGCompress (compression artifacts)</li>
<li>InverseColor (color inversion)</li>
<li>SurroundingCharacters (text interference)</li>
<li>RandomCircle (circular artifacts)</li>
<li>ColorJitter (brightness, contrast variations)</li>
<li>Downscale (resolution reduction)</li>
<li>Bounds (boundary cropping variations)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard <strong>Image Captioning</strong> (Encoder-Decoder) paradigm.</p>
<p><strong>Architecture Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vision Encoder</strong></td>
          <td>Swin Transformer (ImageNet pretrained)</td>
      </tr>
      <tr>
          <td>- Tiny variant</td>
          <td>66M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Small variant</td>
          <td>108M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Base variant</td>
          <td>216M parameters, $384 \times 384$ input</td>
      </tr>
      <tr>
          <td><strong>Connector</strong></td>
          <td>2-layer MLP reducing channel dimension by half</td>
      </tr>
      <tr>
          <td><strong>Text Decoder</strong></td>
          <td>BART-Decoder (12 layers, 16 attention heads)</td>
      </tr>
  </tbody>
</table>
<p><strong>Training Configuration</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Pre-training</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Hardware</strong></td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
      </tr>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>AdamW</td>
          <td>AdamW</td>
      </tr>
      <tr>
          <td><strong>Learning Rate</strong></td>
          <td>$1 \times 10^{-4}$</td>
          <td>$5 \times 10^{-5}$</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong></td>
          <td>$1 \times 10^{-2}$</td>
          <td>$1 \times 10^{-2}$</td>
      </tr>
      <tr>
          <td><strong>Scheduler</strong></td>
          <td>Cosine with warmup</td>
          <td>Cosine with warmup</td>
      </tr>
      <tr>
          <td><strong>Epochs</strong></td>
          <td>20</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Label Smoothing</strong></td>
          <td>0.01</td>
          <td>0.005</td>
      </tr>
  </tbody>
</table>
<p><strong>Curriculum Learning Schedule</strong> (Pre-training):</p>
<ul>
<li>Starts with simple molecules (&lt;60 tokens, no augmentation)</li>
<li>Gradually adds complexity and augmentation (blur, noise, perspective transforms)</li>
<li>Enables stable learning across diverse molecular structures</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)</p>
<p><strong>Key Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MolParser-Base</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td><strong>76.9%</strong></td>
          <td>66.4%</td>
          <td>45.5%</td>
          <td>Real-world patent/paper crops</td>
      </tr>
      <tr>
          <td><strong>USPTO-10k</strong></td>
          <td>94.5%</td>
          <td><strong>96.0%</strong></td>
          <td>93.3%</td>
          <td>Synthetic benchmark</td>
      </tr>
      <tr>
          <td><strong>Throughput (FPS)</strong></td>
          <td><strong>39.8</strong></td>
          <td>16.5</td>
          <td>2.2</td>
          <td>Measured on RTX 4090D</td>
      </tr>
  </tbody>
</table>
<p><strong>Additional Performance</strong>:</p>
<ul>
<li>MolParser-Tiny: 131 FPS on RTX 4090D (66M params)</li>
<li>Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents</li>
</ul>
<p><strong>Ablation Findings</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-world training data</td>
          <td>Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k</td>
      </tr>
      <tr>
          <td>Curriculum learning</td>
          <td>Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%</td>
      </tr>
      <tr>
          <td>Active learning selection</td>
          <td>More effective than random sampling for annotation budget</td>
      </tr>
      <tr>
          <td>E-SMILES extensions</td>
          <td>Essential for Markush structure recognition (impossible with standard SMILES)</td>
      </tr>
      <tr>
          <td>Dataset scale</td>
          <td>Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 8x NVIDIA RTX 4090D GPUs</li>
<li><strong>Inference</strong>: Single RTX 4090D sufficient for real-time processing</li>
<li><strong>Training time</strong>: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2025molparser,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24528--24538}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2411.11098}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2411.11098}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DenoiseVAE: Adaptive Noise for Molecular Pre-training</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/denoise-vae/</link><pubDate>Sun, 24 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/denoise-vae/</guid><description>Liu et al.'s ICLR 2025 paper introducing DenoiseVAE, which learns adaptive, atom-specific noise distributions for better molecular force fields.</description><content:encoded><![CDATA[<h2 id="paper-contribution-type">Paper Contribution Type</h2>
<p>This is a <strong>method paper</strong> with a supporting theoretical component. It introduces a new pre-training framework, DenoiseVAE, that challenges the standard practice of using fixed, hand-crafted noise distributions in denoising-based molecular representation learning.</p>
<h2 id="motivation-the-inter--and-intra-molecular-variations-problem">Motivation: The Inter- and Intra-molecular Variations Problem</h2>
<p>The motivation is to create a more physically principled denoising pre-training task for 3D molecules. The core idea of denoising is to learn molecular force fields by corrupting an equilibrium conformation with noise and then learning to recover it. However, existing methods use a single, hand-crafted noise strategy (e.g., Gaussian noise of a fixed scale) for all atoms across all molecules. This is physically unrealistic for two main reasons:</p>
<ol>
<li><strong>Inter-molecular differences</strong>: Different molecules have unique Potential Energy Surfaces (PES), meaning the space of low-energy (i.e., physically plausible) conformations is highly molecule-specific.</li>
<li><strong>Intra-molecular differences (Anisotropy)</strong>: Within a single molecule, different atoms have different degrees of freedom. For instance, an atom in a rigid functional group can move much less than one connected by a single, rotatable bond.</li>
</ol>
<p>The authors argue that this &ldquo;one-size-fits-all&rdquo; noise approach leads to inaccurate force field learning because it samples many physically improbable conformations.</p>
<h2 id="novelty-a-learnable-atom-specific-noise-generator">Novelty: A Learnable, Atom-Specific Noise Generator</h2>
<p>The core novelty is a framework that learns to generate noise tailored to each specific molecule and atom. This is achieved through three key innovations:</p>
<ol>
<li><strong>Learnable Noise Generator</strong>: The authors introduce a Noise Generator module (a 4-layer Equivariant Graph Neural Network) that takes a molecule&rsquo;s equilibrium conformation $X$ as input and outputs a unique, atom-specific Gaussian noise distribution (i.e., a different variance $\sigma_i^2$ for each atom $i$). This directly addresses the issues of PES specificity and force field anisotropy.</li>
<li><strong>Variational Autoencoder (VAE) Framework</strong>: The Noise Generator (encoder) and a Denoising Module (a 7-layer EGNN decoder) are trained jointly within a VAE paradigm. The noisy conformation is sampled using the reparameterization trick:
$$
\begin{aligned}
\tilde{x}_i &amp;= x_i + \epsilon \sigma_i
\end{aligned}
$$</li>
<li><strong>Principled Optimization Objective</strong>: The training loss balances two competing goals:
$$
\begin{aligned}
\mathcal{L}_{DenoiseVAE} &amp;= \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL}
\end{aligned}
$$
<ul>
<li>A denoising reconstruction loss ($\mathcal{L}_{Denoise}$) encourages the Noise Generator to produce physically plausible perturbations from which the original conformation can be recovered. This implicitly constrains the noise to respect the molecule&rsquo;s underlying force fields.</li>
<li>A KL divergence regularization term ($\mathcal{L}_{KL}$) pushes the generated noise distributions towards a predefined prior. This prevents the trivial solution of generating zero noise and encourages the model to explore a diverse set of low-energy conformations.</li>
</ul>
</li>
</ol>
<p>The authors also provide a theoretical analysis showing that optimizing their objective is equivalent to maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of observing physically realistic conformations.</p>
<h2 id="methodology--experimental-baselines">Methodology &amp; Experimental Baselines</h2>
<p>The model was pretrained on the PCQM4Mv2 dataset (approximately 3.4 million organic molecules) and then evaluated on a comprehensive suite of downstream tasks to test the quality of the learned representations:</p>
<ol>
<li><strong>Molecular Property Prediction (<a href="/notes/chemistry/datasets/qm9/">QM9</a>)</strong>: The model was evaluated on 12 quantum chemical property prediction tasks for small molecules (134k molecules; 100k train, 18k val, 13k test split). DenoiseVAE achieved state-of-the-art or second-best performance on 11 of the 12 tasks, with particularly significant gains on $C_v$ (heat capacity), indicating better capture of vibrational modes.</li>
<li><strong>Force Prediction (MD17)</strong>: The task was to predict atomic forces from molecular dynamics trajectories for 8 different small molecules (9,500 train, 500 val split). DenoiseVAE was the top performer on 5 of the 8 molecules (Aspirin, Benzene, Ethanol, Naphthalene, Toluene), though it underperformed Frad on Malonaldehyde, Salicylic Acid, and Uracil by significant margins.</li>
<li><strong>Ligand Binding Affinity (PDBBind v2019)</strong>: On the PDBBind dataset with 30% and 60% protein sequence identity splits, the model showed strong generalization, outperforming baselines like Uni-Mol particularly on the more stringent 30% split across RMSE, Pearson correlation, and Spearman correlation.</li>
<li><strong>PCQM4Mv2 Validation</strong>: DenoiseVAE achieved a validation MAE of 0.0777 on the PCQM4Mv2 HOMO-LUMO gap prediction task with only 1.44M parameters, competitive with models 10-40x larger (e.g., GPS++ at 44.3M params achieves 0.0778).</li>
<li><strong>Ablation Studies</strong>: The authors analyzed the sensitivity to key hyperparameters, namely the prior&rsquo;s standard deviation ($\sigma$) and the KL-divergence weight ($\lambda$), confirming that $\lambda=1$ and $\sigma=0.1$ are optimal. Removing the KL term leads to trivial solutions (near-zero noise). An additional ablation on the Noise Generator depth found 4 EGNN layers optimal over 2 layers. A comparison of independent (diagonal) versus non-independent (full covariance) noise sampling showed comparable results, suggesting the EGNN already captures inter-atomic dependencies implicitly.</li>
<li><strong>Case Studies</strong>: Visualizations of the learned noise variances for different molecules confirmed that the model learns chemically intuitive noise patterns. For example, it applies smaller perturbations to atoms in a rigid bicyclic norcamphor derivative and larger ones to atoms in flexible functional groups of a cyclopropane derivative. Even identical functional groups (e.g., hydroxyl) receive different noise scales in different molecular contexts.</li>
</ol>
<h2 id="key-findings-on-force-field-learning">Key Findings on Force Field Learning</h2>
<ul>
<li><strong>Primary Conclusion</strong>: Learning a <strong>molecule-adaptive and atom-specific</strong> noise distribution is a superior strategy for denoising-based pre-training compared to using fixed, hand-crafted heuristics. This more physically-grounded approach leads to representations that better capture molecular force fields.</li>
<li><strong>Strong Benchmark Performance</strong>: DenoiseVAE achieves best or second-best results on 11 of 12 QM9 tasks, 5 of 8 MD17 molecules, and leads on the stringent 30% LBA split. Performance is mixed on some MD17 molecules (Malonaldehyde, Salicylic Acid, Uracil), where it trails Frad.</li>
<li><strong>Effective Framework</strong>: The proposed VAE-based framework, which jointly trains a Noise Generator and a Denoising Module, is an effective and theoretically sound method for implementing this adaptive noise strategy. The interplay between the reconstruction loss and the KL-divergence regularization is key to its success.</li>
<li><strong>Limitation and Future Direction</strong>: The method is based on classical force field assumptions. The authors note that integrating more accurate force fields represents a promising direction for future work.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Serendipity-r/DenoiseVAE">Serendipity-r/DenoiseVAE</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<ul>
<li><strong>Source Code</strong>: The authors have released their code at <a href="https://github.com/Serendipity-r/DenoiseVAE">Serendipity-r/DenoiseVAE</a> on GitHub. No license is specified in the repository.</li>
<li><strong>Implementation</strong>: Hyperparameters and architectures are detailed in the paper&rsquo;s appendix (A.14), and the repository provides reference implementations.</li>
</ul>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pre-training Dataset</strong>: <a href="https://ogb.stanford.edu/docs/lsc/pcqm4mv2/">PCQM4Mv2</a> (approximately 3.4 million organic molecules)</li>
<li><strong>Property Prediction</strong>: <a href="https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html">QM9 dataset</a> (134k molecules; 100k train, 18k val, 13k test split) for 12 quantum chemical properties</li>
<li><strong>Force Prediction</strong>: <a href="http://www.sgdml.org/#datasets">MD17 dataset</a> (9,500 train, 500 val split) for 8 different small molecules</li>
<li><strong>Ligand Binding Affinity</strong>: PDBBind v2019 (4,463 protein-ligand complexes) with 30% and 60% sequence identity splits</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Noise Generator</strong>: 4-layer Equivariant Graph Neural Network (EGNN) that outputs atom-specific Gaussian noise distributions</li>
<li><strong>Denoising Module</strong>: 7-layer EGNN decoder</li>
<li><strong>Training Objective</strong>: $\mathcal{L}_{DenoiseVAE} = \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL}$ with $\lambda=1$</li>
<li><strong>Noise Sampling</strong>: Reparameterization trick with $\tilde{x}_i = x_i + \epsilon \sigma_i$</li>
<li><strong>Prior Distribution</strong>: Standard deviation $\sigma=0.1$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Model Size</strong>: 1.44M parameters total</li>
<li><strong>Fine-tuning Protocol</strong>: Noise Generator discarded after pre-training; only the pre-trained Denoising Module (7-layer EGNN) is retained for downstream fine-tuning</li>
<li><strong>Optimizer</strong>: AdamW with cosine learning rate decay (max LR of 0.0005)</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>System Training</strong>: Fine-tuned end-to-end for specific tasks; force prediction involves computing the gradient of the predicted energy</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Ablation Studies</strong>: Sensitivity analysis confirmed $\lambda=1$ and $\sigma=0.1$ as optimal hyperparameters; removing the KL term leads to trivial solutions (near-zero noise)</li>
<li><strong>Noise Generator Depth</strong>: 4 EGNN layers outperformed 2 layers across both QM9 and MD17 benchmarks</li>
<li><strong>Covariance Structure</strong>: Full covariance matrix (non-independent noise sampling) yielded comparable results to diagonal variance (independent sampling), likely because the EGNN already integrates neighboring atom information</li>
<li><strong>O(3) Invariance</strong>: The method satisfies O(3) probabilistic invariance, meaning the noise distribution is unchanged under rotations and reflections</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU Configuration</strong>: Experiments conducted on a single RTX A3090 GPU; 6 GPUs with 144GB total memory sufficient for full reproduction</li>
<li><strong>CPU</strong>: Intel Xeon Gold 5318Y @ 2.10GHz</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, Y., Chen, J., Jiao, R., Li, J., Huang, W., &amp; Su, B. (2025). DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training. <em>The Thirteenth International Conference on Learning Representations (ICLR)</em>.</p>
<p><strong>Publication</strong>: ICLR 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liu2025denoisevae,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yurou Liu and Jiahao Chen and Rui Jiao and Jiangmeng Li and Wenbing Huang and Bing Su}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Thirteenth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=ym7pr83XQr}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://iclr.cc/virtual/2025/poster/27701">ICLR 2025 poster page</a></li>
<li><a href="https://openreview.net/forum?id=ym7pr83XQr">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=ym7pr83XQr">PDF on OpenReview</a></li>
</ul>
]]></content:encoded></item><item><title>eSEN: Smooth Interatomic Potentials (ICML Spotlight)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/learning-smooth-interatomic-potentials/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/learning-smooth-interatomic-potentials/</guid><description>Fu et al. propose energy conservation as a key MLIP diagnostic and introduce eSEN, bridging test accuracy and real performance.</description><content:encoded><![CDATA[<h2 id="paper-overview">Paper Overview</h2>
<p>This is a <strong>method paper</strong>. It addresses a critical disconnect in the evaluation of Machine Learning Interatomic Potentials (MLIPs) and introduces a novel architecture, <strong>eSEN</strong>, designed based on insights from this analysis. The paper proposes a new standard for evaluating MLIPs beyond simple test-set errors.</p>
<h2 id="the-energy-conservation-gap-in-mlip-evaluation">The Energy Conservation Gap in MLIP Evaluation</h2>
<p>The motivation addresses a well-known but under-addressed problem in the field: improvements in standard MLIP metrics (lower energy/force MAE on static test sets) do not reliably translate to better performance on complex downstream tasks like molecular dynamics (MD) simulations, materials stability prediction, or phonon calculations. The authors seek to understand why this gap exists and how to design models that are both accurate on test sets and physically reliable in practical scientific workflows.</p>
<h2 id="the-esen-architecture-and-continuous-representation">The eSEN Architecture and Continuous Representation</h2>
<p>The novelty is twofold, spanning both a conceptual framework for evaluation and a new model architecture:</p>
<ol>
<li>
<p><strong>Energy Conservation as a Diagnostic Test</strong>: The core conceptual contribution is using an MLIP&rsquo;s ability to conserve energy in out-of-distribution MD simulations as a crucial diagnostic test. The authors demonstrate that for models passing this test, a strong correlation between test-set error and downstream task performance is restored.</p>
</li>
<li>
<p><strong>The eSEN Architecture</strong>: The paper introduces the <strong>equivariant Smooth Energy Network (eSEN)</strong>, designed with specific choices to ensure a smooth and well-behaved Potential Energy Surface (PES):</p>
<ul>
<li><strong>Strictly Conservative Forces</strong>: Forces are computed exclusively as the negative gradient of energy ($F = -\nabla E$), using conservative force prediction instead of faster direct-force prediction heads.</li>
<li><strong>Continuous Representations</strong>: Maintains strict equivariance and smoothness by using equivariant gated non-linearities instead of discretizing spherical harmonic representations during nodewise processing.</li>
<li><strong>Smooth PES Construction</strong>: Critical design choices include using distance cutoffs, polynomial envelope functions ensuring derivatives go to zero at cutoffs, and limited radial basis functions to avoid overly sensitive PES.</li>
</ul>
</li>
<li>
<p><strong>Efficient Training Strategy</strong>: A two-stage training regimen with fast pre-training using a non-conservative direct-force model, followed by fine-tuning to enforce energy conservation. This captures the efficiency of direct-force training while ensuring physical robustness.</p>
</li>
</ol>
<h2 id="evaluating-ood-energy-conservation-and-physical-properties">Evaluating OOD Energy Conservation and Physical Properties</h2>
<p>The paper presents a comprehensive experimental validation:</p>
<ol>
<li>
<p><strong>Ablation Studies on Energy Conservation</strong>: MD simulations on out-of-distribution systems (TM23 and MD22 datasets) systematically tested key design choices (direct-force vs. conservative, representation discretization, neighbor limits, envelope functions). This empirically demonstrated which choices lead to energy drift despite negligible impact on test-set MAE.</p>
</li>
<li>
<p><strong>Physical Property Prediction Benchmarks</strong>: The eSEN model was evaluated on challenging downstream tasks:</p>
<ul>
<li><strong>Matbench-Discovery</strong>: Materials stability and thermal conductivity prediction, where eSEN achieved the highest F1 score among compliant models and excelled at both metrics simultaneously.</li>
<li><strong>MDR Phonon Benchmark</strong>: Predicting phonon properties that test accurate second and third-order derivatives of the PES. eSEN achieved state-of-the-art results, particularly outperforming direct-force models.</li>
<li><strong>SPICE-MACE-OFF</strong>: Standard energy and force prediction on organic molecules, demonstrating that physical plausibility design choices enhanced raw accuracy.</li>
</ul>
</li>
<li>
<p><strong>Correlation Analysis</strong>: Explicit plots of test-set energy MAE versus performance on downstream benchmarks showed weak overall correlation that becomes strong and predictive when restricted to models passing the energy conservation test.</p>
</li>
</ol>
<h2 id="outcomes-and-conclusions">Outcomes and Conclusions</h2>
<ul>
<li>
<p><strong>Primary Conclusion</strong>: Energy conservation is a critical, practical property for MLIPs. Using it as a filter re-establishes test-set error as a reliable proxy for model development, dramatically accelerating the innovation cycle. Models that are not conservative, even with low test error, are unreliable for many critical scientific applications.</p>
</li>
<li>
<p><strong>Model Performance</strong>: The eSEN architecture outperforms base models across diverse tasks, from energy/force prediction to geometry optimization, phonon calculations, and thermal conductivity prediction.</p>
</li>
<li>
<p><strong>Actionable Design Principles</strong>: The paper provides experimentally-validated architectural choices that promote physical plausibility. Seemingly minor details, like how atomic neighbors are selected, can have profound impacts on a model&rsquo;s utility in simulations.</p>
</li>
<li>
<p><strong>Efficient Path to Robust Models</strong>: The direct-force pre-training plus conservative fine-tuning strategy offers a practical method for developing physically robust models without incurring the full computational cost of conservative training from scratch.</p>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/facebookresearch/fairchem">fairchem (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation within FAIR Chemistry framework</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/facebook/OMAT24">OMAT24 (Hugging Face)</a></td>
          <td>Model</td>
          <td>FAIR Acceptable Use Policy</td>
          <td>Pre-trained eSEN-30M-MP and eSEN-30M-OAM checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://openreview.net/forum?id=R0PBjxIbgm">OpenReview</a></td>
          <td>Paper</td>
          <td>CC BY 4.0</td>
          <td>ICML 2025 camera-ready paper</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>The eSEN architecture builds on components from <strong>eSCN</strong> (Equivariant Spherical Channel Network) and <strong>Equiformer</strong>, combining them with design choices that prioritize smoothness and energy conservation. The implementation integrates into the standard <code>fairchem</code> Open Catalyst experimental framework.</p>
<h4 id="layer-structure">Layer Structure</h4>
<ul>
<li><strong>Edgewise Convolution</strong>: Uses <code>SO2</code> convolution layers (from eSCN) with an envelope function applied. Source and target embeddings are concatenated before convolution.</li>
<li><strong>Nodewise Feed-Forward</strong>: Two equivariant linear layers with an intermediate <strong>SiLU-based gated non-linearity</strong> (from Equiformer).</li>
<li><strong>Normalization</strong>: Equivariant Layer Normalization (from Equiformer).</li>
</ul>
<h4 id="smoothness-design-choices">Smoothness Design Choices</h4>
<p>Several architectural decisions distinguish eSEN from prior work:</p>
<ul>
<li><strong>No Grid Projection</strong>: eSEN performs operations directly in the spherical harmonic space to maintain equivariance and energy conservation, bypassing the projection of spherical harmonics to spatial grids for non-linearity.</li>
<li><strong>Distance Cutoff for Graph Construction</strong>: Uses a strict distance cutoff (6 Å for MPTrj models, 5 Å for SPICE models). Neighbor limits introduce discontinuities that break energy conservation.</li>
<li><strong>Polynomial Envelope Functions</strong>: Ensures derivatives go to zero smoothly at the cutoff radius.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<h4 id="two-stage-training-esen-30m-mp">Two-Stage Training (eSEN-30M-MP)</h4>
<ol>
<li><strong>Direct-Force Pre-training</strong> (60 epochs): Uses <strong>DeNS</strong> (Denoising Non-equilibrium Structures) to reduce overfitting. This stage is fast because it does not require backpropagation through energy gradients.</li>
<li><strong>Conservative Fine-tuning</strong> (40 epochs): The direct-force head is removed, and forces are calculated via gradients ($F = -\nabla E$). This enforces energy conservation.</li>
</ol>
<p><strong>Important</strong>: DeNS is used exclusively during the direct-force pre-training stage, with a noising probability of 0.5, a standard deviation of 0.1 Å for the added Gaussian noise, and a DeNS loss coefficient of 10. The fine-tuning strategy reduces the wall-clock time for model training by 40%.</p>
<h4 id="optimization">Optimization</h4>
<ul>
<li><strong>Optimizer</strong>: AdamW with cosine learning rate scheduler</li>
<li><strong>Max Learning Rate</strong>: $4 \times 10^{-4}$</li>
<li><strong>Batch Size</strong>: 512 (for MPTrj models)</li>
<li><strong>Weight Decay</strong>: $1 \times 10^{-3}$</li>
<li><strong>Gradient Clipping</strong>: Norm of 100</li>
<li><strong>Warmup</strong>: 0.1 epochs with a factor of 0.2</li>
</ul>
<h4 id="loss-function">Loss Function</h4>
<p>A composite loss combining per-atom energy MAE, force $L_2$ loss, and stress MAE:</p>
<p>$$
\begin{aligned}
\mathcal{L} = \lambda_{\text{e}} \frac{1}{N} \sum_{i=1}^N \lvert E_{i} - \hat{E}_{i} \rvert + \lambda_{\text{f}} \frac{1}{3N} \sum_{i=1}^N \lVert \mathbf{F}_{i} - \hat{\mathbf{F}}_{i} \rVert_2^2 + \lambda_{\text{s}} \lVert \mathbf{S} - \hat{\mathbf{S}} \rVert_1
\end{aligned}
$$</p>
<p>For MPTrj-30M, the weighting coefficients are set to $\lambda_{\text{e}} = 20$, $\lambda_{\text{f}} = 20$, and $\lambda_{\text{s}} = 5$.</p>
<h3 id="data">Data</h3>
<h4 id="training-data">Training Data</h4>
<ul>
<li><strong>Inorganic</strong>: MPTrj (Materials Project Trajectory) dataset</li>
<li><strong>Organic</strong>: SPICE-MACE-OFF dataset</li>
</ul>
<h4 id="test-data-construction">Test Data Construction</h4>
<ul>
<li><strong>MPTrj Testing</strong>: Since MPTrj lacks an official test split, the authors created a test set using 5,000 random samples from the <strong>subsampled Alexandria (sAlex)</strong> dataset to ensure fair comparison.</li>
<li><strong>Out-of-Distribution Conservation Testing</strong>:
<ul>
<li><em>Inorganic</em>: <strong>TM23</strong> dataset (transition metal defects). Simulation: 100 ps, 5 fs timestep.</li>
<li><em>Organic</em>: <strong>MD22</strong> dataset (large molecules). Simulation: 100 ps, 1 fs timestep.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Compute for training operations predominantly utilizes <strong>80GB NVIDIA A100 GPUs</strong>.</p>
<h4 id="inference-efficiency">Inference Efficiency</h4>
<p>For a periodic system of <strong>216 atoms</strong> on a single A100 (PyTorch 2.4.0, CUDA 12.1, no compile/torchscript), the 2-layer eSEN models achieve approximately <strong>0.4 million steps per day</strong> (3.2M parameters) and <strong>0.8 million steps per day</strong> (6.5M parameters), comparable to MACE-OFF-L at 0.7 million steps per day.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The paper evaluated eSEN across three major benchmark tasks. Key evaluation metrics included energy MAE (meV/atom), force MAE (meV/Å), stress MAE (meV/Å/atom), F1 score for stability prediction, $\kappa_{\text{SRME}}$ for thermal conductivity, and phonon frequency accuracy.</p>
<h4 id="ablation-test-set-mae-table-1">Ablation Test-Set MAE (Table 1)</h4>
<p>Design choices that dramatically affect energy conservation have negligible impact on static test-set MAE, which is precisely why test-set error alone is misleading. All models are 2-layer with 3.2M parameters, $L_{\text{max}} = 2$, $M_{\text{max}} = 2$:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Energy MAE</th>
          <th>Force MAE</th>
          <th>Stress MAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>eSEN (default)</td>
          <td>17.02</td>
          <td>43.96</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>eSEN, direct-force</td>
          <td>18.66</td>
          <td>43.62</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>eSEN, neighbor limit</td>
          <td>17.30</td>
          <td>44.11</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>eSEN, no envelope</td>
          <td>17.60</td>
          <td>44.69</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>eSEN, $N_{\text{basis}} = 512$</td>
          <td>19.87</td>
          <td>48.29</td>
          <td>0.15</td>
      </tr>
      <tr>
          <td>eSEN, Bessel</td>
          <td>17.65</td>
          <td>44.83</td>
          <td>0.15</td>
      </tr>
      <tr>
          <td>eSEN, discrete, res=6</td>
          <td>17.05</td>
          <td>43.10</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>eSEN, discrete, res=10</td>
          <td>17.11</td>
          <td>43.13</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>eSEN, discrete, res=14</td>
          <td>17.12</td>
          <td>43.09</td>
          <td>0.14</td>
      </tr>
  </tbody>
</table>
<p>Energy MAE in meV/atom. Force MAE in meV/Å. Stress MAE in meV/Å/atom.</p>
<h4 id="matbench-discovery-tables-2-and-3">Matbench-Discovery (Tables 2 and 3)</h4>
<p><strong>Compliant models</strong> (trained only on MPTrj or its subset), unique prototype split:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>F1</th>
          <th>DAF</th>
          <th>$\kappa_{\text{SRME}}$</th>
          <th>RMSD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>eSEN-30M-MP</strong></td>
          <td><strong>0.831</strong></td>
          <td><strong>5.260</strong></td>
          <td><strong>0.340</strong></td>
          <td><strong>0.0752</strong></td>
      </tr>
      <tr>
          <td>eqV2-S-DeNS</td>
          <td>0.815</td>
          <td>5.042</td>
          <td>1.676</td>
          <td>0.0757</td>
      </tr>
      <tr>
          <td>MatRIS-MP</td>
          <td>0.809</td>
          <td>5.049</td>
          <td>0.861</td>
          <td>0.0773</td>
      </tr>
      <tr>
          <td>AlphaNet-MP</td>
          <td>0.799</td>
          <td>4.863</td>
          <td>1.31</td>
          <td>0.1067</td>
      </tr>
      <tr>
          <td>DPA3-v2-MP</td>
          <td>0.786</td>
          <td>4.822</td>
          <td>0.959</td>
          <td>0.0823</td>
      </tr>
      <tr>
          <td>ORB v2 MPtrj</td>
          <td>0.765</td>
          <td>4.702</td>
          <td>1.725</td>
          <td>0.1007</td>
      </tr>
      <tr>
          <td>SevenNet-13i5</td>
          <td>0.760</td>
          <td>4.629</td>
          <td>0.550</td>
          <td>0.0847</td>
      </tr>
      <tr>
          <td>GRACE-2L-MPtrj</td>
          <td>0.691</td>
          <td>4.163</td>
          <td>0.525</td>
          <td>0.0897</td>
      </tr>
      <tr>
          <td>MACE-MP-0</td>
          <td>0.669</td>
          <td>3.777</td>
          <td>0.647</td>
          <td>0.0915</td>
      </tr>
      <tr>
          <td>CHGNet</td>
          <td>0.613</td>
          <td>3.361</td>
          <td>1.717</td>
          <td>0.0949</td>
      </tr>
      <tr>
          <td>M3GNet</td>
          <td>0.569</td>
          <td>2.882</td>
          <td>1.412</td>
          <td>0.1117</td>
      </tr>
  </tbody>
</table>
<p>eSEN-30M-MP excels at both F1 and $\kappa_{\text{SRME}}$ simultaneously, while all previous models only achieve SOTA on one or the other.</p>
<p><strong>Non-compliant models</strong> (trained on additional datasets):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>F1</th>
          <th>$\kappa_{\text{SRME}}$</th>
          <th>RMSD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>eSEN-30M-OAM</strong></td>
          <td><strong>0.925</strong></td>
          <td><strong>0.170</strong></td>
          <td><strong>0.0608</strong></td>
      </tr>
      <tr>
          <td>eqV2-M-OAM</td>
          <td>0.917</td>
          <td>1.771</td>
          <td>0.0691</td>
      </tr>
      <tr>
          <td>ORB v3</td>
          <td>0.905</td>
          <td>0.210</td>
          <td>0.0750</td>
      </tr>
      <tr>
          <td>SevenNet-MF-ompa</td>
          <td>0.901</td>
          <td>0.317</td>
          <td>0.0639</td>
      </tr>
      <tr>
          <td>DPA3-v2-OpenLAM</td>
          <td>0.890</td>
          <td>0.687</td>
          <td>0.0679</td>
      </tr>
      <tr>
          <td>GRACE-2L-OAM</td>
          <td>0.880</td>
          <td>0.294</td>
          <td>0.0666</td>
      </tr>
      <tr>
          <td>MatterSim-v1-5M</td>
          <td>0.862</td>
          <td>0.574</td>
          <td>0.0733</td>
      </tr>
      <tr>
          <td>MACE-MPA-0</td>
          <td>0.852</td>
          <td>0.412</td>
          <td>0.0731</td>
      </tr>
  </tbody>
</table>
<p>The eSEN-30M-OAM model is pre-trained on the OMat24 dataset, then fine-tuned on the subsampled Alexandria (sAlex) dataset and MPTrj dataset.</p>
<h4 id="mdr-phonon-benchmark-table-4">MDR Phonon Benchmark (Table 4)</h4>
<p>Metrics: maximum phonon frequency MAE($\omega_{\text{max}}$) in K, vibrational entropy MAE($S$) in J/K/mol, Helmholtz free energy MAE($F$) in kJ/mol, heat capacity MAE($C_V$) in J/K/mol.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>MAE($\omega_{\text{max}}$)</th>
          <th>MAE($S$)</th>
          <th>MAE($F$)</th>
          <th>MAE($C_V$)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>eSEN-30M-MP</strong></td>
          <td><strong>21</strong></td>
          <td><strong>13</strong></td>
          <td><strong>5</strong></td>
          <td><strong>4</strong></td>
      </tr>
      <tr>
          <td>SevenNet-13i5</td>
          <td>26</td>
          <td>28</td>
          <td>10</td>
          <td>5</td>
      </tr>
      <tr>
          <td>GRACE-2L (r6)</td>
          <td>40</td>
          <td>25</td>
          <td>9</td>
          <td>5</td>
      </tr>
      <tr>
          <td>SevenNet-0</td>
          <td>40</td>
          <td>48</td>
          <td>19</td>
          <td>9</td>
      </tr>
      <tr>
          <td>MACE</td>
          <td>61</td>
          <td>60</td>
          <td>24</td>
          <td>13</td>
      </tr>
      <tr>
          <td>CHGNet</td>
          <td>89</td>
          <td>114</td>
          <td>45</td>
          <td>21</td>
      </tr>
      <tr>
          <td>M3GNet</td>
          <td>98</td>
          <td>150</td>
          <td>56</td>
          <td>22</td>
      </tr>
  </tbody>
</table>
<p>Direct-force models show dramatically worse performance at the standard 0.01 Å displacement (e.g., eqV2-S-DeNS: 280/224/54/94) but improve at larger displacements (0.2 Å: 58/26/8/8), revealing that their PES is rough near energy minima.</p>
<h4 id="spice-mace-off-table-5">SPICE-MACE-OFF (Table 5)</h4>
<p>Test set MAE for organic molecule energy/force prediction. Energy MAE in meV/atom, force MAE in meV/Å:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MACE-4.7M (E/F)</th>
          <th>EscAIP-45M* (E/F)</th>
          <th>eSEN-3.2M (E/F)</th>
          <th>eSEN-6.5M (E/F)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubChem</td>
          <td>0.88 / 14.75</td>
          <td>0.53 / 5.86</td>
          <td>0.22 / 6.10</td>
          <td><strong>0.15</strong> / <strong>4.21</strong></td>
      </tr>
      <tr>
          <td>DES370K M.</td>
          <td>0.59 / 6.58</td>
          <td>0.41 / 3.48</td>
          <td>0.17 / 1.85</td>
          <td><strong>0.13</strong> / <strong>1.24</strong></td>
      </tr>
      <tr>
          <td>DES370K D.</td>
          <td>0.54 / 6.62</td>
          <td>0.38 / 2.18</td>
          <td>0.20 / 2.77</td>
          <td><strong>0.15</strong> / <strong>2.12</strong></td>
      </tr>
      <tr>
          <td>Dipeptides</td>
          <td>0.42 / 10.19</td>
          <td>0.31 / 5.21</td>
          <td>0.10 / 3.04</td>
          <td><strong>0.07</strong> / <strong>2.00</strong></td>
      </tr>
      <tr>
          <td>Sol. AA</td>
          <td>0.98 / 19.43</td>
          <td>0.61 / 11.52</td>
          <td>0.30 / 5.76</td>
          <td><strong>0.25</strong> / <strong>3.68</strong></td>
      </tr>
      <tr>
          <td>Water</td>
          <td>0.83 / 13.57</td>
          <td>0.72 / 10.31</td>
          <td>0.24 / 3.88</td>
          <td><strong>0.15</strong> / <strong>2.50</strong></td>
      </tr>
      <tr>
          <td>QMugs</td>
          <td>0.45 / 16.93</td>
          <td>0.41 / 8.74</td>
          <td>0.16 / 5.70</td>
          <td><strong>0.12</strong> / <strong>3.78</strong></td>
      </tr>
  </tbody>
</table>
<p>*EscAIP-45M is a direct-force model. eSEN-6.5M outperforms MACE-OFF-L and EscAIP on all test splits. The smaller eSEN-3.2M has inference efficiency comparable to MACE-4.7M while achieving lower MAE.</p>
<hr>
<h2 id="why-these-design-choices-matter">Why These Design Choices Matter</h2>
<h3 id="bounded-energy-derivatives-and-the-verlet-integrator">Bounded Energy Derivatives and the Verlet Integrator</h3>
<p>The theoretical foundation for why smoothness matters comes from Theorem 5.1 of Hairer et al. (2003). For the Verlet integrator (the standard NVE integrator), the total energy drift satisfies:</p>
<p>$$
|E(\mathbf{r}_T, \mathbf{a}) - E(\mathbf{r}_0, \mathbf{a})| \leq C \Delta t^2 + C_N \Delta t^N T
$$</p>
<p>where $T$ is the total simulation time ($T \leq \Delta t^{-N}$), $N$ is the highest order for which the $N$th derivative of $E$ is continuously differentiable with bounded derivative, and $C$, $C_N$ are constants independent of $T$ and $\Delta t$. The first term is a time-independent fluctuation of $O(\Delta t^2)$; the second term governs long-term conservation. This means the PES must be continuously differentiable to high order, with bounded derivatives, for energy conservation in long-time simulations.</p>
<h3 id="architectural-choices-that-break-conservation">Architectural Choices That Break Conservation</h3>
<p>The authors provide theoretical justification for why specific architectural choices break energy conservation:</p>
<ul>
<li><strong>Max Neighbor Limit (KNN)</strong>: Introduces discontinuity in the PES. If a neighbor at distance $r$ moves to $r + \epsilon$ and drops out of the top-$K$, the energy changes discontinuously.</li>
<li><strong>Grid Discretization</strong>: Projecting spherical harmonics to a spatial grid introduces discretization errors in energy gradients that break conservation. This can be mitigated with higher-resolution grids but not eliminated.</li>
<li><strong>Direct-Force Prediction</strong>: Imposes no mathematical constraint that forces must be the gradient of an energy scalar field. In other words, $\nabla \times \mathbf{F} \neq 0$ is permitted, violating the requirement for a conservative force field.</li>
</ul>
<h3 id="displacement-sensitivity-in-phonon-calculations">Displacement Sensitivity in Phonon Calculations</h3>
<p>An important empirical finding concerns how displacement values affect phonon predictions. Conservative models (eSEN, MACE) show convergent phonon band structures as displacement decreases toward zero. In contrast, direct-force models (eqV2-S-DeNS) fail to converge, exhibiting missing acoustic branches and spurious imaginary frequencies at small displacements. While direct-force models achieve competitive thermodynamic property accuracy at large displacements (0.2 Å), this is deceptive: the underlying phonon band structures remain inaccurate, and the apparent accuracy comes from Boltzmann-weighted integrals smoothing over errors.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fu, X., Wood, B. M., Barroso-Luque, L., Levine, D. S., Gao, M., Dzamba, M., &amp; Zitnick, C. L. (2025). Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction. <em>Proceedings of the 42nd International Conference on Machine Learning (ICML)</em>, PMLR 267:17875–17893.</p>
<p><strong>Publication</strong>: ICML 2025 (Spotlight)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fu2025learning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fu, Xiang and Wood, Brandon M. and Barroso-Luque, Luis and Levine, Daniel S. and Gao, Meng and Dzamba, Misko and Zitnick, C. Lawrence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17875--17893}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://icml.cc/virtual/2025/poster/45302">ICML 2025 poster page</a></li>
<li><a href="https://openreview.net/forum?id=R0PBjxIbgm">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=R0PBjxIbgm">PDF on OpenReview</a></li>
<li><a href="https://huggingface.co/facebook/OMAT24">OMAT24 model on Hugging Face</a></li>
<li><a href="https://github.com/facebookresearch/fairchem">Code on GitHub (fairchem)</a></li>
</ul>
]]></content:encoded></item><item><title>Efficient DFT Hamiltonian Prediction via Adaptive Sparsity</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/efficient-dft-hamiltonian-predicton-sphnet/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/efficient-dft-hamiltonian-predicton-sphnet/</guid><description>Luo et al. introduce SPHNet, using adaptive sparsity to achieve up to 7x speedup in SE(3)-equivariant Hamiltonian prediction.</description><content:encoded><![CDATA[<h2 id="core-innovation-adaptive-sparsity-in-se3-networks">Core Innovation: Adaptive Sparsity in SE(3) Networks</h2>
<p>This is a <strong>methodological paper</strong> introducing a novel architecture and training curriculum to solve efficiency bottlenecks in Geometric Deep Learning. It directly tackles the primary computational bottleneck in modern SE(3)-equivariant graph neural networks (the tensor product operation) and proposes a generalizable solution through adaptive network sparsification.</p>
<h2 id="the-computational-bottleneck-in-dft-hamiltonian-prediction">The Computational Bottleneck in DFT Hamiltonian Prediction</h2>
<p>SE(3)-equivariant networks are accurate but unscalable for DFT Hamiltonian prediction due to two key bottlenecks:</p>
<ul>
<li><strong>Atom Scaling</strong>: Tensor Product (TP) operations grow quadratically with atoms ($N^2$).</li>
<li><strong>Basis Set Scaling</strong>: Computational complexity grows with the sixth power of the angular momentum order ($L^6$). Larger basis sets (e.g., def2-TZVP) require higher orders ($L=6$), making them prohibitively slow.</li>
</ul>
<p>Existing SE(3)-equivariant models cannot handle large molecules (40-100 atoms) with high-quality basis sets, limiting their practical applicability in computational chemistry.</p>
<h2 id="sphnet-architecture-and-the-three-phase-sparsity-scheduler">SPHNet Architecture and the Three-Phase Sparsity Scheduler</h2>
<p><strong>SPHNet</strong> introduces <strong>Adaptive Sparsity</strong> to prune redundant computations at two levels:</p>
<ol>
<li><strong>Sparse Pair Gate</strong>: Learns which atom pairs to include in message passing, adapting the interaction graph based on importance.</li>
<li><strong>Sparse TP Gate</strong>: Filters which spherical harmonic triplets $(l_1, l_2, l_3)$ are computed in tensor product operations, pruning higher-order combinations that contribute less to accuracy.</li>
<li><strong>Three-Phase Sparsity Scheduler</strong>: A training curriculum (Random → Adaptive → Fixed) that enables stable convergence to high-performing sparse subnetworks.</li>
</ol>
<p>Key insight: The Sparse Pair Gate learns to preserve long-range interactions (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are abundant and easier to learn, while rare long-range interactions require more samples for accurate representation, making them more critical to retain.</p>
<h2 id="benchmarks-and-ablation-studies">Benchmarks and Ablation Studies</h2>
<p>The authors evaluated SPHNet on three datasets (MD17, QH9, and PubChemQH) with varying molecule sizes and basis set complexities. Baselines include SchNOrb, PhiSNet, QHNet, and WANet. SchNOrb and PhiSNet results are limited to MD17, as those models are designed for trajectory datasets. WANet was not open-sourced, so only partial metrics from its paper are reported.</p>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<ul>
<li><strong>Hamiltonian MAE ($H$)</strong>: Mean absolute error between predicted and DFT-computed Hamiltonian matrices, in Hartrees ($E_h$)</li>
<li><strong>Occupied Orbital Energy MAE ($\epsilon$)</strong>: Mean absolute error of all occupied molecular orbital energies derived from the predicted Hamiltonian</li>
<li><strong>Orbital Coefficient Similarity ($\psi$)</strong>: Cosine similarity of occupied molecular orbital coefficients between predicted and reference wavefunctions</li>
</ul>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Sparse Gates</strong> (on PubChemQH):</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>$H$ [$10^{-6} E_h$] $\downarrow$</th>
          <th>Memory [GB] $\downarrow$</th>
          <th>Speedup $\uparrow$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Both gates</td>
          <td>97.31</td>
          <td>5.62</td>
          <td>7.09x</td>
      </tr>
      <tr>
          <td>Pair Gate only</td>
          <td>87.70</td>
          <td>6.98</td>
          <td>2.73x</td>
      </tr>
      <tr>
          <td>TP Gate only</td>
          <td>94.31</td>
          <td>8.04</td>
          <td>3.98x</td>
      </tr>
      <tr>
          <td>Neither gate</td>
          <td>86.35</td>
          <td>10.91</td>
          <td>1.73x</td>
      </tr>
  </tbody>
</table>
<p>The Sparse Pair Gate contributes a 78% speedup with 30% memory reduction. The Sparse TP Gate (pruning 70% of combinations) yields a 160% speedup. Both gates together achieve the highest speedup, though accuracy slightly decreases compared to no gating.</p>
<p><strong>Three-Phase Scheduler</strong>: Removing the random phase causes convergence to local optima ($112.68 \pm 10.75$ vs $97.31 \pm 0.52$). Removing the adaptive phase increases variance and lowers accuracy ($122.79 \pm 19.02$). Removing the fixed phase has minimal accuracy impact but reduces speedup from 7.09x to 5.45x due to dynamic graph overhead.</p>
<p><strong>Sparsity Rate</strong>: The critical sparsity threshold scales with system complexity: 30% for MD17 (small molecules), 40% for QH9 (medium), and 70% for PubChemQH (large). Beyond the threshold, MAE increases sharply. Computational cost decreases approximately linearly with sparsity rate.</p>
<h3 id="transferability-to-other-models">Transferability to Other Models</h3>
<p>To demonstrate the speedup is architecture-agnostic, the authors applied the Sparse Pair Gate and Sparse TP Gate to the QHNet baseline on PubChemQH:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>$H$ [$10^{-6} E_h$] $\downarrow$</th>
          <th>Memory [GB] $\downarrow$</th>
          <th>Speedup $\uparrow$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QHNet baseline</td>
          <td>123.74</td>
          <td>22.50</td>
          <td>1.00x</td>
      </tr>
      <tr>
          <td>+ TP Gate</td>
          <td>128.16</td>
          <td>12.68</td>
          <td>2.04x</td>
      </tr>
      <tr>
          <td>+ Pair Gate</td>
          <td>126.27</td>
          <td>10.07</td>
          <td>1.66x</td>
      </tr>
      <tr>
          <td>+ Both gates</td>
          <td>128.89</td>
          <td>8.46</td>
          <td>3.30x</td>
      </tr>
  </tbody>
</table>
<p>The gates reduced QHNet&rsquo;s memory by 62% and improved speed by 3.3x with modest accuracy trade-off, confirming the gates are portable modules applicable to other SE(3)-equivariant architectures.</p>
<h2 id="performance-results">Performance Results</h2>
<h3 id="qh9-134k-molecules-leq-20-atoms">QH9 (134k molecules, $\leq$ 20 atoms)</h3>
<p>SPHNet achieves 3.3x to 4.0x speedup over QHNet across all four QH9 splits, with improved Hamiltonian MAE and orbital energy MAE. Memory drops to 0.23 GB/sample (33% of QHNet&rsquo;s 0.70 GB). On the stable-iid split, Hamiltonian MAE improves from 76.31 to 45.48 ($10^{-6} E_h$).</p>
<h3 id="pubchemqh-50k-molecules-40-100-atoms">PubChemQH (50k molecules, 40-100 atoms)</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>$H$ [$10^{-6} E_h$] $\downarrow$</th>
          <th>$\epsilon$ [$E_h$] $\downarrow$</th>
          <th>$\psi$ [$10^{-2}$] $\uparrow$</th>
          <th>Memory [GB] $\downarrow$</th>
          <th>Speedup $\uparrow$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QHNet</td>
          <td>123.74</td>
          <td>3.33</td>
          <td>2.32</td>
          <td>22.5</td>
          <td>1.0x</td>
      </tr>
      <tr>
          <td>WANet</td>
          <td>99.98</td>
          <td><strong>1.17</strong></td>
          <td><strong>3.13</strong></td>
          <td>15.0</td>
          <td>2.4x</td>
      </tr>
      <tr>
          <td>SPHNet</td>
          <td><strong>97.31</strong></td>
          <td>2.16</td>
          <td>2.97</td>
          <td><strong>5.62</strong></td>
          <td><strong>7.1x</strong></td>
      </tr>
  </tbody>
</table>
<p>SPHNet achieves the best Hamiltonian MAE and efficiency, though WANet outperforms on orbital energy MAE and coefficient similarity. The higher speedup on PubChemQH (vs QH9) reflects greater computational redundancy in larger systems with higher-order basis sets ($L_{max} = 6$ for def2-TZVP vs $L_{max} = 4$ for def2-SVP).</p>
<h3 id="md17-small-molecule-trajectories">MD17 (Small Molecule Trajectories)</h3>
<p>SPHNet achieves accuracy comparable to QHNet and PhiSNet on four MD17 molecules (water, ethanol, malondialdehyde, uracil; 3-12 atoms). MD17 represents a simpler task where baseline models already perform well, leaving limited room for improvement. For water (3 atoms), the number of interaction combinations is inherently small, limiting the benefit of adaptive sparsification.</p>
<h3 id="scaling-limit">Scaling Limit</h3>
<p>SPHNet can train on systems with approximately 3000 atomic orbitals on a single A6000 GPU; the QHNet baseline runs out of memory at approximately 1800 orbitals. Memory consumption scales more favorably as molecule size increases.</p>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li><strong>Adaptive sparsity scales with system complexity</strong>: The method is most effective for large systems where redundancy is high. For small molecules (e.g., water with only 3 atoms), every interaction is critical, so pruning hurts accuracy and yields negligible speedup.</li>
<li><strong>Long-range pair preservation</strong>: The Sparse Pair Gate selects long-range pairs (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are numerous and easier to learn, while rare long-range interactions are harder to represent and thus more critical to retain.</li>
<li><strong>Generalizable components</strong>: The sparsification techniques are portable modules, demonstrated by successful integration into QHNet with 3.3x speedup.</li>
<li><strong>Architecture ablation</strong>: Removing one Vectorial Node Interaction block or Spherical Node Interaction block significantly hurts accuracy, confirming the importance of the progressive order-increase design. Removing one Pair Construction block has less impact, suggesting room for further speedup.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/microsoft/SPHNet">SPHNet (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation; archived by Microsoft (Dec 2025), read-only</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/EperLuo/PubChemQH">PubChemQH (Hugging Face)</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>50k molecules, 40-100 atoms, def2-TZVP basis</td>
      </tr>
  </tbody>
</table>
<p>No pre-trained model weights are provided. MD17 and QH9 are publicly available community datasets. Training requires 4x NVIDIA A100 (80GB) GPUs; benchmarking uses a single NVIDIA RTX A6000 (46GB).</p>
<h3 id="data">Data</h3>
<p>The experiments evaluated SPHNet on three datasets with different molecular sizes and basis set complexities. All datasets use DFT calculations as ground truth, with MD17 using the PBE exchange-correlation functional and QH9/PubChemQH using B3LYP.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Molecules</th>
          <th>Molecule Size</th>
          <th>Basis Set</th>
          <th>$L_{max}$</th>
          <th>Functional</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MD17</td>
          <td>4 systems</td>
          <td>3-12 atoms (water, ethanol, malondialdehyde, uracil)</td>
          <td>def2-SVP</td>
          <td>4</td>
          <td>PBE</td>
      </tr>
      <tr>
          <td>QH9</td>
          <td>134k</td>
          <td>$\leq$ 20 atoms (Stable/Dynamic splits)</td>
          <td>def2-SVP</td>
          <td>4</td>
          <td>B3LYP</td>
      </tr>
      <tr>
          <td>PubChemQH</td>
          <td>50k</td>
          <td>40-100 atoms</td>
          <td>def2-TZVP</td>
          <td>6</td>
          <td>B3LYP</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Availability</strong>:</p>
<ul>
<li><strong>MD17 &amp; QH9</strong>: Publicly available</li>
<li><strong>PubChemQH</strong>: Publicly available on Hugging Face (<a href="https://huggingface.co/datasets/EperLuo/PubChemQH">EperLuo/PubChemQH</a>)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Loss Function</strong>:</p>
<p>The model learns the <strong>residual</strong> $\Delta H$:</p>
<p>$$
\begin{aligned}
\Delta H &amp;= H_{\text{ref}} - H_{\text{init}} \\
\mathcal{L} &amp;= \text{MAE}(H_{\text{ref}}, H_{\text{pred}}) + \text{MSE}(H_{\text{ref}}, H_{\text{pred}})
\end{aligned}
$$</p>
<p>where $H_{\text{init}}$ is a computationally inexpensive initial guess computed via PySCF.</p>
<p><strong>Hyperparameters</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>PubChemQH</th>
          <th>QH9</th>
          <th>MD17</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Batch Size</td>
          <td>8</td>
          <td>32</td>
          <td>10 (uracil: 5)</td>
      </tr>
      <tr>
          <td>Training Steps</td>
          <td>300k</td>
          <td>260k</td>
          <td>200k</td>
      </tr>
      <tr>
          <td>Warmup Steps</td>
          <td>1k</td>
          <td>1k</td>
          <td>1k</td>
      </tr>
      <tr>
          <td>Learning Rate</td>
          <td>1e-3</td>
          <td>1e-3</td>
          <td>5e-4</td>
      </tr>
      <tr>
          <td>Sparsity Rate</td>
          <td>0.7</td>
          <td>0.4</td>
          <td>0.1-0.3</td>
      </tr>
      <tr>
          <td>TSS Epoch $t$</td>
          <td>3</td>
          <td>3</td>
          <td>3</td>
      </tr>
  </tbody>
</table>
<p><strong>Sparse Pair Gate</strong>: Adapts the interaction graph. It concatenates zero-order features and inner products of atom pairs, then passes them through a linear layer $F_p$ with sigmoid activation to learn a weight $W_p^{ij}$ for every pair. Pairs are kept only if selected by the scheduler ($U_p^{TSS}$). The overhead comes primarily from the linear layer $F_p$.</p>
<p><strong>Sparse TP Gate</strong>: Filters triplets $(l_1, l_2, l_3)$ inside the TP operation. Higher-order combinations are more likely to be pruned. Complexity: $\mathcal{O}(L^3)$.</p>
<p><strong>Three-Phase Sparsity Scheduler</strong>: Training curriculum designed to optimize the sparse gates effectively:</p>
<ul>
<li><strong>Phase 1 (Random)</strong>: Random selection ($1-k$ probability) to ensure unbiased weight updates. Complexity: $\mathcal{O}(|U|)$.</li>
<li><strong>Phase 2 (Adaptive)</strong>: Selects top $(1-k)$ percent based on learned magnitude. Complexity: $\mathcal{O}(|U|\log|U|)$.</li>
<li><strong>Phase 3 (Fixed)</strong>: Freezes the connectivity mask for maximum inference speed. No overhead.</li>
</ul>
<p><strong>Weight Initialization</strong>: Learnable sparsity weights ($W$) initialized as all-ones vector.</p>
<h3 id="models">Models</h3>
<p>The model predicts the Hamiltonian matrix $H$ from atomic numbers $Z$ and coordinates $r$.</p>
<p><strong>Inputs</strong>: Atomic numbers ($Z$) and 3D coordinates.</p>
<p><strong>Backbone Structure</strong>:</p>
<ol>
<li><strong>Vectorial Node Interaction (x4)</strong>: Uses long-short range message passing. Extracts vectorial representations ($l=1$) without high-order TPs to save cost.</li>
<li><strong>Spherical Node Interaction (x2)</strong>: Projects features to high-order spherical harmonics (up to $L_{max}$). The first block increases the maximum order from 0 to $L_{max}$ without the Sparse Pair Gate; the second block applies the <strong>Sparse Pair Gate</strong> to filter node pairs.</li>
<li><strong>Pair Construction Block (x2)</strong>: Splits into <strong>Diagonal</strong> (self-interaction) and <strong>Non-Diagonal</strong> (cross-interaction) blocks. Both use the <strong>Sparse TP Gate</strong> to prune cross-order combinations $(l_1, l_2, l_3)$. The Non-Diagonal blocks also use the <strong>Sparse Pair Gate</strong> to filter atom pairs. The two Pair Construction blocks receive representations from the two Spherical Node Interaction blocks respectively, and their outputs are summed.</li>
<li><strong>Expansion Block</strong>: Reconstructs the full Hamiltonian matrix from the sparse irreducible representations, exploiting symmetry ($H_{ji} = H_{ij}^T$) to halve computations.</li>
</ol>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 4x NVIDIA A100 (80GB)</li>
<li><strong>Benchmarking</strong>: Single NVIDIA RTX A6000 (46GB)</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, E., Wei, X., Huang, L., Li, Y., Yang, H., Xia, Z., Wang, Z., Liu, C., Shao, B., &amp; Zhang, J. (2025). Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity. <em>Proceedings of the 42nd International Conference on Machine Learning</em>, PMLR 267:41368&ndash;41390.</p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{luo2025efficient,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Erpai and Wei, Xinran and Huang, Lin and Li, Yunyang and Yang, Han and Xia, Zaishuo and Wang, Zun and Liu, Chang and Shao, Bin and Zhang, Jia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{41368--41390}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://icml.cc/virtual/2025/poster/45656">ICML 2025 poster page</a></li>
<li><a href="https://openreview.net/forum?id=K3lykWhXON">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=K3lykWhXON">PDF on OpenReview</a></li>
<li><a href="https://github.com/microsoft/SPHNet">GitHub Repository</a> <em>(Note: The official repository was archived by Microsoft in December 2025. It is available for reference but no longer actively maintained.)</em></li>
</ul>
]]></content:encoded></item><item><title>Beyond Atoms: 3D Space Modeling for Molecular Pretraining</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/beyond-atoms/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/beyond-atoms/</guid><description>Lu et al. introduce SpaceFormer, a Transformer that models entire 3D molecular space including atoms for superior representations.</description><content:encoded><![CDATA[<h2 id="paper-typology-and-contribution">Paper Typology and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is <strong>SpaceFormer</strong>, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.</p>
<h2 id="the-physical-intuition-modeling-empty-space">The Physical Intuition: Modeling &ldquo;Empty&rdquo; Space</h2>
<p><strong>The Gap</strong>: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the &ldquo;empty&rdquo; space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.</p>
<p><strong>The Hypothesis</strong>: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.</p>
<h2 id="a-surprising-observation-virtual-points-improve-representations">A Surprising Observation: Virtual Points Improve Representations</h2>
<p>Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.</p>
<p>The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.</p>
<p>The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.</p>
<h2 id="spaceformer-voxelization-and-3d-positional-encodings">SpaceFormer: Voxelization and 3D Positional Encodings</h2>
<p>The key innovation is treating the molecular representation problem as <strong>3D space modeling</strong>. SpaceFormer follows these core steps:</p>
<ol>
<li><strong>Voxelizes the entire 3D space</strong> into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).</li>
<li><strong>Uses adaptive multi-resolution grids</strong> to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.</li>
<li><strong>Applies Transformers to 3D spatial tokens</strong> with custom positional encodings that achieve linear complexity.</li>
</ol>
<p>Specifically, the model utilizes two forms of 3D Positional Encoding:</p>
<p><strong>3D Directional PE (RoPE Extension)</strong>
They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:</p>
<p>$$
\begin{aligned}
\mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s}
\end{aligned}
$$</p>
<p><strong>3D Distance PE (RFF Approximation)</strong>
To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:</p>
<p>$$
\begin{aligned}
\exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &amp;\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\
z(\mathbf{c}_i) &amp;= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b})
\end{aligned}
$$</p>
<p>This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.</p>
<h2 id="experimental-setup-and-downstream-tasks">Experimental Setup and Downstream Tasks</h2>
<p><strong>Pretraining Data</strong>: 19 million unlabeled molecules from the same dataset used by Uni-Mol.</p>
<p><strong>Downstream Benchmarks</strong>: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:</p>
<ol>
<li>
<p><strong>Computational Properties (Quantum Mechanics)</strong></p>
<ul>
<li>Subsets of <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)</li>
<li>Cata-condensed polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction, 8,678 samples)</li>
<li>Metric: Mean Absolute Error (MAE)</li>
</ul>
</li>
<li>
<p><strong>Experimental Properties (Pharma/Bio)</strong></p>
<ul>
<li>MoleculeNet tasks (BBBP, BACE for drug discovery)</li>
<li>Biogen ADME tasks (HLM, MME, Solubility)</li>
<li>Metrics: AUC for classification, MAE for regression</li>
</ul>
</li>
</ol>
<p><strong>Splitting Strategy</strong>: All datasets use 8:1:1 train/validation/test ratio with <strong>scaffold splitting</strong> to test out-of-distribution generalization.</p>
<p><strong>Training Setup</strong>:</p>
<ul>
<li><strong>Objective</strong>: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.</li>
<li><strong>Hardware</strong>: ~50 hours on 8 NVIDIA A100 GPUs</li>
<li><strong>Optimizer</strong>: Adam ($\beta_1=0.9, \beta_2=0.99$)</li>
<li><strong>Learning Rate</strong>: Peak 1e-4 with linear decay and 0.01 warmup ratio</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>Total Updates</strong>: 1 million</li>
</ul>
<p><strong>Baseline Comparisons</strong>: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).</p>
<h2 id="results-and-analysis">Results and Analysis</h2>
<p><strong>Strong Contextual Performance</strong>: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.</p>
<h3 id="key-results-on-quantum-properties">Key Results on Quantum Properties</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GROVER</th>
          <th>GEM</th>
          <th>3D Infomax</th>
          <th>Uni-Mol</th>
          <th>Mol-AE</th>
          <th><strong>SpaceFormer</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HOMO (Ha)</td>
          <td>0.0075</td>
          <td>0.0068</td>
          <td>0.0065</td>
          <td>0.0052</td>
          <td>0.0050</td>
          <td><strong>0.0042</strong></td>
      </tr>
      <tr>
          <td>LUMO (Ha)</td>
          <td>0.0086</td>
          <td>0.0080</td>
          <td>0.0070</td>
          <td>0.0060</td>
          <td>0.0057</td>
          <td><strong>0.0040</strong></td>
      </tr>
      <tr>
          <td>GAP (Ha)</td>
          <td>0.0109</td>
          <td>0.0107</td>
          <td>0.0095</td>
          <td>0.0081</td>
          <td>0.0080</td>
          <td><strong>0.0064</strong></td>
      </tr>
      <tr>
          <td>E1-CC2 (eV)</td>
          <td>0.0101</td>
          <td>0.0090</td>
          <td>0.0089</td>
          <td>0.0067</td>
          <td>0.0070</td>
          <td><strong>0.0058</strong></td>
      </tr>
      <tr>
          <td>Dipmom (Debye)</td>
          <td>0.0752</td>
          <td>0.0289</td>
          <td>0.0291</td>
          <td>0.0106</td>
          <td>0.0113</td>
          <td><strong>0.0083</strong></td>
      </tr>
  </tbody>
</table>
<p>SpaceFormer&rsquo;s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer&rsquo;s 0.8605.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The authors present several ablations that isolate the source of SpaceFormer&rsquo;s improvements:</p>
<p><strong>MAE vs. Denoising</strong>: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting <em>whether</em> an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.</p>
<p><strong>FLOPs Control</strong>: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.</p>
<p><strong>Virtual Points vs. SpaceFormer</strong>: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.</p>
<p><strong>Efficiency Validation</strong>: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol&rsquo;s pretraining cost scales quadratically.</p>
<h3 id="scope-and-future-directions">Scope and Future Directions</h3>
<p>SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="code-and-data-availability">Code and Data Availability</h3>
<ul>
<li><strong>Source Code</strong>: As of the current date, the authors have not released the official source code or pre-trained weights.</li>
<li><strong>Datasets</strong>: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.</li>
<li><strong>Compute</strong>: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source code</td>
          <td>Code</td>
          <td>N/A</td>
          <td>Not publicly released as of March 2026</td>
      </tr>
      <tr>
          <td>Pre-trained weights</td>
          <td>Model</td>
          <td>N/A</td>
          <td>Not publicly released</td>
      </tr>
      <tr>
          <td>Pretraining data (19M molecules)</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Same dataset as Uni-Mol; not independently released</td>
      </tr>
      <tr>
          <td>Downstream benchmark splits</td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Custom scaffold splits pending code release</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>The model treats a molecule as a 3D &ldquo;image&rdquo; via voxelization, processed by a Transformer.</p>
<p><strong>Input Representation</strong>:</p>
<ul>
<li><strong>Discretization</strong>: 3D space divided into grid cells with length <strong>$0.49\text{\AA}$</strong> (based on O-H bond length to ensure at most one atom per cell)</li>
<li><strong>Tokenization</strong>: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate</li>
<li><strong>Embeddings</strong>: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision</li>
</ul>
<p><strong>Transformer Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>Embedding Dim</th>
          <th>FFN Dim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Encoder</strong></td>
          <td>16</td>
          <td>8</td>
          <td>512</td>
          <td>2048</td>
      </tr>
      <tr>
          <td><strong>Decoder</strong> (MAE)</td>
          <td>4</td>
          <td>4</td>
          <td>256</td>
          <td>1024</td>
      </tr>
  </tbody>
</table>
<p><strong>Attention Mechanism</strong>: FlashAttention for efficient handling of large sequence lengths.</p>
<p><strong>Positional Encodings</strong>:</p>
<ol>
<li><strong>3D Directional PE</strong>: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality</li>
<li><strong>3D Distance PE</strong>: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity</li>
</ol>
<h4 id="visualizing-rff-and-rope">Visualizing RFF and RoPE</h4>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-rff-rope-visualization.webp"
         alt="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         title="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Visual intuition for SpaceFormer&rsquo;s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).</figcaption>
    
</figure>

<p><strong>Top Row (Distance / RFF):</strong> Shows how the model learns &ldquo;closeness.&rdquo; Distance is represented by a complex &ldquo;fingerprint&rdquo; of waves that creates a Gaussian-like force field.</p>
<ul>
<li><strong>Top Left (The Force Field):</strong> The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.</li>
<li><strong>Top Right (The Fingerprint):</strong> Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique &ldquo;fingerprint&rdquo; for that exact distance.</li>
</ul>
<p><strong>Bottom Row (Direction / RoPE):</strong> Shows how the model learns &ldquo;relative position.&rdquo; It visualizes the vector rotation and how that creates a grid-like attention pattern.</p>
<ul>
<li><strong>Bottom Left (The Rotation):</strong> This visualizes the &ldquo;X-axis chunk&rdquo; of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.</li>
<li><strong>Bottom Right (The Grid):</strong> The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., &ldquo;top-right&rdquo; vs &ldquo;bottom-left&rdquo;).</li>
</ul>
<h4 id="adaptive-grid-merging">Adaptive Grid Merging</h4>
<p>To make the 3D grid approach computationally tractable, two key strategies are employed:</p>
<ol>
<li><strong>Grid Sampling</strong>: Randomly selecting 10-20% of empty cells during training</li>
<li><strong>Adaptive Grid Merging</strong>: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger &ldquo;coarse&rdquo; cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)</li>
</ol>
<p><strong>Visualizing Adaptive Grid Merging</strong>:</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-merging.webp"
         alt="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         title="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.</figcaption>
    
</figure>

<p>The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:</p>
<ul>
<li><strong>Red Cells (Level 0):</strong> The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.</li>
<li><strong>Light Blue Cells (Level 0/1):</strong> Small empty regions close to atoms.</li>
<li><strong>Darker Blue Cells (Level 2/3):</strong> Large blocks of empty space further away.</li>
</ul>
<p>If we used a naive uniform grid, we would have to process thousands of empty &ldquo;Level 0&rdquo; cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly <strong>10x</strong> compared to a dense grid.</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-benzene.webp"
         alt="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         title="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.</figcaption>
    
</figure>

<p>The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red &ldquo;active&rdquo; zones where chemistry actually happens.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., &amp; Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. <em>Proceedings of the 42nd International Conference on Machine Learning (ICML)</em>, 267, 40491-40504. <a href="https://proceedings.mlr.press/v267/lu25e.html">https://proceedings.mlr.press/v267/lu25e.html</a></p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lu2025beyond,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{40491--40504}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=Wd9KPQCKwq">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=Wd9KPQCKwq">PDF on OpenReview</a></li>
<li><a href="https://icml.cc/virtual/2025/poster/45004">ICML 2025 poster page</a></li>
</ul>
]]></content:encoded></item><item><title>Embedded-Atom Method: Impurities and Defects in Metals</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method/</link><pubDate>Fri, 22 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method/</guid><description>Daw and Baskes's foundational 1984 paper introducing the Embedded-Atom Method (EAM), a many-body potential for metal simulations.</description><content:encoded><![CDATA[<h2 id="contribution-adaptive-many-body-potentials">Contribution: Adaptive Many-Body Potentials</h2>
<p>This is a foundational <strong>method paper</strong> that introduces a new class of semi-empirical, many-body interatomic potential: the <strong>Embedded-Atom Method (EAM)</strong>. It is designed for large-scale atomistic simulations of metallic systems, bridging the gap between computationally cheap (but physically limited) pair potentials and accurate (but expensive) quantum mechanical methods. The EAM achieves pair-potential speed while incorporating many-body physics inspired by density functional theory.</p>
<h2 id="motivation-the-geometric-limits-of-pair-potentials">Motivation: The Geometric Limits of Pair Potentials</h2>
<p>The authors sought to overcome the limitations of <strong>pair potentials</strong> (the dominant method of the time), which failed in three key areas:</p>
<ul>
<li><strong>Elastic Anisotropy:</strong> Pair potentials enforce the Cauchy relation ($C_{12} = C_{44}$), which is violated by most transition metals.</li>
<li><strong>Volume Ambiguity:</strong> Pair potentials require a volume-dependent energy term, making them impossible to use accurately on surfaces or cracks where local volume is undefined.</li>
<li><strong>Chemical Incompatibility:</strong> Pair potentials cannot model chemically active impurities like Hydrogen.</li>
</ul>
<p>First-principles quantum mechanical methods (e.g., band theory) are limited by basis-set size and periodicity requirements, making them impractical for the large systems (thousands of atoms) needed to study defects, surfaces, and mechanical properties.</p>
<p>The goal was to create a new model that bridges this gap in accuracy and computational cost.</p>
<h2 id="core-innovation-the-embedding-energy-function">Core Innovation: The Embedding Energy Function</h2>
<p>The EAM postulates that the energy of an atom is determined by the local electron density of its neighbors. The total energy is:</p>
<p>$$E_{tot} = \sum_{i} F_i(\rho_{h,i}) + \frac{1}{2}\sum_{i \neq j} \phi_{ij}(R_{ij})$$</p>
<ul>
<li><strong>$F_i(\rho_{h,i})$ (Embedding Energy):</strong> The energy required to embed atom $i$ into the background electron density $\rho$ provided by its neighbors. This term is non-linear and captures many-body effects.</li>
<li><strong>$\phi_{ij}$ (Pair Potential):</strong> A short-range electrostatic repulsion between cores.</li>
<li><strong>$\rho_{h,i}$ (Host Density):</strong> Approximated as a linear superposition of atomic densities: $\rho_{h,i} = \sum_{j \neq i} \rho^a_j(R_{ij})$.</li>
</ul>
<p>The key innovations are:</p>
<ol>
<li><strong>The Embedding Energy</strong>: Each atom $i$ contributes an energy $F_i$ which is a non-linear function of the local electron density $\rho_{h,i}$ it is embedded in. This density is approximated as a simple linear superposition of the atomic electron densities of all its neighbors. This term captures the crucial many-body effects of metallic bonding.</li>
<li><strong>A Redefined Pair Potential</strong>: A short-range, two-body potential $\phi_{ij}$ is retained, but it primarily models the electrostatic core-core repulsion.</li>
<li><strong>Elimination of the &ldquo;Volume&rdquo; Problem</strong>: Because the embedding energy depends on the local electron density (a quantity that is always well-defined, even at a surface or a crack tip), the method circumvents the ambiguities of volume-dependent pair potentials.</li>
<li><strong>Intrinsic Many-Body Nature</strong>: The non-linearity of the embedding function $F(\rho)$ naturally accounts for why chemically active impurities (like hydrogen) cannot be described by pair potentials and correctly breaks the Cauchy relation for elastic constants.</li>
</ol>
<h2 id="experimental-design-robust-parameter-validation">Experimental Design: Robust Parameter Validation</h2>
<p>The authors validated EAM through a rigorous split between parameterization data and prediction tasks:</p>
<p><strong>Fitting Data (Bulk Properties Only):</strong></p>
<p>The model parameters were fitted exclusively to these experimental values for Ni and Pd:</p>
<ul>
<li>Lattice constant ($a_0$)</li>
<li>Elastic constants ($C_{11}, C_{12}, C_{44}$)</li>
<li>Sublimation energy ($E_s$)</li>
<li>Vacancy-formation energy ($E^F_{1V}$)</li>
<li>Hydrogen heat of solution (for fitting H parameters)</li>
</ul>
<p><strong>Validation Tests (No Further Fitting):</strong></p>
<p>The model was then evaluated on its ability to predict these properties without any additional parameter adjustments:</p>
<ul>
<li><strong>Surface Relaxations:</strong> Ni(110) surface contraction</li>
<li><strong>Surface Energy:</strong> Ni(100) surface energy</li>
<li><strong>Hydrogen Migration:</strong> H migration energy in Pd</li>
<li><strong>Fracture Mechanics:</strong> Hydrogen embrittlement in Ni slabs</li>
</ul>
<h2 id="results-extending-predictive-power-to-surfaces-and-defects">Results: Extending Predictive Power to Surfaces and Defects</h2>
<ol>
<li><strong>Many-Body Physics:</strong> The embedding function $F(\rho)$ successfully captures the volume-dependence of metallic cohesion, fixing the &ldquo;Cauchy discrepancy&rdquo; inherent in pair potentials.</li>
<li><strong>Surface Properties:</strong> A single set of functions, fitted only to bulk data, correctly reproduces surface relaxations within 0.1 Å of experiment across three faces (100), (110), and (111) for Ni. The Ni(100) surface energy (1550 erg/cm²) compares well with the measured crystal-vapor average (1725 erg/cm²).</li>
<li><strong>Hydrogen in Bulk:</strong> The method predicts H migration energy in Pd as 0.26 eV, matching experiment exactly. Hydride lattice expansions are also well reproduced: 4.5% for NiH (experiment: 5%) and 4% for PdH (experiment: 3.5% for PdH$_{0.6}$).</li>
<li><strong>Hydrogen on Surfaces:</strong> Calculated adsorption sites on all three Ni and Pd faces agree with experimentally determined sites. Adsorption energies on Ni surfaces are systematically about 0.25 eV too low, while on Pd surfaces the error is much smaller (about 0.05 eV too high on average).</li>
<li><strong>Fracture Mechanics:</strong> Static fracture calculations on Ni slabs demonstrate brittle fracture behavior and show that hydrogen lowers the fracture stress, providing a qualitative model of hydrogen embrittlement.</li>
</ol>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>The functions $F$ and $\phi$ are not uniquely determined by the empirical fitting procedure. The short-range pair potential (restricted to first neighbors in fcc metals) may not be the best choice for all crystal structures.</li>
<li>The choice of hydrogen embedding function (Puska et al. vs. Norskov&rsquo;s corrected function) remains undecided and may affect hydrogen binding energies.</li>
<li>The fracture calculations are static, and dynamical effects and plasticity play important roles in real fracture that are not captured.</li>
<li>The method has only been demonstrated for fcc metals (Ni and Pd). Extension to bcc metals and other crystal structures requires further investigation.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="algorithms">Algorithms</h3>
<p>To replicate the method, three specific algorithmic definitions are needed:</p>
<ol>
<li>
<p><strong>Atomic Density Construction</strong>: The electron density $\rho^a(r)$ is a weighted sum of Hartree-Fock $s$ and $d$ orbital densities (from Clementi &amp; Roetti tables), controlled by a parameter $N_s$ (the number of s-like electrons):
$$\rho^a(r) = N_s\rho_s^a(r) + (N-N_s)\rho_d^a(r)$$
For Ni, $N_s = 0.85$; for Pd, $N_s = 0.65$ (fitted to H solution heat).</p>
</li>
<li>
<p><strong>Pair Potential Form</strong>: The short-range pair interaction derives from an effective charge function $Z(r)$ to handle core repulsion:
$$\phi_{ij}(r) = \frac{Z_i(r)Z_j(r)}{r}$$
Splines for $Z(r)$ are provided in Table II.</p>
</li>
<li>
<p><strong>Analytic Forces</strong>: Because embedding energy depends on neighbor density, the force calculation is many-body:
$$\vec{f}_{k} = -\sum_{j(\neq k)} (F&rsquo;_{k} \rho&rsquo;_{j} + F&rsquo;_{j} \rho&rsquo;_{k} + \phi&rsquo;_{jk}) \vec{r}_{jk}$$</p>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The functions $F(\rho)$ and $\phi(r)$ are modeled using <strong>cubic splines</strong>, with parameters fitted to reproduce bulk experimental constants. The embedding function $F(\rho)$ is constrained to have a single minimum and to be linear at high densities, matching the qualitative form of the first-principles calculations by Puska et al. Energy minimization uses the <strong>conjugate gradients</strong> technique. The paper explicitly lists spline knots, coefficients, and cutoffs in Tables II and IV, making the method fully reproducible.</p>















<figure class="post-figure center ">
    <img src="/img/notes/chemistry/eam-embedding-effective-charge.webp"
         alt="Reproduction of Figures 1 and 2 from Daw &amp; Baskes (1984) showing the embedding energy and effective charge functions for Ni and Pd"
         title="Reproduction of Figures 1 and 2 from Daw &amp; Baskes (1984) showing the embedding energy and effective charge functions for Ni and Pd"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Left:</strong> Dimensionless embedding energy ($E/E_s$) vs. normalized electron density ($\rho/\bar{\rho}$). The minimum near $\rho/\bar{\rho} \approx 1.0$ drives metallic cohesion. <strong>Right:</strong> Normalized effective charge ($Z/Z_0$) vs. normalized distance ($R/a_0$). The charge drops to zero near $R/a_0 = 0.85$, ensuring short-range interactions. Reproduced from Table II spline knots.</figcaption>
    
</figure>

<h3 id="evaluation">Evaluation</h3>
<p><strong>Fitting Data (Used for Parameterization):</strong></p>
<p>Bulk experimental properties for Ni and Pd only:</p>
<ul>
<li>Lattice constant ($a_0$)</li>
<li>Elastic constants ($C_{11}, C_{12}, C_{44}$)</li>
<li>Sublimation energy ($E_s$)</li>
<li>Vacancy-formation energy ($E^F_{1V}$)</li>
<li>Hydrogen heat of solution (for fitting H parameters)</li>
</ul>
<p><strong>Validation Results (Predictions Without Further Fitting):</strong></p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Predicted</th>
          <th>Experimental</th>
          <th>Agreement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ni(110) surface contraction</td>
          <td>-0.11 Å</td>
          <td>-0.06 to -0.10 Å</td>
          <td>Within 0.1 Å</td>
      </tr>
      <tr>
          <td>Ni(100) surface energy</td>
          <td>1550 erg/cm²</td>
          <td>1725 erg/cm² (avg.)</td>
          <td>Close</td>
      </tr>
      <tr>
          <td>H migration in Pd</td>
          <td>0.26 eV</td>
          <td>0.26 eV</td>
          <td>Exact</td>
      </tr>
      <tr>
          <td>NiH lattice expansion</td>
          <td>4.5%</td>
          <td>5%</td>
          <td>Close</td>
      </tr>
      <tr>
          <td>PdH lattice expansion</td>
          <td>4%</td>
          <td>3.5% (PdH$_{0.6}$)</td>
          <td>Close</td>
      </tr>
      <tr>
          <td>H adsorption sites (Ni, Pd)</td>
          <td>Correct on all faces</td>
          <td>Matches experiment</td>
          <td>Exact</td>
      </tr>
      <tr>
          <td>H embrittlement in Ni</td>
          <td>Qualitative model</td>
          <td>-</td>
          <td>Qualitative</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Daw, M. S., &amp; Baskes, M. I. (1984). Embedded-atom method: Derivation and application to impurities, surfaces, and other defects in metals. <em>Physical Review B</em>, 29(12), 6443-6453. <a href="https://doi.org/10.1103/PhysRevB.29.6443">https://doi.org/10.1103/PhysRevB.29.6443</a></p>
<p><strong>Publication</strong>: Physical Review B, 1984</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{daw1984embedded,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Embedded-atom method: Derivation and application to impurities, surfaces, and other defects in metals}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Daw, Murray S and Baskes, Mike I}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Physical Review B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{29}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6443--6453}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1984}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{APS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1103/PhysRevB.29.6443}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-review-1993/">EAM Review (1993)</a></li>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-voter-1994/">EAM User Guide (1994)</a></li>
<li><a href="https://www.ctcms.nist.gov/potentials/">NIST Interatomic Potentials Repository</a></li>
</ul>
]]></content:encoded></item><item><title>Umbrella Sampling: Monte Carlo Free-Energy Estimation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/umbrella-sampling/</link><pubDate>Thu, 21 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/umbrella-sampling/</guid><description>Torrie and Valleau's 1977 paper introducing Umbrella Sampling, an importance sampling technique for Monte Carlo free-energy calculations.</description><content:encoded><![CDATA[<h2 id="a-methodological-shift-in-monte-carlo-simulations">A Methodological Shift in Monte Carlo Simulations</h2>
<p>This is a <strong>Method</strong> paper that introduces a novel computational technique for Monte Carlo simulations. It presents Umbrella Sampling, an importance sampling approach that uses non-physical distributions to calculate free energy differences in molecular systems.</p>
<h2 id="the-sampling-gap-in-phase-transitions">The Sampling Gap in Phase Transitions</h2>
<p>The paper addresses the failure of conventional Boltzmann-weighted Monte Carlo to estimate free energy differences.</p>
<ul>
<li><strong>The Problem</strong>: Free energy depends on the integral of configurations that are rare in the reference system. In a standard simulation, the relevant probability density $f_0(\Delta U^*)$ is too small to be sampled accurately by conventional Boltzmann-weighted Monte Carlo.</li>
<li><strong>Phase Transitions</strong>: Conventional &ldquo;thermodynamic integration&rdquo; fails near phase transitions because it requires a path of integration where ensemble averages can be reliably measured, which is difficult in unstable regions.</li>
</ul>
<h2 id="bridging-states-with-non-physical-distributions">Bridging States with Non-Physical Distributions</h2>
<p>The authors introduce a non-physical distribution $\pi(q^N)$ to bridge the gap between a reference system (0) and a system of interest (1).</p>
<ul>
<li><strong>Arbitrary Weights</strong>: They generate a Markov chain with a limiting distribution $\pi(q^N)$ that differs from the Boltzmann distribution of either system. This distribution is written as $\pi(q&rsquo;^N) = w(q&rsquo;^N) \exp(-U_0(q&rsquo;^N)/kT_0) / Z$, where $w(q^N) = W(\Delta U^*)$ is a weighting function chosen to favor configurations with values of $\Delta U^*$ important to the free-energy integral.</li>
<li><strong>Reweighting Formula</strong>: The unbiased average of any property $\theta$ is recovered via the ratio of biased averages:</li>
</ul>
<p>$$\langle\theta\rangle_{0}=\frac{\langle\theta/w\rangle_{w}}{\langle1/w\rangle_{w}}$$</p>
<ul>
<li><strong>Overlap</strong>: The method allows sampling a range of $\Delta U^*$ up to <strong>three times</strong> that of a conventional Monte Carlo experiment, enabling accurate determination of values of $f_0(\Delta U^*)$ as small as $10^{-8}$. If a single weight function cannot span the entire gap, additional overlapping umbrella-sampling experiments are carried out with different weighting functions exploring successively overlapping ranges of $\Delta U^*$.</li>
</ul>
<h2 id="validation-on-lennard-jones-fluids">Validation on Lennard-Jones Fluids</h2>
<p>The authors validated Umbrella Sampling using Monte Carlo simulations of model fluids.</p>
<h3 id="experimental-setup">Experimental Setup</h3>
<ul>
<li><strong>System Specifications</strong>: The study used a <strong>Lennard-Jones (LJ)</strong> fluid and an <strong>inverse-12 &ldquo;soft-sphere&rdquo;</strong> fluid.</li>
<li><strong>System Size</strong>: Simulations were primarily performed with <strong>$N=32$ particles</strong>, with some validation runs at <strong>$N=108$ particles</strong> to check for size dependence.</li>
<li><strong>State Points</strong>: Calculations covered a wide range of densities ($N\sigma^3/V = 0.50$ to $0.85$) and temperatures ($kT/\epsilon = 0.7$ to $2.8$), including the gas-liquid coexistence region.</li>
</ul>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>Baselines</strong>: Results were compared to thermodynamic integration data from <strong>Hansen</strong>, <strong>Levesque</strong>, and <strong>Verlet</strong>.</li>
<li><strong>Quantitative Success</strong>:
<ul>
<li><strong>Agreement</strong>: The free energy estimates agreed with pressure integration results to within statistical uncertainties (e.g., at $kT/\epsilon=1.35$, Umbrella Sampling gave -3.236 vs. Conventional -3.25).</li>
<li><strong>Precision</strong>: Free energy differences were obtained with high precision ($\pm 0.005 NkT$ for $N=108$).</li>
<li><strong>Efficiency</strong>: A single umbrella run could replace the &ldquo;numerous runs&rdquo; required for conventional $1/T$ integrations.</li>
</ul>
</li>
</ul>
<h2 id="temperature-scaling-via-reweighting">Temperature Scaling via Reweighting</h2>
<p>When the reference system has the same internal energy function as the system of interest (i.e., the same fluid at a different temperature), the free-energy expression simplifies to:</p>
<p>$$\frac{A(T)}{kT} = \frac{A(T_0)}{kT_0} - \ln \int f_0(U) \exp\left[-U\left(\frac{1}{kT} - \frac{1}{kT_0}\right)\right] dU$$</p>
<p>This is especially useful because a single determination of $f_0(U)$ over a wide energy range gives the free energy over a whole range of temperatures simultaneously. For 32 Lennard-Jones particles, only two umbrella-sampling experiments are needed to span the temperature range from the triple point ($kT/\epsilon = 0.7$) to twice the critical temperature ($kT/\epsilon = 2.8$). For 108 particles, four experiments suffice.</p>
<h2 id="mapping-the-liquid-gas-free-energy-surface">Mapping the Liquid-Gas Free Energy Surface</h2>
<ul>
<li><strong>Methodological Utility</strong>: The method successfully mapped the free energy of the LJ fluid across the liquid-gas transition, a region where conventional methods face convergence problems.</li>
<li><strong>N-Dependence</strong>: Comparison between $N=32$ and $N=108$ showed no statistically significant size dependence for free energy differences, suggesting small systems are sufficient for these estimates.</li>
<li><strong>Comparison with Gosling-Singer Method</strong>: The paper contrasts its results with free energies derived from Gosling and Singer&rsquo;s entropy estimation technique, finding discrepancies as large as $0.4N\epsilon$ (a 20% error in the nonideal entropy), equivalent to overestimating the configurational integral of a 108-particle system by a factor of $10^{16}$.</li>
<li><strong>Generality</strong>: While demonstrated on energy ($U$), the authors note the weighting function $w$ can be any function of the coordinates, generalizing the technique beyond simple free energy differences.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>This 1977 paper predates modern code-sharing practices, and no source code or data files are publicly available. However, the paper provides sufficient algorithmic detail for reimplementation:</p>
<ul>
<li><strong>Constructing $W$</strong>: The paper does not derive $W$ analytically. It uses a <strong>trial-and-error procedure</strong>: start with a short Boltzmann-weighted experiment, then broaden the distribution in stages through short test runs, adjusting weights to flatten the probability density $f_w(\Delta U^*)$. The paper acknowledges this requires &ldquo;interaction between the trial computer results and human judgment.&rdquo;</li>
<li><strong>Specific Weights</strong>: Table I provides the exact numerical weights used for the 32-particle soft-sphere experiment at $N\sigma^3/V = 0.85$, $kT/\epsilon = 2.74$, with values spanning from $W=1{,}500{,}000$ at the lowest energies down to $W=1.0$ at the center and back up to $W=16.0$ at the highest energies.</li>
<li><strong>Potentials</strong>: The Lennard-Jones and inverse-twelve potentials are fully specified (Eqs. 8 and 9).</li>
<li><strong>State Points</strong>: Densities and temperatures are enumerated in Tables II and III.</li>
<li><strong>Block Averaging</strong>: Errors were estimated by treating sequences of $m$ steps as independent samples, where $m$ is determined by increasing block size until no systematic trends can be detected in either the average or the standard deviation of the mean.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Torrie, G. M., &amp; Valleau, J. P. (1977). Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. <em>Journal of Computational Physics</em>, 23(2), 187-199. <a href="https://doi.org/10.1016/0021-9991(77)90121-8">https://doi.org/10.1016/0021-9991(77)90121-8</a></p>
<p><strong>Publication</strong>: Journal of Computational Physics, 1977</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{torrie1977nonphysical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Torrie, Glenn M and Valleau, John P}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Computational Physics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{187--199}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1977}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/0021-9991(77)90121-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Contrastive Learning for Variational Autoencoder Priors</title><link>https://hunterheidenreich.com/notes/machine-learning/generative-models/contrastive-learning-for-vae-priors/</link><pubDate>Sun, 17 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/generative-models/contrastive-learning-for-vae-priors/</guid><description>Aneja et al.'s NeurIPS 2021 paper introducing Noise Contrastive Priors (NCPs) to address VAE's 'prior hole' problem with energy-based priors.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>method paper</strong> that introduces a training approach for Variational Autoencoders (VAEs) to address fundamental limitations in their generative quality through improved prior learning.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The work is motivated by a critical limitation in Variational Autoencoders known as the <strong>&ldquo;prior hole&rdquo; problem</strong>, where the prior distribution p(z) fails to match the aggregate approximate posterior q(z). This mismatch leads to areas in the latent space with high density under the prior that don&rsquo;t map to realistic data samples, resulting in poor generative quality compared to GANs and other generative models.</p>















<figure class="post-figure center ">
    <img src="/img/notes/vae-prior-hole-problem-illustrated.webp"
         alt="Visualization of the VAE prior hole problem showing a ring-shaped aggregate posterior q(z) with an empty center, while the standard Gaussian prior p(z) has highest density at the center where no data exists"
         title="Visualization of the VAE prior hole problem showing a ring-shaped aggregate posterior q(z) with an empty center, while the standard Gaussian prior p(z) has highest density at the center where no data exists"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The &lsquo;prior hole&rsquo; problem: the standard Gaussian prior (red dashed contours) assigns highest probability to the center, but the aggregate posterior (blue dots) forms a ring with no data in that region.</figcaption>
    
</figure>

<p>The figure above illustrates this mismatch. The blue dots represent where a trained encoder actually places data in the latent space (the aggregate posterior $q(z)$), which often forms complex, non-Gaussian shapes. The red dashed contours show the standard Gaussian prior $p(z) = \mathcal{N}(0, I)$, which assumes data is centered at the origin. When generating new samples, we draw from this prior, making it likely to sample from the empty &ldquo;hole&rdquo; where the decoder has never seen training data, producing unrealistic outputs.</p>
<p>A natural question arises: the prior $p(z)$ is used for <em>sampling</em> at inference time, so why does learning a better prior also improve <em>likelihood</em> (NLL)? The answer lies in the VAE objective. VAEs maximize the Evidence Lower Bound (ELBO):</p>
<p>$$ \log p(x) \geq \mathcal{L}_{\text{ELBO}}(x) = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction}} - \underbrace{\text{KL}(q(z|x) \parallel p(z))}_{\text{Regularization}} $$</p>
<p>The KL divergence term penalizes the mismatch between each data point&rsquo;s approximate posterior $q(z|x)$ and the prior $p(z)$. When the prior is a simple Gaussian but the aggregate posterior forms a complex shape (as in the figure above), this KL term remains unnecessarily high for every data point.</p>
<p>By replacing the simple prior with a learned $p_{\text{NCP}}(z)$ that matches the aggregate posterior, the KL penalty decreases, tightening the ELBO and improving NLL. The learned prior thus provides a <strong>unified solution</strong>: better likelihood during training (tighter bound) and better sampling at inference (no &ldquo;holes&rdquo;).</p>
<p>The OpenReview discussion contains a significant theoretical debate regarding the paper&rsquo;s core premise. Reviewers argued that the &ldquo;prior hole&rdquo; problem is actually a failure of the posterior to match the prior, or a failure of the encoder. The authors defended their approach by noting that even with a perfect posterior, a simple Normal prior might fail because the decoder lacks capacity to map a simple distribution to complex data without dropping modes. This justifies fixing the prior by making it learned and complex.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The authors propose an <strong>energy-based model (EBM) prior</strong> that is trained using <strong>Noise Contrastive Estimation (NCE)</strong>, which they term a <strong>Noise Contrastive Prior (NCP)</strong>. The key innovations are:</p>
<ul>
<li><strong>Two-Stage Training Process</strong>: First, a standard VAE is trained with a simple base prior. Then, the VAE weights are frozen and a binary classifier learns to distinguish between samples from the aggregate posterior q(z) and the base prior p(z).</li>
<li><strong>Reweighting Strategy</strong>: The core idea is to reweight a base prior distribution p(z) with a learned reweighting factor r(z) to make the resulting prior $p_{\text{NCP}}(z)$ better match the aggregate posterior q(z).</li>
<li><strong>NCE for EBM Training</strong>: The method frames EBM training as a binary classification task to avoid computationally expensive MCMC sampling.</li>
<li><strong>Scalability to Hierarchical Models</strong>: For hierarchical VAEs with multiple latent groups, the NCP approach can be applied independently and in parallel to each group&rsquo;s conditional prior.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The method was evaluated on several standard image generation benchmarks:</p>
<ul>
<li><strong>MNIST</strong> (dynamically binarized): Likelihood evaluation on a controlled, small-latent-space task</li>
<li><strong>CIFAR-10</strong>: Standard computer vision benchmark for generative modeling</li>
<li><strong>CelebA 64x64</strong>: Applied to both standard VAE architectures and more advanced VAEs with GMM priors (RAE model)</li>
<li><strong>CelebA HQ 256x256</strong>: High-resolution face generation task</li>
</ul>
<p>The hierarchical NVAE models used 30 latent groups for CIFAR-10 and CelebA-64, 20 groups for CelebA-HQ-256, and 10 groups of $4 \times 4$ latent variables for MNIST (deliberately small to enable reliable partition function estimation). The experiments compared FID scores, likelihood metrics, and qualitative sample quality between baseline VAEs and NCP-enhanced versions, with particular focus on hierarchical VAEs (NVAE).</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<p>The proposed NCP method demonstrated improvements in generative quality across evaluated datasets, with modest gains on standard VAEs and particularly large gains on hierarchical models like NVAE:</p>
<ul>
<li><strong>CelebA-64</strong>: NCP improved FID scores from 48.12 to 41.28 for standard VAEs, and from 40.95 to 39.00 for RAE models with GMM priors.</li>
<li><strong>Hierarchical Models (NVAE)</strong>: The impact was particularly pronounced on hierarchical VAEs:
<ul>
<li><strong>CIFAR-10</strong>: FID improved from 51.71 to 24.08</li>
<li><strong>CelebA-64</strong>: FID improved from 13.48 to 5.25, making it competitive with GANs</li>
<li><strong>CelebA HQ 256x256</strong>: FID reduced from 40.26 to 24.79</li>
</ul>
</li>
<li><strong>Likelihood Performance</strong>: On MNIST, NCP-VAE achieved 78.10 nats NLL vs. baseline NVAE&rsquo;s 78.67 nats</li>
</ul>
<p>On CIFAR-10 and CelebA-HQ-256, the concurrent VAEBM method (which forms an EBM on the data space rather than the latent space) outperforms NCP-VAE. However, the authors argue the two approaches are complementary: NCP-VAE targets the latent space while VAEBM operates in data space, and combining them could yield further gains. NCP-VAE also has the advantage of applicability to discrete data (e.g., binarized MNIST) and simpler setup since it only requires training binary classifiers rather than MCMC-based training and sampling.</p>
<p>The key conclusions are that <strong>two-stage training with noise contrastive estimation</strong> provides an effective framework for learning expressive energy-based priors that addresses the prior hole problem while scaling efficiently to hierarchical models.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://drive.google.com/drive/folders/15tCGruQcSdm2G4yLkUpKvGASluSZPIBD">Code (Google Drive)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation; hosted on Google Drive (may become inaccessible)</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://openreview.net/forum?id=LcSfRundgwI">OpenReview</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Reviews, author responses, and supplementary material</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<h4 id="the-reweighting-mechanism">The Reweighting Mechanism</h4>
<p>The core innovation is defining the NCP prior as $p_{\text{NCP}}(z) \propto p(z)r(z)$. The reweighting factor $r(z)$ is derived from the binary classifier $D(z)$ using the <strong>likelihood ratio trick</strong>:</p>
<p>$$ r(z) \approx \frac{D(z)}{1 - D(z)} $$</p>
<p>Here, $D(z)$ is the sigmoid output of the trained discriminator, representing the probability that sample $z$ came from the aggregate posterior $q(z)$ (&ldquo;real&rdquo;). For an optimal discriminator $D^*(z)$, this ratio exactly equals $\frac{q(z)}{p(z)}$, allowing the model to approximate the density ratio without explicit density estimation.</p>















<figure class="post-figure center ">
    <img src="/img/notes/ncp-vae-reweighting-the-prior-posterior.webp"
         alt="Visualization of the NCP reweighting mechanism showing three 1D distributions: q(z) the complex bimodal aggregate posterior, p(z) the simple Gaussian prior, and r(z) the learned reweighting factor that transforms p(z) to match q(z)"
         title="Visualization of the NCP reweighting mechanism showing three 1D distributions: q(z) the complex bimodal aggregate posterior, p(z) the simple Gaussian prior, and r(z) the learned reweighting factor that transforms p(z) to match q(z)"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The reweighting mechanism: the learned factor $r(z)$ (bottom) reweights the simple Gaussian prior $p(z)$ (middle) to approximate the complex aggregate posterior $q(z)$ (top). Where $q(z)$ has high density but $p(z)$ is low, $r(z)$ compensates with high values.</figcaption>
    
</figure>

<h4 id="hierarchical-architecture-strategy">Hierarchical Architecture Strategy</h4>
<p>For hierarchical models (like NVAE), the method trains $K$ binary classifiers in parallel (one for each latent group). Crucially, to ensure efficiency, the classifiers reuse the <strong>context feature</strong> $c(z_{&lt;k})$ extracted by the frozen VAE&rsquo;s prior network. This architectural choice provides significant computational savings.</p>
<h4 id="test-time-sampling-inference">Test-Time Sampling (Inference)</h4>
<p>Since $p_{\text{NCP}}(z)$ is an energy-based model, direct sampling is impossible. The paper employs two methods to generate samples:</p>
<ul>
<li><strong>Sampling-Importance-Resampling (SIR):</strong> Used for most results. It draws $M$ samples (e.g., $M=5000$) from the base prior $p(z)$ and resamples them based on weights $w^{(m)} = r(z^{(m)})$.</li>
<li><strong>Langevin Dynamics (LD):</strong> An iterative refinement method using the gradient of the energy function $E(z) = -\log r(z) - \log p(z)$.</li>
</ul>
<h3 id="models">Models</h3>
<h4 id="decoder-architecture">Decoder Architecture</h4>
<p>For RGB datasets (CIFAR-10, CelebA), the output likelihood must be changed from <strong>Discretized Logistic</strong> (standard NVAE) to a <strong>Normal distribution</strong>. The authors note this change alone led to &ldquo;significant improvements in the base model performance.&rdquo; Using the standard NVAE decoder will result in a weaker baseline than reported.</p>
<h4 id="discriminator-architecture">Discriminator Architecture</h4>
<p>The binary classifier uses a ResNet-style architecture with <strong>Squeeze-and-Excitation (SE)</strong> blocks:</p>
<ul>
<li><strong>Activation:</strong> Swish</li>
<li><strong>Normalization:</strong> Batch Normalization</li>
<li><strong>Optimization:</strong> Adam with Cosine Annealing (learning rate: $10^{-3} \to 10^{-7}$)</li>
</ul>
<p>The SE blocks help the model focus on channel-wise feature recalibration, which is important for distinguishing subtle differences between prior and aggregate posterior in high-dimensional latent spaces.</p>
<h3 id="hardware">Hardware</h3>
<p>The main paper is vague on training time, but the OpenReview rebuttal explicitly lists hardware costs:</p>
<ul>
<li><strong>Hardware:</strong> NVIDIA Tesla V100 (32GB) GPUs</li>
<li><strong>Per-Discriminator Training:</strong> ~13 hours for 100 epochs</li>
<li><strong>Parallelization:</strong> Because latent groups are independent, all discriminators can train in parallel</li>
<li><strong>Total Cost (CelebA-64):</strong> ~8.1 GPU-days</li>
<li><strong>Comparison:</strong> The authors argue this is efficient compared to VDVAE, which requires ~560 GPU-days</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<h4 id="inference-speed-vs-quality-trade-off">Inference Speed vs. Quality Trade-off</h4>
<p>Reviewers flagged that SIR sampling can be prohibitively slow. The authors clarified the exact trade-off:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Proposal Samples ($M$)</th>
          <th style="text-align: left">Time per Image</th>
          <th style="text-align: left">FID (CelebA-64)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">5,000 (paper default)</td>
          <td style="text-align: left">~10.11 seconds</td>
          <td style="text-align: left">5.25</td>
      </tr>
      <tr>
          <td style="text-align: left">500 (practical)</td>
          <td style="text-align: left">~1.25 seconds</td>
          <td style="text-align: left">6.76</td>
      </tr>
  </tbody>
</table>
<p>The quality gain from 500 to 5,000 samples is modest (FID difference of 1.51) while inference time increases roughly 8x, suggesting $M=500$ may be a practical default.</p>
<h4 id="hyperparameters">Hyperparameters</h4>
<ul>
<li><strong>FID Calculation:</strong> 50,000 samples</li>
<li><strong>SIR Proposals:</strong> 5,000 samples (paper default) or 500 (practical)</li>
<li><strong>MNIST:</strong> Dynamically binarized version used for likelihood evaluation</li>
<li><strong>Optimizers:</strong> The study largely adopts hyperparameters from baseline papers (e.g., Lawson et al. for MNIST, Ghosh et al. for RAE)</li>
</ul>
<h4 id="debugging-benchmark-25-gaussians">Debugging Benchmark: 25-Gaussians</h4>
<p>The supplement provides a toy experiment ideal for verifying a new implementation before running on expensive image datasets:</p>
<ul>
<li><strong>Setup:</strong> Synthetic dataset of 25 2D-Gaussians arranged on a grid</li>
<li><strong>Target NLL:</strong> ~-0.954 nats (NCP) vs. ~-2.753 nats (Vanilla VAE)</li>
<li><strong>Success Criterion:</strong> Samples should avoid low-density regions between grid points. A standard VAE will generate samples in these &ldquo;prior holes,&rdquo; while a working NCP implementation should cleanly remove these artifacts.</li>
</ul>
<h4 id="implementation-warnings">Implementation Warnings</h4>
<ul>
<li><strong>SIR Failure Mode:</strong> If the learned prior $p_{\text{NCP}}$ deviates too far from the base prior, SIR sampling collapses (low effective sample size). The paper shows a strong correlation between the NCE classification loss and the effective sample size (Fig. 5(b)), indicating that SIR reliability depends on how well the base prior matches the aggregate posterior.</li>
<li><strong>Temperature Scaling:</strong> The qualitative images in the paper use reduced temperature for improved visual sharpness (Section 5.3). The FID tables do not specify a temperature, so results may or may not use $T=1.0$.</li>
</ul>
<h3 id="data">Data</h3>
<p>The method was evaluated on several standard image generation benchmarks:</p>
<ul>
<li><strong>MNIST</strong> (dynamically binarized): Likelihood evaluation on a controlled, small-latent-space task</li>
<li><strong>CIFAR-10</strong>: Standard computer vision benchmark for generative modeling (32x32 RGB images)</li>
<li><strong>CelebA 64x64</strong>: Face generation task with moderate resolution</li>
<li><strong>CelebA HQ 256x256</strong>: High-resolution face generation task</li>
</ul>
<p>All datasets use standard train/test splits from the computer vision literature.</p>
<h4 id="additional-metrics">Additional Metrics</h4>
<p>Beyond FID and NLL, the paper uses:</p>
<ul>
<li><strong>Effective Sample Size (ESS):</strong> Validates reliability of the SIR sampling procedure</li>
<li><strong>Maximum Mean Discrepancy (MMD):</strong> Measures distance between aggregate posterior and NCP prior distributions</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Aneja, J., Schwing, A. G., Kautz, J., &amp; Vahdat, A. (2021). A contrastive learning approach for training variational autoencoder priors. <em>Advances in Neural Information Processing Systems</em>, 34, 29604-29616. <a href="https://proceedings.neurips.cc/paper/2021/hash/0496604c1d80f66fbeb963c12e570a26-Abstract.html">https://proceedings.neurips.cc/paper/2021/hash/0496604c1d80f66fbeb963c12e570a26-Abstract.html</a></p>
<p><strong>Publication</strong>: NeurIPS 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{aneja2021contrastive,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A Contrastive Learning Approach for Training Variational Autoencoder Priors}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Aneja, Jyoti and Schwing, Alexander G and Kautz, Jan and Vahdat, Arash}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{29604--29616}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=LcSfRundgwI">OpenReview Discussion</a></li>
<li><a href="https://drive.google.com/drive/folders/15tCGruQcSdm2G4yLkUpKvGASluSZPIBD">Code Repository</a> (Google Drive; link may become inaccessible over time)</li>
</ul>
]]></content:encoded></item><item><title>SubGrapher: Visual Fingerprinting of Chemical Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/</link><pubDate>Mon, 28 Apr 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/</guid><description>SubGrapher creates molecular fingerprints directly from chemical structure images through functional group segmentation for database retrieval.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-taxonomy">Paper Classification and Taxonomy</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution. Using the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> framework:</p>
<p><strong>Primary Classification: Method</strong></p>
<p>The dominant basis vector is Methodological because SubGrapher introduces an architecture that replaces the two-step OCSR workflow (image, then structure, then fingerprint) with single-step fingerprinting (image to visual fingerprint). The paper validates this approach through systematic comparison against state-of-the-art methods (MolGrapher, OSRA, DECIMER, MolScribe), demonstrating superior performance on specific tasks like retrieval and robustness to image quality degradation.</p>
<p><strong>Secondary Classification: Resource</strong></p>
<p>The paper makes non-negligible resource contributions by releasing:</p>
<ul>
<li>Code and model weights on <a href="https://github.com/DS4SD/SubGrapher">GitHub</a> and <a href="https://huggingface.co/docling-project/SubGrapher">HuggingFace</a></li>
<li>Five new visual fingerprinting benchmark datasets for molecule retrieval tasks</li>
<li>Comprehensive functional group knowledge base (1,534 substructures)</li>
</ul>
<h2 id="motivation-extracting-complex-structures-from-noisy-images">Motivation: Extracting Complex Structures from Noisy Images</h2>
<p>The motivation tackles a fundamental challenge in chemical informatics: extracting molecular information from the vast amounts of unstructured scientific literature, particularly patents. Millions of molecular structures exist only as images in these documents, making them inaccessible for computational analysis, database searches, or machine learning applications.</p>
<p>Traditional Optical Chemical Structure Recognition (OCSR) tools attempt to fully reconstruct molecular graphs from images, converting them into machine-readable formats like SMILES. However, these approaches face two critical limitations:</p>
<ol>
<li><strong>Brittleness to image quality</strong>: Poor resolution, noise, or unconventional drawing styles frequently degrade recognition accuracy</li>
<li><strong>Limited handling of complex structures</strong>: Markush structures, generic molecular templates with variable R-groups commonly used in patents, are poorly supported by most conventional OCSR methods</li>
</ol>
<p>The key insight driving SubGrapher is that full molecular reconstruction may be unnecessary for many applications. For tasks like database searching, similarity analysis, or document retrieval, a molecular fingerprint - a vectorized representation capturing structural features - is often sufficient. This realization opens up a new approach: bypass the fragile reconstruction step and create fingerprints directly from visual information.</p>
<h2 id="key-innovation-direct-visual-fingerprinting">Key Innovation: Direct Visual Fingerprinting</h2>
<p>SubGrapher takes a different approach to extracting chemical information from images. It creates &ldquo;visual fingerprints&rdquo; through functional group recognition. The key innovations are:</p>
<ol>
<li>
<p><strong>Direct Image-to-Fingerprint Pipeline</strong>: SubGrapher eliminates the traditional two-step process (image → structure → fingerprint) by generating fingerprints directly from pixels. This single-stage approach avoids error accumulation from failed structure reconstructions and can handle images where conventional OCSR tools produce invalid outputs.</p>
</li>
<li>
<p><strong>Dual Instance Segmentation Architecture</strong>: The system employs two specialized Mask-RCNN networks working in parallel:</p>
<ul>
<li><strong>Functional group detector</strong>: Trained to identify 1,534 expert-defined functional groups using pixel-level segmentation masks</li>
<li><strong>Carbon backbone detector</strong>: Recognizes 27 common carbon chain patterns to capture the molecular scaffold</li>
</ul>
<p>Using instance segmentation provides detailed spatial information and higher accuracy through richer supervision during training.</p>
</li>
<li>
<p><strong>Extensive Functional Group Knowledge Base</strong>: The method uses one of the most comprehensive open-source collections of functional groups, encompassing 1,534 substructures. These were systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Additional halogen substituents and organometallic groups relevant to EUV photoresists</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li>
<p><strong>Substructure-Graph Construction</strong>: After detecting functional groups and carbon backbones, SubGrapher builds a connectivity graph where:</p>
<ul>
<li>Each node represents an identified substructure</li>
<li>Edges connect substructures whose bounding boxes overlap (with 10% margin expansion)</li>
<li>This graph captures both the chemical components and their spatial relationships</li>
</ul>
</li>
<li>
<p><strong>Substructure-based Visual Molecular Fingerprint (SVMF)</strong>: The final output is a continuous, count-based fingerprint formally defined as a matrix $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$ (1,534 functional groups + 27 carbon backbones). The matrix is stored as a compressed upper triangular form:</p>
<p><strong>Diagonal elements</strong> ($i = j$): Weighted count of substructure $i$ plus self-intersection
$$SVMF_{ii}(m) = h_1 \cdot n_i + g_{ii}$$
where $h_1 = 10$ is the diagonal weight hyperparameter, $n_i$ is the instance count, and $g_{ii}$ is the self-intersection coefficient.</p>
<p><strong>Off-diagonal elements</strong> ($i \neq j$): Intersection coefficient based on shortest path distance $d$ in the substructure graph
$$SVMF_{ij}(m) = h_2(d) \cdot \text{intersection}(s_i, s_j)$$
where the distance decay function $h_2(d)$ is:</p>
<ul>
<li>$d \leq 1$: weight = 2</li>
<li>$d = 2$: weight = 2/4 = 0.5</li>
<li>$d = 3$: weight = 2/16 = 0.125</li>
<li>$d = 4$: weight = $2/256 \approx 0.0078$</li>
<li>$d &gt; 4$: weight = 0</li>
</ul>
<p><strong>Key properties</strong>:</p>
<ul>
<li>Carbon chain intersection coefficients are divided by 2, giving functional groups higher effective weight</li>
<li>Similarity between fingerprints calculated using a normalized Euclidean distance (ratio of L2 norm of difference to L2 norm of sum)</li>
<li>Resulting fingerprints are highly sparse (average 0.001% non-zero elements)</li>
<li>Compressed storage enables efficient database searches</li>
</ul>
</li>
<li>
<p><strong>Markush Structure Compatibility</strong>: SubGrapher processes Markush structures by recognizing their constituent functional groups and creating meaningful fingerprints for similarity searches, achieving higher accuracy than existing OCSR methods on the USPTO-Markush benchmark (S-F1: 88).</p>
</li>
</ol>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating SubGrapher&rsquo;s effectiveness across two critical tasks: accurate substructure detection and robust molecule retrieval from diverse image collections.</p>
<h4 id="substructure-detection-performance">Substructure Detection Performance</h4>
<p>SubGrapher&rsquo;s ability to identify functional groups was tested on three challenging benchmarks that expose different failure modes of OCSR systems:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Description</th>
          <th>Key Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>JPO</strong></td>
          <td>341 images</td>
          <td>Japanese Patent Office images (molecules with abbreviations removed)</td>
          <td>Low quality, noise, artifacts, non-standard drawing styles</td>
      </tr>
      <tr>
          <td><strong>USPTO-10K-L</strong></td>
          <td>1,000 images</td>
          <td>Large molecules (&gt;70 atoms)</td>
          <td>Scale variation, structural complexity, many functional groups</td>
      </tr>
      <tr>
          <td><strong>USPTO-Markush</strong></td>
          <td>74 images</td>
          <td>Generic Markush structures</td>
          <td>Variable R-groups, abstract patterns, template representation</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings:</strong></p>
<ol>
<li>
<p><strong>JPO Dataset (Low-Quality Patent Images)</strong>: SubGrapher achieved the highest Molecule Exact Match rate (83%), demonstrating robustness to image quality degradation where rule-based methods like OSRA scored lower (67% M-EM).</p>
</li>
<li>
<p><strong>USPTO-10K-L (Large Molecules)</strong>: SubGrapher achieved an S-F1 of 97, matching the rule-based OSRA and outperforming all other learning-based methods (MolScribe: 90, DECIMER: 86, MolGrapher: 56). The object detection approach handled scale variation better than other deep-learning OCSR tools on these challenging targets.</p>
</li>
<li>
<p><strong>USPTO-Markush (Generic Structures)</strong>: SubGrapher achieved the highest Substructure F1-score (88) on this benchmark, outperforming MolScribe (86), OSRA (74), and DECIMER (10). While other OCSR tools can attempt these images, they have limited support for Markush features. SubGrapher&rsquo;s instance segmentation approach handles complex Markush structures more effectively by focusing on relevant image regions.</p>
</li>
</ol>
<p>Qualitative analysis revealed that SubGrapher correctly identified functional groups in scenarios where other methods failed completely: images with captions, unconventional drawing styles, or significant quality degradation.</p>
<h4 id="visual-fingerprinting-for-molecule-retrieval">Visual Fingerprinting for Molecule Retrieval</h4>
<p>The core application was evaluated using a retrieval task designed to simulate real-world database searching:</p>
<ol>
<li>
<p><strong>Benchmark Creation</strong>: Five benchmark datasets were constructed around structurally similar molecules (adenosine, camphor, cholesterol, limonene, and pyridine), each containing 500 molecules sampled from PubChem with at least 90% Tanimoto similarity to the reference molecule, rendered as augmented images.</p>
</li>
<li>
<p><strong>Retrieval Task</strong>: Given a SMILES string as a query, the goal was to find the corresponding molecular image within the dataset of 500 visually similar structures. This tests whether the visual fingerprint can distinguish between closely related molecules.</p>
</li>
<li>
<p><strong>Performance Comparison</strong>: SubGrapher significantly outperformed baseline methods, retrieving the correct molecule at an average rank of 95 out of 500. The key advantage was robustness: SubGrapher generates a unique fingerprint for every image, even with partial or uncertain predictions. In contrast, OCSR-based methods frequently fail to produce valid SMILES, making them unable to generate fingerprints for comparison.</p>
</li>
<li>
<p><strong>Real-World Case Study</strong>: A practical demonstration involved searching a 54-page patent document containing 356 chemical images for a specific Markush structure. SubGrapher successfully located the target structure, highlighting its utility for large-scale document mining.</p>
</li>
</ol>
<h4 id="training-data-generation">Training Data Generation</h4>
<p>Since no public datasets existed with the required pixel-level mask annotations for functional groups, the researchers developed a comprehensive synthetic data generation pipeline:</p>
<ol>
<li>
<p><strong>Extended MolDepictor</strong>: They enhanced existing molecular rendering tools to create images from SMILES strings and generate corresponding segmentation masks for all substructures present in each molecule.</p>
</li>
<li>
<p><strong>Markush Structure Rendering</strong>: The pipeline was extended to handle complex generic structures using CXSMILES representations and the CDK library for rendering, creating training data for molecular templates with structural, positional, and frequency variation indicators.</p>
</li>
<li>
<p><strong>Diverse Molecular Sources</strong>: Training molecules were sourced from PubChem to ensure broad chemical diversity and coverage of different structural families.</p>
</li>
</ol>
<h2 id="results-impact-and-limitations">Results, Impact, and Limitations</h2>
<ul>
<li><strong>Superior Robustness to Image Quality</strong>: SubGrapher consistently outperformed traditional OCSR methods on degraded images, particularly the JPO patent dataset. SubGrapher&rsquo;s learned representations proved more resilient to noise, artifacts, and unconventional drawing styles than rule-based alternatives like OSRA (M-EM: 83 vs. 67 on JPO).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SubGrapher</th>
          <th>MolScribe</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>MolGrapher</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>S-F1</strong> (JPO)</td>
          <td>92</td>
          <td><strong>94</strong></td>
          <td>81</td>
          <td>86</td>
          <td>89</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (JPO)</td>
          <td><strong>83</strong></td>
          <td>82</td>
          <td>67</td>
          <td>79</td>
          <td>80</td>
      </tr>
      <tr>
          <td><strong>S-F1</strong> (USPTO-10K-L)</td>
          <td><strong>97</strong></td>
          <td>90</td>
          <td><strong>97</strong></td>
          <td>86</td>
          <td>56</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (USPTO-10K-L)</td>
          <td>55</td>
          <td>55</td>
          <td><strong>75</strong></td>
          <td>66</td>
          <td>31</td>
      </tr>
      <tr>
          <td><strong>S-F1</strong> (USPTO-Markush)</td>
          <td><strong>88</strong></td>
          <td>86</td>
          <td>74</td>
          <td>10</td>
          <td>35</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (USPTO-Markush)</td>
          <td>82</td>
          <td><strong>86</strong></td>
          <td>70</td>
          <td>11</td>
          <td>30</td>
      </tr>
      <tr>
          <td><strong>Avg Retrieval Rank</strong></td>
          <td><strong>95/500</strong></td>
          <td>181-241/500</td>
          <td>138-185/500</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<p>Note: Retrieval rank ranges reflect the best and worst fingerprint method pairing for each OCSR model (RDKit Daylight or MHFP).</p>
<ul>
<li>
<p><strong>Effective Handling of Scale and Complexity</strong>: The instance segmentation approach successfully managed large molecules and complex structures where traditional graph-reconstruction methods struggled. The Substructure F1-scores on USPTO-10K-L and USPTO-Markush benchmarks demonstrated clear advantages for challenging molecular targets.</p>
</li>
<li>
<p><strong>Markush Structure Processing</strong>: SubGrapher achieves the highest Substructure F1-score on Markush structures (88 vs. MolScribe&rsquo;s 86 and OSRA&rsquo;s 74). While other OCSR methods can attempt Markush images, they support only limited features such as abbreviation-based variable groups. SubGrapher handles complex Markush features more effectively, expanding the scope of automatically extractable chemical information from patent literature.</p>
</li>
<li>
<p><strong>Robust Molecule Retrieval Performance</strong>: The visual fingerprinting approach achieved reliable retrieval performance (average rank 95/500) across diverse molecular families. The key advantage was consistency: SubGrapher generates meaningful fingerprints even from partial or uncertain predictions, while OCSR-based methods often fail to produce any usable output.</p>
</li>
<li>
<p><strong>Practical Document Mining Capability</strong>: The successful identification of specific Markush structures within large patent documents (54 pages, 356 images) demonstrates real-world applicability for large-scale literature mining and intellectual property analysis.</p>
</li>
<li>
<p><strong>Single-Stage Architecture Benefits</strong>: By eliminating the traditional image → structure → fingerprint pipeline, SubGrapher avoids error accumulation from failed molecular reconstructions. Every input image produces a fingerprint, making the system more reliable for batch processing of diverse document collections.</p>
</li>
<li>
<p><strong>Limitations and Scope</strong>: The method remains focused on common organic functional groups and may struggle with inorganic chemistry, organometallic complexes, or highly specialized molecular classes not well-represented in the training data. The 1,534 functional groups, while extensive, represent a curated subset of chemical space. SubGrapher also cannot distinguish enantiomers, as the detected substructures lack stereochemistry information. Additionally, the method currently cannot recognize substructures in abbreviations or single-atom fragments.</p>
</li>
</ul>
<p>The work demonstrates that direct fingerprint generation can be more robust and practical than traditional structure reconstruction approaches. SubGrapher&rsquo;s ability to handle Markush structures and degraded images makes it particularly valuable for patent analysis and large-scale document mining, where traditional OCSR methods frequently fail. The approach suggests that task-specific learning (fingerprints for retrieval) can outperform general-purpose reconstruction methods in many practical applications.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data Generation</strong>: The paper developed a custom synthetic data pipeline since no public datasets existed with pixel-level mask annotations for functional groups:</p>
<ul>
<li><strong>Extended MolDepictor</strong>: Enhanced molecular rendering tool to generate both images and corresponding segmentation masks for all substructures</li>
<li><strong>Markush Structure Rendering</strong>: Pipeline extended to handle complex generic structures</li>
<li><strong>Source Molecules</strong>: PubChem for broad chemical diversity</li>
</ul>
<p><strong>Evaluation Benchmarks</strong>:</p>
<ul>
<li><strong>JPO Dataset</strong>: Real patent images with poor resolution, noise, and artifacts</li>
<li><strong>USPTO-10K-L</strong>: Large complex molecular structures</li>
<li><strong>USPTO-Markush</strong>: Generic structures with variable R-groups</li>
<li><strong>Retrieval Benchmarks</strong>: Five datasets (adenosine, camphor, cholesterol, limonene, pyridine), each with 500 similar molecular images</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Dual instance segmentation system using Mask-RCNN</p>
<ul>
<li><strong>Functional Group Detector</strong>: Mask-RCNN trained to identify 1,534 expert-defined functional groups</li>
<li><strong>Carbon Backbone Detector</strong>: Mask-RCNN trained to recognize 27 common carbon chain patterns</li>
<li><strong>Backbone Network</strong>: Not specified in the paper</li>
</ul>
<p><strong>Functional Group Knowledge Base</strong>: 1,534 substructures systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Definition</strong>:</p>
<ul>
<li><strong>1,534 Functional Groups</strong>: Defined by manually curated SMARTS patterns
<ul>
<li>Must contain heteroatoms (O, N, S, P, B)</li>
<li>Frequency threshold: ~1,000+ occurrences in PubChem</li>
<li>Systematically constructed from chemically logical atom combinations</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li><strong>27 Carbon Backbones</strong>: Patterns of 3-6 carbon atoms (rings and chains) to capture molecular scaffolds</li>
</ul>
<p><strong>Substructure-Graph Construction</strong>:</p>
<ol>
<li>Detect functional groups and carbon backbones using Mask-RCNN models</li>
<li>Build connectivity graph:
<ul>
<li>Each node represents an identified substructure instance</li>
<li>Edges connect substructures whose bounding boxes overlap</li>
<li>Bounding boxes expanded by 10% of smallest box&rsquo;s diagonal to ensure connectivity between adjacent groups</li>
<li>Carbon chain intersection coefficients divided by 2, giving functional groups higher effective weight</li>
</ul>
</li>
</ol>
<p><strong>SVMF Fingerprint Generation</strong>:</p>
<ul>
<li>Matrix form: $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$</li>
<li>Stored as compressed sparse upper triangular matrix</li>
<li><strong>Diagonal elements</strong>: $SVMF_{ii} = h_1 \cdot n_i + g_{ii}$ where $h_1 = 10$</li>
<li><strong>Off-diagonal elements</strong>: $SVMF_{ij} = h_2(d) \cdot \text{intersection}(s_i, s_j)$ where:
<ul>
<li>$h_2(d) = 2$ for $d = 0, 1$</li>
<li>$h_2(2) = 2/4$, $h_2(3) = 2/16$, $h_2(4) = 2/256$</li>
<li>$h_2(d) = 0$ for $d &gt; 4$</li>
</ul>
</li>
<li>Average sparsity: 0.001% non-zero elements</li>
<li>Similarity metric: Normalized Euclidean distance (L2 norm of difference divided by L2 norm of sum)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Substructure F1-score (S-F1)</strong>: Harmonic mean of precision and recall for individual substructure detection across all molecules in the dataset</li>
<li><strong>Molecule Exact Match (M-EM)</strong>: Percentage of molecules where S-F1 = 1.0 (all substructures correctly identified)</li>
<li><strong>Retrieval Rank</strong>: Average rank of ground truth molecule in candidate list of 500 similar structures when querying with SMILES fingerprint, averaged across 50 queries per benchmark</li>
</ul>
<p><strong>Baselines</strong>: Compared against SOTA OCSR methods:</p>
<ul>
<li>Deep learning: MolScribe, MolGrapher, DECIMER</li>
<li>Rule-based: OSRA</li>
<li>Fingerprint methods: RDKit Daylight, MHFP (applied to OCSR outputs)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training and inference hardware details are not provided in the main text or would be found in supplementary materials.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/SubGrapher">SubGrapher (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official inference code</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/docling-project/SubGrapher">SubGrapher (HuggingFace)</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Pre-trained model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/docling-project/SubGrapher-Datasets">SubGrapher-Datasets (HuggingFace)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Visual fingerprinting benchmark datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="implementation-gaps">Implementation Gaps</h3>
<p>The following details are not available in the paper and would require access to the code repository or supplementary information:</p>
<ul>
<li>Specific backbone architecture for Mask-RCNN (ResNet variant, Swin Transformer, etc.)</li>
<li>Optimizer type (AdamW, SGD, etc.)</li>
<li>Learning rate and scheduler</li>
<li>Batch size and number of training epochs</li>
<li>Loss function weights (box loss vs. mask loss)</li>
<li>GPU/TPU specifications used for training</li>
<li>Training time and computational requirements</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Meijer, G. I., Weber, V., Van Gool, L., &amp; Staar, P. W. J. (2025). SubGrapher: Visual fingerprinting of chemical structures. Journal of Cheminformatics, 17(1), 149. <a href="https://doi.org/10.1186/s13321-025-01091-4">https://doi.org/10.1186/s13321-025-01091-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{morinSubGrapherVisualFingerprinting2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{SubGrapher: Visual Fingerprinting of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{SubGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valéry and Van Gool, Luc and Staar, Peter W. J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{149}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-025-01091-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>3D Steerable CNNs: Rotationally Equivariant Features</title><link>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/3d-steerable-cnns/</link><pubDate>Thu, 16 Jan 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/3d-steerable-cnns/</guid><description>Weiler et al.'s NeurIPS 2018 paper introducing SE(3)-equivariant CNNs for volumetric data using group theory and spherical harmonics.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>method paper</strong> that introduces a novel neural network architecture, the 3D Steerable CNN. It provides a comprehensive theoretical derivation for the architecture grounded in group representation theory and demonstrates its practical application.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The work is motivated by the prevalence of <strong>symmetry</strong> in problems from the natural sciences. Standard 3D CNNs lack inherent equivariance to 3D rotations, a fundamental symmetry governed by the SE(3) group in many scientific datasets like molecular or protein structures. Building this symmetry directly into the model architecture as an <strong>inductive bias</strong> is expected to yield more data-efficient, generalizable, and physically meaningful models.</p>















<figure class="post-figure center ">
    <img src="/img/notes/3d-cnn-versus-3d-steerable-cnn.webp"
         alt="Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry. Panel A shows how standard CNNs produce distorted outputs when inputs are rotated, requiring data augmentation. Panel B shows how Steerable CNNs use spherical harmonic kernel bases to produce equivariant geometric field outputs that transform predictably under rotation."
         title="Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry. Panel A shows how standard CNNs produce distorted outputs when inputs are rotated, requiring data augmentation. Panel B shows how Steerable CNNs use spherical harmonic kernel bases to produce equivariant geometric field outputs that transform predictably under rotation."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Standard 3D CNNs (Panel A) produce inconsistent feature maps when inputs are rotated, requiring expensive data augmentation. 3D Steerable CNNs (Panel B) use analytically-derived spherical harmonic kernels to produce geometric field outputs that transform equivariantly under rotation.</figcaption>
    
</figure>

<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the rigorous and practical construction of a CNN architecture that is equivariant to 3D rigid body motions (SE(3) group). The key contributions are:</p>
<ul>
<li><strong>Geometric Feature Representation</strong>: Features are modeled as geometric <strong>fields</strong> (collections of scalars, vectors, and higher-order tensors) defined over $\mathbb{R}^{3}$. Each type of feature transforms according to an <strong>irreducible representation (irrep)</strong> of the rotation group SO(3).</li>
<li><strong>General Equivariant Convolution</strong>: The paper proves that the most general form of an SE(3)-equivariant linear map between these fields is a convolution with a <strong>rotation-steerable kernel</strong>.</li>
<li><strong>Analytical Kernel Basis</strong>: The main theoretical breakthrough is the analytical derivation of a complete basis for these steerable kernels. They solve the kernel&rsquo;s equivariance constraint, $\kappa(rx) = D^{j}(r)\kappa(x)D^{l}(r)^{-1}$, showing the solutions are functions whose angular components are <strong>spherical harmonics</strong>. The network&rsquo;s kernels are then parameterized as a learnable linear combination of these pre-computed basis functions, making the implementation a minor modification to standard 3D convolutions.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/notes/spherical-harmonics.webp"
         alt="Spherical harmonics visualization showing the angular basis functions organized by degree l (rows) and order m (columns). Row 0 shows the single s-type orbital (l=0), row 1 shows three p-type orbitals (l=1), row 2 shows five d-type orbitals (l=2), and row 3 shows seven f-type orbitals (l=3)."
         title="Spherical harmonics visualization showing the angular basis functions organized by degree l (rows) and order m (columns). Row 0 shows the single s-type orbital (l=0), row 1 shows three p-type orbitals (l=1), row 2 shows five d-type orbitals (l=2), and row 3 shows seven f-type orbitals (l=3)."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Spherical harmonics $Y_l^m$ organized by degree $l$ (rows) and order $m$ (columns). These functions form the angular basis for steerable kernels: $l=0$ (scalar), $l=1$ (vector/p-orbital), $l=2$ (rank-2 tensor/d-orbital), $l=3$ (rank-3 tensor/f-orbital). Each degree $l$ has $2l+1$ components.</figcaption>
    
</figure>

<ul>
<li><strong>Equivariant Nonlinearity</strong>: A novel <strong>gated nonlinearity</strong> is proposed for non-scalar features. It preserves equivariance by multiplying a feature field by a separately computed, learned scalar field (the gate).</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The model&rsquo;s performance was evaluated on a series of tasks with inherent rotational symmetry:</p>
<ol>
<li><strong>Tetris Classification</strong>: A toy problem to empirically validate the model&rsquo;s rotational equivariance by training on aligned blocks and testing on randomly rotated ones.</li>
<li><strong>SHREC17 3D Model Classification</strong>: A benchmark for classifying complex 3D shapes that are arbitrarily rotated.</li>
<li><strong>Amino Acid Propensity Prediction</strong>: A scientific application to predict amino acid types from their 3D atomic environments.</li>
<li><strong>CATH Protein Structure Classification</strong>: A challenging task on a new dataset introduced by the authors, requiring classification of global protein architecture, a problem with full SE(3) invariance.</li>
</ol>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<p>The 3D Steerable CNN demonstrated clear advantages due to its built-in equivariance:</p>
<ul>
<li>It was empirically confirmed to be <strong>rotationally equivariant</strong>, achieving $99 \pm 2%$ test accuracy on the rotated Tetris dataset (averaged over 17 runs), compared to a standard 3D CNN&rsquo;s $27 \pm 7%$ accuracy.</li>
<li>On the amino acid prediction task the model achieves 0.58 accuracy, compared to 0.50 (regular-grid) and 0.56 (concentric-grid) baselines, using roughly half the parameters. On SHREC17 it reaches a total score (micro + macro mAP) of 1.11, compared to 1.13 for the leading contemporary system.</li>
<li>On the CATH protein classification task, it <strong>outperformed a deep 3D CNN baseline</strong> while using ~110x fewer parameters. This performance gap widened as the training data was reduced, highlighting the model&rsquo;s superior <strong>data efficiency</strong>.</li>
</ul>
<p>The paper concludes that 3D Steerable CNNs provide a universal and effective framework for incorporating SE(3) symmetry into deep learning models, leading to improved accuracy and efficiency for tasks involving volumetric data, particularly in scientific domains.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Input Format</strong>: All inputs must be voxelized. Point clouds require voxelization before use.
<ul>
<li><strong>Proteins (CATH)</strong>: $50^3$ grid, 0.2 nm voxel size. Simplified to $C_\alpha$ atoms only; Gaussian density placed at each atom position.</li>
<li><strong>3D Objects (SHREC17)</strong>: $64^3$ voxel grids.</li>
<li><strong>Tetris</strong>: $36^3$ input grid.</li>
</ul>
</li>
<li><strong>Splitting Strategy</strong>: CATH used a 10-fold split (7 train, 1 val, 2 test) strictly separated by &ldquo;superfamily&rdquo; level to prevent data leakage (&lt;40% sequence identity).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Kernel Basis Construction</strong>:</p>
<ul>
<li>Constructed from <strong>Spherical Harmonics</strong> multiplied by <strong>Gaussian Radial Shells</strong>: $\exp\left(-\frac{1}{2}(|x|-m)^{2}/\sigma^{2}\right)$</li>
<li><strong>Anti-aliasing</strong>: A radially dependent angular frequency cutoff ($J_{\max}$) is applied to prevent aliasing near the origin.</li>
</ul>
<p><strong>Normalization</strong>: Uses <strong>Equivariant Batch Norm</strong>. Non-scalar fields are normalized by the average of their norms.</p>
<p><strong>Downsampling</strong>: Standard strided convolution breaks equivariance. The architecture uses <strong>low-pass filtering</strong> (Gaussian blur) before downsampling to maintain equivariance.</p>
<p><strong>Exact Architecture Configurations</strong>:</p>
<p><strong>Tetris Architecture</strong> (4 layers):</p>
<table>
  <thead>
      <tr>
          <th>Layer</th>
          <th>Field Types</th>
          <th>Spatial Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input</td>
          <td>1 scalar</td>
          <td>$36^3$</td>
      </tr>
      <tr>
          <td>Layer 1</td>
          <td>4 scalars, 4 vectors ($l=1$), 4 tensors ($l=2$), 1 tensor ($l=3$)</td>
          <td>$40^3$</td>
      </tr>
      <tr>
          <td>Layer 2</td>
          <td>16 scalars, 16 vectors, 16 tensors ($l=2$)</td>
          <td>$22^3$ (stride 2)</td>
      </tr>
      <tr>
          <td>Layer 3</td>
          <td>32 scalars, 16 vectors, 16 tensors ($l=2$)</td>
          <td>$13^3$ (stride 2)</td>
      </tr>
      <tr>
          <td>Layer 4</td>
          <td>128 scalars</td>
          <td>$17^3$</td>
      </tr>
      <tr>
          <td>Output</td>
          <td>8 scalars (global average pool)</td>
          <td>$1$</td>
      </tr>
  </tbody>
</table>
<p><strong>SHREC17 Architecture</strong> (8 layers):</p>
<table>
  <thead>
      <tr>
          <th>Layers</th>
          <th>Field Types</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1-2</td>
          <td>8 scalars, 4 vectors, 2 tensors ($l=2$)</td>
      </tr>
      <tr>
          <td>3-4</td>
          <td>16 scalars, 8 vectors, 4 tensors</td>
      </tr>
      <tr>
          <td>5-7</td>
          <td>32 scalars, 16 vectors, 8 tensors</td>
      </tr>
      <tr>
          <td>8</td>
          <td>512 scalars</td>
      </tr>
      <tr>
          <td>Output</td>
          <td>55 scalars (classes)</td>
      </tr>
  </tbody>
</table>
<p><strong>CATH Architecture</strong> (ResNet34-inspired):</p>
<p>Block progression: <code>(2,2,2,2)</code>, <code>(4,4,4,4)</code>, <code>(8,8,8,8)</code>, <code>(16,16,16,16)</code></p>
<p>Notation: <code>(a,b,c,d)</code> = $a$ scalars ($l=0$), $b$ vectors ($l=1$), $c$ rank-2 tensors ($l=2$), $d$ rank-3 tensors ($l=3$).</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Parameter Counts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Model</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CATH</td>
          <td>3D Steerable CNN</td>
          <td>143,560</td>
      </tr>
      <tr>
          <td>CATH</td>
          <td>Baseline (ResNet34-style)</td>
          <td>15,878,764</td>
      </tr>
      <tr>
          <td>Amino Acid</td>
          <td>3D Steerable CNN</td>
          <td>~32,600,000</td>
      </tr>
      <tr>
          <td>Amino Acid</td>
          <td>Regular grid baseline</td>
          <td>~61,100,000</td>
      </tr>
      <tr>
          <td>Amino Acid</td>
          <td>Concentric grid baseline</td>
          <td>~75,300,000</td>
      </tr>
  </tbody>
</table>
<p>Note: The concentric grid baseline groups voxels by distance from the molecular center, reflecting that atomic interactions are primarily distance-dependent (Torng, W., &amp; Altman, R. B. (2017). 3D deep convolutional neural networks for amino acid environment similarity analysis. <em>BMC Bioinformatics</em>, 18, 302). Amino acid parameter counts are rounded figures as reported in the paper.</p>
<p><strong>Hyperparameters &amp; Training</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>Adam</td>
      </tr>
      <tr>
          <td><strong>LR Scheduler</strong></td>
          <td>Exponential decay (0.94/epoch) after 40 epoch burn-in</td>
      </tr>
      <tr>
          <td><strong>Dropout</strong> (CATH)</td>
          <td>0.1 (Capsule-wide convolutional dropout)</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong> (CATH)</td>
          <td>L1 &amp; L2 regularization: $10^{-8.5}$</td>
      </tr>
  </tbody>
</table>
<p><strong>Mathematical Formulations for Equivariance</strong>:</p>
<p>Standard operations like Batch Normalization and ReLU break rotational equivariance. The paper derives equivariant alternatives:</p>
<p><strong>Equivariant Batch Normalization</strong>:</p>
<p>Standard BN subtracts a mean, which introduces a preferred direction and breaks symmetry. <strong>Norm-based normalization</strong> scales feature fields by the average of their squared norms to preserve symmetry:</p>
<p>$$f_{i}(x) \mapsto f_{i}(x) \left( \frac{1}{|B|} \sum_{j \in B} \frac{1}{V} \int dx |f_{j}(x)|^{2} + \epsilon \right)^{-1/2}$$</p>
<p>This scales vector lengths to unit variance on average while avoiding mean subtraction, preserving directional information and symmetry.</p>
<p><strong>Equivariant Nonlinearities</strong>:</p>
<p>Applying ReLU to vector components independently breaks equivariance (this depends on the coordinate frame). Two approaches:</p>
<ol>
<li>
<p><strong>Norm Nonlinearity</strong> (geometric shrinking): Acts on magnitude, preserves direction. Shrinks vectors shorter than learned bias $\beta$ to zero:
$$f(x) \mapsto \text{ReLU}(|f(x)| - \beta) \frac{f(x)}{|f(x)|}$$
<em>Note: Found to converge slowly; omitted from final models.</em></p>
</li>
<li>
<p><strong>Gated Nonlinearity</strong> (used in practice): A separate scalar field $s(x)$ passes through sigmoid to create a gate $\sigma(s(x))$, which multiplies the geometric field:
$$f_{\text{out}}(x) = f_{\text{in}}(x) \cdot \sigma(s(x))$$
<em>Architecture implication: Requires extra scalar channels ($l=0$) specifically for gating higher-order channels ($l&gt;0$).</em></p>
</li>
</ol>
<p><strong>Voxelization Details</strong>:</p>
<p>For CATH protein inputs, Gaussian density is placed at each atom position with standard deviation equal to <strong>half the voxel width</strong> ($0.5 \times 0.2\text{ nm} = 0.1\text{ nm}$).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Steerable CNN</th>
          <th>Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tetris (rotated test)</td>
          <td>Accuracy</td>
          <td>$99 \pm 2%$</td>
          <td>$27 \pm 7%$ (standard 3D CNN)</td>
      </tr>
      <tr>
          <td>Amino Acid Propensity</td>
          <td>Accuracy</td>
          <td><strong>0.58</strong> (32.6M params)</td>
          <td>0.50 (regular grid, 61.1M params); 0.56 (concentric grid, 75.3M params)</td>
      </tr>
      <tr>
          <td>SHREC17</td>
          <td>micro + macro mAP (higher is better)</td>
          <td>1.11</td>
          <td>1.13 (SOTA)</td>
      </tr>
      <tr>
          <td>CATH</td>
          <td>Accuracy</td>
          <td>Higher across all training set sizes (see Figure 4; not reported as a single value) (143,560 params)</td>
          <td>Deep 3D CNN (15,878,764 params; ~110x more)</td>
      </tr>
  </tbody>
</table>
<p>Note: On SHREC17, the total score is micro mAP + macro mAP combined (higher is better). From Table 4 in the supplementary material: Steerable CNN micro mAP = 0.661, macro mAP = 0.449, total = 1.11. On CATH, the steerable CNN outperformed the baseline with ~110x fewer parameters, a gap that widened as training data was reduced.</p>
<h2 id="historical-context-from-peer-reviews">Historical Context (From Peer Reviews)</h2>
<p>The NeurIPS peer reviews reveal important context about the paper&rsquo;s structure and claims:</p>
<ul>
<li>
<p><strong>Evolution of Experiments</strong>: The <strong>SHREC17</strong> experiment and the <strong>arbitrary rotation</strong> test in Tetris were added during the rebuttal phase to address reviewer concerns about the lack of standard computer vision benchmarks. This explains why SHREC17 feels somewhat disconnected from the paper&rsquo;s &ldquo;AI for Science&rdquo; narrative.</p>
</li>
<li>
<p><strong>Continuous vs. Discrete Rotations</strong>: The Tetris experiment validates equivariance to <strong>continuous</strong> ($SO(3)$) rotations alongside discrete 90-degree turns. This distinction is crucial and separates Steerable CNNs from earlier Group CNNs that handled discrete rotation groups exclusively.</p>
</li>
<li>
<p><strong>Terminology Warning</strong>: The use of terms like &ldquo;fiber&rdquo; and &ldquo;induced representation&rdquo; was critiqued by reviewers as being denser than necessary and inconsistent with related work (e.g., Tensor Field Networks). If you find Section 3 difficult, this is a known barrier of this paper. Focus on the resulting kernel constraints.</p>
</li>
<li>
<p><strong>Parameter Efficiency Quantified</strong>: Reviewers highlighted that the main practical win is <strong>parameter efficiency</strong>. Standard 3D CNNs hit diminishing returns around $10^7$ parameters, while Steerable CNNs achieve better results with ~110x fewer parameters ($10^5$).</p>
</li>
</ul>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/ENLJACPHSEA?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mariogeiger/se3cnn">se3cnn (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original implementation; superseded by <a href="https://github.com/e3nn/e3nn">e3nn</a> for point cloud applications</td>
      </tr>
      <tr>
          <td><a href="https://github.com/wouterboomsma/cath_datasets">CATH Datasets (GitHub)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Protein structure classification dataset introduced in this paper</td>
      </tr>
  </tbody>
</table>
<p>Pre-trained model weights are not publicly released. Hardware and compute requirements are not specified in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weiler, M., Geiger, M., Welling, M., Boomsma, W., &amp; Cohen, T. S. (2018). 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. <em>Advances in Neural Information Processing Systems</em>, 31. <a href="https://proceedings.neurips.cc/paper/2018/hash/488e4104520c6aab692863cc1dba45af-Abstract.html">https://proceedings.neurips.cc/paper/2018/hash/488e4104520c6aab692863cc1dba45af-Abstract.html</a></p>
<p><strong>Publication</strong>: NeurIPS 2018</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{weiler20183d,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Weiler, Maurice and Geiger, Mario and Welling, Max and Boomsma, Wouter and Cohen, Taco S}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{31}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/mariogeiger/se3cnn">GitHub Repository</a></li>
<li><a href="https://www.youtube.com/watch?v=ENLJACPHSEA">YouTube Video</a></li>
<li><a href="https://github.com/wouterboomsma/cath_datasets">CATH Dataset</a></li>
</ul>
]]></content:encoded></item><item><title>RFL: Simplifying Chemical Structure Recognition (AAAI 2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/</guid><description>Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD) for improved optical chemical structure recognition from molecular images.</description><content:encoded><![CDATA[<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$). It introduces a novel representation system (Ring-Free Language) and a specialized neural architecture (Molecular Skeleton Decoder) designed to solve specific limitations in converting 2D images to 1D chemical strings. The paper validates this method through direct comparison with existing baselines and ablation studies.</p>
<h2 id="motivation-limitations-of-1d-serialization">Motivation: Limitations of 1D Serialization</h2>
<p>Current Optical Chemical Structure Recognition (OCSR) methods typically rely on &ldquo;unstructured modeling,&rdquo; where 2D molecular graphs are flattened into 1D strings like SMILES or SSML. While simple, these linear formats struggle to explicitly capture complex spatial relationships, particularly in molecules with multiple rings and branches. End-to-end models often fail to &ldquo;understand&rdquo; the graph structure when forced to predict these implicit 1D sequences, leading to error accumulation in complex scenarios.</p>
<h2 id="innovation-ring-free-language-rfl-and-molecular-skeleton-decoder-msd">Innovation: Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD)</h2>
<p>The authors propose two primary contributions to decouple spatial complexity:</p>
<ol>
<li><strong>Ring-Free Language (RFL)</strong>: A divide-and-conquer representation that splits a molecular graph $G$ into three explicit components: a molecular skeleton $\mathcal{S}$, individual ring structures $\mathcal{R}$, and branch information $\mathcal{F}$. This allows rings to be collapsed into &ldquo;SuperAtoms&rdquo; or &ldquo;SuperBonds&rdquo; during initial parsing.</li>
<li><strong>Molecular Skeleton Decoder (MSD)</strong>: A hierarchical architecture that progressively predicts the skeleton first, then the individual rings (using SuperAtom features as conditions), and finally classifies the branch connections.</li>
</ol>
<h2 id="methodology-and-experiments">Methodology and Experiments</h2>
<p>The method was evaluated on both handwritten and printed chemical structures against two baselines: DenseWAP (Zhang et al. 2018) and RCGD (Hu et al. 2023).</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>EDU-CHEMC</strong>: ~49k handwritten samples (challenging, diverse styles)</li>
<li><strong>Mini-CASIA-CSDB</strong>: ~89k printed samples (from ChEMBL)</li>
<li><strong>Synthetic Complexity Dataset</strong>: A custom split of ChEMBL data grouped by structural complexity (atoms + bonds + rings) to test generalization</li>
</ul>
</li>
<li><strong>Ablation Studies</strong> (Table 2, on EDU-CHEMC with MSD-DenseWAP): Without MSD or <code>[conn]</code>, EM=38.70%. Adding <code>[conn]</code> alone raised EM to 44.02%. Adding MSD alone raised EM to 52.76%. Both together achieved EM=64.96%, confirming each component&rsquo;s contribution.</li>
</ul>
<h2 id="outcomes-and-conclusions">Outcomes and Conclusions</h2>
<ul>
<li><strong>New best results</strong>: MSD-RCGD achieved 65.39% EM on EDU-CHEMC (handwritten) and 95.23% EM on Mini-CASIA-CSDB (printed), outperforming the RCGD baseline (62.86% and 95.01%, respectively). MSD-DenseWAP surpassed the previous best on EDU-CHEMC by 2.06% EM (64.92% vs. 62.86%).</li>
<li><strong>Universal improvement</strong>: Applying MSD/RFL to DenseWAP improved its accuracy from 61.35% to 64.92% EM on EDU-CHEMC and from 92.09% to 94.10% EM on Mini-CASIA-CSDB, demonstrating the method is model-agnostic.</li>
<li><strong>Complexity handling</strong>: When trained on low-complexity molecules only (levels 1-2), MSD-DenseWAP still recognized higher-complexity unseen structures, while standard DenseWAP could hardly recognize them at all (Figure 6 in the paper).</li>
</ul>
<p>The authors note that this is the first end-to-end solution that decouples and models chemical structures in a structured form. Future work aims to extend structured-based modeling to other tasks such as tables, flowcharts, and diagrams.</p>
<hr>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/JingMog/RFL-MSD">RFL-MSD</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized one handwritten and one printed dataset, plus a synthetic set for stress-testing complexity.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>EDU-CHEMC</strong></td>
          <td>48,998 Train / 2,992 Test</td>
          <td>Handwritten images from educational scenarios</td>
      </tr>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>Mini-CASIA-CSDB</strong></td>
          <td>89,023 Train / 8,287 Test</td>
          <td>Printed images rendered from ChEMBL using RDKit</td>
      </tr>
      <tr>
          <td><strong>Generalization</strong></td>
          <td><strong>ChEMBL Subset</strong></td>
          <td>5 levels of complexity</td>
          <td>Custom split based on Eq: $N_{atom} + N_{bond} + 12 \times N_{ring}$</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>RFL Splitting (Encoding)</strong>:</p>
<ol>
<li><strong>Detect Rings</strong>: Use DFS to find all non-nested rings $\mathcal{R}$.</li>
<li><strong>Determine Adjacency ($\gamma$)</strong>: Calculate shared edges between rings.</li>
<li><strong>Merge</strong>:
<ul>
<li>If $\gamma(r_i) = 0$ (isolated), merge ring into a <strong>SuperAtom</strong> node.</li>
<li>If $\gamma(r_i) &gt; 0$ (adjacent), merge ring into a <strong>SuperBond</strong> edge.</li>
</ul>
</li>
<li><strong>Update</strong>: Record connection info in $\mathcal{F}$ and remove ring details from the main graph to form Skeleton $\mathcal{S}$.</li>
</ol>
<p><strong>MSD Decoding</strong>:</p>
<ul>
<li><strong>Hierarchical Prediction</strong>: The model predicts the Skeleton $\mathcal{S}$ first.</li>
<li><strong>Contextual Ring Prediction</strong>: When a SuperAtom/Bond token is predicted, its hidden state $f^s$ is stored. After the skeleton is finished, $f^s$ is used as a condition to autoregressively decode the specific ring structure.</li>
<li><strong>Token <code>[conn]</code></strong>: A special token separates connected ring bonds from unconnected ones to sparsify the branch classification task.</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard Image-to-Sequence pattern but with a forked decoder.</p>
<ul>
<li><strong>Encoder</strong>: DenseNet (Growth rate=24, Depth=32 per block)</li>
<li><strong>Decoder (MSD)</strong>:
<ul>
<li><strong>Core</strong>: GRU with Attention (Hidden dim=256, Embedding dim=256, Dropout=0.15)</li>
<li><strong>Skeleton Module</strong>: Autoregressively predicts sequence tokens. Uses Maxout activation.</li>
<li><strong>Branch Module</strong>: A binary classifier (MLP) taking concatenated features of skeleton bonds $f_{bs}$ and ring bonds $f_{br}$ to predict connectivity matrix $\mathcal{F}$.</li>
</ul>
</li>
<li><strong>Loss Function</strong>: $\mathcal{O} = \lambda_1 \mathcal{L}_{ce} + \lambda_2 \mathcal{L}_{cls}$ (where $\lambda_1 = \lambda_2 = 1$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on exact image reconstruction and structural validity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>EM (Exact Match)</strong></td>
          <td>% of images where predicted graph exactly matches ground truth.</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td><strong>Struct-EM</strong></td>
          <td>% of correctly identified chemical structures (ignoring non-chemical text).</td>
          <td>Auxiliary metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 (32GB RAM)</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch size: 8 (Handwritten), 32 (Printed)</li>
<li>Epochs: 50</li>
<li>Optimizer: Adam ($lr=2\times10^{-4}$, decayed by 0.5 via MultiStepLR)</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, Q., Chen, M., Pi, C., Hu, P., Zhang, Z., Ma, J., Du, J., Yin, B., &amp; Hu, J. (2025). RFL: Simplifying Chemical Structure Recognition with Ring-Free Language. In <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(2), 2007-2015. <a href="https://doi.org/10.1609/aaai.v39i2.32197">https://doi.org/10.1609/aaai.v39i2.32197</a></p>
<p><strong>Publication</strong>: AAAI 2025 (Oral)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/JingMog/RFL-MSD">Official Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{changRFLSimplifyingChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{RFL: Simplifying Chemical Structure Recognition with Ring-Free Language}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{RFL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chang, Qikai and Chen, Mingjun and Pi, Changpeng and Hu, Pengfei and Zhang, Zhenrong and Ma, Jiefeng and Du, Jun and Yin, Baocai and Hu, Jinshui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2007--2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2412.07594}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1609/aaai.v39i2.32197}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>