<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Datasets on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/tags/datasets/</link><description>Recent content in Datasets on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Mon, 01 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/datasets/index.xml" rel="self" type="application/rss+xml"/><item><title>VQM24: 836k Molecules at DFT and Diffusion QMC</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/vqm24/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/vqm24/</guid><description>Dataset card for VQM24, providing DFT and diffusion QMC properties for 836k exhaustively enumerated small molecules across 9 elements.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>VQM24 (Vector-QM24) is the first exhaustive quantum mechanical dataset covering all possible neutral closed-shell small molecules with up to five heavy atoms from nine p-block elements (C, N, O, F, Si, P, S, Cl, Br). It provides DFT-level properties for all 836k structures and <a href="https://en.wikipedia.org/wiki/Diffusion_Monte_Carlo">diffusion quantum Monte Carlo</a> (DMC) energies for a 10,793-molecule subset, constituting the largest QMC dataset in chemical space to date. ML benchmarking reveals that VQM24 is significantly more challenging than <a href="/notes/chemistry/datasets/qm9/">QM9</a> despite containing smaller molecules.</p>
<h2 id="overview">Overview</h2>
<p>Most existing QM datasets (QM7, QM9, ANI-1x) are derived from string-based molecular lists and are restricted to a few elements (typically CHONF), introducing selection bias and limiting ML model generalizability. VQM24 addresses this by exhaustively enumerating all valid stoichiometries, <a href="https://en.wikipedia.org/wiki/Lewis_structure">Lewis-rule-consistent</a> graphs, and stable conformers for molecules composed of 9 elements with their most common valencies:</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Valencies</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>C</td>
          <td>4</td>
      </tr>
      <tr>
          <td>N</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>O</td>
          <td>2</td>
      </tr>
      <tr>
          <td>F</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Si</td>
          <td>4</td>
      </tr>
      <tr>
          <td>P</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>S</td>
          <td>2, 4, 6</td>
      </tr>
      <tr>
          <td>Cl</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Br</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Heavy Atoms</th>
          <th>Stoichiometries</th>
          <th>Graphs</th>
          <th>Geometries</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>9</td>
          <td>9</td>
          <td>9</td>
      </tr>
      <tr>
          <td>2</td>
          <td>69</td>
          <td>69</td>
          <td>81</td>
      </tr>
      <tr>
          <td>3</td>
          <td>367</td>
          <td>766</td>
          <td>1,287</td>
      </tr>
      <tr>
          <td>4</td>
          <td>1,321</td>
          <td>10,992</td>
          <td>29,581</td>
      </tr>
      <tr>
          <td>5</td>
          <td>3,793</td>
          <td>246,406</td>
          <td>753,917</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>5,559</strong></td>
          <td><strong>258,242</strong></td>
          <td><strong>784,875</strong> (minima)</td>
      </tr>
  </tbody>
</table>
<p>Including saddle points, the full dataset contains 835,947 converged structures. Extrapolation suggests ~33 million geometries at 6 heavy atoms.</p>
<h2 id="generation-pipeline">Generation Pipeline</h2>
<ol>
<li><strong>Stoichiometry enumeration</strong>: All combinations of up to 5 heavy atoms from the 13 element/valency types, with hydrogen counts determined by integer partitioning of total valency</li>
<li><strong>Graph generation</strong>: <a href="https://en.wikipedia.org/wiki/Structural_isomer">Constitutional isomers</a> enumerated using <a href="/notes/chemistry/molecular-design/chemical-space/surge-chemical-graph-generator/">Surge</a> for each stoichiometry</li>
<li><strong>Geometry initialization</strong>: RDKit <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> force field generates initial 3D coordinates</li>
<li><strong>Semi-empirical optimization</strong>: GFN2-xTB geometry optimization</li>
<li><strong>Conformer search</strong>: CREST identifies conformational isomers (~1.1M initial geometries)</li>
<li><strong>DFT optimization</strong>: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 v1.7, all using Gaussian Tight convergence criteria with density fitting (cc-pVDZ-JKFIT auxiliary basis):
<ul>
<li><strong>Pass 1</strong>: Default PSI4 settings (DIIS for SCF, RFO optimizer in redundant internal coordinates), max 100 steps</li>
<li><strong>Pass 2</strong>: SOSCF with full Newton step, ultrafine Lebedev-Treutler grid (590 spherical, 99 radial points), max 100 steps</li>
<li><strong>Pass 3</strong>: Full Hessian evaluation at initial geometry and every 20th step, Cartesian coordinates, max 50 steps</li>
</ul>
</li>
<li><strong>DMC calculations</strong>: For 10,793 lowest-energy conformers with up to 4 heavy atoms, using QMCPACK with PBE0/ccECP/cc-pVQZ trial wavefunctions. Slater-Jastrow trial wavefunctions with Jastrow terms for 1-body (16 params/atom type, 8 Bohr cutoff), 2-body (20 params/spin-channel, 10 Bohr cutoff), and 3-body (26 params, 5 Bohr cutoff) interactions. DMC used a timestep of 0.001 a.u., 16,000 walkers, and 1,500 blocks of 40 imaginary time steps. ccECP pseudopotentials with the determinant-localization approximation and t-moves (DLTM) handled core electrons.</li>
</ol>
<p>The $\omega$B97X-D3 functional was chosen for its strong GMTKN55 benchmark performance and for compatibility with ANI-1, ANI-1x, OrbNet Denali, QMugs, SPICE, and MultiXC-QM9, all of which use $\omega$B97X variants with double-zeta basis sets. This enables transfer learning across datasets.</p>
<h2 id="data-files-and-access">Data Files and Access</h2>
<p>The Zenodo dataset contains separate .npz files, loadable via NumPy:</p>
<table>
  <thead>
      <tr>
          <th>File</th>
          <th>Contents</th>
          <th>Molecules</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>DFT_all.npz</code></td>
          <td>DFT properties for all conformational minima</td>
          <td>784,875</td>
      </tr>
      <tr>
          <td><code>DFT_uniques.npz</code></td>
          <td>DFT properties for constitutional isomers (most stable conformer)</td>
          <td>258,242</td>
      </tr>
      <tr>
          <td><code>DFT_saddles.npz</code></td>
          <td>DFT properties for saddle point structures</td>
          <td>51,072</td>
      </tr>
      <tr>
          <td><code>DMC.npz</code></td>
          <td>DMC total energies and error bars</td>
          <td>10,793</td>
      </tr>
      <tr>
          <td><code>wavefunctions.tar.gz</code></td>
          <td>Wavefunction .molden files (includes MO energies)</td>
          <td>~106.7 GB</td>
      </tr>
  </tbody>
</table>
<p>All molecules are ordered consistently across every array within a file. Properties are accessed by key:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>load(<span style="color:#e6db74">&#39;DFT_all.npz&#39;</span>, allow_pickle<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>print(data<span style="color:#f92672">.</span>files)  <span style="color:#75715e"># list all available properties</span>
</span></span><span style="display:flex;"><span>freqs <span style="color:#f92672">=</span> data[<span style="color:#e6db74">&#39;freqs&#39;</span>]  <span style="color:#75715e"># vibrational frequencies</span>
</span></span></code></pre></div><h2 id="computed-properties">Computed Properties</h2>
<p>DFT ($\omega$B97X-D3/cc-pVDZ) properties and their NPZ access keys:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Unit</th>
          <th>Key</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total energies</td>
          <td>Ha</td>
          <td><code>Etot</code></td>
      </tr>
      <tr>
          <td>Internal energies</td>
          <td>Ha</td>
          <td><code>U0</code></td>
      </tr>
      <tr>
          <td>Atomization energies</td>
          <td>Ha</td>
          <td><code>Eatomization</code></td>
      </tr>
      <tr>
          <td>Electron-electron energies</td>
          <td>Ha</td>
          <td><code>Eee</code></td>
      </tr>
      <tr>
          <td>Exchange-correlation energies</td>
          <td>Ha</td>
          <td><code>Exc</code></td>
      </tr>
      <tr>
          <td>Dispersion energy</td>
          <td>Ha</td>
          <td><code>Edisp</code></td>
      </tr>
      <tr>
          <td>HOMO-LUMO gap</td>
          <td>Ha</td>
          <td><code>gap</code></td>
      </tr>
      <tr>
          <td>Dipole moments</td>
          <td>a.u.</td>
          <td><code>dipole</code></td>
      </tr>
      <tr>
          <td>Quadrupole moments</td>
          <td>a.u.</td>
          <td><code>quadrupole</code></td>
      </tr>
      <tr>
          <td>Octupole moments</td>
          <td>a.u.</td>
          <td><code>octupole</code></td>
      </tr>
      <tr>
          <td>Hexadecapole moments</td>
          <td>a.u.</td>
          <td><code>hexadecapole</code></td>
      </tr>
      <tr>
          <td>Rotational constants</td>
          <td>MHz</td>
          <td><code>rots</code></td>
      </tr>
      <tr>
          <td>Vibrational modes</td>
          <td>Å</td>
          <td><code>vibmodes</code></td>
      </tr>
      <tr>
          <td>Vibrational frequencies</td>
          <td>cm$^{-1}$</td>
          <td><code>freqs</code></td>
      </tr>
      <tr>
          <td>Gibbs free energy (H)</td>
          <td>Ha</td>
          <td><code>G</code></td>
      </tr>
      <tr>
          <td>Internal (thermal) energy (H)</td>
          <td>Ha</td>
          <td><code>U298</code></td>
      </tr>
      <tr>
          <td>Enthalpy (H)</td>
          <td>Ha</td>
          <td><code>H</code></td>
      </tr>
      <tr>
          <td>ZPVE (H)</td>
          <td>Ha</td>
          <td><code>zpves</code></td>
      </tr>
      <tr>
          <td>Entropy (H)</td>
          <td>cal/mol K</td>
          <td><code>S</code></td>
      </tr>
      <tr>
          <td>Heat capacities (H)</td>
          <td>cal/mol K</td>
          <td><code>Cv</code>, <code>Cp</code></td>
      </tr>
      <tr>
          <td>Electrostatic potentials at nuclei</td>
          <td>a.u.</td>
          <td><code>Vesp</code></td>
      </tr>
      <tr>
          <td>Mulliken charges</td>
          <td>a.u.</td>
          <td><code>Qmulliken</code></td>
      </tr>
      <tr>
          <td>SMILES</td>
          <td></td>
          <td><code>graphs</code></td>
      </tr>
      <tr>
          <td>InChI strings</td>
          <td></td>
          <td><code>inchi</code></td>
      </tr>
  </tbody>
</table>
<p>(H) indicates thermodynamic properties computed via the harmonic approximation. Molecular orbital energies are available in the wavefunction .molden files.</p>
<p>DMC properties (<code>DMC.npz</code>) include total energy (<code>Etot</code>) and statistical error bar (<code>std</code>) for each molecule.</p>
<p>DMC energies (PBE0/ccECP/cc-pVQZ nodal surfaces, Slater-Jastrow trial wavefunctions) achieve average statistical uncertainty of 0.4 mHa across ~2.3 billion samples per molecule.</p>
<h2 id="ml-benchmarking-harder-than-qm9">ML Benchmarking: Harder Than QM9</h2>
<p>Learning curves for atomization energy prediction show that VQM24 is substantially more challenging than QM9 for all tested models:</p>
<ul>
<li>KRR models (CM, ACSF, LMBTR, FCHL19, cMBDF) and GNNs (SchNet, PaiNN) all show up to ~8x larger mean errors on VQM24 than QM9 at the same training set size</li>
<li>None of the tested models achieve chemical accuracy (1 kcal/mol) on VQM24, even with 128k training molecules</li>
<li>The atomization energy range in VQM24 (1,545 kcal/mol) is smaller than QM9 (2,427 kcal/mol), so the higher errors reflect greater chemical diversity rather than a wider property range</li>
<li>For a fair comparison with QM9 (which has no conformational isomers), learning curves use only the 258k unique constitutional isomers from VQM24</li>
</ul>
<p><strong>Benchmark methodology</strong>: KRR models use an atomic Gaussian kernel with hyperparameters (length-scale $l$, regularizer $\lambda$) optimized via grid search and 5-fold cross-validation. Both GNNs (SchNet, PaiNN) use 128 atomic basis functions (589k total parameters), trained for 1,000 epochs with Adam (lr = $10^{-4}$). Test set size is 10,000 randomly selected molecules, with results averaged over 5 runs. Training and evaluation scripts are available in the <a href="https://github.com/dkhan42/VQM24">GitHub repository</a>.</p>
<p>Prediction error analysis with the best KRR model (cMBDF, trained on 200k across 4 disjoint training sets on all 784,875 equilibrium geometries) yields an overall MAE of 0.75 kcal/mol (standard deviation 1.55 kcal/mol). The largest individual error reaches 167.3 kcal/mol, and the 25 largest outliers have a mean absolute error of 85.9 kcal/mol.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Exhaustive coverage of 1-5 heavy atom chemical space across 9 elements</li>
<li>Both DFT and DMC-level data (largest QMC dataset in chemical space)</li>
<li>Includes conformational isomers (average 3 per constitutional isomer)</li>
<li>Extensive property set including wavefunctions and multipole moments up to hexadecapole</li>
<li>More challenging ML benchmark than QM9, exposing model limitations</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Limited to 5 heavy atoms (very small molecules)</li>
<li>262,542 structures (~24%) failed DFT convergence, with a strong silicon bias in failures</li>
<li>51,072 structures converged to saddle points rather than minima</li>
<li>DMC subset limited to 4 heavy atoms (10,793 molecules)</li>
<li>Does not include metals, rare gases, or heavier halogens (I)</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible</strong></p>
<p>The paper, dataset, and code are all publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://zenodo.org/records/15442257">VQM24 Dataset (Zenodo)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>DFT .npz files + DMC .npz + wavefunction tarball (~108 GB total)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dkhan42/VQM24">dkhan42/VQM24 (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Generation tools, PSI4 templates, KRR and GNN training scripts</td>
      </tr>
      <tr>
          <td><a href="https://arxiv.org/abs/2405.05961">arXiv preprint</a></td>
          <td>Paper</td>
          <td>arXiv license</td>
          <td>Open-access preprint of the Scientific Data article</td>
      </tr>
  </tbody>
</table>
<p><strong>Software stack</strong>: Surge (graph enumeration), RDKit/MMFF94 (initial geometries), GFN2-xTB (semi-empirical optimization), CREST (conformer search), PSI4 v1.7 (DFT), PySCF (trial wavefunctions), QMCPACK (DMC), QMLcode (KRR models), SchNetPack (GNN models).</p>
<p><strong>Hardware requirements</strong>:</p>
<ul>
<li>DFT: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 (compute details not specified per-molecule for DFT)</li>
<li>DMC trial wavefunctions: Argonne LCRC Improv, single node (2x AMD EPYC 7713, 64 cores, 2 GHz), ~45 seconds per molecule, ~134 node-hours total</li>
<li>DMC calculations: Argonne Polaris HPC (AMD EPYC 7543P, 64 cores, 2.8 GHz), 20 nodes per molecule, ~15 minutes each, ~54,000 node-hours total</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khan2025quantum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Quantum mechanical dataset of 836k neutral closed-shell molecules
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">         with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Khan, Danish and Benali, Anouar and Kim, Scott Y. H.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">          and von Rudorff, Guido Falk and von Lilienfeld, O. Anatole}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1551}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Portfolio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41597-025-05428-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>VEHICLe: Heteroaromatic Rings of the Future</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/vehicle-heteroaromatic-rings/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/vehicle-heteroaromatic-rings/</guid><description>Pitt et al. enumerate all 24,867 possible small heteroaromatic ring systems and predict over 3,000 novel synthetically tractable candidates.</description><content:encoded><![CDATA[<h2 id="exhaustive-enumeration-of-heteroaromatic-ring-systems">Exhaustive Enumeration of Heteroaromatic Ring Systems</h2>
<p>VEHICLe (Virtual Exploratory Heterocyclic Library) is a complete enumeration of all possible heteroaromatic ring systems under a set of constraints designed to capture the ring types most relevant to medicinal chemistry. The library contains 24,867 ring systems (23,895 after collapsing tautomers), yet only 1,701 of these have ever appeared in published compounds across databases totaling over 10 million molecules. The authors use this complete library to predict which unsynthesized ring systems could plausibly be made and to challenge organic chemists to conquer them.</p>
<h2 id="why-heteroaromatic-rings-matter-for-drug-design">Why Heteroaromatic Rings Matter for Drug Design</h2>
<p>Heteroaromatic rings are central to synthetic bioactive small molecules for several reasons: they bind proteins efficiently through shape and hydrophobicity, their rigidity combined with heteroatom hydrogen bonding provides target selectivity, they support parallelizable coupling reactions (<a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki</a>, <a href="https://en.wikipedia.org/wiki/Stille_reaction">Stille</a>) for rapid <a href="https://en.wikipedia.org/wiki/Structure%E2%80%93activity_relationship">SAR</a> exploration, multiple substitution positions can be explored without introducing stereocenters, and unusual ring systems or substitution patterns provide patent novelty. These advantages come with tradeoffs: low aqueous solubility, restricted SAR from rigidity, tendency toward molecular bloat during optimization, and difficulty achieving patent novelty with well-explored ring systems.</p>
<h2 id="vehicle-construction">VEHICLe Construction</h2>
<p>The library is built through a simple combinatorial pipeline implemented in Pipeline Pilot (Accelrys Software Inc.) that runs in about 3 minutes on a single-core 3 GHz Intel Xeon workstation:</p>
<ol>
<li><strong>Building blocks</strong>: Six atomic units (C, N, O, S variants with appropriate bond types) serve as starting materials.</li>
<li><strong>Chain formation</strong>: Building blocks are combined into all possible chains of length 5 and 6 using two bond-forming rules (single and double bond).</li>
<li><strong>Ring closure</strong>: Chains are closed into five- and six-membered rings using three closure rules. Only rings satisfying <a href="https://en.wikipedia.org/wiki/H%C3%BCckel%27s_rule">Hückel&rsquo;s</a> $4n + 2$ aromaticity rule are retained.</li>
<li><strong>Ring fusion</strong>: Monocyclic rings are fused pairwise into all possible bicyclic combinations using four fusion rules. Aromatic bicycles are retained.</li>
</ol>
<p>The enumeration constraints are: mono- and bicyclic rings only, five- and six-membered rings only, atoms restricted to C, N, O, S, and H, all neutral, all aromatic by Hückel&rsquo;s rule, and only exocyclic carbonyls allowed. Including the carbonyl building block expands the library from 2,986 to 24,867 ring systems. Within this count, 1,744 tautomeric pairs exist in 772 clusters. Building blocks are input as MDL mol files, chains formed using MDL REACCS rxn format reactions, and duplicates removed by <a href="/notes/chemistry/molecular-representations/notations/smiles/">canonical SMILES</a> comparison.</p>
<p>The following table summarizes VEHICLe ring system coverage across the compound datasets used for analysis:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: right">Molecules</th>
          <th style="text-align: right">Distinct Ring Systems</th>
          <th style="text-align: right">VEHICLe Rings</th>
          <th style="text-align: right">VEHICLe %</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Launched + Phases II/III</td>
          <td style="text-align: right">2,461</td>
          <td style="text-align: right">950</td>
          <td style="text-align: right">120</td>
          <td style="text-align: right">13%</td>
      </tr>
      <tr>
          <td>Phase I</td>
          <td style="text-align: right">730</td>
          <td style="text-align: right">494</td>
          <td style="text-align: right">86</td>
          <td style="text-align: right">17%</td>
      </tr>
      <tr>
          <td>Derwent patents</td>
          <td style="text-align: right">44,367</td>
          <td style="text-align: right">7,910</td>
          <td style="text-align: right">388</td>
          <td style="text-align: right">5%</td>
      </tr>
      <tr>
          <td>Vendor catalogues</td>
          <td style="text-align: right">2,991,988</td>
          <td style="text-align: right">24,073</td>
          <td style="text-align: right">708</td>
          <td style="text-align: right">3%</td>
      </tr>
  </tbody>
</table>
<h2 id="synthetic-tractability-prediction">Synthetic Tractability Prediction</h2>
<p>Many VEHICLe ring systems are clearly impractical (e.g., rings composed almost entirely of nitrogen). To separate plausible candidates from outlandish ones, the authors train a random forest classifier using the NovoD ArborPharm decision tree software (NovoDynamics, Inc.) within Pipeline Pilot:</p>
<ul>
<li><strong>Features</strong>: ECFP_2 circular fingerprints (346 unique fragment types across VEHICLe), recording the presence or absence of each small substructure fragment per ring system</li>
<li><strong>Training labels</strong>: &ldquo;Good&rdquo; (769 ring systems found in compound databases totaling 3M+ molecules) vs. &ldquo;bad&rdquo; (24,098 remaining)</li>
<li><strong>Method</strong>: 100 trees using the Buja pure-bucket split method, optimized to minimize false negatives (GoodBias = 32, the ratio of bad to good examples). The PreserveMinority parameter was set to true, ensuring that training data selected for exclusion came exclusively from the &ldquo;bad&rdquo; class.</li>
<li><strong>Tree depth</strong>: 200 layers, chosen by systematic variation (50 to 250 in steps of 50) showing diminishing returns beyond this depth</li>
<li><strong>Node parameters</strong>: EnrichmentThreshold = 0.2 (if $\geq 20%$ of molecules in a node are &ldquo;good&rdquo;, the whole node is classified as good); minimum bucket size = 10 molecules per node ($0.04%$ of the dataset)</li>
</ul>
<p>The classifier produces a $p(\text{good})$ score for each ring system. All 769 known ring systems scored $p(\text{good}) &gt; 0.9$. Of the unknown ring systems, 2,185 (9%) were predicted tractable ($p(\text{good}) &gt; 0.5$).</p>
<p><strong>Validation</strong>: 36 VEHICLe rings from UCB&rsquo;s corporate collection (not in the training set) were all correctly classified as good ($p(\text{good}) \geq 0.95$). Against the Beilstein database, 663 of 2,185 predicted-good unknowns had at least one substructure hit (30% minimum true positive rate), compared to only 374 of 21,913 predicted-bad unknowns (2% false negative rate), a 15-fold improvement over random. Selecting only $p(\text{good}) = 1.0$ predictions raised this ratio to 56-fold.</p>
<p>A final random forest incorporating Beilstein data predicted 3,288 unique unknown ring systems as tractable, with 232 having fewer than five heteroatoms and $p(\text{good}) &gt; 0.95$. The authors manually selected 22 of these as &ldquo;unconquered&rdquo; challenges for synthetic chemists.</p>
<h2 id="ring-system-usage-patterns">Ring System Usage Patterns</h2>
<p>Analysis of ring system frequency across compound databases reveals striking concentration:</p>
<ul>
<li><strong>Phenyl dominance</strong>: 2% of ring systems (15 types) account for 90% of occurrences, with phenyl alone at 70%.</li>
<li><strong>Heteroatom penalty</strong>: The significance of ring system usage drops sharply with increasing heteroatom count, quantified as:</li>
</ul>
<p>$$
\text{significance}_{i,j} = \frac{\text{nobs}_{i,j} / \text{nobs}_{j}}{\text{ntot}_{i,j} / \text{ntot}_{j}}
$$</p>
<p>where $i$ is the number of heteroatoms, $j$ is the compound set, $\text{nobs}$ is the frequency of observation, and $\text{ntot}$ is the total count in VEHICLe. Drug molecules in clinical trials show an even steeper drop-off than the broader compound set.</p>
<ul>
<li><strong>Frequency distribution</strong>: Ring system frequency does not follow <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf&rsquo;s power law</a> across the full range. Only ring systems occurring fewer than 500 times follow a power-law distribution.</li>
<li><strong>Publication rate decline</strong>: The rate of first publication of novel heteroaromatic ring systems peaked at about 41 per year in the late 1970s and declined to 5-10 per year by the early 2000s.</li>
</ul>
<p>The concentration likely reflects the &ldquo;<a href="https://en.wikipedia.org/wiki/Principle_of_least_effort">principle of least effort</a>,&rdquo; the phylogenetic nature of drug discovery, and conservative risk management in pharma, rather than inherent unsuitability of the unused ring systems.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The enumeration method is fully described and could be reimplemented, but the original implementation relies on proprietary software. The random forest model also uses proprietary tools but is specified in sufficient detail for reproduction with open-source alternatives.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://datarepository.wolframcloud.com/resources/VEHICLe/">VEHICLe on Wolfram Data Repository</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>24,867 ring systems with 16 properties each</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Software dependencies</strong>: Pipeline Pilot (Accelrys Software Inc.) for enumeration; NovoD ArborPharm (NovoDynamics, Inc.) for decision trees. Both are proprietary.</li>
<li><strong>Hardware</strong>: 3 GHz Intel Xeon workstation (enumeration completes in ~3 minutes).</li>
<li><strong>Missing components</strong>: Original Pipeline Pilot protocols and rxn files are not publicly released. ECFP_2 fingerprints used a proprietary Accelrys implementation, though open-source equivalents (RDKit Morgan fingerprints with radius 1) exist.</li>
<li><strong>Reproducibility status</strong>: Partially Reproducible. The VEHICLe library itself is publicly available, and the method is described in sufficient detail for reimplementation with modern open-source tools, but the original code and protocols are not released.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Journal</strong>: Journal of Medicinal Chemistry, Vol. 52, No. 9, pp. 2952-2963</li>
<li><strong>Published</strong>: April 6, 2009</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{pitt2009heteroaromatic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Heteroaromatic Rings of the Future}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Pitt, William R. and Parry, David M. and Perry, Benjamin G. and Groom, Colin R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Medicinal Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{52}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2952--2963}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/jm801513z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>QM9: Quantum Chemistry Properties of 134k Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/qm9/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/qm9/</guid><description>Dataset card for QM9, providing DFT-computed geometric, electronic, and thermodynamic properties for 134k small organic molecules from GDB-9.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>QM9 provides a consistent, comprehensive set of quantum chemical properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) from the <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> chemical universe. It is among the most widely used benchmark datasets in molecular machine learning, enabling systematic development and evaluation of structure-property prediction methods.</p>
<h2 id="overview">Overview</h2>
<p>The dataset corresponds to the GDB-9 subset of the GDB-17 chemical universe: all neutral molecules with up to nine heavy atoms (C, O, N, F), not counting hydrogen. Cations, anions, and molecules containing S, Br, Cl, or I were excluded, though 1,705 <a href="https://en.wikipedia.org/wiki/Zwitterion">zwitterions</a> (relevant for small biomolecules like amino acids) were retained. The dataset spans 621 stoichiometries. It includes small amino acids (glycine, alanine), nucleobases (cytosine, uracil, thymine), and pharmaceutically relevant building blocks (pyruvic acid, piperazine, hydroxy urea).</p>
<h2 id="computed-properties">Computed Properties</h2>
<p>All properties were calculated at the <a href="https://en.wikipedia.org/wiki/Hybrid_functionals">B3LYP</a>/6-31G(2df,p) level of DFT. The 15 scalar properties per molecule are:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Unit</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A, B, C</td>
          <td>GHz</td>
          <td>Rotational constants</td>
      </tr>
      <tr>
          <td>$\mu$</td>
          <td>D</td>
          <td>Dipole moment</td>
      </tr>
      <tr>
          <td>$\alpha$</td>
          <td>$a_0^3$</td>
          <td>Isotropic polarizability</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{HOMO}}$</td>
          <td>Ha</td>
          <td>HOMO energy</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{LUMO}}$</td>
          <td>Ha</td>
          <td>LUMO energy</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{gap}}$</td>
          <td>Ha</td>
          <td>HOMO-LUMO gap</td>
      </tr>
      <tr>
          <td>$\langle R^2 \rangle$</td>
          <td>$a_0^2$</td>
          <td>Electronic spatial extent</td>
      </tr>
      <tr>
          <td>ZPVE</td>
          <td>Ha</td>
          <td>Zero-point vibrational energy</td>
      </tr>
      <tr>
          <td>$U_0$</td>
          <td>Ha</td>
          <td>Internal energy at 0 K</td>
      </tr>
      <tr>
          <td>$U$</td>
          <td>Ha</td>
          <td>Internal energy at 298.15 K</td>
      </tr>
      <tr>
          <td>$H$</td>
          <td>Ha</td>
          <td>Enthalpy at 298.15 K</td>
      </tr>
      <tr>
          <td>$G$</td>
          <td>Ha</td>
          <td>Free energy at 298.15 K</td>
      </tr>
      <tr>
          <td>$C_v$</td>
          <td>cal/mol K</td>
          <td>Heat capacity at 298.15 K</td>
      </tr>
  </tbody>
</table>
<p>Each molecule is stored in an extended XYZ file. The first line gives the atom count, and the second (comment) line packs all 15 scalar properties. Lines 3 through $n_a + 2$ contain element type, Cartesian coordinates (x, y, z in Angstroms), and <a href="https://en.wikipedia.org/wiki/Mulliken_population_analysis">Mulliken partial charges</a> as a fifth column. Three trailing lines append harmonic vibrational frequencies ($3n_a - 5$ or $3n_a - 6$ modes, in cm$^{-1}$), <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings (from GDB-17 and from the B3LYP-relaxed geometry), and <a href="/notes/chemistry/molecular-representations/notations/inchi/">InChI</a> strings (from Corina and B3LYP geometries).</p>
<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-9 (Full)</strong></td>
          <td>133,885</td>
          <td>All molecules, B3LYP properties</td>
      </tr>
      <tr>
          <td><strong>C7H10O2 isomers</strong></td>
          <td>6,095</td>
          <td>Predominant stoichiometry, with additional G4MP2 energetics</td>
      </tr>
      <tr>
          <td><strong>Validation set</strong></td>
          <td>100</td>
          <td>Random subset with G4MP2, G4, and CBS-QB3 reference values</td>
      </tr>
  </tbody>
</table>
<h2 id="geometry-generation-pipeline">Geometry Generation Pipeline</h2>
<p>Starting from GDB-17 SMILES strings, initial 3D coordinates were generated with Corina, then relaxed at the PM7 semi-empirical level (<a href="https://en.wikipedia.org/wiki/MOPAC">MOPAC</a>), followed by B3LYP/6-31G(2df,p) geometry optimization (<a href="https://en.wikipedia.org/wiki/Gaussian_(software)">Gaussian 09</a>). A five-stage iterative convergence procedure handled difficult cases: default thresholds, then ultrafine grids, tighter SCF criteria, Hessian-guided optimization (calcfc), and full Hessian optimization (calcall). After all stages, 11 molecules still failed to converge to true minima (6 converged with loose thresholds, 2 near-linear molecules converged to saddle points with very low imaginary frequencies below $i10 \text{ cm}^{-1}$).</p>
<h2 id="validation">Validation</h2>
<p><strong>Geometry consistency</strong>: B3LYP-relaxed geometries were converted back to InChI strings and compared against the original GDB-17 InChI. 3,054 molecules failed this round-trip test, primarily due to implementation-specific artifacts in SMILES/InChI conversion rather than actual geometry problems. Coulomb-matrix distances between Corina and B3LYP geometries quantified the magnitude of geometric changes.</p>
<p><strong>Energy accuracy</strong>: For 100 randomly selected molecules, B3LYP atomization enthalpies were compared against higher-level composite methods. These reference methods are themselves near experimental accuracy: G4MP2 achieves MAE 1.0 and RMSE 1.5 kcal/mol against the G3/05 test set of 454 experimental energies, while G4 achieves MAE 0.8 and RMSE 1.2 kcal/mol on the same set. G4MP2 also deviates by only 1.4 kcal/mol from the highly accurate W1w composite procedure on 261 bond dissociation enthalpies (BDE261 dataset). Against these references, B3LYP shows:</p>
<table>
  <thead>
      <tr>
          <th>Reference</th>
          <th>MAE (kcal/mol)</th>
          <th>RMSE (kcal/mol)</th>
          <th>Max AE (kcal/mol)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>G4MP2</td>
          <td>5.0</td>
          <td>6.1</td>
          <td>16.0</td>
      </tr>
      <tr>
          <td>G4</td>
          <td>4.9</td>
          <td>5.9</td>
          <td>14.4</td>
      </tr>
      <tr>
          <td>CBS-QB3</td>
          <td>4.5</td>
          <td>5.5</td>
          <td>13.4</td>
      </tr>
  </tbody>
</table>
<p>All 6,095 C7H10O2 isomers passed the geometry consistency check, and their G4MP2-level energetics provide a higher-accuracy benchmark within a fixed stoichiometry.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Comprehensive and consistent: same level of theory across all 134k molecules</li>
<li>Derived from a systematically enumerated chemical space (GDB-17), reducing selection bias</li>
<li>Rich property set covering geometric, electronic, energetic, and thermodynamic quantities</li>
<li>Widely adopted benchmark enabling reproducible comparisons across ML methods</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Restricted to very small molecules (up to 9 heavy atoms), limiting relevance to drug-sized compounds</li>
<li>Only CHONF elements, excluding sulfur, halogens (Cl, Br, I), and metals</li>
<li>B3LYP/6-31G(2df,p) has known systematic errors (~5 kcal/mol MAE for atomization enthalpies)</li>
<li>3,054 molecules have geometry consistency issues in SMILES/InChI round-tripping</li>
<li>Single conformer per molecule (energy-minimized geometry only)</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904">Figshare collection</a></td>
          <td>Dataset</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Full dataset: 134k molecules, C7H10O2 isomers, validation set, atomic references</td>
      </tr>
  </tbody>
</table>
<p>The Figshare deposit contains four files:</p>
<ul>
<li><code>dsgdb9nsd.xyz.tar.bz2</code>: All 133,885 GDB-1 through GDB-9 molecules with B3LYP properties</li>
<li><code>dsC7O2H10nsd.xyz.tar.bz2</code>: 6,095 C7H10O2 constitutional isomers with G4MP2 energetics</li>
<li><code>validation.txt</code>: Atomization enthalpies at B3LYP, G4MP2, G4, and CBS-QB3 for 100 random molecules</li>
<li><code>atomref.txt</code>: Atomic reference energies for computing atomization energies from total energies</li>
</ul>
<p>All data is in extended XYZ plain-text format. The paper and its metadata are open access (CC BY-NC-SA 4.0 for the article, CC0 for metadata).</p>
<p>No source code is provided. The computational pipeline relies on commercial and semi-commercial software: Corina (3D coordinate generation), MOPAC (PM7 semi-empirical relaxation), and Gaussian 09 (B3LYP DFT calculations). Specific convergence keywords and iteration procedures are documented in the paper. Hardware requirements are not reported.</p>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The dataset itself is fully available, but regenerating it requires commercial licenses for Corina and Gaussian 09.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ramakrishnan2014quantum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Quantum chemistry structures and properties of 134 kilo molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ramakrishnan, Raghunathan and Dral, Pavlo O. and Rupp, Matthias and von Lilienfeld, O. Anatole}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{140022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Portfolio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/sdata.2014.22}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-medchem/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-medchem/</guid><description>Dataset card for GDBMedChem, 10 million drug-like molecules from GDB-17 filtered by medicinal chemistry criteria and evenly sampled.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GDBMedChem is a 10 million molecule subset of <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> selected using medicinal chemistry criteria rather than the fragment-likeness rules used for <a href="/notes/chemistry/datasets/fdb-17/">FDB-17</a>. The resulting database has reduced complexity and better synthetic accessibility than the full GDB-17, while retaining higher Fsp3 carbon fraction and natural product likeness compared to known drugs. Critically, 97% of its MHFP6 substructure shingles are absent from <a href="https://en.wikipedia.org/wiki/DrugBank">DrugBank</a>, <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, and ZINC, making it an unprecedented source of structural diversity for drug design.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 enumerates 166.4 billion molecules following chemical stability and synthetic feasibility rules, but does not consider medicinal chemistry criteria such as acceptable functional group types, overall structural complexity, or drug-likeness. GDBMedChem addresses this gap with a different filtering philosophy than FDB-17: instead of enforcing fragment-likeness (rotatable bond limits, small size), it applies medicinal chemistry-inspired rules that allow larger, more flexible molecules while excluding problematic functional groups and overly complex scaffolds.</p>
<h2 id="assembly-pipeline">Assembly Pipeline</h2>
<p><strong>Stage 1: Medicinal chemistry filters (166.4B to 17.8B, ~9.4x reduction)</strong></p>
<p>Three categories of filters, each benchmarked against ChEMBL, DrugBank, and UNPD (natural products) to ensure low elimination of known bioactives:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Key Filters</th>
          <th>GDB-17 Eliminated</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Functional groups</strong></td>
          <td>No amidines, imidates, aldehydes, aziridines, epoxides; no Br/I; no Cl/F on heterocycles; max 1 nitrile/alkyne/sulfone; max 2 ethers/amides/esters</td>
          <td>53%</td>
      </tr>
      <tr>
          <td><strong>Structural complexity</strong></td>
          <td>Max 18 avalon fingerprint density; max 1 cyclic tetravalent node; max 4 stereocenters; max 3 bonds in fused ring systems; max 3 rings</td>
          <td>62%</td>
      </tr>
      <tr>
          <td><strong>Polarity</strong></td>
          <td>Heteroatom-to-carbon ratio max 0.7</td>
          <td>6%</td>
      </tr>
      <tr>
          <td><strong>Combined</strong></td>
          <td>All filters together</td>
          <td>86%</td>
      </tr>
  </tbody>
</table>
<p>These filters eliminate 86% of GDB-17 but only 36% of ChEMBL molecules and 50% of DrugBank drugs (the higher DrugBank rate is driven mainly by the heteroatom-to-carbon ratio filter removing highly polar drugs with negative clogP values).</p>
<p>Of the 21 filters, 16 are implemented as SMARTS queries and 5 (stereocenters, ring count, avalon density, heteroatom-to-carbon ratio, largest aromatic ring size) use other <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> functions. Filters were applied progressively (simplest first), not in the order listed above. The benchmarking percentages for ChEMBL and DrugBank refer to ChEMBL 22 and DrugBank 5.011 molecules with HAC ≤ 17.</p>
<p><strong>Stage 2: Even sampling (17.8B to 10M)</strong></p>
<p>The 17,804,900,000 molecules in the filtered set are binned into 425 possible triplet combinations of HAC (1-17), heteroatoms (≤1, 2, 3, 4, ≥5), and stereocenters (0, 1, 2, 3, 4). Of these, 181 bins are unoccupied, leaving 244 bins. PySpark&rsquo;s <code>sampleBy</code> function performs stratified sampling without replacement, using a round-robin allocation that increments each bin&rsquo;s quota by one until the total reaches 10M. The resulting distribution is uniform except in low-HAC bins (HAC ≤ 10) where all available molecules are taken.</p>
<h2 id="comparison-with-fdb-17">Comparison with FDB-17</h2>
<p>GDBMedChem and FDB-17 are both 10M-molecule subsets of GDB-17 but take fundamentally different approaches:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>GDBMedChem</th>
          <th>FDB-17</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Parent set</strong></td>
          <td>17.8B (medchem filters)</td>
          <td>4.6B (fragment filters)</td>
      </tr>
      <tr>
          <td><strong>Overlap</strong></td>
          <td>480M molecules shared between parent sets</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Rotatable bonds</strong></td>
          <td>Similar to known drugs</td>
          <td>Restricted to max 3 (fragment-like)</td>
      </tr>
      <tr>
          <td><strong>Key difference</strong></td>
          <td>Drug-like flexibility, medchem FG rules</td>
          <td>Fragment-like rigidity, strict FG removal</td>
      </tr>
  </tbody>
</table>
<p>Both databases retain GDB-17&rsquo;s characteristic high Fsp3 fraction and 3D molecular shape diversity compared to predominantly planar known molecules.</p>
<h2 id="substructure-novelty">Substructure Novelty</h2>
<p>MHFP6 (<a href="https://en.wikipedia.org/wiki/MinHash">MinHash fingerprint</a> with diameter 6) shingle analysis reveals striking structural novelty:</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Molecules</th>
          <th>Unique Shingles</th>
          <th>Unique to Database</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDBMedChem</strong></td>
          <td>10M</td>
          <td>17.3M</td>
          <td>97%</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>1.4M</td>
          <td>1.6M</td>
          <td>57%</td>
      </tr>
      <tr>
          <td>ZINC</td>
          <td>15M</td>
          <td>1.5M</td>
          <td>53%</td>
      </tr>
      <tr>
          <td>DrugBank</td>
          <td>8.3k</td>
          <td>82k</td>
          <td>12%</td>
      </tr>
  </tbody>
</table>
<p>GDBMedChem contains 17.3 million unique shingles, roughly 10x more than the 15 million-molecule <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a>, with 97% appearing in no other database. The cumulative unique shingle count grows faster and more steadily with database size for GDBMedChem than for known molecule databases, reflecting greater internal diversity. Among the most frequent shingles, oxygen-containing saturated or singly unsaturated substructures dominate GDBMedChem, in contrast to aromatic and nitrogen heterocycles in ZINC.</p>
<h2 id="property-profiles">Property Profiles</h2>
<p>Compared to known drugs (DrugBank17, ChEMBL17):</p>
<ul>
<li><strong>Synthetic accessibility</strong>: Slightly better than GDB-17 due to complexity filters, but still lower than known molecules</li>
<li><strong>Natural product likeness</strong>: Significantly higher than drugs, approaching natural products (UNPD17)</li>
<li><strong>Fsp3 fraction</strong>: Higher than drugs, reflecting more 3D-shaped molecules</li>
<li><strong>Compound categories</strong>: Much higher fraction of heterocyclic molecules, much lower fraction of aromatic molecules (a consequence of combinatorial enumeration favoring heteroatom-in-ring combinations)</li>
</ul>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>97% structurally novel substructures provide unprecedented diversity for drug design</li>
<li>Medicinal chemistry filters retain drug-relevant functional group patterns</li>
<li>Even sampling corrects GDB-17&rsquo;s combinatorial bias toward large, complex molecules</li>
<li>Higher Fsp3 and natural product likeness compared to known drugs</li>
<li>Available with interactive 3D visualization, MQN/MHFP6 similarity search, and download</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Synthetic accessibility scores remain lower than for known molecules</li>
<li>Excludes Br, I, and Cl/F on heterocycles, which are common in medicinal chemistry</li>
<li>Random sampling means specific molecules of interest from the 17.8B parent set may be absent</li>
<li>Overlap with FDB-17 is limited (different filtering philosophies), so both databases complement rather than replace each other</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="molecule-preprocessing">Molecule Preprocessing</h3>
<p>Before filtering, each molecule undergoes: counter-ion removal, largest-fragment retention, conversion to non-chiral SMILES, valence-error checking, and protonation at pH 7.4 (using ChemAxon JChem). Duplicates are removed by <a href="/notes/chemistry/molecular-representations/notations/smiles/">canonical SMILES</a> comparison within each database.</p>
<h3 id="reference-databases">Reference Databases</h3>
<p>The comparison databases used specific versions: ChEMBL 22 (1.4M compounds with HAC ≤ 50; 105,423 with HAC ≤ 17), DrugBank 5.011 (8,299 approved/experimental drugs with HAC ≤ 50; 2,284 with HAC ≤ 17), UNPD (20,302 natural products with HAC ≤ 17), and ZINC 12 (15M commercially available compounds).</p>
<h3 id="mhfp6-shingle-computation">MHFP6 Shingle Computation</h3>
<p>Shingles were computed using the <a href="https://github.com/reymond-group/mhfp"><code>mhfp</code> Python package</a> (also on <a href="https://pypi.org/project/mhfp/">PyPI</a>), specifically the <code>shingling_from_smiles</code> function from the <code>MHFPEncoder</code> class. Each shingle represents an extended-connectivity substructure around an atom with a diameter of up to 6 bonds, plus all ring structures, encoded as rooted SMILES strings.</p>
<h3 id="avalon-fingerprint-density">Avalon Fingerprint Density</h3>
<p>The avalon fingerprint density, used as the overall structural complexity filter (max 18), is defined as the number of on-bits in the avalon fingerprint scaled to the heavy atom count.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDBMedChem download</a></td>
          <td>Dataset</td>
          <td>Non-commercial (no patents, no redistribution)</td>
          <td>10M molecules in SMILES format</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch">GDB web tools</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>3D visualization, MQN/MHFP6 similarity search</td>
      </tr>
      <tr>
          <td><a href="https://github.com/reymond-group/mhfp"><code>mhfp</code> Python package</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>MHFP6 fingerprint and shingle computation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/reymond-group/pca">PCA visualization tools</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>MQN-to-3D PCA projection preprocessing</td>
      </tr>
  </tbody>
</table>
<p><strong>Status: Partially Reproducible.</strong> The dataset itself is publicly available for download, and the paper describes the filtering and sampling pipeline in detail (RDKit 2017_09_03, PySpark 2.3.2, 98-node cluster with 252 GB RAM). The <code>mhfp</code> package for shingle analysis is open-source. However, no standalone filtering/sampling code is released: reproducing the pipeline from scratch requires reimplementing the 16 SMARTS filters and 5 RDKit-based filters, plus the PySpark stratified sampling procedure. The molecule preprocessing step also depends on ChemAxon JChem (commercial) for pH 7.4 protonation and MQN calculation.</p>
<p>The paper is published in the closed-access journal <em>Molecular Informatics</em>. An open-access preprint is available on <a href="https://doi.org/10.26434/chemrxiv.7770809.v1">ChemRxiv</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{awale2019medicinal,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Medicinal Chemistry Aware Database GDBMedChem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Awale, Mahendra and Sirockin, Finton and Stiefl, Nikolaus and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Molecular Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{8-9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e1900031}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Wiley}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1002/minf.201900031}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>FDB-17: Fragment Database (10M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/fdb-17/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/fdb-17/</guid><description>Dataset card for FDB-17, a 10 million fragment-like molecule subset of GDB-17 evenly sampled across size, polarity, and stereochemical complexity.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>FDB-17 is a curated subset of 10 million <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-like</a> molecules extracted from the 166.4 billion molecules in <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>. It corrects the combinatorial bias of exhaustive enumeration (which overwhelmingly produces large, complex molecules) by evenly sampling across molecular size, polarity, and stereochemical complexity. The result is a database sized for practical virtual screening tools while retaining GDB-17&rsquo;s distinctive 3D molecular shape diversity.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 exhaustively enumerates molecules up to 17 heavy atoms, but the combinatorial explosion means the database is dominated by the largest, most functionalized, and stereochemically most complex entries. This makes it impractical for most <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> workflows and poorly suited for identifying simple, synthetically accessible fragments. FDB-17 addresses both problems through a two-stage reduction.</p>
<h2 id="assembly-pipeline">Assembly Pipeline</h2>
<p><strong>Stage 1: Fragment-likeness filters (166.4B to 4.6B, 36x reduction)</strong></p>
<p>Criteria limiting structural and functional group complexity:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Constraints</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Scaffolds</strong></td>
          <td>Max 3 rings, max 2 small (3/4-membered) rings, max 2 quaternary centers, max 4 stereocenters, max 3 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>FG density</strong></td>
          <td>Max 5 N+O atoms, max 1 positive/negative charge at neutral pH, max 3 HBA, max 2 HBD</td>
      </tr>
      <tr>
          <td><strong>Excluded groups</strong></td>
          <td>Aldehydes, epoxides, aziridines, carbonates, imidates, nitro groups, aromatic rings &gt;6 atoms, ≤ 1 cyano group</td>
      </tr>
      <tr>
          <td><strong>Removed elements</strong></td>
          <td>Non-aromatic C=C, C triple bonds, halogens (approximated by saturated C-C and methyl)</td>
      </tr>
  </tbody>
</table>
<p><strong>Stage 2: Even sampling (4.6B to 10M, 460x reduction)</strong></p>
<p>The 4.6B fragment subset is binned into 175 cells defined by value triplets of (HAC, heteroatoms, stereocenters):</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Bin values</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>HAC</strong></td>
          <td>≤11, 12, 13, 14, 15, 16, 17 (7 bins)</td>
      </tr>
      <tr>
          <td><strong>Heteroatoms (N+O+S)</strong></td>
          <td>≤1, 2, 3, 4, ≥5 (5 bins)</td>
      </tr>
      <tr>
          <td><strong>Stereocenters</strong></td>
          <td>0, 1, 2, 3, 4 (5 bins)</td>
      </tr>
  </tbody>
</table>
<p>Individual bins ranged from 3,359 to 446,322,188 molecules, reflecting the extreme combinatorial skew toward large, complex structures. Bins with ≤70,000 molecules are taken entirely; larger bins are randomly sampled to approximately 60,000 molecules each. The filtering was implemented in Java using ChemAxon&rsquo;s JChem libraries and executed on a 500-node cluster in 10,000 CPU hours. The resulting even distribution across molecular size, polarity, and complexity replaces the exponentially skewed distribution of the parent database.</p>
<h2 id="property-profiles-vs-commercial-fragments">Property Profiles vs. Commercial Fragments</h2>
<p>FDB-17 was compared against 40,986 commercial fragments collected from 8 vendors (AnalytiCon, ChemBridge, Enamine, FRAGMENTA, BIONET, LifeChemical, Maybridge, Vitas) and filtered by Congreve&rsquo;s <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">rule of three</a> (mass ≤300, HBA ≤3, HBD ≤3, logP ≤3, RBC ≤3, PSA ≤60). Only 31% (12,847) of these commercial fragments appeared in the 4.6B fragment subset at all, due to functional groups absent from GDB-17 (halogens, thiols, azides, thioethers). Of those, only 6.7% (2,740) appeared in FDB-17 due to the random sampling step.</p>
<p>Key differences:</p>
<ul>
<li><strong>Size and polarity</strong>: FDB-17&rsquo;s even sampling produces distributions comparable to commercial fragments, unlike the parent GDB-17 which peaks sharply at HAC = 17</li>
<li><strong>Compound categories</strong>: Half are heteroaromatic in both sets, but FDB-17&rsquo;s second half is predominantly heterocyclic vs. aromatic for commercial fragments</li>
<li><strong>3D character</strong>: FDB-17 retains GDB-17&rsquo;s coverage of the full PMI (principal moments of inertia) shape triangle, with a frequency peak at center-left (PMI computed from single low-energy CORINA conformers). Commercial fragments are predominantly planar. FDB-17 has significantly higher Fsp3 values</li>
<li><strong>Ring count</strong>: Fragment subsets of GDB-17 are enriched in 2- and 3-ring molecules (a consequence of the rotatable bond limit, which constrains monocyclic molecules more than polycyclic ones)</li>
</ul>
<h2 id="virtual-screening-validation">Virtual Screening Validation</h2>
<p>Nearest-neighbor searches were performed using two fingerprint spaces: MQN (42-dimensional molecular quantum numbers counting atoms, bonds, polarity, and topology) and Xfp (55-dimensional extended <a href="https://en.wikipedia.org/wiki/Pharmacophore">pharmacophore</a> fingerprint capturing shape and pharmacophore features). Four fragment-like drugs were used as queries: fencamfamine, gabapentin, rimantadine, and levetiracetam. For each drug, 10,000 nearest neighbors were retrieved and scored by 3D-shape similarity using ROCS (Rapid Overlay of Chemical Structures). 3D conformers were generated with OMEGA (all possible stereoisomers, keeping the highest-scoring one). Molecules with ROCS Tanimoto Combo &gt; 1.4 were considered virtual hits.</p>
<p>FDB-17 delivered comparable numbers of virtual hits to the full 4.6B fragment subset and the entire GDB-17, despite being 460x and 16,640x smaller respectively. Both close analogs (high substructure similarity, Tsfp &gt; 0.7) and scaffold-hopping compounds (low substructure similarity but high shape similarity) were identified. Random sampling from FDB-17 and searches in the 41k commercial fragment set returned far fewer hits.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Manageable size (10M) compatible with docking and 3D-shape virtual screening tools</li>
<li>Even coverage of molecular size, polarity, and complexity avoids combinatorial bias</li>
<li>High 3D shape diversity compared to predominantly flat commercial fragment libraries</li>
<li>Available with interactive visualization (MQN/SMIfp-mapplet) and web-based nearest neighbor search</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Only the 10M FDB-17 is released, not the 4.6B fragment-filtered intermediate. Practitioners who want a different sampling strategy or the full fragment subset cannot access it</li>
<li>Random sampling means specific molecules of interest from the 4.6B subset may be absent</li>
<li>Excludes halogens, non-aromatic unsaturations, and several functional group classes present in commercial fragments</li>
<li>Only 6.7% overlap with commercial fragments limits direct comparison</li>
<li>Still derived from GDB-17&rsquo;s enumeration rules, so molecules outside those rules (e.g., containing metals or larger rings) are excluded</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>FDB-17 is publicly available for download from the <a href="https://gdb.unibe.ch/downloads/">GDB project page</a> as a single SMILES file (62.2 MB), hosted on Zenodo. Interactive visualization via the MQN/SMIfp-mapplet and web-based nearest neighbor search tools are also accessible through the same site. The multi-fingerprint browser supports nearest-neighbor search across six fingerprints: MQN (42D), SMIfp (34D), APfp (21D), Xfp (55D), Sfp (1024-bit Daylight-type), and ECfp4 (1024-bit circular). The filtering code was written in Java using JChem libraries (ChemAxon) and executed on a 500-node cluster in 10,000 CPU hours. The filtering code itself is not publicly released. Virtual screening additionally requires OMEGA (conformer generation) and ROCS (3D-shape scoring), both commercial tools from OpenEye.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">FDB-17 SMILES</a></td>
          <td>Dataset</td>
          <td>Custom (no patents, no redistribution)</td>
          <td>10M fragment-like molecules from GDB-17</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">MQN/SMIfp-mapplet</a></td>
          <td>Other</td>
          <td>Web tool</td>
          <td>Interactive PCA visualization on 1000x1000 grids</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">Multi-fingerprint browser</a></td>
          <td>Other</td>
          <td>Web tool</td>
          <td>Nearest neighbor search across 6 fingerprints (MQN, SMIfp, APfp, Xfp, Sfp, ECfp4)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The 10M FDB-17 is freely downloadable, but the 4.6B fragment-filtered intermediate is not released. The filtering criteria are fully documented, but the Java filtering code is not released and depends on proprietary ChemAxon libraries. Reproducing the virtual screening experiments requires commercial tools (OMEGA, ROCS from OpenEye; CORINA for PMI analysis).</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{visini2017fragment,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fragment Database FDB-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Visini, Ricardo and Awale, Mahendra and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{57}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{700--709}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.7b00020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CHX8: Complete Eight-Carbon Hydrocarbon Space</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/chx8-hydrocarbon-chemical-space/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/chx8-hydrocarbon-chemical-space/</guid><description>Harman &amp; Ermanis exhaustively enumerate and DFT-optimize all hydrocarbons up to 8 carbons, yielding 31,497 stable structures with strain energies.</description><content:encoded><![CDATA[<h2 id="exhaustive-hydrocarbon-enumeration-without-exclusion-filters">Exhaustive Hydrocarbon Enumeration Without Exclusion Filters</h2>
<p>CHX8 is the first dataset to fully enumerate all closed-shell <a href="https://en.wikipedia.org/wiki/Hydrocarbon">hydrocarbons</a> with up to eight carbon atoms, deliberately including strained, <a href="https://en.wikipedia.org/wiki/Bredt%27s_rule">anti-Bredt</a>, and unconventional architectures that prior enumerations (e.g., <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>, <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>) excluded. Of 77,524 enumerated structures, 31,497 are stable under DFT optimization, covering 16x more C8 hydrocarbons than GDB-13. A universal relative strain energy (RSE) metric provides a quantitative synthesizability proxy for every molecule.</p>
<h2 id="motivation-strained-scaffolds-are-no-longer-inaccessible">Motivation: Strained Scaffolds Are No Longer Inaccessible</h2>
<p>GDB-series databases applied strict filters during enumeration, excluding highly strained polycyclic systems, cyclic <a href="https://en.wikipedia.org/wiki/Allene">allenes</a>, anti-Bredt frameworks, and other &ldquo;unconventional&rdquo; motifs. Recent synthetic advances have shown that many of these structures can be accessed and exploited: 3D strained <a href="https://en.wikipedia.org/wiki/Bioisostere">bioisosteres</a> improve pharmacokinetic properties, cyclic allenes enable rapid construction of complex skeletons, and anti-Bredt olefins can be generated and trapped stereospecifically. CHX8 deliberately retains all of these motifs to provide a future-proofed database that remains relevant as synthetic capabilities expand.</p>
<h2 id="enumeration-and-optimization">Enumeration and Optimization</h2>
<p><strong>CHX8-enum (77,524 structures)</strong>: All mathematically feasible hydrocarbons generated by exhaustively enumerating saturated carbon frameworks using the GENG tool from the <a href="https://pallini.di.uniroma1.it/">nauty</a> graph-isomorphism package (all 1-to-8-node connected graphs with 1-4 edges per node), then converting graphs to 3D coordinates via <a href="https://en.wikipedia.org/wiki/Open_Babel">OpenBabel</a>&rsquo;s <code>--Gen3D</code> with the <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> force field. Unsaturations (double bonds, triple bonds, allenes) were introduced iteratively in all valid positions by identifying C-C bonds flanked by hydrogen atoms (SMARTS: <code>[#1]~[#6]~[#6]~[#1]</code>), removing H atoms, and incrementing bond order. Point <a href="https://en.wikipedia.org/wiki/Diastereomer">diastereoisomers</a> and E/Z isomers were generated by manipulating <a href="/notes/chemistry/molecular-representations/notations/inchi/">InChI</a> chiral layers. Duplicate detection relied on canonical InChI strings; residual duplicates account for no more than 1.5% of CHX8.</p>
<table>
  <thead>
      <tr>
          <th>HAC</th>
          <th>Graphs</th>
          <th>Saturated</th>
          <th>Unsaturated</th>
          <th>CHX8-enum</th>
          <th>CHX8 (stable)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
      </tr>
      <tr>
          <td>2</td>
          <td>1</td>
          <td>1</td>
          <td>2</td>
          <td>3</td>
          <td>3</td>
      </tr>
      <tr>
          <td>3</td>
          <td>2</td>
          <td>2</td>
          <td>7</td>
          <td>9</td>
          <td>8</td>
      </tr>
      <tr>
          <td>4</td>
          <td>6</td>
          <td>7</td>
          <td>31</td>
          <td>38</td>
          <td>30</td>
      </tr>
      <tr>
          <td>5</td>
          <td>21</td>
          <td>25</td>
          <td>138</td>
          <td>163</td>
          <td>117</td>
      </tr>
      <tr>
          <td>6</td>
          <td>78</td>
          <td>114</td>
          <td>753</td>
          <td>867</td>
          <td>522</td>
      </tr>
      <tr>
          <td>7</td>
          <td>353</td>
          <td>746</td>
          <td>4,939</td>
          <td>5,685</td>
          <td>2,917</td>
      </tr>
      <tr>
          <td>8</td>
          <td>1,929</td>
          <td>12,903</td>
          <td>57,856</td>
          <td>70,758</td>
          <td>27,899</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>2,391</strong></td>
          <td><strong>13,799</strong></td>
          <td><strong>63,726</strong></td>
          <td><strong>77,524</strong></td>
          <td><strong>31,497</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>DFT optimization</strong>: All structures were geometry-optimized at the PBE0-D4/def2-TZVP level of theory. 66.5% of structures converged after a single optimization; the remainder required one or two additional passes. 59% of CHX8-enum structures underwent $\sigma$-framework rearrangements during optimization and were classified as unstable. Rearranged structures were identified by comparing input and output InChI strings. Analysis confirmed that all rearrangement products (closed-shell, zwitterionic, or <a href="https://en.wikipedia.org/wiki/Carbene">carbene</a> species) were already present in the enumeration, so no new compounds were missed.</p>
<h2 id="relative-strain-energy-as-a-synthesizability-proxy">Relative Strain Energy as a Synthesizability Proxy</h2>
<p>A universal <a href="https://en.wikipedia.org/wiki/Ring_strain">RSE</a> metric, referenced to <a href="https://en.wikipedia.org/wiki/Cyclohexane">cyclohexane</a> (zero strain), was developed and assigned to every molecule. The RSE for a molecule of interest (subscript $n$) relative to a reference structure (subscript $r$) is:</p>
<p>$$
\text{RSE} = E_{n} - E_{r} - (c_{n} - c_{r}),E_{\text{CH}_2} + E_{\text{unsat}}
$$</p>
<p>where $E_{n}$ and $E_{r}$ are Gibbs energies, $c_{n}$ and $c_{r}$ are carbon counts, $E_{\text{CH}_2}$ is the average energy cost of adding an unstrained CH$_2$ unit, computed from the Gibbs energy differences between consecutive linear alkanes (ethane through octane, six increments), and $E_{\text{unsat}}$ corrects for differences in unsaturation:</p>
<p>$$
E_{\text{unsat}} = (r_{n} - r_{r}),E_{\text{ring}} + (d_{n} - d_{r}),E_{\text{double}} + (t_{n} - t_{r}),E_{\text{triple}}
$$</p>
<p>$E_{\text{double}}$ and $E_{\text{triple}}$ are each derived from internal transformations between the second and third carbon of linear chains, averaged over four chain lengths (n-butane through n-octane). Initial attempts using terminal unsaturations systematically underestimated RSE for structures containing double and triple bonds. $E_{\text{ring}}$ is derived separately using the Dudev-Lim homolytic bond dissociation approach:</p>
<p>$$
E_{\text{ring}} = 2E_{\text{C-H}} - E_{\text{C-C}}
$$</p>
<p>where the individual bond energies are obtained from ethane:</p>
<p>$$
E_{\text{C-H}} = E_{\text{ethane}} - E_{\text{ethyl radical}}, \quad E_{\text{C-C}} = E_{\text{ethane}} - 2E_{\text{methyl radical}}
$$</p>
<p>The highest-RSE molecule with synthetic precedent (a C6 structure detected by <a href="https://en.wikipedia.org/wiki/Atomic_force_microscopy">atomic force microscopy</a> on a metal surface) has an RSE of 201.4 kcal/mol. Using this as a threshold, over 90% of the novel structures in CHX8 should be considered synthetically accessible in principle.</p>
<p>Notable reference points on the RSE scale:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Cyclopropane">Cyclopropane</a>: 27.5 kcal/mol</li>
<li><a href="https://en.wikipedia.org/wiki/Tetrahedrane">Tetrahedrane</a>: 140.1 kcal/mol (substituted variants synthesized, unsubstituted not yet)</li>
<li><a href="https://en.wikipedia.org/wiki/Cubane">Cubane</a>: 157.4 kcal/mol (synthesized)</li>
<li>Highest synthesized: 201.4 kcal/mol (C6 structure on metal surface)</li>
</ul>
<h2 id="key-findings-on-strained-motifs">Key Findings on Strained Motifs</h2>
<p>The exhaustive enumeration enables systematic analysis of structural classes previously excluded:</p>
<ol>
<li><strong>Trans-cycloalkenes</strong>: All trans-cycloalkenes in 6-membered rings or larger should be synthetically feasible. The stability of multi-trans systems depends on the relative position of double bonds: parallel trans-double bonds in a ring can undergo thermally accessible 4$\pi$-electrocyclisation, while non-parallel arrangements may be conformationally locked and stable.</li>
<li><strong>Cyclic alkynes and allenes</strong>: 37% of the CHX8 dataset consists of cyclic alkynes or allenes. All cyclic alkynes except cyclopropyne, and all cyclic allenes, should be synthesizable (in singlet or triplet states), with RSE values below cubane.</li>
<li><strong>Trans-fused rings</strong>: All but [3,3]- and [3,4]-unsubstituted trans-fused rings should be accessible. The proposed lower limit for trans-ring junctions is either (i) a 3-membered ring trans-fused to a ring of five or more atoms, or (ii) a 4-membered ring trans-fused to another 4-membered ring.</li>
<li><strong>Anti-Bredt structures</strong>: CHX8 contains seven hydrocarbon skeletons with a bridging section, yielding fourteen possible anti-Bredt (bridgehead-unsaturated) derivatives. Of these, thirteen are stable under DFT optimization, and over 200 substituted anti-Bredt structures are present in the dataset. All stable anti-Bredt structures have RSE values below cubane. Stability is classified using Fawcett&rsquo;s S parameter (the number of non-bridgehead ring atoms): CHX8 finds structures with S $\geq$ 4 are stable to optimization, consistent with recent experimental work that has accessed anti-Bredt intermediates at S values as low as 4.</li>
</ol>
<h2 id="comparison-to-existing-databases">Comparison to Existing Databases</h2>
<ul>
<li><strong>vs. GDB-13</strong>: CHX8 contains 31,497 C1-C8 hydrocarbons vs. 1,966 in GDB-13 (16x more). For C8 hydrocarbons specifically, GDB-13 has more coverage than GDB-17 (1,966 vs. 1,121). All GDB-13 hydrocarbons appear in CHX8-enum, though some were unstable to DFT optimization.</li>
<li><strong>vs. <a href="/notes/chemistry/datasets/vqm24/">VQM24</a></strong>: For C1-C5 hydrocarbons, VQM24 contains 123 closed-shell isomers vs. 154 in CHX8 (14-25% more). Many missing structures in VQM24 are diastereoisomers not generated by the <a href="/notes/chemistry/molecular-design/chemical-space/surge-chemical-graph-generator/">SURGE</a> process.</li>
<li><strong>vs. PubChem</strong>: Less than 44% of CHX8 structures appear in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></li>
<li><strong>vs. Reaxys</strong>: Only 25% of CHX7 (up to 7 carbons) structures are commercially available</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The enumeration pipeline uses open-source tools: GENG from the <a href="/notes/interdisciplinary/graph-theory/nauty-traces-graph-isomorphism/">nauty</a> package for graph generation, <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> for molecular manipulation and InChI canonicalization, and OpenBabel for 3D coordinate generation (MMFF94). <a href="https://en.wikipedia.org/wiki/Density_functional_theory">DFT</a> calculations used the PBE0-D4/def2-TZVP level of theory via the <a href="https://en.wikipedia.org/wiki/ORCA_(quantum_chemistry_program)">ORCA</a> quantum chemistry package. The paper does not report total compute time or hardware specifications.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.17639/nott.7626">CHX8 Dataset (Nottingham Repository)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>All optimized 3D structures, optimization/frequency output files, organized into CHX7, CHX8-sat, and CHX8-unsat subsets</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components for full reproduction</strong>: No source code for the enumeration or unsaturation-introduction scripts is released. The RSE calculation scripts and DFT input templates are not provided. Hardware/compute requirements are not reported.</p>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The dataset itself is deposited, but the enumeration and analysis code is not released.</p>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Preprint</strong>: ChemRxiv, January 2, 2026</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{harman2026complete,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Complete Computational Exploration of Eight-Carbon Hydrocarbon Chemical Space}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Harman, Stephen J. and Ermanis, Kristaps}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ChemRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.26434/chemrxiv-2026-qjr5r}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AllChem: Generating and Searching 10^20 Structures</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/allchem-synthetically-accessible-structures/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/chemical-space/allchem-synthetically-accessible-structures/</guid><description>AllChem generates and searches 10^20 synthetically accessible structures by combining synthons from recursive reaction application.</description><content:encoded><![CDATA[<h2 id="combinatorial-synthon-assembly-at-scale">Combinatorial Synthon Assembly at Scale</h2>
<p>AllChem is a computer-aided molecular design system that generates and searches an unprecedentedly large space of synthetically accessible structures (on the order of $10^{20}$). Rather than enumerating molecules from mathematical graphs (as in the <a href="/notes/chemistry/datasets/gdb-17/">GDB databases</a>), AllChem builds its chemical space from real synthetic chemistry: it recursively applies known reactions to commercial building blocks, producing <a href="https://en.wikipedia.org/wiki/Synthon">synthons</a> (structures with open valences of defined reactivity) that combinatorially assemble into complete molecules. Every structure found by a search comes paired with a proposed synthetic route.</p>
<h2 id="motivation-costs-and-benefits-together">Motivation: Costs and Benefits Together</h2>
<p>Most computer-aided molecular design methods focus on predicting biological activity (the benefit) while leaving synthesis feasibility (the cost) to the laboratory chemist. AllChem addresses both simultaneously. Its predecessor, ChemSpace, accessed $\sim 10^{14}$ structures built from simple <a href="https://en.wikipedia.org/wiki/Combinatorial_chemistry">combinatorial libraries</a> (chemist-proposed scaffolds plus commercial side chains), but only about 5% of structures in the medicinal chemistry literature fit that template. AllChem aims to cover roughly 50% of published structures by allowing multi-step synthon generation that produces more complex, non-trivial scaffolds.</p>
<h2 id="the-gensyn-synthon-generator">The gensyn Synthon Generator</h2>
<p>The core component is <code>gensyn</code>, a program that recursively applies a curated set of approximately 100 reactions to approximately 7,000 commercially available building blocks. Each product becomes a new building block for subsequent reaction steps, with recursion bounded primarily by a cumulative synthesis &ldquo;cost&rdquo; limit (roughly five AllChem-type steps per sequence). Structures bearing open valences are collected as synthons. A typical run produces around $5 \times 10^6$ synthons, which combinatorially represent $(5 \times 10^6)^3 = 10^{20}$ complete structures with an A-B-C topology.</p>
<p>Key design decisions in gensyn:</p>
<ul>
<li><strong>Reaction curation</strong>: All reactions come from external human-readable text files, based on reactions already practiced by laboratory chemists. Scope constraints are calibrated so that at least 90% of randomly sampled reaction applications appear unchallengeable to synthetic chemists.</li>
<li><strong>Reactive intermediates</strong>: Explicitly represented. For example, amide formation requires three steps: acid chloride to electrophilic synthon, amine to nucleophilic synthon, then coupling.</li>
<li><strong>Protective groups</strong>: Addition and removal are treated as standard reactions.</li>
<li><strong>Concerted cyclizations</strong>: Represented by splitting the ring formation across two complementary synthons with specially labeled open valences.</li>
<li><strong>Bimolecular reactions</strong>: In addition to unimolecular transformations, gensyn performs reactions that combine selected synthons with other synthons, increasing overall structural diversity.</li>
<li><strong>Constraints</strong>: Maximum of one prochiral center (to avoid diastereomeric mixtures), heavy atom count limits for lead-likeness, and a cumulative cost bound on synthetic routes. Each reaction step has a default cost of $-5$, and the maximum allowed cumulative cost is $-25$ (roughly five steps per sequence).</li>
</ul>
<h2 id="reaction-description-language">Reaction Description Language</h2>
<p>Reactions are described using an extension of Sybyl Line Notation (SLN), a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-like notation. Each reaction description specifies the structural pattern required in the substrate, the transformation to apply, the reactivity class of resulting open valences, the relative cost, incompatible functional groups, and rules for handling multiple equivalent reactive sites. A separate reactivity table defines which valence classes can react with each other (e.g., nucleophilic with electrophilic).</p>
<h2 id="topomer-similarity-search">Topomer Similarity Search</h2>
<p>Searching among $10^{20}$ complete structures relies on topomer shape similarity as a branch-and-bound filter. A query structure is fragmented by breaking acyclic single bonds (individually and pairwise), each fragment is converted to a topomer (a canonical 3D shape), and the topomer is compared against all stored synthons. Topomer comparisons run at tens of thousands per second. Because the vast majority of synthons are individually shape-dissimilar enough to eliminate every complete structure containing them, the search space collapses rapidly. To be acceptable, a product must also have been formed by joining open valences with complementary reactivity.</p>
<p>Validation used repeated &ldquo;self-searches,&rdquo; in which a query structure is assembled from randomly chosen synthons and searched for in the database. On the 250,000-synthon leadhopping database, average self-search time was 7.1 minutes; complete searches of the full-scale database take several hours on standard hardware.</p>
<h2 id="applications-lead-hopping-and-scaffold-generation">Applications: Lead Hopping and Scaffold Generation</h2>
<p><strong>Lead hopping</strong>: Finding structurally novel molecules that are shape-similar (and therefore likely biologically similar) to a query lead. Using a 250,000-synthon leadhopping database, 18 of 19 self-search queries recovered the query structure perfectly (shape difference of 0 topomer units). The remaining query also recovered itself as the closest hit.</p>
<p><strong>Scaffold idea generation</strong>: Filtering the synthon collection for small ($\leq$ 14 heavy atoms), low-chirality scaffolds with at least two diversification sites (primarily through nucleophilic heteroatom reactions on activated carbon electrophiles or <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki-type couplings</a>), UV chromophores, minimal freely rotatable bonds (especially between diversification sites and rings), a ring, and short synthetic paths (all branches fewer than about six AllChem steps). Over 20% of gensyn-proposed synthons pass these scaffold filters, suggesting on the order of $10^6$ accessible and structurally distinct scaffolds, compared to the few thousand scaffolds typically represented in large screening collections.</p>
<h2 id="compute-and-infrastructure">Compute and Infrastructure</h2>
<p>Full-scale synthon database recreation takes approximately one week using two standard workstations (one Oracle database server, one compute engine). The codebase was rewritten from Java to Python for portability and performance. All data is managed through an Oracle relational database, including synthons, intermediates, and a reactions table recording every gensyn conversion.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Variable reactivity of open valences (e.g., weakly nucleophilic amines may not form the implied bond readily) is handled only approximately via reagent class annotations.</li>
<li>Stereospecificity and most aromatic electrophilic substitution reactions are omitted.</li>
<li>The system was described as under active development at the time of publication, giving the paper the character of an interim progress report.</li>
<li>Drug-likeness of 3-synthon products (average MW ~800, CLOGP ~8.0) requires careful filtering of the synthon distribution toward smaller, less lipophilic components.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>AllChem was developed as proprietary software at Tripos Inc. (Tripos Discovery Research, Bude, Cornwall, UK). No source code, synthon databases, or reaction files have been publicly released. The paper functions as a description of the system&rsquo;s architecture and early results rather than a reproducibility-oriented publication.</p>
<ul>
<li><strong>Code</strong>: Not publicly available. The system was proprietary to Tripos Inc.</li>
<li><strong>Data</strong>: Synthon databases and reaction description files are not shared.</li>
<li><strong>Hardware</strong>: Two standard workstations (one Oracle server, one compute engine); no specialized hardware required.</li>
<li><strong>Funding</strong>: NIH/GMS SBIR grant 2 R44 GM068359-02.</li>
</ul>
<p><strong>Reproducibility status</strong>: Closed.</p>
<h2 id="paper-information">Paper Information</h2>
<ul>
<li><strong>Journal</strong>: Journal of Computer-Aided Molecular Design, Vol. 21, No. 6, pp. 341-350</li>
<li><strong>Published</strong>: January 25, 2007</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cramer2007allchem,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AllChem: generating and searching 10^{20} synthetically accessible structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cramer, Richard D. and Soltanshahi, Farhad and Jilek, Robert J. and Campbell, Brian}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Computer-Aided Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{21}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{341--350}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science+Business Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1007/s10822-006-9093-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SlimPajama-DC: Data Combinations for LLM Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</guid><description>Shen et al. study how global deduplication and domain combinations in SlimPajama affect LLM training, finding diversity after dedup is key.</description><content:encoded><![CDATA[<h2 id="an-empirical-study-of-data-domain-combinations">An empirical study of data domain combinations</h2>
<p>This is a <strong>discovery paper</strong> that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.</p>
<h2 id="why-data-combination-strategy-matters">Why data combination strategy matters</h2>
<p>Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.</p>
<h2 id="global-deduplication-and-the-slimpajama-dataset">Global deduplication and the SlimPajama dataset</h2>
<p>SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama&rsquo;s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.</p>
<p>The key processing steps:</p>
<ol>
<li><strong>Low-length document filtering</strong>: Remove documents below a minimum length threshold.</li>
<li><strong>Global deduplication</strong>: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.</li>
</ol>
<p>The resulting dataset composition:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>SlimPajama</th>
          <th>RedPajama</th>
          <th>LLaMA 1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CommonCrawl</td>
          <td>52.2% (333B)</td>
          <td>72.6% (878B)</td>
          <td>67.0%</td>
      </tr>
      <tr>
          <td>C4</td>
          <td>26.7% (170B)</td>
          <td>14.4% (175B)</td>
          <td>15.0%</td>
      </tr>
      <tr>
          <td>GitHub</td>
          <td>5.2% (33B)</td>
          <td>4.9% (59B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>Books</td>
          <td>4.2% (27B)</td>
          <td>2.1% (26B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>ArXiv</td>
          <td>4.6% (29B)</td>
          <td>2.3% (28B)</td>
          <td>2.5%</td>
      </tr>
      <tr>
          <td>Wikipedia</td>
          <td>3.8% (24B)</td>
          <td>2.0% (24B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>StackExchange</td>
          <td>3.3% (21B)</td>
          <td>1.7% (20B)</td>
          <td>2.0%</td>
      </tr>
  </tbody>
</table>
<h2 id="seven-domain-combination-configurations">Seven domain combination configurations</h2>
<p>All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:</p>
<ul>
<li><strong>DC-1</strong>: CommonCrawl only (single source)</li>
<li><strong>DC-2</strong>: CommonCrawl + C4 (two web sources)</li>
<li><strong>DC-3</strong>: CommonCrawl + C4 with adjusted proportions</li>
<li><strong>DC-4</strong>: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)</li>
<li><strong>DC-5</strong>: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)</li>
<li><strong>DC-6</strong>: All seven SlimPajama sources (maximum diversity)</li>
<li><strong>DC-7</strong>: RefinedWeb CommonCrawl (external single-source baseline)</li>
</ul>
<p>The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).</p>
<h2 id="diversity-after-global-deduplication-drives-performance">Diversity after global deduplication drives performance</h2>
<h3 id="hugging-face-leaderboard-results">Hugging Face leaderboard results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Average</th>
          <th>ARC</th>
          <th>HellaSwag</th>
          <th>MMLU</th>
          <th>TruthfulQA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RedPajama-1.3B</td>
          <td>38.0</td>
          <td>37.2</td>
          <td>55.8</td>
          <td>24.9</td>
          <td>34.3</td>
      </tr>
      <tr>
          <td>DC-1 (CC only)</td>
          <td>38.5</td>
          <td>36.3</td>
          <td>56.0</td>
          <td>27.0</td>
          <td>34.8</td>
      </tr>
      <tr>
          <td>DC-4 (no web)</td>
          <td>37.6</td>
          <td>33.4</td>
          <td>53.3</td>
          <td>26.0</td>
          <td>37.6</td>
      </tr>
      <tr>
          <td>DC-6 (all sources)</td>
          <td>40.0</td>
          <td>33.7</td>
          <td>61.0</td>
          <td>26.9</td>
          <td>38.4</td>
      </tr>
      <tr>
          <td>DC-7 (RefinedWeb)</td>
          <td>41.0</td>
          <td>35.1</td>
          <td>64.7</td>
          <td>26.2</td>
          <td>37.9</td>
      </tr>
  </tbody>
</table>
<p><strong>Key patterns:</strong></p>
<ol>
<li>
<p><strong>More domain diversity improves average performance.</strong> The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.</p>
</li>
<li>
<p><strong>Global deduplication enables clean combination.</strong> All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.</p>
</li>
<li>
<p><strong>Removing web crawl data hurts.</strong> DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.</p>
</li>
<li>
<p><strong>Individual domains excel at specific tasks.</strong> DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.</p>
</li>
<li>
<p><strong>Findings transfer to 7B scale.</strong> The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.</p>
</li>
</ol>
<h3 id="training-loss-patterns">Training loss patterns</h3>
<p>DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The central finding is that <strong>diversity matters most after deduplication</strong>. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with <a href="/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/">DoReMi</a> or <a href="/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/">Data Mixing Laws</a>).</li>
<li>The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.</li>
<li>Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.</li>
<li>English-only. Cross-lingual diversity effects are not studied.</li>
<li>The paper is a technical report without formal peer review.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> All 1.3B models and datasets are publicly released under MIT license on HuggingFace.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>SlimPajama</td>
          <td>627B tokens</td>
          <td>Globally deduplicated from 1.2T RedPajama</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>RefinedWeb</td>
          <td>600B tokens</td>
          <td>External CC-only baseline</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)</td>
          <td>Standard</td>
          <td>4 benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Extended suite</td>
          <td>12 additional benchmarks</td>
          <td>Zero and few-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).</p>
<h3 id="hardware">Hardware</h3>
<p>Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/MBZUAI-LLM/SlimPajama-DC">SlimPajama-DC Models</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>All 1.3B DC configurations (select via revision)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC">SlimPajama-627B-DC Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>Source-split version of SlimPajama-627B</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2023slimpajamadc,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SlimPajama-DC: Understanding Data Combinations for LLM Training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2309.10818}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemLLM: A Chemical Large Language Model Framework</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/</guid><description>ChemLLM introduces the first LLM dedicated to chemistry, with ChemData for instruction tuning and ChemBench for evaluation across nine chemical tasks.</description><content:encoded><![CDATA[<h2 id="a-resource-for-chemistry-specific-language-modeling">A Resource for Chemistry-Specific Language Modeling</h2>
<p>ChemLLM is a <strong>Resource</strong> paper that delivers three interconnected artifacts: ChemData (a 7M-sample instruction tuning dataset for chemistry), ChemBench (a 4,100-question multiple-choice benchmark spanning nine chemistry tasks), and ChemLLM itself (a 7B-parameter language model fine-tuned on InternLM2-Base-7B). Together, these components form the first comprehensive framework for building and evaluating LLMs dedicated to the chemical domain. The primary contribution is not a novel architecture but rather the data curation pipeline, evaluation benchmark, and training methodology that converts structured chemical knowledge into dialogue-formatted instruction data.</p>
<h2 id="bridging-structured-chemical-databases-and-conversational-llms">Bridging Structured Chemical Databases and Conversational LLMs</h2>
<p>While general-purpose LLMs like GPT-4 have shown promise on chemistry tasks, they are not specifically designed for the chemical domain. Several challenges motivate ChemLLM:</p>
<ol>
<li>
<p><strong>Structured data incompatibility</strong>: Most chemical information resides in structured databases (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, <a href="https://en.wikipedia.org/wiki/ChEBI">ChEBI</a>, <a href="/notes/chemistry/datasets/zinc-22/">ZINC</a>, USPTO) that are not naturally suited for training conversational language models. Using this data directly can degrade natural language processing capabilities.</p>
</li>
<li>
<p><strong>Molecular notation understanding</strong>: Molecules are represented in specialized notations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, which differ from natural language and require explicit alignment during training.</p>
</li>
<li>
<p><strong>Task diversity</strong>: Chemical tasks span name conversion, property prediction, molecular captioning, <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a>, product prediction, yield prediction, and more. A uniform training pipeline must handle this diversity without task-specific adaptation.</p>
</li>
<li>
<p><strong>Evaluation gaps</strong>: Existing chemical benchmarks (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>) are designed for specialist models, not LLMs. Text-based evaluation metrics like <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> and <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> are sensitive to output style rather than factual correctness, making them unreliable for scientific accuracy assessment.</p>
</li>
</ol>
<p>Prior work focused on developing specialist models for individual downstream tasks while neglecting instruction-following and dialogue capabilities that are essential for broader reasoning and generalization.</p>
<h2 id="template-based-instruction-construction-from-structured-data">Template-Based Instruction Construction from Structured Data</h2>
<p>The core innovation is a systematic approach for converting structured chemical data into instruction-tuning format through two techniques:</p>
<h3 id="seed-template-prompt-technique">Seed Template Prompt Technique</h3>
<p>For each task type, the authors design a foundational seed template and use GPT-4 to generate variations that differ in expression but maintain semantic consistency. For each structured data entry, one template is randomly selected to create a single-turn dialogue sample. For example, converting <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry">IUPAC</a>-to-SMILES entries:</p>
<ul>
<li>&ldquo;Convert the IUPAC name [name] to its corresponding SMILES representation.&rdquo;</li>
<li>&ldquo;What&rsquo;s the SMILES notation for the chemical known as [name]?&rdquo;</li>
<li>&ldquo;Show me the SMILES sequence for [name], please.&rdquo;</li>
</ul>
<h3 id="play-as-playwrights-technique">Play as Playwrights Technique</h3>
<p>To generate richer, multi-turn dialogues, the authors prompt GPT-4 with a chain-of-thought (CoT) style &ldquo;script&rdquo; construction method. GPT-4 is guided to create multi-turn exchanges that simulate expert discussions, smoothly transitioning between question and answer stages. An additional &ldquo;answer masking&rdquo; variant has the model inquire about supplementary chemical information before providing a final answer, simulating realistic expert reasoning.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The model is fine-tuned using <a href="https://en.wikipedia.org/wiki/LoRA_(machine_learning)">LoRA</a> with an autoregressive cross-entropy loss:</p>
<p>$$L_{CE} = -\sum_{c=1}^{M} y_{o,c} \log(p_{o,c})$$</p>
<p>where $M$ is the vocabulary size, $y_{o,c}$ is a binary indicator for whether observation $o$ belongs to class $c$, and $p_{o,c}$ is the predicted probability.</p>
<h2 id="two-stage-training-pipeline-and-chembench-evaluation">Two-Stage Training Pipeline and ChemBench Evaluation</h2>
<h3 id="training-setup">Training Setup</h3>
<p>ChemLLM uses a two-stage instruction tuning approach built on InternLM2-Base-7B:</p>
<p><strong>Stage 1</strong>: Fine-tune on Multi-Corpus (1.7M Q&amp;A pairs from Hugging Face) to enhance general linguistic capabilities, producing InternLM2-Chat-7B.</p>
<p><strong>Stage 2</strong>: Fine-tune on a mixture of ChemData (7M entries) and Multi-Corpus, balancing domain-specific chemical expertise with general language ability.</p>
<p>Training details include:</p>
<ul>
<li>LoRA with rank 8, scale factor 16.0, dropout 0.1</li>
<li>AdamW optimizer with initial learning rate $5.0 \times 10^{-5}$</li>
<li>NEFTune noise injection (alpha = 5) to prevent overfitting</li>
<li>Flash Attention-2 and KV Cache for efficiency</li>
<li>ZeRO Stage-2 for parameter offloading</li>
<li>Per-card batch size of 8 (total batch size 128)</li>
<li>1.06 epochs, 85,255 steps</li>
<li>Training loss reduced from 1.4998 to 0.7158</li>
</ul>
<h3 id="chemdata-composition">ChemData Composition</h3>
<p>ChemData spans three principal task categories with 7M instruction-tuning Q&amp;A pairs:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Tasks</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecules</td>
          <td>Name Conversion, Caption2Mol, Mol2Caption, Molecular Property Prediction</td>
      </tr>
      <tr>
          <td>Reactions</td>
          <td>Retrosynthesis, Product Prediction, Yield Prediction, Temperature Prediction, Solvent Prediction</td>
      </tr>
      <tr>
          <td>Domain-specific</td>
          <td>General chemical knowledge for broader chemical space understanding</td>
      </tr>
  </tbody>
</table>
<p>Data sources include PubChem, ChEMBL, ChEBI, ZINC, USPTO, ORDerly, ChemRxiv, LibreTexts Chemistry, Wikipedia, and Wikidata.</p>
<h3 id="chembench-design">ChemBench Design</h3>
<p>ChemBench contains 4,100 multiple-choice questions across the same nine tasks as ChemData. The choice of multiple-choice format is deliberate: it minimizes the influence of output style and focuses evaluation on factual correctness, unlike BLEU/ROUGE-based evaluation. Wrong answers are generated by sampling nearby values (for prediction tasks) or using GPT-4 to create plausible distractors. Deduplication ensures no overlap between ChemData training entries and ChemBench questions.</p>
<p>ChemBench has been contributed to the OpenCompass evaluation platform.</p>
<h3 id="baselines">Baselines</h3>
<p>All evaluations use 5-shot prompting. Baselines include:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLaMA-2</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>Mistral</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>ChatGLM3</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>Qwen</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>InternLM2-Chat-7B</td>
          <td>Open-source (Stage 1 only)</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>GPT-3.5</td>
          <td>Closed-source</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>GPT-4</td>
          <td>Closed-source</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<h2 id="chemllm-matches-gpt-4-on-chemical-tasks-and-outperforms-7b-peers">ChemLLM Matches GPT-4 on Chemical Tasks and Outperforms 7B Peers</h2>
<h3 id="chemical-evaluation-chembench">Chemical Evaluation (ChemBench)</h3>
<p>ChemLLM significantly outperforms general LLMs of similar scale and surpasses GPT-3.5 across all nine tasks. Compared to GPT-4, ChemLLM achieves higher scores on six of nine tasks, with the remaining three ranking just below GPT-4. LLaMA-2 scores near random chance (~25 per task), highlighting the difficulty of these tasks for models without chemical training.</p>
<p>Compared to InternLM2-Chat-7B (the Stage 1 model), ChemLLM shows substantial improvement, confirming the effectiveness of the Stage 2 chemical fine-tuning.</p>
<h3 id="general-evaluation">General Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>ChemLLM</th>
          <th>Best 7B Baseline</th>
          <th>GPT-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MMLU</td>
          <td>65.6</td>
          <td>&lt; 65.6</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>C-Eval</td>
          <td>67.2</td>
          <td>&lt; 67.2</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>GSM8K</td>
          <td>67.2</td>
          <td>&lt; 67.2</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>C-MHChem</td>
          <td>76.4</td>
          <td>&lt; 76.4</td>
          <td>&lt; 76.4</td>
      </tr>
  </tbody>
</table>
<p>ChemLLM outperforms all competing 7B models on MMLU, C-Eval, and GSM8K. On C-MHChem (Chinese middle and high school chemistry), ChemLLM scores 76.4, surpassing GPT-4. The authors note that chemical data fine-tuning may enhance reasoning capabilities due to the logical reasoning required in chemical problem-solving. ChemLLM also comprehensively surpasses InternLM2-Chat-7B on all four general benchmarks, indicating that chemical data does not harm general capabilities.</p>
<h3 id="qualitative-capabilities">Qualitative Capabilities</h3>
<p>The paper demonstrates qualitative performance on chemistry-related NLP tasks including:</p>
<ul>
<li>Chemical literature translation (English to Chinese and vice versa)</li>
<li>Chemical poetry creation</li>
<li>Information extraction from chemical text</li>
<li>Text summarization of chemical research</li>
<li>Reading comprehension on chemistry topics</li>
<li>Named entity recognition for chemical entities</li>
<li>Ethics and safety reasoning in chemical contexts</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The paper does not provide individual task-level scores in tabular form for ChemBench (only radar charts), making precise comparison difficult. Specific scores for each of the nine tasks across all baselines are not reported numerically. The evaluation is limited to 5-shot prompting without exploration of zero-shot or chain-of-thought prompting variants. The paper also does not discuss failure modes or systematic weaknesses of ChemLLM on particular task types.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 Training</td>
          <td>Multi-Corpus</td>
          <td>1.7M Q&amp;A</td>
          <td>Collected from Hugging Face</td>
      </tr>
      <tr>
          <td>Stage 2 Training</td>
          <td>ChemData + Multi-Corpus</td>
          <td>7M + 1.7M</td>
          <td>Chemical + general mixture</td>
      </tr>
      <tr>
          <td>Chemical Evaluation</td>
          <td>ChemBench</td>
          <td>4,100 MCQ</td>
          <td>9 tasks, contributed to OpenCompass</td>
      </tr>
      <tr>
          <td>General Evaluation</td>
          <td>MMLU, C-Eval, GSM8K, C-MHChem</td>
          <td>Varies</td>
          <td>Standard benchmarks</td>
      </tr>
  </tbody>
</table>
<p>Data sources for ChemData: PubChem, ChEMBL, ChEBI, ZINC, USPTO, ORDerly, ChemRxiv, LibreTexts Chemistry, Wikipedia, Wikidata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Two-stage instruction tuning (general then chemical)</li>
<li>LoRA fine-tuning (rank 8, scale 16.0, dropout 0.1)</li>
<li>Template-based instruction construction with GPT-4 for diversity</li>
<li>Play as Playwrights CoT prompting for multi-turn dialogue generation</li>
<li>NEFTune noise injection (alpha 5)</li>
<li>DeepSpeed ZeRO++ for distributed training</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Base</th>
          <th>Parameters</th>
          <th>Availability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemLLM-7B-Chat</td>
          <td>InternLM2-Base-7B</td>
          <td>7B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat">Hugging Face</a></td>
      </tr>
      <tr>
          <td>ChemLLM-7B-Chat-1.5-DPO</td>
          <td>InternLM2</td>
          <td>7B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1_5-DPO">Hugging Face</a></td>
      </tr>
      <tr>
          <td>ChemLLM-20B-Chat-DPO</td>
          <td>InternLM</td>
          <td>20B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-20B-Chat-DPO">Hugging Face</a></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>5-shot evaluation across all benchmarks. Multiple-choice format for ChemBench to minimize output style bias.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>2 machines, each with 8 NVIDIA A100 SMX GPUs</li>
<li>2 AMD EPYC 7742 64-Core CPUs per machine (256 threads each)</li>
<li>SLURM cluster management</li>
<li>BF16 mixed precision training</li>
<li>Flash Attention-2 + KV Cache</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat">ChemLLM-7B-Chat</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Original 7B chat model</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1_5-DPO">ChemLLM-7B-Chat-1.5-DPO</a></td>
          <td>Model</td>
          <td>Other</td>
          <td>Updated v1.5 with DPO</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-20B-Chat-DPO">ChemLLM-20B-Chat-DPO</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>20B parameter variant</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem">AI4Chem HuggingFace</a></td>
          <td>Collection</td>
          <td>Various</td>
          <td>All models, datasets, and code</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, D., Liu, W., Tan, Q., Chen, J., Yan, H., Yan, Y., Li, J., Huang, W., Yue, X., Ouyang, W., Zhou, D., Zhang, S., Su, M., Zhong, H.-S., &amp; Li, Y. (2024). ChemLLM: A Chemical Large Language Model. <em>arXiv preprint arXiv:2402.06852</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2024chemllm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemLLM: A Chemical Large Language Model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Di and Liu, Wei and Tan, Qian and Chen, Jingdan and Yan, Hang and Yan, Yuliang and Li, Jiatong and Huang, Weiran and Yue, Xiangyu and Ouyang, Wanli and Zhou, Dongzhan and Zhang, Shufei and Su, Mao and Zhong, Han-Sen and Li, Yuqiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2402.06852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoleculeNet: Benchmarking Molecular Machine Learning</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/</guid><description>MoleculeNet curates 17 datasets across quantum mechanics, physical chemistry, biophysics, and physiology with standardized splits and metrics for molecular ML.</description><content:encoded><![CDATA[<h2 id="a-resource-paper-for-molecular-machine-learning-benchmarking">A Resource Paper for Molecular Machine Learning Benchmarking</h2>
<p>This is a <strong>Resource</strong> paper. MoleculeNet provides a standardized benchmark suite for evaluating molecular machine learning methods. Its primary contribution is the curation of 17 public datasets spanning four categories of molecular properties, together with standardized evaluation metrics, multiple dataset splitting strategies, and open-source implementations of featurization and learning algorithms via the DeepChem library.</p>
<h2 id="why-molecular-ml-needed-a-unified-benchmark">Why Molecular ML Needed a Unified Benchmark</h2>
<p>Prior to MoleculeNet, algorithmic progress in molecular machine learning was difficult to measure. Individual papers benchmarked proposed methods on different datasets with different metrics, making cross-method comparison unreliable. Several factors make molecular ML particularly challenging:</p>
<ol>
<li><strong>Data scarcity</strong>: Molecular datasets are much smaller than those available for computer vision or NLP, since obtaining accurate chemical property measurements requires specialized instruments and expert supervision.</li>
<li><strong>Heterogeneous outputs</strong>: Properties of interest range from quantum mechanical characteristics to macroscopic physiological effects on the human body.</li>
<li><strong>Variable input structures</strong>: Molecules have arbitrary size, variable connectivity, and many possible 3D conformers, all of which must be encoded into fixed-length representations for conventional ML algorithms.</li>
<li><strong>No standard evaluation protocol</strong>: Without prescribed metrics, splits, or data subsets, two methods using the same underlying database (e.g., PubChem) could be entirely incomparable.</li>
</ol>
<p>Existing databases like PubChem, ChEMBL, and the Quantum Machine collections provided raw data but did not define evaluation protocols suitable for machine learning development. MoleculeNet bridges this gap, following the precedent set by ImageNet in computer vision and WordNet in NLP.</p>
<h2 id="core-design-datasets-splits-metrics-and-featurizations">Core Design: Datasets, Splits, Metrics, and Featurizations</h2>
<p>MoleculeNet is organized around four components: curated datasets, splitting methods, evaluation metrics, and molecular featurizations.</p>
<h3 id="datasets-across-four-property-categories">Datasets Across Four Property Categories</h3>
<p>The benchmark includes 17 datasets covering over 700,000 compounds and more than 800 tasks. These are organized into four categories reflecting different levels of molecular properties:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Compounds</th>
          <th>Task Type</th>
          <th>Rec. Split</th>
          <th>Rec. Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Quantum Mechanics</td>
          <td>QM7</td>
          <td>1</td>
          <td>7,165</td>
          <td>Regression</td>
          <td>Stratified</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM7b</td>
          <td>14</td>
          <td>7,211</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM8</td>
          <td>12</td>
          <td>21,786</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM9</td>
          <td>12</td>
          <td>133,885</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>1,128</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>643</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>Lipophilicity</td>
          <td>1</td>
          <td>4,200</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA</td>
          <td>128</td>
          <td>439,863</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>MUV</td>
          <td>17</td>
          <td>93,127</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>HIV</td>
          <td>1</td>
          <td>41,913</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>PDBbind</td>
          <td>1</td>
          <td>11,908</td>
          <td>Regression</td>
          <td>Time</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>BACE</td>
          <td>1</td>
          <td>1,522</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>2,053</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>Tox21</td>
          <td>12</td>
          <td>8,014</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ToxCast</td>
          <td>617</td>
          <td>8,615</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>SIDER</td>
          <td>27</td>
          <td>1,427</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ClinTox</td>
          <td>2</td>
          <td>1,491</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p><strong>Quantum mechanics</strong> datasets (QM7, QM7b, QM8, <a href="/notes/chemistry/datasets/qm9/">QM9</a>) contain DFT-computed electronic properties for subsets of the <a href="/notes/chemistry/datasets/gdb-17/">GDB</a> database. <strong>Physical chemistry</strong> datasets cover solubility (ESOL), hydration free energy (FreeSolv), and lipophilicity. <strong>Biophysics</strong> datasets include high-throughput screening results (PCBA, MUV), HIV inhibition activity, protein-ligand binding affinity (PDBbind), and BACE-1 inhibition. <strong>Physiology</strong> datasets cover blood-brain barrier penetration (BBBP), toxicity (Tox21, ToxCast), side effects (SIDER), and clinical trial toxicity (ClinTox).</p>
<h3 id="data-splitting-strategies">Data Splitting Strategies</h3>
<p>MoleculeNet implements four splitting methods, all using an 80/10/10 train/validation/test ratio:</p>
<ul>
<li><strong>Random splitting</strong>: Standard random assignment to subsets.</li>
<li><strong>Scaffold splitting</strong>: Separates molecules by their 2D structural frameworks (Bemis-Murcko scaffolds), providing a harder generalization test since structurally different molecules appear in different subsets.</li>
<li><strong>Stratified splitting</strong>: Ensures each subset contains the full range of label values (used for QM7).</li>
<li><strong>Time splitting</strong>: Trains on older data and tests on newer data to mimic real-world development (used for PDBbind).</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>Regression tasks use MAE or RMSE depending on the dataset. Classification tasks use either ROC-AUC or PRC-AUC. The choice between ROC-AUC and PRC-AUC depends on class imbalance: PRC-AUC is recommended for datasets with positive rates below 2% (PCBA, MUV), since precision-recall curves better capture performance under extreme imbalance.</p>
<p>The false positive rate and precision are defined as:</p>
<p>$$
\text{FPR} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}}
$$</p>
<p>$$
\text{precision} = \frac{\text{true positive}}{\text{false positive} + \text{true positive}}
$$</p>
<p>When positive samples form a small fraction of the data, false positives influence precision much more than FPR, making PRC-AUC more informative than ROC-AUC.</p>
<h3 id="featurization-methods">Featurization Methods</h3>
<p>MoleculeNet implements six molecular featurization approaches:</p>
<ol>
<li><strong>ECFP (Extended-Connectivity Fingerprints)</strong>: Fixed-length binary fingerprints capturing topological substructures via hashing.</li>
<li><strong><a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb Matrix</a></strong>: Encodes nuclear charges and 3D coordinates through atomic self-energies and Coulomb repulsion:</li>
</ol>
<p>$$
M_{IJ} = \begin{cases} 0.5 Z_{I}^{2.4} &amp; \text{for } I = J \\ \frac{Z_{I} Z_{J}}{|\mathbf{R}_{I} - \mathbf{R}_{J}|} &amp; \text{for } I \neq J \end{cases}
$$</p>
<ol start="3">
<li><strong>Grid Featurizer</strong>: Designed for PDBbind, incorporating both ligand and protein structural information including salt bridges, hydrogen bonds, and SPLIF fingerprints.</li>
<li><strong>Symmetry Functions</strong>: Preserve rotational and permutation symmetry through radial and angular functions between atom pairs and triplets.</li>
<li><strong>Graph Convolutions</strong>: Compute initial atom feature vectors and neighbor lists from molecular graphs.</li>
<li><strong>Weave</strong>: Similar to graph convolutions but also computes pairwise atom features encoding bond properties, graph distance, and ring information.</li>
</ol>
<h2 id="benchmarked-models-and-experimental-setup">Benchmarked Models and Experimental Setup</h2>
<p>MoleculeNet benchmarks 12 learning algorithms divided into conventional methods and graph-based methods.</p>
<h3 id="conventional-methods">Conventional Methods</h3>
<ul>
<li><strong>Logistic Regression</strong> (classification only)</li>
<li><strong>Kernel SVM</strong> with radial basis function kernel</li>
<li><strong>Kernel Ridge Regression (KRR)</strong></li>
<li><strong>Random Forests</strong></li>
<li><strong>Gradient Boosting</strong> (XGBoost)</li>
<li><strong>Singletask/Multitask Networks</strong>: Fully connected networks with shared layers across tasks</li>
<li><strong>Bypass Networks</strong>: Multitask networks augmented with per-task &ldquo;bypass&rdquo; layers that directly connect inputs to outputs</li>
<li><strong>Influence Relevance Voting (IRV)</strong>: Refined K-nearest neighbor classifiers using Jaccard-Tanimoto similarity:</li>
</ul>
<p>$$
S(\vec{A}, \vec{B}) = \frac{A \cap B}{A \cup B}
$$</p>
<h3 id="graph-based-methods">Graph-Based Methods</h3>
<ul>
<li><strong>Graph Convolutional Models (GC)</strong>: Extend circular fingerprints with learnable convolutions over molecular graphs.</li>
<li><strong>Weave Models</strong>: Update atom features using information from all other atoms and their pairwise features.</li>
<li><strong>Directed Acyclic Graph (DAG) Models</strong>: Define directed bonds toward a central atom and propagate features through the directed graph.</li>
<li><strong>Deep Tensor Neural Networks (DTNN)</strong>: Use nuclear charges and distance matrices directly, updating atom embeddings based on pairwise physical distances.</li>
<li><strong>ANI-1</strong>: Learns transferable potentials using symmetry function features with atom-type-specific neural networks.</li>
<li><strong>Message Passing Neural Networks (MPNN)</strong>: Generalized framework with edge-dependent message functions and set2set readout.</li>
</ul>
<h3 id="experimental-protocol">Experimental Protocol</h3>
<p>Gaussian process hyperparameter optimization was applied to each dataset-model combination, followed by three independent runs with different random seeds. All results are reported as means with standard deviations. Variable training-size experiments were conducted on Tox21, FreeSolv, and QM7 to study data efficiency.</p>
<h2 id="key-findings-across-property-categories">Key Findings Across Property Categories</h2>
<h3 id="biophysics-and-physiology">Biophysics and Physiology</h3>
<p>Graph convolutional and weave models showed strong performance on larger datasets with less overfitting than conventional methods. Graph-based models outperformed multitask networks at 30% training data compared to 90% on Tox21. However, for smaller single-task datasets (under 3,000 samples), kernel SVM and ensemble tree methods were more robust. On highly imbalanced datasets like MUV (0.20% positive rate), graph-based models struggled to control false positives.</p>
<p>Multitask training had a regularizing effect, reducing the gap between train and test scores compared to single-task models. Bypass networks consistently matched or exceeded vanilla multitask networks, confirming that per-task layers add explanatory power.</p>
<h3 id="physical-chemistry">Physical Chemistry</h3>
<p>Graph-based methods (GC, DAG, MPNN, Weave) provided significant improvements over single-task networks for predicting solubility, solvation energy, and lipophilicity. The best models achieved accuracy comparable to ab initio predictions (within 0.5 RMSE for ESOL, within 1.5 kcal/mol for FreeSolv). On FreeSolv, a weave model trained on approximately 200 samples matched the accuracy of alchemical free energy calculations.</p>
<h3 id="quantum-mechanics">Quantum Mechanics</h3>
<p>Models incorporating 3D distance information (DTNN, MPNN, KRR with Coulomb matrix) substantially outperformed models using only topological features. DTNN and MPNN covered the best-performing models on 28 of 39 tasks across QM datasets. The choice of physics-aware featurization proved more important than the choice of learning algorithm for these tasks.</p>
<h3 id="summary-of-best-performances">Summary of Best Performances</h3>
<p>Graph-based models outperformed conventional methods on 11 of 17 datasets. Key results on the test set:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>Best Conventional</th>
          <th>Best Graph-Based</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM7</td>
          <td>MAE</td>
          <td>KRR (CM): 10.22</td>
          <td>DTNN: 8.75</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>MAE</td>
          <td>Multitask (CM): 4.35</td>
          <td>DTNN: 2.35</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>XGBoost: 0.99</td>
          <td>MPNN: 0.58</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>XGBoost: 1.74</td>
          <td>MPNN: 1.15</td>
      </tr>
      <tr>
          <td>PCBA</td>
          <td>PRC-AUC</td>
          <td>Logreg: 0.129</td>
          <td>GC: 0.136</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.822</td>
          <td>GC: 0.829</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.792</td>
          <td>GC: 0.763</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>ROC-AUC</td>
          <td>RF: 0.867</td>
          <td>Weave: 0.806</td>
      </tr>
  </tbody>
</table>
<p>Conventional methods (KernelSVM, RF) still won on several smaller or scaffold-split datasets (HIV, BACE, MUV, PDBbind, BBBP, SIDER), highlighting that graph-based models are not universally superior, particularly under data scarcity or challenging splits.</p>
<h2 id="conclusions-and-limitations">Conclusions and Limitations</h2>
<p>MoleculeNet demonstrated that learnable representations broadly offer the best performance for molecular machine learning. However, the authors identify several important caveats:</p>
<ol>
<li><strong>Data scarcity</strong>: Graph-based methods are not robust enough on complex tasks with limited training data.</li>
<li><strong>Class imbalance</strong>: On heavily imbalanced classification datasets, conventional methods such as kernel SVM outperform learnable featurizations with respect to recall of positives.</li>
<li><strong>Task-specific featurizations</strong>: For quantum mechanical and biophysical datasets, incorporating physics-aware features (<a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrix</a>, 3D coordinates) is more important than the choice of learning algorithm.</li>
<li><strong>Data-driven physical chemistry</strong>: On FreeSolv, data-driven methods outperformed ab initio calculations with moderate data, suggesting data-driven approaches will become increasingly important as methods and datasets mature.</li>
</ol>
<p>The authors express hope that MoleculeNet will stimulate algorithmic development similar to how ImageNet catalyzed breakthroughs in computer vision. Future directions include extending coverage to 3D protein structure prediction, DNA topological modeling, and other areas of molecular science.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All 17 datasets are publicly available and integrated into the DeepChem Python package. Users can load any dataset with a single library call.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM benchmark</td>
          <td>QM7/QM7b/QM8/QM9</td>
          <td>7K-134K compounds</td>
          <td>DFT-computed properties from GDB subsets</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL/FreeSolv/Lipophilicity</td>
          <td>643-4,200 compounds</td>
          <td>Experimental measurements</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA/MUV/HIV/PDBbind/BACE</td>
          <td>1.5K-440K compounds</td>
          <td>Bioassay and binding data</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP/Tox21/ToxCast/SIDER/ClinTox</td>
          <td>1.4K-8.6K compounds</td>
          <td>Toxicity and drug safety data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>All splitting methods (random, scaffold, stratified, time) and featurizations (ECFP, Coulomb matrix, grid, symmetry functions, graph convolutions, weave) are implemented in DeepChem. Hyperparameters were tuned via Gaussian process optimization. Three random seeds were used per experiment.</p>
<h3 id="models">Models</h3>
<p>All 12 models are implemented in DeepChem, built on Scikit-Learn and TensorFlow. No pretrained weights are provided; models are trained from scratch on each dataset.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics include MAE, RMSE, ROC-AUC, and PRC-AUC as specified per dataset. Multi-task datasets report mean metric values across all tasks.</p>
<h3 id="hardware">Hardware</h3>
<p>The authors used Stanford&rsquo;s Sherlock and Xstream GPU nodes. Specific GPU types and training times per model are provided in Table S1 of the supplementary material.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source library with all datasets, featurizations, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., &amp; Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. <em>Chemical Science</em>, 9(2), 513-530. <a href="https://doi.org/10.1039/c7sc02664a">https://doi.org/10.1039/c7sc02664a</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2018moleculenet,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MoleculeNet: a benchmark for molecular machine learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{513--530}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/c7sc02664a}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DOCKSTRING: Docking-Based Benchmarks for Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/dockstring-docking-benchmarks-ligand-design/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/dockstring-docking-benchmarks-ligand-design/</guid><description>DOCKSTRING provides an open-source Python docking package, 15M+ score dataset across 58 targets, and benchmark tasks for ML-driven drug design.</description><content:encoded><![CDATA[<h2 id="a-three-part-resource-for-docking-based-ml-benchmarks">A Three-Part Resource for Docking-Based ML Benchmarks</h2>
<p>DOCKSTRING is a <strong>Resource</strong> paper that delivers three integrated components for benchmarking machine learning models in drug discovery using molecular docking. The primary contributions are: (1) an open-source Python package wrapping <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a> for deterministic docking from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, (2) a dataset of over 15 million docking scores and poses covering 260,000+ molecules docked against 58 medically relevant protein targets, and (3) a suite of benchmark tasks spanning regression, <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a>, and de novo molecular design. The paper additionally provides baseline results across classical and deep learning methods.</p>
<h2 id="why-existing-molecular-benchmarks-fall-short">Why Existing Molecular Benchmarks Fall Short</h2>
<p>ML methods for drug discovery are frequently evaluated using simple physicochemical properties such as penalized logP or QED (quantitative estimate of druglikeness). These properties are computationally cheap and easy to optimize, but they do not depend on the interaction between a candidate compound and a protein target. As a result, strong performance on logP or QED benchmarks does not necessarily translate to strong performance on real drug design tasks.</p>
<p><a href="https://en.wikipedia.org/wiki/Docking_(molecular)">Molecular docking</a> offers a more realistic evaluation objective because docking scores depend on the 3D structure of the ligand-target complex. Docking is routinely used by medicinal chemists to estimate binding affinities during hit discovery and lead optimization. Several prior efforts attempted to bring docking into ML benchmarking, but each had limitations:</p>
<ul>
<li><strong>VirtualFlow and DockStream</strong> require manually prepared target files and domain expertise.</li>
<li><strong>TDC and Cieplinski et al.</strong> provide SMILES-to-score wrappers but lack proper ligand protonation and randomness control, and cover very few targets (one and four, respectively).</li>
<li><strong>DUD-E</strong> is easily overfit by ML models that memorize actives vs. decoys.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></strong> rely on physicochemical properties or similarity functions that miss 3D structural subtleties.</li>
<li><strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> compiles experimental datasets but does not support on-the-fly label computation needed for transfer learning or de novo design.</li>
</ul>
<p>DOCKSTRING addresses all of these gaps: it standardizes the docking procedure, automates ligand and target preparation, controls randomness for reproducibility, and provides a large, diverse target set.</p>
<h2 id="core-innovation-standardized-end-to-end-docking-pipeline">Core Innovation: Standardized End-to-End Docking Pipeline</h2>
<p>The key innovation is a fully automated, deterministic docking pipeline that produces reproducible scores from a SMILES string in four lines of Python code. The pipeline consists of three stages:</p>
<p><strong>Target Preparation.</strong> 57 of the 58 protein targets originate from the Directory of Useful Decoys Enhanced (DUD-E). PDB files are standardized with <a href="https://en.wikipedia.org/wiki/Open_Babel">Open Babel</a>, polar hydrogens are added, and conversion to PDBQT format is performed with AutoDock Tools. Search boxes are derived from crystallographic ligands with 12.5 A padding and a minimum side length of 30 A. The 58th target (DRD2, <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor D2</a>) was prepared separately following the same protocol.</p>
<p><strong>Ligand Preparation.</strong> Ligands are protonated at pH 7.4 with Open Babel, embedded into 3D conformations using the ETKDG algorithm in RDKit, refined with the <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94 force field</a>, and assigned Gasteiger partial charges. Stereochemistry of determined stereocenters is maintained, while undetermined stereocenters are assigned randomly but consistently across runs.</p>
<p><strong>Docking.</strong> AutoDock Vina runs with default exhaustiveness (8), up to 9 binding modes, and an energy range of 3 kcal/mol. The authors verified that fixing the random seed yields docking score variance of less than 0.1 kcal/mol across runs, making the pipeline fully deterministic.</p>
<p>The three de novo design objective functions incorporate a QED penalty to enforce druglikeness:</p>
<p>$$
f_{\text{F2}}(l) = s(l, \text{F2}) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{PPAR}}(l) = \max_{t \in \text{PPAR}} s(l, t) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{JAK2}}(l) = s(l, \text{JAK2}) - \min(s(l, \text{LCK}), -8.1) + 10(1 - \text{QED}(l))
$$</p>
<p>The F2 task optimizes binding to a single protease. The Promiscuous <a href="https://en.wikipedia.org/wiki/Peroxisome_proliferator-activated_receptor">PPAR</a> task requires strong binding to three nuclear receptors simultaneously. The Selective <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> task is adversarial, requiring strong JAK2 binding while avoiding <a href="https://en.wikipedia.org/wiki/Tyrosin-protein_kinase_Lck">LCK</a> binding (two kinases with a score correlation of 0.80).</p>
<h2 id="experimental-setup-regression-virtual-screening-and-de-novo-design">Experimental Setup: Regression, Virtual Screening, and De Novo Design</h2>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>The dataset combines molecules from ExCAPE-DB (which curates PubChem and ChEMBL bioactivity assays). The authors selected all molecules with active labels against targets having at least 1,000 experimental actives, plus 150,000 inactive-only molecules. After discarding 1.8% of molecules that failed ligand preparation, the final dataset contains 260,155 compounds docked against 58 targets, producing over 15 million docking scores and poses. The dataset required over 500,000 CPU hours to generate.</p>
<p>Cluster analysis using <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard distance</a> threshold of 0.25 on RDKit fingerprints) found 52,000 clusters, and Bemis-Murcko scaffold decomposition identified 102,000 scaffolds, confirming high molecular diversity. Train/test splitting follows cluster labels to prevent data leakage.</p>
<h3 id="regression-baselines">Regression Baselines</h3>
<p>Five targets of varying difficulty were selected: <a href="https://en.wikipedia.org/wiki/Poly_(ADP-ribose)_polymerase">PARP1</a> (easy), F2 (easy-medium), KIT (medium), ESR2 (hard), and PGR (hard). Baselines include Ridge, Lasso, XGBoost, exact GP, sparse GP, MPNN, and Attentive FP.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Ridge</th>
          <th>Lasso</th>
          <th>XGBoost</th>
          <th>GP (exact)</th>
          <th>GP (sparse)</th>
          <th>MPNN</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>0.640</td>
          <td>0.640</td>
          <td>0.734</td>
          <td>0.707</td>
          <td>0.716</td>
          <td>0.953</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.519</td>
          <td>0.483</td>
          <td>0.660</td>
          <td>0.640</td>
          <td>0.598</td>
          <td>0.901</td>
          <td>0.981</td>
      </tr>
      <tr>
          <td>ESR2</td>
          <td>0.421</td>
          <td>0.416</td>
          <td>0.497</td>
          <td>0.441</td>
          <td>0.508</td>
          <td>0.506</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>F2</td>
          <td>0.672</td>
          <td>0.663</td>
          <td>0.688</td>
          <td>0.705</td>
          <td>0.744</td>
          <td>0.798</td>
          <td>0.880</td>
      </tr>
      <tr>
          <td>KIT</td>
          <td>0.604</td>
          <td>0.594</td>
          <td>0.674</td>
          <td>0.637</td>
          <td>0.684</td>
          <td>0.755</td>
          <td>0.806</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>0.706</td>
          <td>0.700</td>
          <td>0.723</td>
          <td>0.743</td>
          <td>0.772</td>
          <td>0.815</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>0.242</td>
          <td>0.245</td>
          <td>0.345</td>
          <td>0.291</td>
          <td>0.387</td>
          <td>0.324</td>
          <td>0.678</td>
      </tr>
  </tbody>
</table>
<p>Values are mean $R^2$ over three runs. Attentive FP achieves the best performance on every target but remains well below perfect prediction on the harder targets, confirming that docking score regression is a meaningful benchmark.</p>
<h3 id="virtual-screening-baselines">Virtual Screening Baselines</h3>
<p>Models trained on PARP1, KIT, and PGR docking scores rank all molecules in <a href="/notes/chemistry/datasets/zinc-22/">ZINC20</a> (~1 billion compounds). The top 5,000 predictions are docked, and the enrichment factor (EF) is computed relative to a 0.1 percentile activity threshold.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Threshold</th>
          <th>FSS</th>
          <th>Ridge</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>KIT</td>
          <td>-10.7</td>
          <td>239.2</td>
          <td>451.6</td>
          <td>766.5</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>-12.1</td>
          <td>313.1</td>
          <td>325.9</td>
          <td>472.2</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>-10.1</td>
          <td>161.4</td>
          <td>120.5</td>
          <td>461.3</td>
      </tr>
  </tbody>
</table>
<p>The maximum possible EF is 1,000. Attentive FP substantially outperforms fingerprint similarity search (FSS) and Ridge regression across all targets.</p>
<h3 id="de-novo-design-baselines">De Novo Design Baselines</h3>
<p>Four optimization methods were tested: <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> GA, <a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, GP-BO with UCB acquisition ($\beta = 10$), and GP-BO with expected improvement (EI), each with a budget of 5,000 objective function evaluations. Without QED penalties, all methods easily surpass the best training set molecules but produce large, lipophilic, undrug-like compounds. With QED penalties, the tasks become substantially harder: GP-BO with EI is the only method that finds 25 molecules better than the training set across all three tasks.</p>
<p>The Selective JAK2 task proved hardest due to the high correlation between JAK2 and LCK scores. Pose analysis of the top de novo molecule revealed a dual binding mode: type V inhibitor behavior in JAK2 (binding distant N- and C-terminal lobe regions) and type I behavior in LCK (hinge-binding), suggesting a plausible selectivity mechanism.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ol>
<li>Docking scores are substantially harder to predict than logP or QED, making them more suitable for benchmarking high-performing ML models. Graph neural networks (Attentive FP) achieve near-perfect $R^2$ on logP but only 0.63-0.91 on docking targets.</li>
<li>In-distribution regression difficulty does not necessarily predict out-of-distribution virtual screening difficulty. PARP1 is easiest for regression, but KIT is easiest for virtual screening.</li>
<li>Adding a QED penalty to de novo design objectives transforms trivially solvable tasks into meaningful benchmarks. The adversarial Selective JAK2 objective, which exploits correlated docking scores, may be an effective way to avoid docking score biases toward large and lipophilic molecules.</li>
<li>Docking scores from related protein targets are highly correlated, supporting the biological meaningfulness of the dataset and enabling multiobjective and transfer learning tasks.</li>
</ol>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>Docking scores are approximate heuristics. They use static binding sites and force fields with limited calibration for certain metal ions. DOCKSTRING benchmarks should not substitute for rational drug design and experimental validation.</li>
<li>The pipeline relies on AutoDock Vina specifically; other docking programs may produce different rankings.</li>
<li>Top de novo molecules for F2 and Promiscuous PPAR contain conjugated ring structures uncommon in successful drugs.</li>
<li>Platform support is primarily Linux, with noted scoring inconsistencies on macOS.</li>
</ul>
<p><strong>Future directions</strong> mentioned include multiobjective tasks (transfer learning, few-shot learning), improved objective functions for better pharmacokinetic properties and synthetic feasibility, and multifidelity optimization tasks combining docking with more expensive computational methods.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ligand source</td>
          <td>ExCAPE-DB (PubChem + ChEMBL)</td>
          <td>260,155 molecules</td>
          <td>Actives against 58 targets + 150K inactive-only</td>
      </tr>
      <tr>
          <td>Docking scores</td>
          <td>DOCKSTRING dataset</td>
          <td>15M+ scores and poses</td>
          <td>Full matrix across all molecule-target pairs</td>
      </tr>
      <tr>
          <td>Virtual screening library</td>
          <td>ZINC20</td>
          <td>~1 billion molecules</td>
          <td>Used for out-of-distribution evaluation</td>
      </tr>
      <tr>
          <td>Target structures</td>
          <td>DUD-E + PDB 6CM4 (DRD2)</td>
          <td>58 targets</td>
          <td>Kinases (22), enzymes (12), nuclear receptors (9), proteases (7), GPCRs (5), cytochromes (2), chaperone (1)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Docking engine</strong>: AutoDock Vina with default exhaustiveness (8), up to 9 binding modes, energy range of 3 kcal/mol</li>
<li><strong>Ligand preparation</strong>: Open Babel (protonation at pH 7.4), RDKit ETKDG (3D embedding), MMFF94 (force field refinement), Gasteiger charges</li>
<li><strong>Regression models</strong>: Ridge, Lasso, XGBoost (hyperparameters via 20-configuration random search with 5-fold CV), exact GP and sparse GP (Tanimoto kernel on fingerprints), MPNN, Attentive FP (DeepChem defaults, 10 epochs)</li>
<li><strong>Optimization</strong>: Graph GA (population 250, offspring 25, mutation rate 0.01), SELFIES GA (same population/offspring settings), GP-BO with UCB ($\beta = 10$) or EI (batch size 5, 1000 offspring, 25 generations per iteration)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Setting</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$R^2$ (coefficient of determination)</td>
          <td>Regression</td>
          <td>Cluster-split train/test</td>
      </tr>
      <tr>
          <td>EF (enrichment factor)</td>
          <td>Virtual screening</td>
          <td>Top 5,000 from ZINC20, 0.1 percentile threshold</td>
      </tr>
      <tr>
          <td>Objective value trajectory</td>
          <td>De novo design</td>
          <td>5,000 function evaluation budget</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The dataset required over 500,000 CPU hours to compute, using the University of Cambridge Research Computing Service (EPSRC and DiRAC funded). Per-target docking takes approximately 15 seconds on 8 CPUs.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">DOCKSTRING Python package</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Wraps AutoDock Vina; available via conda-forge and PyPI</td>
      </tr>
      <tr>
          <td><a href="https://dockstring.github.io">DOCKSTRING dataset</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>15M+ docking scores and poses for 260K molecules x 58 targets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">Benchmark baselines</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Regression, virtual screening, and de novo design baseline implementations</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., &amp; Bacallado, S. (2022). DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. <em>Journal of Chemical Information and Modeling</em>, 62(15), 3486-3502. <a href="https://doi.org/10.1021/acs.jcim.1c01334">https://doi.org/10.1021/acs.jcim.1c01334</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://dockstring.github.io">DOCKSTRING Project Page</a></li>
<li><a href="https://github.com/dockstring/dockstring">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{garciaortegon2022dockstring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{DOCKSTRING}: Easy Molecular Docking Yields Better Benchmarks for Ligand Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Garc{\&#39;\i}a-Orteg{\&#39;o}n, Miguel and Simm, Gregor N. C. and Tripp, Austin J. and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and Bender, Andreas and Bacallado, Sergio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3486--3502}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.1c01334}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBench: Evaluating LLM Chemistry Against Experts</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chembench-llm-chemistry-evaluation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chembench-llm-chemistry-evaluation/</guid><description>ChemBench benchmarks LLM chemical knowledge with 2,700+ questions across topics, finding top models outperform expert chemists on average.</description><content:encoded><![CDATA[<h2 id="a-benchmark-resource-for-chemistry-focused-llm-evaluation">A Benchmark Resource for Chemistry-Focused LLM Evaluation</h2>
<p>ChemBench is a <strong>Resource</strong> paper that introduces an automated benchmarking framework for evaluating the chemical knowledge and reasoning abilities of large language models against human expert chemists. The primary contribution is the benchmark corpus itself (2,788 question-answer pairs), the evaluation infrastructure, and the human baseline study that contextualizes model performance. The framework is designed to be extensible and can evaluate any system that returns text, including tool-augmented agents.</p>
<h2 id="why-chemistry-needs-its-own-llm-benchmark">Why Chemistry Needs Its Own LLM Benchmark</h2>
<p>Existing LLM benchmarks provide poor coverage of chemistry. BigBench contains only 2 of 204 tasks classified as chemistry-related, and the LM Eval Harness contains none. Developers of chemical language models often fall back on tabular property-prediction datasets (<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>, Therapeutic Data Commons, MatBench), which give a narrow view of chemical capabilities. Prior attempts at chemistry-specific benchmarks based on university entrance exams or automatic text mining have not gained wide acceptance because they cannot be used with black-box or tool-augmented systems, do not cover a broad range of topics and skills, or are not validated by domain experts.</p>
<p>At the same time, LLMs are increasingly used in chemistry: for property prediction, reaction optimization, materials generation, information extraction, and even autonomous experiment execution. Some users (students, general public) may rely on LLMs for safety-critical chemical questions without the expertise to evaluate outputs. Understanding where LLMs succeed and fail in chemistry is therefore both a scientific and a safety question.</p>
<h2 id="chembench-framework-design-and-benchmark-corpus">ChemBench: Framework Design and Benchmark Corpus</h2>
<p>ChemBench addresses these gaps with several design choices that distinguish it from prior work.</p>
<p><strong>Diverse question corpus.</strong> The benchmark contains 2,788 question-answer pairs from multiple sources: 1,039 manually generated (from university exams, chemistry olympiads, textbooks, and novel questions) and 1,749 semi-automatically generated (from chemical databases covering <a href="https://en.wikipedia.org/wiki/Globally_Harmonized_System_of_Classification_and_Labelling_of_Chemicals">GHS pictograms</a>, daily allowed intakes, hazard statements, <a href="https://en.wikipedia.org/wiki/Nuclear_magnetic_resonance_spectroscopy">NMR</a> peak counts, electron counts, IUPAC-SMILES conversions, oxidation states, and <a href="https://en.wikipedia.org/wiki/Point_group">point groups</a>). Questions span general, organic, inorganic, physical, analytical, and technical chemistry, among other topics.</p>
<p><strong>Skill-based classification.</strong> Each question is annotated with the skills required to answer it: knowledge, reasoning, calculation, intuition, or combinations thereof. Questions are also classified by difficulty level (basic vs. advanced), enabling fine-grained analysis of model capabilities.</p>
<p><strong>Both MCQ and open-ended formats.</strong> The corpus includes 2,544 multiple-choice and 244 open-ended questions, reflecting the reality that chemistry education and research involve more than multiple-choice testing.</p>
<p><strong>Semantic annotation.</strong> Questions use tagged annotations for molecules (<code>[START_SMILES]...[END_SMILES]</code>), equations, units, and reactions. This allows models with special processing for scientific notation (e.g., <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a>) to handle these modalities appropriately, while remaining compatible with standard text-completion APIs.</p>
<p><strong>Text-completion evaluation.</strong> ChemBench operates on text completions rather than raw logits, enabling evaluation of tool-augmented and agentic systems (not just bare models). Parsing uses multi-step regex followed by LLM-based extraction as a fallback.</p>
<p><strong>ChemBench-Mini.</strong> A curated 236-question subset balances topic and skill diversity for fast, cost-effective routine evaluations. This subset was also used for the full human baseline study.</p>
<h2 id="evaluation-setup-models-human-experts-and-confidence">Evaluation Setup: Models, Human Experts, and Confidence</h2>
<h3 id="models-evaluated">Models evaluated</h3>
<p>The study evaluated a wide range of leading models, including both open-source and proprietary systems: o1-preview, GPT-4, Claude-3.5 (Sonnet), Llama-3.1-405B-Instruct, and others, as well as the agentic literature-search system PaperQA2. All models used greedy decoding (temperature 0) via API endpoints.</p>
<h3 id="human-baseline">Human baseline</h3>
<p>Nineteen chemistry experts participated through a custom web application (chembench.org). Volunteers included 2 post-postdoc researchers, 13 PhD students (with master&rsquo;s degrees), and 1 bachelor&rsquo;s holder. The analysis excluded anyone with fewer than 2 years of chemistry experience. For a subset of questions, volunteers were allowed to use external tools (web search, ChemDraw) but not LLMs or other people.</p>
<h3 id="confidence-calibration">Confidence calibration</h3>
<p>Selected top-performing models were prompted to estimate their confidence on a 1-5 ordinal scale (verbalized confidence estimates). This approach captures semantic uncertainty and works with models that do not expose logits.</p>
<h2 id="key-results-where-llms-outperform-chemists-and-where-they-fail">Key Results: Where LLMs Outperform Chemists and Where They Fail</h2>
<h3 id="overall-performance">Overall performance</h3>
<p>On ChemBench-Mini, the leading model (o1-preview) outperformed the best human expert by nearly a factor of two in overall accuracy. Many other models also exceeded average human performance. Llama-3.1-405B-Instruct achieved performance close to the leading proprietary models, showing that open-source models can be competitive in chemical settings.</p>
<h3 id="performance-varies-by-topic">Performance varies by topic</h3>
<p>While models scored well on general and technical chemistry, they performed poorly on toxicity/safety and analytical chemistry. Predicting the number of NMR signals was particularly difficult (22% correct for o1-preview). This task requires reasoning about molecular symmetry from a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string, which models struggle with compared to humans who can view molecular drawings.</p>
<h3 id="textbook-questions-vs-database-derived-questions">Textbook questions vs. database-derived questions</h3>
<p>Models performed better on textbook-inspired questions than on semi-automatically constructed tasks. For example, models could pass the German Chemical Prohibition Ordinance certification exam (71% for GPT-4, 61% for Claude-3.5 Sonnet) while human experts scored only 3% on the sampled subset. This suggests that good textbook question performance does not transfer to tasks requiring deeper reasoning or knowledge outside the training corpus.</p>
<h3 id="knowledge-intensive-limitations">Knowledge-intensive limitations</h3>
<p>Models struggled with knowledge-intensive questions that required looking up facts in specialized databases (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, Gestis). PaperQA2, which augments LLMs with literature search, could not compensate because the required knowledge lives in specialized databases rather than papers.</p>
<h3 id="chemical-preference-judgment">Chemical preference judgment</h3>
<p>When asked to judge chemical preference (choosing between two molecules in an early <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> setting, following the Choung et al. dataset), model performance was often indistinguishable from random guessing, even for models that excelled at other ChemBench tasks. Human chemists showed reasonable inter-rater agreement on the same questions.</p>
<h3 id="confidence-calibration-is-poor">Confidence calibration is poor</h3>
<p>For most models, verbalized confidence estimates did not correlate meaningfully with actual correctness. GPT-4 reported confidence of 1.0 for a correctly answered safety question but 4.0 for six incorrectly answered ones. Claude-3.5 Sonnet showed slightly better calibration on average but still produced misleading estimates in specific topic areas (e.g., GHS pictogram labeling: average confidence of 2.0 for correct answers vs. 1.83 for incorrect ones).</p>
<h3 id="scaling-and-molecular-complexity">Scaling and molecular complexity</h3>
<p>Model performance correlated with model size, consistent with observations in other domains. However, performance did not correlate with molecular complexity indicators, suggesting that models may rely on training data proximity rather than genuine structural reasoning.</p>
<h2 id="implications-for-chemistry-and-llm-development">Implications for Chemistry and LLM Development</h2>
<p>The authors draw several conclusions from the ChemBench evaluation.</p>
<p><strong>Chemistry education needs rethinking.</strong> Since LLMs already outperform average human chemists on many textbook-style questions, the value of rote memorization and problem-solving in chemistry curricula is diminishing. Critical reasoning and evaluation of model outputs become more important skills.</p>
<p><strong>Breadth vs. depth matters.</strong> Model performance varies widely across topics and question types, even within a single topic. Aggregate scores can mask significant weaknesses in safety-critical areas.</p>
<p><strong>Better human-model interaction is needed.</strong> Poor confidence calibration means users cannot trust models&rsquo; self-reported uncertainty. Developing better uncertainty estimation for chemical LLMs is an important direction.</p>
<p><strong>Room for improvement through specialized data.</strong> Training on specialized chemical databases (rather than just papers) and integrating domain-specific tools could address the knowledge-intensive gaps identified by ChemBench.</p>
<p><strong>Open science framework.</strong> ChemBench is designed for extensibility: new models can be added by contributors, and the leaderboard is publicly accessible. The use of a BigBench-compatible canary string helps prevent test set contamination in future training corpora.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>ChemBench (full corpus)</td>
          <td>2,788 Q-A pairs</td>
          <td>1,039 manual + 1,749 semi-automatic</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ChemBench-Mini</td>
          <td>236 questions</td>
          <td>Curated diverse subset; used for human baseline</td>
      </tr>
      <tr>
          <td>Chemical preference</td>
          <td>Choung et al. dataset</td>
          <td>1,000 sampled pairs</td>
          <td>From original 5,000+ dataset</td>
      </tr>
  </tbody>
</table>
<p>All benchmark data is publicly available on GitHub and archived on Zenodo.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Evaluation uses greedy decoding (temperature 0) for all models. Parsing is multi-step: regex extraction of answer environments and enumeration letters/numbers, word-to-number conversion, and LLM-based fallback parsing (Claude-3.5 Sonnet). Confidence estimates are verbalized on an ordinal 1-5 scale.</p>
<h3 id="models">Models</h3>
<p>The paper evaluates multiple models including o1-preview, GPT-4, Claude-3.5 (Sonnet), Llama-3.1-405B-Instruct, Galactica, and PaperQA2. Model weights are not released (the contribution is the benchmark, not a model).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy (% correct)</td>
          <td>Per question, per topic, overall</td>
          <td>Strict: partially correct = incorrect</td>
      </tr>
      <tr>
          <td>Confidence calibration</td>
          <td>Ordinal 1-5 scale</td>
          <td>Verbalized, not logit-based</td>
      </tr>
      <tr>
          <td>Human comparison</td>
          <td>19 experts on ChemBench-Mini</td>
          <td>Tools allowed for subset</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not applicable; the benchmark is designed for API-based evaluation. Cost context: Liang et al. report &gt;US$10,000 for a single HELM evaluation, motivating ChemBench-Mini.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/lamalab-org/chembench">ChemBench Code &amp; Data</a></td>
          <td>Code + Dataset</td>
          <td>MIT</td>
          <td>Framework and benchmark corpus</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/14010212">ChemBench Zenodo Archive</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Version v0.2.0, archived</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lamalab-org/chem-bench-app">ChemBench Web App</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Human baseline survey application</td>
      </tr>
      <tr>
          <td><a href="https://chembench.org">ChemBench Leaderboard</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Public model leaderboard</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Mirza, A., Alampara, N., Kunchapu, S., Ríos-García, M., Emoekabu, B., Krishnan, A., &hellip; &amp; Jablonka, K. M. (2025). A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. <em>Nature Chemistry</em>, 17(7), 1027-1034. <a href="https://doi.org/10.1038/s41557-025-01815-x">https://doi.org/10.1038/s41557-025-01815-x</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mirza2025chembench,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Mirza, Adrian and Alampara, Nawaf and Kunchapu, Sreekanth and R{\&#39;\i}os-Garc{\&#39;\i}a, Marti{\~n}o and Emoekabu, Benedict and Krishnan, Aswanth and Gupta, Tanya and Schilling-Wilhelmi, Mara and Okereke, Macjonathan and Aneesh, Anagha and Asgari, Mehrdad and Eberhardt, Juliane and Elahi, Amir Mohammad and Elbeheiry, Hani M. and Gil, Mar{\&#39;\i}a Victoria and Glaubitz, Christina and Greiner, Maximilian and Holick, Caroline T. and Hoffmann, Tim and Ibrahim, Abdelrahman and Klepsch, Lea C. and K{\&#34;o}ster, Yannik and Kreth, Fabian Alexander and Meyer, Jakob and Miret, Santiago and Peschel, Jan Matthias and Ringleb, Michael and Roesner, Nicole C. and Schreiber, Johanna and Schubert, Ulrich S. and Stafast, Leanne M. and Wonanke, A. D. Dinga and Pieler, Michael and Schwaller, Philippe and Jablonka, Kevin Maik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1027--1034}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41557-025-01815-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OCSU: Optical Chemical Structure Understanding (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</guid><description>OCSU task for translating molecular images into multi-level descriptions. Introduces Vis-CheBI20 dataset and DoubleCheck/Mol-VL for molecular understanding.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., &amp; Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. <em>arXiv preprint arXiv:2501.15415</em>. <a href="https://doi.org/10.48550/arXiv.2501.15415">https://doi.org/10.48550/arXiv.2501.15415</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/PharMolix/OCSU">Code and Dataset (GitHub)</a></li>
</ul>
<h2 id="multi-level-chemical-understanding-method-and-resource">Multi-Level Chemical Understanding (Method and Resource)</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution.</p>
<ul>
<li><strong>Methodological</strong>: It proposes two novel architectures, <strong>DoubleCheck</strong> (an enhanced recognition model) and <strong>Mol-VL</strong> (an end-to-end vision-language model), to solve the newly formulated OCSU task.</li>
<li><strong>Resource</strong>: It constructs and releases <strong>Vis-CheBI20</strong>, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.</li>
</ul>
<h2 id="the-motivation-for-ocsu-beyond-basic-graph-recognition">The Motivation for OCSU Beyond Basic Graph Recognition</h2>
<p>Existing methods for processing molecular images focus narrowly on <strong>Optical Chemical Structure Recognition (OCSR)</strong>, which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.</p>
<ul>
<li><strong>Gap</strong>: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.</li>
<li><strong>Goal</strong>: To enable <strong>Optical Chemical Structure Understanding (OCSU)</strong>, bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.</li>
</ul>
<h2 id="key-innovations-doublecheck-mol-vl-and-the-vis-chebi20-dataset">Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset</h2>
<p>The paper introduces the <strong>OCSU task</strong>, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:</p>
<ol>
<li><strong>DoubleCheck (OCSR-based)</strong>: An enhancement to standard OCSR models (like MolScribe) that performs a &ldquo;second look&rdquo; at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.</li>
<li><strong>Mol-VL (OCSR-free)</strong>: An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.</li>
<li><strong>Vis-CheBI20 Dataset</strong>: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.</li>
</ol>
<h2 id="methodology-and-experimental-evaluation">Methodology and Experimental Evaluation</h2>
<p>The authors evaluated both paradigms on <strong>Vis-CheBI20</strong> and existing benchmarks (USPTO, ACS) across four subtasks:</p>
<ol>
<li><strong>Functional Group Caption</strong>: Retrieval/F1 score evaluation.</li>
<li><strong>Molecule Description</strong>: Natural language generation metrics (BLEU, ROUGE, METEOR).</li>
<li><strong>IUPAC Naming</strong>: Text generation metrics (BLEU, ROUGE).</li>
<li><strong>SMILES Naming (OCSR)</strong>: Exact matching accuracy ($Acc_s$).</li>
</ol>
<p><strong>Baselines</strong>:</p>
<ul>
<li><strong>Task-Specific</strong>: MolScribe, MolVec, OSRA.</li>
<li><strong>LLM/VLM</strong>: Qwen2-VL, BioT5+, Mol-Instructions.</li>
<li><strong>Ablation</strong>: DoubleCheck vs. MolScribe backbone to test the &ldquo;feature enhancement&rdquo; mechanism.</li>
</ul>
<h2 id="results-and-conclusions-paradigm-trade-offs">Results and Conclusions: Paradigm Trade-Offs</h2>
<ul>
<li><strong>DoubleCheck Superiority</strong>: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved <strong>92.85%</strong> $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a <strong>+3.12%</strong> gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.</li>
<li><strong>Paradigm Trade-offs</strong>:
<ul>
<li><strong>Mol-VL (OCSR-free)</strong> excelled at semantic tasks like <strong>Functional Group Captioning</strong>, achieving <strong>97.32%</strong> F1 (vs. 93.63% for DoubleCheck &amp; RDKit and 89.60% for MolScribe &amp; RDKit). It benefits from end-to-end learning of structural context.</li>
<li><strong>DoubleCheck (OCSR-based)</strong> performed better on <strong>IUPAC naming recall</strong> and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.</li>
</ul>
</li>
<li><strong>Conclusion</strong>: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Vis-CheBI20 Dataset</strong></p>
<ul>
<li><strong>Source</strong>: Derived from ChEBI-20 and PubChem.</li>
<li><strong>Size</strong>: 29,700 molecular diagrams, 117,700 image-text pairs.</li>
<li><strong>Generation</strong>: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.</li>
<li><strong>Splits</strong> (vary by task, see table below):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Train Size</th>
          <th style="text-align: left">Test Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Functional Group</td>
          <td style="text-align: left">26,144</td>
          <td style="text-align: left">3,269</td>
      </tr>
      <tr>
          <td style="text-align: left">Description</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
      <tr>
          <td style="text-align: left">IUPAC Naming</td>
          <td style="text-align: left">26,200</td>
          <td style="text-align: left">2,680</td>
      </tr>
      <tr>
          <td style="text-align: left">SMILES Naming</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>DoubleCheck (Attentive Feature Enhancement)</strong></p>
<ol>
<li><strong>Ambiguity Detection</strong>: Uses atom prediction confidence to identify &ldquo;ambiguous atoms&rdquo;.</li>
<li><strong>Masking</strong>: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.</li>
<li><strong>Local Encoding</strong>: A Swin-B encoder ($\Phi_l$) encodes the masked image region.</li>
<li><strong>Fusion</strong>: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l
\end{aligned}
$$</p>
<ol start="5">
<li><strong>Two-Stage Training</strong>:
<ul>
<li>Stage 1: Train atom/bond predictors (30 epochs).</li>
<li>Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).</li>
</ul>
</li>
</ol>
<p><strong>Mol-VL (Multi-Task VLM)</strong></p>
<ul>
<li><strong>Prompting</strong>: System prompt: &ldquo;You are working as an excellent assistant in chemistry&hellip;&rdquo;</li>
<li><strong>Tokens</strong>: Uses <code>&lt;image&gt;</code> and <code>&lt;/image&gt;</code> special tokens.</li>
<li><strong>Auxiliary Task</strong>: Functional group recognition (identifying highlighted groups) added to training to improve context learning.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>DoubleCheck</strong>:
<ul>
<li><strong>Backbone</strong>: MolScribe architecture.</li>
<li><strong>Encoders</strong>: Swin-B for both global and local atom encoding.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>:
<ul>
<li><strong>Base Model</strong>: Qwen2-VL (2B and 7B versions).</li>
<li><strong>Vision Encoder</strong>: ViT with naive dynamic resolution and M-RoPE.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Metrics</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).</li>
<li><strong>Functional Groups</strong>: F1 Score (Information Retrieval task).</li>
<li><strong>Text Generation</strong>: BLEU-2/4, METEOR, ROUGE-L.</li>
</ul>
<p><strong>Selected Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left"><strong>92.85%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>MolScribe</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left">92.57%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mol-VL-7B</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left"><strong>97.32%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck &amp; RDKit</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left">93.63%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>DoubleCheck</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>4 days</strong>.
<ul>
<li>Max LR: 4e-4.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>10 days</strong>.
<ul>
<li>Max LR: 1e-5, 50 epochs.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/PharMolix/OCSU">PharMolix/OCSU (GitHub)</a></td>
          <td style="text-align: left">Code, Model, Dataset</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.</li>
<li>Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.</li>
<li>Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.</li>
<li>The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{fanOCSUOpticalChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSU}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Sets (MOSES): A Generative Modeling Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/</link><pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/</guid><description>MOSES provides a standardized benchmarking platform for molecular generative models, featuring datasets, metrics, and baselines.</description><content:encoded><![CDATA[<h2 id="the-role-of-moses-a-benchmarking-resource">The Role of MOSES: A Benchmarking Resource</h2>
<p>This is a <strong>Resource and Benchmarking</strong> paper. It introduces Molecular Sets (MOSES), a platform designed to standardize the training, comparison, and evaluation of molecular generative models. It provides a standardized dataset, a suite of evaluation metrics, and a collection of baseline models to serve as reference points for the field.</p>
<h2 id="motivation-the-reproducibility-crisis-in-generative-chemistry">Motivation: The Reproducibility Crisis in Generative Chemistry</h2>
<p>Generative models are increasingly popular for drug discovery and material design, capable of exploring the vast chemical space ($10^{23}$ to $10^{80}$ compounds) more efficiently than traditional methods. However, the field faces a significant reproducibility crisis:</p>
<ol>
<li><strong>Lack of Standardization</strong>: There is no consensus on how to properly compare and rank the efficacy of different generative models.</li>
<li><strong>Inconsistent Metrics</strong>: Different papers use different metrics or distinct implementations of the same metrics.</li>
<li><strong>Data Variance</strong>: Models are often trained on different subsets of chemical databases (like ZINC), making direct comparison impossible.</li>
</ol>
<p>MOSES aims to solve these issues by providing a unified &ldquo;measuring stick&rdquo; for distribution learning models in chemistry.</p>
<h2 id="core-innovation-standardizing-chemical-distribution-learning">Core Innovation: Standardizing Chemical Distribution Learning</h2>
<p>The core contribution is the <strong>standardization of the distribution learning definition</strong> for molecular generation. Why focus on distribution learning? Rule-based filters enforce strict boundaries like molecular weight limits. Distribution learning complements this by allowing chemists to apply <strong>implicit or soft restrictions</strong>. This ensures that generated molecules satisfy hard constraints and reflect complex chemical realities defined by the training distribution. These realities include the prevalence of certain substructures and the avoidance of unstable motifs.</p>
<p>MOSES specifically targets distribution learning by providing:</p>
<ol>
<li><strong>A Clean, Standardized Dataset</strong>: A specific subset of the ZINC Clean Leads collection with rigorous filtering.</li>
<li><strong>Diverse Metrics</strong>: A comprehensive suite of metrics that measure validity alongside novelty, diversity (internal and external), chemical properties (properties distribution), and substructure similarity.</li>
<li><strong>Open Source Platform</strong>: A Python library <code>molsets</code> that decouples the data and evaluation logic from the model implementation, ensuring everyone measures performance exactly the same way.</li>
</ol>
<h2 id="experimental-setup-and-baseline-generative-models">Experimental Setup and Baseline Generative Models</h2>
<p>The authors benchmarked a wide variety of generative models against the MOSES dataset to establish baselines:</p>
<ul>
<li><strong>Baselines</strong>: Character-level RNN (CharRNN), <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoder</a> (VAE), Adversarial Autoencoder (AAE), Junction Tree VAE (JTN-VAE), and <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>.</li>
<li><strong>Non-Neural Baselines</strong>: HMM, n-gram models, and a combinatorial generator (randomly connecting fragments).</li>
<li><strong>Evaluation</strong>: Models were trained on the standard set and evaluated on:
<ul>
<li><strong>Validity/Uniqueness</strong>: Can the model generate valid, non-duplicate SMILES? Uniqueness is measured at $k = 1{,}000$ and $k = 10{,}000$ samples.</li>
<li><strong>Filters</strong>: What fraction of generated molecules pass the same medicinal chemistry and PAINS filters used for dataset construction?</li>
<li><strong>Feature Distribution</strong>: Do generated molecules match the physicochemical properties of the training set? Evaluated using the <strong>Wasserstein-1 distance</strong> on 1D distributions of:
<ul>
<li><strong>LogP</strong>: Octanol-water partition coefficient (lipophilicity).</li>
<li><strong>SA</strong>: Synthetic Accessibility score (ease of synthesis).</li>
<li><strong>QED</strong>: Quantitative Estimation of Drug-likeness.</li>
<li><strong>MW</strong>: Molecular Weight.</li>
</ul>
</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Measures similarity in biological/chemical space using the penultimate-layer (second-to-last layer) activations of a pre-trained network (ChemNet).</li>
<li><strong>Similarity to Nearest Neighbor (SNN)</strong>: Measures the precision of generation by checking the closest match in the training set (Tanimoto similarity).</li>
</ul>
</li>
</ul>
<h2 id="key-findings-and-metric-trade-offs">Key Findings and Metric Trade-offs</h2>
<ul>
<li><strong>CharRNN Performance</strong>: The simple character-level RNN (CharRNN) outperformed more complex models (like VAEs and <a href="/posts/what-is-a-gan/">GANs</a>) on many metrics, achieving the best FCD scores ($0.073$).</li>
<li><strong>Metric Trade-offs</strong>: No single metric captures &ldquo;quality.&rdquo;
<ul>
<li>The <strong>Combinatorial Generator</strong> achieved 100% validity and high diversity. It struggled with distribution learning metrics (FCD), indicating it explores chemical space broadly without capturing natural distributions.</li>
<li><strong>VAEs</strong> often achieve high <strong>Similarity to Nearest Neighbor (SNN)</strong> while exhibiting low novelty. The authors suggest this pattern may indicate overfitting to training set prototypes, though they treat this as a hypothesis rather than a proven mechanism.</li>
</ul>
</li>
<li><strong>Implicit Constraints</strong>: A major finding was that neural models successfully learned implicit chemical rules (like avoiding <a href="https://en.wikipedia.org/wiki/Pan-assay_interference_compounds">PAINS</a> structures) purely from the data distribution.</li>
<li><strong>Recommendation</strong>: The authors suggest using FCD/Test for general model ranking, while emphasizing the importance of checking specific metrics (validity, diversity) to diagnose model failure modes.</li>
<li><strong>Limitations of the Benchmark</strong>: MOSES focuses on distribution learning and uses FCD as a primary ranking metric. As the authors note, FCD captures multiple aspects of other metrics in a single number but does not give insights into specific issues, so more interpretable metrics are necessary for thorough investigation. The benchmark evaluates only 1D (SMILES) and 2D molecular features, without assessing 3D conformational properties.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The benchmark uses a curated subset of the <strong>ZINC Clean Leads</strong> collection.</p>
<ul>
<li><strong>Source Size</strong>: ~4.6M molecules (4,591,276 after initial extraction).</li>
<li><strong>Final Size</strong>: 1,936,962 molecules.</li>
<li><strong>Splits</strong>: Train (1,584,664), Test (176,075), Scaffold Test (176,226).
<ul>
<li><strong>Scaffold Test Split</strong>: This split is crucial for distinct generalization testing. It contains molecules whose <a href="https://pubs.acs.org/doi/10.1021/jm9602928">Bemis-Murcko scaffolds</a> are <em>completely absent</em> from the training and test sets. Evaluating on this split strictly tests a model&rsquo;s ability to generate novel chemical structures (generalization).</li>
</ul>
</li>
<li><strong>Filters Applied</strong>:
<ul>
<li>Molecular weight: 250 to 350 Da</li>
<li>Rotatable bonds: $\leq 7$</li>
<li>XlogP: $\leq 3.5$</li>
<li>Atom types: C, N, S, O, F, Cl, Br, H</li>
<li>No charged atoms or cycles &gt; 8 atoms</li>
<li>Medicinal Chemistry Filters (MCF) and PAINS filters applied.</li>
</ul>
</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>MOSES introduces a standard suite of metrics. Key definitions:</p>
<ul>
<li><strong>Validity</strong>: Fraction of valid <a href="/posts/visualizing-smiles-and-selfies-strings/">SMILES</a> strings (via <a href="https://www.rdkit.org/">RDKit</a>).</li>
<li><strong>Unique@k</strong>: Fraction of unique molecules in the first $k$ valid samples ($k = 1{,}000$ and $k = 10{,}000$).</li>
<li><strong>Filters</strong>: Fraction of generated molecules passing the MCF and PAINS filters used during dataset construction. High scores here indicate the model learned implicit chemical validity constraints from the data distribution.</li>
<li><strong>Novelty</strong>: Fraction of generated molecules not present in the training set.</li>
<li><strong>Internal Diversity (IntDiv)</strong>: Average Tanimoto distance between generated molecules ($G$), useful for detecting mode collapse:
$$ \text{IntDiv}_p(G) = 1 - \sqrt[p]{\frac{1}{|G|^2} \sum_{m_1, m_2 \in G} T(m_1, m_2)^p} $$</li>
<li><strong>Fragment Similarity (Frag)</strong>: Cosine similarity of fragment frequency vectors (BRICS decomposition) between generated and test sets.</li>
<li><strong>Scaffold Similarity (Scaff)</strong>: Cosine similarity of Bemis-Murcko scaffold frequency vectors between sets. Measures how well the model captures higher-level structural motifs.</li>
<li><strong>Similarity to Nearest Neighbor (SNN)</strong>: The average Tanimoto similarity between a generated molecule&rsquo;s fingerprint and its nearest neighbor in the reference set. This serves as a measure of precision; high SNN suggests the model produces molecules very similar to the training distribution, potentially indicating memorization if novelty is low.
$$ \text{SNN}(G, R) = \frac{1}{|G|} \sum_{m_G \in G} \max_{m_R \in R} T(m_G, m_R) $$</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Fréchet distance between the Gaussian approximations (mean and covariance) of penultimate-layer activations from ChemNet. This measures how close the distribution of generated molecules is to the real distribution in chemical/biological space. The authors note that FCD correlates with other metrics. For example, if the generated structures are not diverse enough or the model produces too many duplicates, FCD will decrease because the variance is smaller. The authors suggest using FCD for hyperparameter tuning and final model selection.
$$ \text{FCD}(G, R) = |\mu_G - \mu_R|^2 + \text{Tr}(\Sigma_G + \Sigma_R - 2(\Sigma_G \Sigma_R)^{1/2}) $$</li>
<li><strong>Properties Distribution (Wasserstein-1)</strong>: The 1D <a href="/posts/what-is-a-gan/#wasserstein-gan-wgan-a-mathematical-revolution">Wasserstein-1 distance</a> between the distributions of molecular properties (MW, LogP, SA, <a href="https://www.nature.com/articles/nchem.1243">QED</a>) in the generated and test sets.</li>
</ul>
<h3 id="models--baselines">Models &amp; Baselines</h3>
<p>The paper selects baselines to represent different theoretical approaches to distribution learning:</p>
<ol>
<li><strong>Explicit Density Models</strong>: Models where the probability mass function $P(x)$ can be computed analytically.
<ul>
<li><strong>N-gram</strong>: Simple statistical models. They failed to generate valid molecules reliably due to limited long-range dependency modeling.</li>
</ul>
</li>
<li><strong>Implicit Density Models</strong>: Models that sample from the distribution without explicitly computing $P(x)$.
<ul>
<li><strong>VAE/AAE</strong>: Optimizes a lower bound on the log-likelihood (ELBO) or uses adversarial training.</li>
<li><strong>GANs (<a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>)</strong>: Directly minimizes the distance between real and generated distributions via a discriminator.</li>
</ul>
</li>
</ol>
<p>Models are also distinguished by their data representation:</p>
<ul>
<li><strong>String-based (SMILES)</strong>: Models like <strong>CharRNN</strong>, <strong>VAE</strong>, and <strong>AAE</strong> treat molecules as SMILES strings. SMILES encodes a molecular graph by traversing a spanning tree in depth-first order, storing atom and edge tokens.</li>
<li><strong>Graph-based</strong>: <strong>JTN-VAE</strong> operates directly on molecular subgraphs (junction tree), ensuring chemical validity by construction but often requiring more complex training.</li>
</ul>
<p>Key baselines implemented in PyTorch (hyperparameters are detailed in Supplementary Information 3 of the original paper):</p>
<ul>
<li><strong>CharRNN</strong>: LSTM-based sequence model (3 layers, 768 hidden units). Trained with Adam ($lr = 10^{-3}$, batch size 64, 80 epochs, learning rate halved every 10 epochs).</li>
<li><strong>VAE</strong>: Encoder-decoder architectures (bidirectional GRU encoder, 3-layer GRU decoder with 512 hidden units) with KL regularization.</li>
<li><strong>AAE</strong>: Encoder (single layer bidirectional LSTM with 512 units) and decoder (2-layer LSTM with 512 units) initialized with adversarial formulation.</li>
<li><strong>LatentGAN</strong>: GAN (5-layer fully connected generator) trained on the latent space of a pre-trained heteroencoder.</li>
<li><strong>JTN-VAE</strong>: Tree-structured graph generation.</li>
</ul>
<h3 id="code--hardware-requirements">Code &amp; Hardware Requirements</h3>
<ul>
<li><strong>Code Repository</strong>: Available at <a href="https://github.com/molecularsets/moses">github.com/molecularsets/moses</a> as well as the PyPI library <code>molsets</code>. The platform provides standard scripts (<code>scripts/run.py</code> to evaluate models end-to-end, and <code>scripts/run_all_models.sh</code> for multi-seed evaluations).</li>
<li><strong>Hardware</strong>: The repository supports GPU acceleration via <code>nvidia-docker</code> (defaulting to 10GB shared memory). However, specific training times and exact GPU models used by the authors for the baselines are not formally documented in the source text.</li>
<li><strong>Model Weights</strong>: Pre-trained model checkpoints are not natively pre-packaged as standalone downloads; practitioners are expected to re-train the default baselines using the provided scripts.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molecularsets/moses">molecularsets/moses</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official benchmark platform with baseline models and evaluation metrics</td>
      </tr>
      <tr>
          <td><a href="https://pypi.org/project/molsets/">molsets (PyPI)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>pip-installable package for dataset access and metric computation</td>
      </tr>
      <tr>
          <td>ZINC Clean Leads subset</td>
          <td>Dataset</td>
          <td>See ZINC terms</td>
          <td>Curated dataset of 1,936,962 molecules distributed via the repository</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. <em>Frontiers in Pharmacology</em>, 11, 565644. <a href="https://doi.org/10.3389/fphar.2020.565644">https://doi.org/10.3389/fphar.2020.565644</a></p>
<p><strong>Publication</strong>: Frontiers in Pharmacology, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{polykovskiy2020moses,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular Sets (MOSES): A benchmarking platform for molecular generation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Al{\&#39;a}n and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Frontiers in Pharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{565644}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Frontiers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.3389/fphar.2020.565644}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PubMed-OCR: PMC Open Access OCR Annotations</title><link>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</guid><description>A large-scale dataset of 209K+ articles with OCR and layout bounding boxes, enabling layout-aware modeling and document understanding research.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>OCR-First Supervision</strong>: Unlike prior datasets for PubMed that align XML to PDFs, PubMed-OCR provides native OCR annotations (Google Cloud Vision), bypassing alignment errors and covering non-digital scanned pages.</li>
<li><strong>High-Density Annotation</strong>: At <strong>~1.3B words across 1.5M pages</strong>, PubMed-OCR is far denser per page than comparable corpora like OCR-IDL: <strong>~13x the word density</strong> (844 vs. 62.5 words/page) and <strong>~6x the line density</strong> (106 vs. 17.5 lines/page), achieved despite drawing from fewer total pages.</li>
<li><strong>Multi-Level Bounding Boxes</strong>: Includes explicit word-, line-, and paragraph-level bounding boxes to support hierarchical document understanding and layout-aware modeling. We also hope that this leads to VQA datasets with grounded answers in document layout.</li>
<li><strong>Open Access &amp; Reproducibility</strong>: Derived strictly from the redistributable PMCOA subset, releasing both the JSON annotations and original PDFs to ensure verifiable and reproducible research.</li>
</ul>
<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="corpus-construction">Corpus Construction</h3>
<p>PubMed-OCR is built from PubMed Central Open Access (PMCOA) PDFs, chosen specifically because the PMCOA license permits redistribution of both the original documents and derived annotations. Each PDF is rendered to page images, then passed to the Google Cloud Vision (GCV) API. Each page produces a structured JSON annotation file capturing the detected text along with bounding box geometry at word, line, and paragraph levels.</p>
<h3 id="json-annotation-schema">JSON Annotation Schema</h3>
<p>Each page annotation follows this compact schema. Bounding boxes are axis-aligned rectangles in <code>[x1, y1, x2, y2]</code> pixel coordinates. Words, lines, and paragraphs are stored as parallel flat lists under the <code>text</code> key:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;text&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;words&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">210</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;lines&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;paragraphs&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence\nSecond line&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">820</span>]}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>  },
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;image&#34;</span>: <span style="color:#e6db74">&#34;...&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/pubmed-ocr-annotation-levels.webp"
         alt="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         title="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The same page annotated at three granularities: word (left), line (center), and paragraph (right). Page from Zhou et al., &ldquo;Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms,&rdquo; <em>Nucleic Acids Research</em> 42(2):701-713, 2014 (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902899/">PMC3902899</a>, DOI:<a href="https://doi.org/10.1093/nar/gkt875">10.1093/nar/gkt875</a>). Licensed under CC BY-NC.</figcaption>
    
</figure>

<h3 id="line-reconstruction">Line Reconstruction</h3>
<p>GCV returns word-level detections natively. Line and paragraph groupings are reconstructed using spatial heuristics: words are clustered into lines by vertical overlap and horizontal proximity, and paragraph grouping follows a similar process at a coarser scale. These heuristics work well for standard single-column scientific layouts but can fail on multi-column or irregularly structured pages (see Limitations).</p>
<h2 id="using-the-dataset">Using the Dataset</h2>
<p>The corpus spans 1.5M pages, so streaming is recommended for most use cases:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Streaming is recommended for the full 1.5M-page corpus</span>
</span></span><span style="display:flex;"><span>ds <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;rootsautomation/pubmed-ocr&#34;</span>, streaming<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, split<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;train&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Inspect a page</span>
</span></span><span style="display:flex;"><span>page <span style="color:#f92672">=</span> next(iter(ds))
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Article: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;accession_id&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">,  Page: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;page&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Parse OCR annotations</span>
</span></span><span style="display:flex;"><span>ocr <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(page[<span style="color:#e6db74">&#34;ocr_json&#34;</span>])
</span></span><span style="display:flex;"><span>text <span style="color:#f92672">=</span> ocr[<span style="color:#e6db74">&#34;text&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over lines and words</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;lines&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Line: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  BBox: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Access individual word detections</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> word <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;words&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Word: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">, BBox: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>Full schema documentation is available on the <a href="https://huggingface.co/datasets/rootsautomation/pubmed-ocr">HuggingFace dataset card</a>.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>The lack of large-scale, high-quality OCR datasets with explicit geometric grounding has been a major bottleneck for training layout-aware models. By releasing PubMed-OCR, we provide the community with the dense, multi-level bounding box annotations necessary to build the next generation of document understanding systems. This dataset directly supports the development of models like <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, enabling them to learn precise token-to-pixel alignment and robust layout reasoning.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Single OCR engine</strong>: All annotations come from Google Cloud Vision. GCV&rsquo;s error modes (handwriting, degraded scans, complex math, non-Latin scripts) propagate uncorrected into the dataset. Different OCR engines could yield different coverage patterns and error distributions.</li>
<li><strong>Heuristic line reconstruction</strong>: Spatial word-to-line clustering is approximate. Multi-column layouts, rotated text, or unusual page orientations may produce incorrect line groupings.</li>
<li><strong>PMCOA scope</strong>: Coverage is limited to the Open Access subset of PubMed Central. Commercial or subscription articles are excluded.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026pubmedocrpmcopenaccess,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PubMed-OCR: PMC Open Access OCR Annotations}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Yosheb Getachew and Olivia Dinica and Ben Elliott}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.11425}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.11425}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>This dataset directly enables <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, a family of vision-language models trained on PubMed-OCR annotations to produce grounded OCR outputs with explicit bounding boxes.</p>
<p>For related work on document processing pipelines that consume OCR output, see <a href="/research/llm-page-stream-segmentation/">LLMs for Page Stream Segmentation</a> and <a href="/research/page-stream-segmentation-llms/">Page Stream Segmentation with LLMs: Challenges and Applications</a>.</p>
]]></content:encoded></item><item><title>ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</guid><description>A 14B-parameter chemical reasoning LLM enhanced with atomized functional group knowledge and mix-sourced distillation strategy.</description><content:encoded><![CDATA[<h2 id="method-and-resource-contributions">Method and Resource Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper with significant <strong>Resource</strong> contributions.</p>
<ul>
<li><strong>Methodological Basis</strong>: The paper introduces a training pipeline (&ldquo;mix-sourced distillation&rdquo;) and domain-specific reinforcement learning to improve reasoning capabilities in chemical LLMs. It validates the approach through ablation studies across training stages.</li>
<li><strong>Resource Contribution</strong>: The authors constructed <strong>ChemFG</strong>, a 101 billion-token corpus annotated with &ldquo;atomized&rdquo; knowledge regarding functional groups and reaction centers.</li>
</ul>
<h2 id="bridging-the-chemical-reasoning-gap">Bridging the Chemical Reasoning Gap</h2>
<p>Current chemical LLMs struggle to reason logically for two main reasons:</p>
<ol>
<li><strong>Shallow Domain Understanding</strong>: Models generally learn molecule-level properties directly, bypassing the intermediate &ldquo;atomized&rdquo; characteristics (e.g., <a href="https://en.wikipedia.org/wiki/Functional_group">functional groups</a>) that ultimately dictate chemical behavior.</li>
<li><strong>Specialized Reasoning Logic</strong>: Chemical logic differs fundamentally from math or code. Distilling reasoning from general teacher models like DeepSeek-R1 frequently fails because the teachers lack the domain intuition required to generate valid chemical rationales.</li>
</ol>
<h2 id="atomized-knowledge-and-mixed-source-distillation">Atomized Knowledge and Mixed-Source Distillation</h2>
<p>The authors introduce three structural innovations to solve the reasoning gap:</p>
<ol>
<li><strong>Atomized Knowledge Enhancement (ChemFG)</strong>: A toolkit was built leveraging SMARTS notations to identify functional group changes during reactions. A critique of this approach is that it relies heavily on 2D cheminformatics abstractions, potentially missing deeper 3D stereochemical interactions.</li>
<li><strong>Mix-Sourced Distillation</strong>: General models (DeepSeek-R1/o3-mini) are fed &ldquo;pseudo-reasoning&rdquo; prompts that include ground truth answers and functional group data. While this forces the teacher to generate high-quality rationales for the student to learn, it introduces a layer of hindsight bias into the generated reasoning chains. During inference, the student model lacks both the pre-calculated functional group metadata and the ground truth, forcing it to bridge an artificially steep generalization gap.</li>
<li><strong>Chemical Reinforcement Learning</strong>: The intermediate model undergoes domain-specific reinforcement learning. The RL details are described in the paper&rsquo;s Appendix D, with the authors citing the open-source DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) framework. The optimization relies on rule-based rewards (format adherence and canonicalized <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> accuracy) across a variety of chemical tasks.</li>
</ol>
<h2 id="benchmark-evaluation-and-ablation-studies">Benchmark Evaluation and Ablation Studies</h2>
<p>The model was evaluated on comprehensive chemical benchmarks: <strong>SciKnowEval</strong> (19 tasks) and <strong><a href="/notes/chemistry/llm-applications/chemeval-multilevel-chemical-evaluation/">ChemEval</a></strong> (36 tasks).</p>
<ul>
<li><strong>Baselines</strong>: Compared against similarly sized open models (Qwen2.5-14B-Instruct, Qwen3-14B), domain models (<a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a>, MolInst), and frontier models (GPT-4o, DeepSeek-R1).</li>
<li><strong>Ablation</strong>: Evaluated across training stages (Base → ChemDFM-I → ChemDFM-R) to measure the specific impact of the instruction tuning versus the reasoning stages.</li>
<li><strong>Qualitative Analysis</strong>: The paper includes case studies demonstrating the model&rsquo;s step-by-step chemical reasoning and its potential for human-AI collaboration (Sections 4.2 and 4.3).</li>
</ul>
<h2 id="performance-outcomes-and-numerical-limitations">Performance Outcomes and Numerical Limitations</h2>
<ul>
<li><strong>Performance vs. Baselines</strong>: ChemDFM-R outperforms similarly sized open models and domain models on molecule-centric and reaction-centric tasks, and surpasses the much larger DeepSeek-R1 on ChemEval (0.78 vs. 0.58 overall). It shows competitive results relative to o4-mini, though o4-mini leads on SciKnowEval (0.74 vs. 0.70).</li>
<li><strong>Reasoning Interactivity</strong>: The model generates readable rationales that allow users to catch structural errors or identify reaction mechanisms accurately. Section 4.3 of the paper demonstrates human-AI collaboration scenarios.</li>
<li><strong>Quantitative Limitations</strong>: The model struggles with tasks involving numerical prediction and calculation (e.g., yield extraction, molecular property calculation). The paper notes that all molecule-centric and reaction-centric tasks where ChemDFM-R falls short of Qwen2.5-14B-Instruct involve numerical reasoning.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is constructed in three phases:</p>
<p><strong>1. Domain Pre-training (ChemFG)</strong>:</p>
<ul>
<li><strong>Size</strong>: 101 billion tokens</li>
<li><strong>Composition</strong>:
<ul>
<li>12M literature documents (79B tokens)</li>
<li>30M molecules from PubChem/PubChemQC</li>
<li>7M reactions from USPTO-FULL</li>
</ul>
</li>
<li><strong>Augmentation</strong>: SMILES augmentation (10x) using R-SMILES</li>
<li><strong>Atomized Features</strong>: Annotated with a custom &ldquo;Functional Group Identification Toolkit&rdquo; that identifies 241 functional group types and tracks changes in reaction centers. <em>Note: Data and toolkit are partially reproduced; while the toolkit (<a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a>) was open-sourced on GitHub, the 101 billion-token ChemFG dataset itself has not been publicly released.</em></li>
</ul>
<p><strong>2. Instruction Tuning</strong>:</p>
<ul>
<li><strong>Sources</strong>: Molecule-centric (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>), Reaction-centric (USPTO), and Knowledge-centric (Exams, Literature QA) tasks</li>
<li><strong>Mixing</strong>: Mixed with general instruction data in a 1:2 ratio</li>
</ul>
<p><strong>3. Distillation Dataset</strong>:</p>
<ul>
<li><strong>Sources</strong>:
<ul>
<li>~70% ChemDFM-R instruction data</li>
<li>~22% constructed pseudo-reasoning (functional group descriptions)</li>
<li>~8% teacher rationales (from DeepSeek-R1/o3-mini)</li>
</ul>
</li>
<li><strong>Mixing</strong>: Mixed with general data (including AM-Deepseek-R1-Distill-1.4M) in a 1:2 ratio</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Identification</strong>:</p>
<ul>
<li>Extends the <code>thermo</code> library&rsquo;s SMARTS list</li>
<li>For reactions, identifies &ldquo;reacting functional groups&rdquo; by finding reactants containing atoms involved in bond changes (reaction centers) that do not appear in the product</li>
</ul>
<p><strong>Mix-Sourced Distillation</strong>:</p>
<ul>
<li>Teacher models (DeepSeek-R1, o3-mini) are prompted with Question + Ground Truth + Functional Group Info to generate high-quality &ldquo;Thoughts&rdquo;</li>
<li>These rationales are distilled into the student model using a supervised fine-tuning loss across target tokens $y_t$:
$$ \mathcal{L}_{\text{SFT}} = - \sum_{t=1}^T \log P_\theta(y_t \mid x, y_{&lt;t}) $$</li>
</ul>
<p><strong>Reinforcement Learning</strong>:</p>
<ul>
<li><strong>Algorithm</strong>: The paper cites DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) as the RL framework; full details are in Appendix D of the paper. <em>Note: While the underlying DAPO framework is open-source, the specific chemistry-oriented RL pipeline and environment used for ChemDFM-R has not been publicly released.</em></li>
<li><strong>Hyperparameters</strong> (from paper appendix): Learning rate <code>5e-7</code>, rollout batch size <code>512</code>, training batch size <code>128</code></li>
<li><strong>Rewards</strong>: The reward system applies rule-based constraints focusing on physical form and chemical validity. The total reward $R(y, y^*)$ for a generated response $y$ given target $y^*$ combines a format adherence reward ($R_{\text{format}}$) and an accuracy reward ($R_{\text{acc}}$) evaluated on canonicalized SMILES:
$$ R(y, y^*) = R_{\text{format}}(y) + R_{\text{acc}}(\text{canonicalize}(y), \text{canonicalize}(y^*)) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base Model</strong>: Qwen2.5-14B</li>
<li><strong>ChemDFM-I</strong>: Result of instruction tuning the domain-pretrained model for 2 epochs</li>
<li><strong>ChemDFM-R</strong>: Result of applying mix-sourced distillation (1 epoch) followed by RL on ChemDFM-I. <em>Note: Model weights are publicly available on <a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">Hugging Face</a>.</em></li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware and training time details are described in the paper&rsquo;s appendices, which are not available in the extracted text. The details below are reported from the paper but could not be independently cross-verified against the main text:</p>
<ul>
<li><strong>Compute</strong>: NVIDIA A800 Tensor Core GPUs</li>
<li><strong>Training Time</strong>: 30,840 GPU hours total (Domain Pretraining: 24,728 hours; Instruction Tuning: 3,785 hours; Distillation: 2,059 hours; Reinforcement Learning: 268 hours)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>SciKnowEval</strong>: 19 tasks (text-centric, molecule-centric, reaction-centric)</li>
<li><strong>ChemEval</strong>: 36 tasks, categorized similarly</li>
</ul>
<p><strong>Key Metrics</strong>: Accuracy, F1 Score, BLEU score (with PRS normalization for ChemEval)</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SciKnowEval (all)</th>
          <th>ChemEval* (all)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Qwen2.5-14B-Instruct</td>
          <td>0.61</td>
          <td>0.57</td>
          <td>General-domain baseline</td>
      </tr>
      <tr>
          <td>ChemDFM-I</td>
          <td>0.69</td>
          <td>0.72</td>
          <td>After domain pretraining + instruction tuning</td>
      </tr>
      <tr>
          <td>ChemDFM-R</td>
          <td><strong>0.70</strong></td>
          <td><strong>0.78</strong></td>
          <td>After distillation + RL</td>
      </tr>
      <tr>
          <td>DeepSeek-R1</td>
          <td>0.62</td>
          <td>0.58</td>
          <td>General-domain reasoning model</td>
      </tr>
      <tr>
          <td>o4-mini</td>
          <td><strong>0.74</strong></td>
          <td>0.69</td>
          <td>Frontier reasoning model</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">ChemDFM-R-14B</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>Final reasoning model weights on Hugging Face</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Functional group identification toolkit (241 groups)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components</strong>: The 101B-token ChemFG pretraining dataset is not publicly released. The chemistry-oriented RL pipeline and training code are not open-sourced. The instruction tuning and distillation datasets are not available.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Wan, Z., Chen, L., Lin, X., Yu, S., Zhang, S., Ma, D., Zhu, Z., Zhang, D., Wang, H., Dai, Z., Wen, L., Chen, X., &amp; Yu, K. (2025). ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge. <em>arXiv preprint arXiv:2507.21990</em>. <a href="https://doi.org/10.48550/arXiv.2507.21990">https://doi.org/10.48550/arXiv.2507.21990</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{zhao2025chemdfmr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2507.21990}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2507.21990}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa: Molecular Property Prediction via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</guid><description>A systematic evaluation of RoBERTa transformers pretrained on 77M PubChem SMILES for molecular property prediction tasks.</description><content:encoded><![CDATA[<h2 id="taxonomy-and-paper-contributions">Taxonomy and Paper Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$), with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine &ldquo;how well does this work?&rdquo; in the chemical domain. It ablates dataset size, tokenization, and input representation.</p>
<p>It is also a resource paper as it introduces &ldquo;PubChem-77M,&rdquo; a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.</p>
<h2 id="overcoming-data-scarcity-in-property-prediction">Overcoming Data Scarcity in Property Prediction</h2>
<p>The primary motivation is <strong>data scarcity</strong> in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.</p>
<p>Massive quantities of <strong>unlabeled chemical structure data</strong> exist in the form of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.</p>
<h2 id="pretraining-scaling-laws-and-novelty">Pretraining Scaling Laws and Novelty</h2>
<p>Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:</p>
<ul>
<li><strong>Scaling Analysis</strong>: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.</li>
<li><strong>Tokenizer Comparison</strong>: It compares standard NLP <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte-Pair Encoding (BPE)</a> against a chemically-aware &ldquo;SmilesTokenizer&rdquo;.</li>
<li><strong>Representation Comparison</strong>: It evaluates if the robust <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representation offers advantages over standard SMILES in a Transformer context.</li>
</ul>
<h2 id="experimental-setup-pretraining-and-finetuning">Experimental Setup: Pretraining and Finetuning</h2>
<p>The authors trained <strong>ChemBERTa</strong> (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.</p>
<ul>
<li><strong>Pretraining</strong>: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.</li>
<li><strong>Baselines</strong>: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.</li>
<li><strong>Downstream Tasks</strong>: Finetuning was performed individually on small <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks: BBBP (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.</li>
<li><strong>Ablations</strong>:
<ul>
<li><strong>Tokenization</strong>: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.</li>
<li><strong>Input</strong>: SMILES vs. SELFIES strings on the Tox21 task.</li>
</ul>
</li>
</ul>
<h2 id="results-vs-graph-neural-network-baselines">Results vs. Graph Neural Network Baselines</h2>
<p>The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>BBBP PRC</th>
          <th>ClinTox ROC</th>
          <th>ClinTox PRC</th>
          <th>HIV ROC</th>
          <th>HIV PRC</th>
          <th>Tox21 ROC</th>
          <th>Tox21 PRC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemBERTa 10M</td>
          <td>0.643</td>
          <td>0.620</td>
          <td>0.733</td>
          <td>0.975</td>
          <td>0.622</td>
          <td>0.119</td>
          <td>0.728</td>
          <td>0.207</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.708</td>
          <td>0.697</td>
          <td>0.906</td>
          <td>0.993</td>
          <td>0.752</td>
          <td>0.152</td>
          <td>0.688</td>
          <td>0.429</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>0.681</td>
          <td>0.692</td>
          <td>0.693</td>
          <td>0.968</td>
          <td>0.780</td>
          <td>0.383</td>
          <td>0.724</td>
          <td>0.335</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>0.702</td>
          <td>0.724</td>
          <td>0.833</td>
          <td>0.986</td>
          <td>0.763</td>
          <td>0.364</td>
          <td>0.708</td>
          <td>0.345</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Scaling Improvements &amp; Training Dynamics</strong>: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.</li>
<li><strong>Performance Limits vs. GNNs</strong>: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.</li>
<li><strong>Ablation Limitations (Tokenization &amp; SELFIES)</strong>: The authors&rsquo; ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding &ldquo;semantically-aware tokenization&rdquo; or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.</li>
<li><strong>Interpretability</strong>: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.</p>
<ul>
<li><strong>Pretraining Data</strong>: <strong>PubChem-77M</strong>.
<ul>
<li>Source: 77 million unique SMILES from PubChem.</li>
<li>Preprocessing: Canonicalized and globally shuffled.</li>
<li>Subsets used: 100K, 250K, 1M, and 10M subsets.</li>
<li><em>Availability Note</em>: The authors provided a direct link to the <a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">canonicalized 10M compound subset</a> used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.</li>
</ul>
</li>
<li><strong>Evaluation Data</strong>: <strong>MoleculeNet</strong>.
<ul>
<li>Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).</li>
<li>Splitting: 80/10/10 train/valid/test split using a <strong>scaffold splitter</strong> to ensure chemical diversity between splits.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.</p>
<ul>
<li><strong>Objective</strong>: Masked Language Modeling (MLM) with <strong>15% token masking</strong>.</li>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>BPE</strong>: Byte-Pair Encoder (vocab size 52K).</li>
<li><strong>SmilesTokenizer</strong>: Regex-based custom tokenizer available in DeepChem (documented <a href="https://deepchem.readthedocs.io/en/latest/tokenizers.html#smilestokenizer">here</a>).</li>
</ul>
</li>
<li><strong>Sequence Length</strong>: Maximum sequence length of <strong>512 tokens</strong>.</li>
<li><strong>Finetuning</strong>: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: <strong>RoBERTa</strong> (via HuggingFace).
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 12 (72 distinct mechanisms total).</li>
<li><em>Implementation Note</em>: The original training notebooks and scripts are maintained in the authors&rsquo; <a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry repository</a>, alongside the primary downstream tasks integrated into DeepChem. A <a href="https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb">full Tox21 transfer learning tutorial</a> has been incorporated into the DeepChem repository.</li>
</ul>
</li>
<li><strong>Baselines</strong> (via Chemprop library):
<ul>
<li><strong>D-MPNN</strong>: Directed Message Passing Neural Network with default hyperparameters.</li>
<li><strong>RF/SVM</strong>: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (<a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ROC-AUC</strong></td>
          <td>Area Under Receiver Operating Characteristic Curve</td>
      </tr>
      <tr>
          <td><strong>PRC-AUC</strong></td>
          <td>Area Under Precision-Recall Curve (vital for imbalanced data)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Single <strong>NVIDIA V100 GPU</strong>.</li>
<li><strong>Training Time</strong>: Approximately <strong>48 hours</strong> for the 10M compound subset.</li>
<li><strong>Carbon Footprint</strong>: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training notebooks and finetuning scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration of ChemBERTa and SmilesTokenizer</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">ChemBERTa-zinc-base-v1</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained RoBERTa on 100K ZINC SMILES</td>
      </tr>
      <tr>
          <td><a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">PubChem-10M subset</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Canonicalized 10M compound subset used for largest experiments</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. <em>arXiv preprint arXiv:2010.09885</em>. <a href="https://doi.org/10.48550/arXiv.2010.09885">https://doi.org/10.48550/arXiv.2010.09885</a></p>
<p><strong>Publication</strong>: arXiv 2020 (Preprint)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">HuggingFace Model Hub (ChemBERTa-zinc-base-v1)</a> - <em>Additional pre-trained variations on PubChem &amp; ZINC datasets are available on the author&rsquo;s <a href="https://huggingface.co/seyonec">seyonec</a> HF profile.</em></li>
<li><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry GitHub Repository</a> - <em>Notebooks and scripts used for MLM pretraining and finetuning evaluations.</em></li>
</ul>
<h3 id="bibtex">BibTeX</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{chithranandaChemBERTaLargeScaleSelfSupervised2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MERMaid: Multimodal Chemical Reaction Mining from PDFs</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</guid><description>Vision-language pipeline extracting chemical reaction data from PDF figures and tables into structured knowledge graphs with 87% accuracy.</description><content:encoded><![CDATA[<h2 id="methodological-and-resource-contributions">Methodological and Resource Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$) that introduces a novel pipeline (MERMaid) for extracting structured chemical data from unstructured PDF documents. It proposes a specific architecture combining fine-tuned vision models (VisualHeist) with vision-language models (DataRaider) and a retrieval-augmented generation system (KGWizard) to solve the problem of multimodal data ingestion.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper ($\Psi_{\text{Resource}}$) as it releases the source code, prompts, and a new benchmark dataset (<strong>MERMaid-100</strong>) consisting of annotated reaction data across three chemical domains.</p>
<h2 id="the-inaccessibility-of-diagrammatic-reaction-data">The Inaccessibility of Diagrammatic Reaction Data</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A significant volume of chemical knowledge currently resides in &ldquo;print-optimized&rdquo; PDF formats, specifically within graphical elements like figures, schemes, and tables, which resist standard text mining.</li>
<li><strong>Limitations of Prior Work</strong>: Existing tools (e.g., ChemDataExtractor, <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/">OpenChemIE</a>) focus primarily on text, struggle with multimodal parsing, or lack the &ldquo;contextual awareness&rdquo; needed to interpret implicit information (e.g., &ldquo;standard conditions&rdquo; with modifications in optimization tables).</li>
<li><strong>Need for Structured Data</strong>: To enable <a href="/notes/chemistry/llm-applications/autonomous-chemical-research-coscientist/">self-driving laboratories</a> and data-driven discovery, this unstructured literature must be converted into machine-actionable formats like <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a>.</li>
</ul>
<h2 id="the-mermaid-pipeline-vision-models-and-llm-rag">The MERMaid Pipeline: Vision Models and LLM RAG</h2>
<ul>
<li><strong>VisualHeist (Fine-tuned Segmentation)</strong>: A custom fine-tuned model based on Microsoft&rsquo;s Florence-2 that accurately segments figures, captions, and footnotes, even in messy supplementary materials.</li>
<li><strong>DataRaider (Context-Aware Extraction)</strong>: A VLM-powered module (using GPT-4o) with a <strong>two-step prompt framework</strong> that performs &ldquo;self-directed context completion.&rdquo; It can infer missing reaction parameters from context and resolve footnote labels (e.g., linking &ldquo;condition a&rdquo; in a table to its footnote description).</li>
<li><strong>KGWizard (Schema-Adaptive Graph Construction)</strong>: A text-to-graph engine that uses LLMs as higher-order functions to synthesize parsers dynamically. It employs <strong>Retrieval-Augmented Generation (RAG)</strong> to check for existing nodes during creation, implicitly resolving coreferences (e.g., unifying &ldquo;MeCN&rdquo; and &ldquo;Acetonitrile&rdquo;).</li>
<li><strong>Topic-Agnostic Design</strong>: MERMaid features a flexible design that works across three distinct domains: <a href="https://en.wikipedia.org/wiki/Electrosynthesis">organic electrosynthesis</a>, <a href="https://en.wikipedia.org/wiki/Photocatalysis">photocatalysis</a>, and organic synthesis.</li>
</ul>
<h2 id="benchmarking-segmentation-and-extraction-accuracy">Benchmarking Segmentation and Extraction Accuracy</h2>
<ul>
<li><strong>Segmentation Benchmarking</strong>: The authors compared VisualHeist against OpenChemIE (LayoutParser) and PDFigCapX using a dataset of 121 PDFs from 5 publishers.</li>
<li><strong>End-to-End Extraction</strong>: Evaluated the full pipeline on <strong>MERMaid-100</strong>, a curated dataset of 100 articles across three domains (organic electrosynthesis, photocatalysis, organic synthesis).
<ul>
<li>Validating extraction of specific parameters (e.g., catalysts, solvents, yields) using &ldquo;hard-match&rdquo; accuracy.</li>
</ul>
</li>
<li><strong>Knowledge Graph Construction</strong>: Automatically generated knowledge graphs for the three domains and assessed the structural integrity and <a href="https://en.wikipedia.org/wiki/Coreference">coreference resolution</a> accuracy.</li>
</ul>
<h2 id="end-to-end-extraction-performance">End-to-End Extraction Performance</h2>
<ul>
<li><strong>Segmentation Results</strong>: VisualHeist achieved &gt;93% F1 score across all document types (including pre-2000 papers and supplementary materials), outperforming OpenChemIE by 15-75% and PDFigCapX by 28-75% across all metrics.</li>
<li><strong>Extraction Accuracy</strong>: DataRaider achieved &gt;92% accuracy for VLM-based parameter extraction and near-unity accuracy for domain-specific reaction parameters (e.g., anode, cathode, photocatalyst).</li>
<li><strong>Graph Building</strong>: KGWizard achieved 96% accuracy in node creation and coreference resolution.</li>
<li><strong>Overall Performance</strong>: The pipeline demonstrated an 87% end-to-end overall accuracy.</li>
<li><strong>Limitations</strong>: The architecture relies heavily on closed-weight models (GPT-4o) for reasoning and graph construction, which risks future reproducibility if API snapshots are deprecated. Additionally, the system remains vulnerable to cumulative error propagation from upstream OCR/OCSR tools like <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">RxnScribe</a>.</li>
<li><strong>Availability</strong>: The authors provide a modular, extensible framework that can be adapted to other scientific domains.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training Data (VisualHeist)</strong>:
<ul>
<li>Dataset of <strong>3,435 figures</strong> and <strong>1,716 tables</strong> annotated from 3,518 PDF pages.</li>
<li>Includes main text, supplementary materials, and unformatted archive papers.</li>
</ul>
</li>
<li><strong>Evaluation Data (MERMaid-100)</strong>:
<ul>
<li><strong>100 PDF articles</strong> curated from three domains: organic electrosynthesis, photocatalysis, and organic synthesis.</li>
<li>Includes 104 image-caption/table-heading pairs relevant to reaction optimization.</li>
<li>Available for download at Zenodo (DOI: 10.5281/zenodo.14917752).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Two-Step Prompt Framework (DataRaider)</strong>:
<ul>
<li><em>Step 1</em>: Generic base prompt + domain keys to extract &ldquo;reaction dictionaries&rdquo; and &ldquo;footnote dictionaries&rdquo;. Uses &ldquo;fill-in-the-blank&rdquo; inference for missing details.</li>
<li><em>Step 2</em>: Safety check prompt where the VLM updates the reaction dictionary using the footnote dictionary to resolve entry-specific modifications.</li>
</ul>
</li>
<li><strong>LLM-Synthesized Parsers (KGWizard)</strong>:
<ul>
<li>Uses LLM as a function $g_{A,B}: A \times B \rightarrow (X \rightarrow Y)$ to generate Python code (parsers) dynamically based on input schema instructions.</li>
</ul>
</li>
<li><strong>RAG for Coreference</strong>:
<ul>
<li>During graph construction, the system queries the existing database for matching values (e.g., &ldquo;MeCN&rdquo;) before creating new nodes to prevent duplication.</li>
</ul>
</li>
<li><strong>Batching</strong>:
<ul>
<li>Articles processed in dynamic batch sizes (starting at 1, increasing to 30) to balance speed and redundancy checks.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>VisualHeist</strong>: Fine-tuned <strong>Florence-2-large</strong> (Microsoft vision foundation model).
<ul>
<li><em>Hyperparameters</em>: 12 epochs, learning rate $5 \times 10^{-6}$, batch size 4.</li>
</ul>
</li>
<li><strong>DataRaider &amp; KGWizard</strong>: <strong>GPT-4o</strong> (version <code>gpt-4o-2024-08-06</code>). Note: Requires an active OpenAI API key. The pipeline&rsquo;s long-term reproducibility is currently tied to the continued availability of this specific closed-source endpoint.</li>
<li><strong>RxnScribe</strong>: Used for <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">Optical Chemical Structure Recognition (OCSR)</a> to convert reactant/product images to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><em>Segmentation</em>: Precision, Recall, F1, Accuracy.</li>
<li><em>Caption Extraction</em>: Evaluated via <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a>, mapping predicted token sets $A$ and true token sets $B$ to a threshold condition: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|} \ge 0.70$$</li>
<li><em>Data Extraction</em>: Evaluated via Hard-Match accuracy, requiring exact correspondence between predicted sets ($\hat{Y}$) and ground-truth parameters ($Y$) for specific roles (e.g., anode vs. cathode): $$\text{HMA} = \frac{1}{|N|} \sum_{i=1}^{N} \mathbb{1}[y_i = \hat{y}_i]$$</li>
</ul>
</li>
<li><strong>Baselines</strong>: OpenChemIE (LayoutParser + EasyOCR) and PDFigCapX.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training (VisualHeist)</strong>: 2x NVLINK Nvidia RTX A6000 GPUs (48GB VRAM) + Intel Xeon w7-2495X CPU (48 cores).</li>
<li><strong>DataRaider Evaluation</strong>: 13th Gen Intel Core i7-1360P CPU (12 cores).</li>
<li><strong>Inference Costs</strong>:
<ul>
<li>DataRaider: ~$0.051 per image.</li>
<li>KGWizard: ~$0.40 per JSON.</li>
</ul>
</li>
<li><strong>Timing</strong>:
<ul>
<li>VisualHeist inference: ~4.5 seconds/image.</li>
<li>DataRaider inference: ~41.3 seconds/image.</li>
<li>KGWizard processing: ~110.6 seconds/file.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leong, S. X., Pablo-García, S., Wong, B., &amp; Aspuru-Guzik, A. (2025). MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models. <em>Matter</em>, 8(12), 102331. <a href="https://doi.org/10.1016/j.matt.2025.102331">https://doi.org/10.1016/j.matt.2025.102331</a></p>
<p><strong>Publication</strong>: Matter, 2025</p>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/MERMaid">GitHub Repository</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (VisualHeist, DataRaider, KGWizard)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.14917752">Zenodo Data/Prompts</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>MERMaid-100 benchmark, prompts, and raw VLM responses</td>
      </tr>
  </tbody>
</table>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leong2025mermaid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leong, Shi Xuan and Pablo-Garc{\&#39;i}a, Sergio and Wong, Brandon and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Matter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{102331}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.matt.2025.102331}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGrapher: Graph-based Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</guid><description>A graph-based deep learning approach for optical chemical structure recognition that outperforms image captioning methods.</description><content:encoded><![CDATA[<h2 id="1-contribution--type">1. Contribution / Type</h2>
<p>This is primarily a <strong>Methodological</strong> paper that proposes a novel neural architecture (MolGrapher), shifting the paradigm of Optical Chemical Structure Recognition (OCSR) from image captioning back to graph reconstruction. It also has a significant <strong>Resource</strong> component, releasing a synthetic data generation pipeline and a new large-scale benchmark (USPTO-30K) to address the scarcity of annotated real-world data.</p>
<h2 id="2-motivation">2. Motivation</h2>
<p>The automatic analysis of chemical literature is critical for accelerating drug and material discovery, but much of this information is locked in 2D images of molecular structures.</p>
<ul>
<li><strong>Problem:</strong> Existing rule-based methods are rigid, while recent deep learning methods based on &ldquo;image captioning&rdquo; (predicting <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings) struggle with complex molecules and fail to exploit the natural graph structure of molecules.</li>
<li><strong>Gap:</strong> There is a lack of diverse, annotated real-world training data, and captioning models suffer from &ldquo;hallucinations&rdquo; where they predict valid SMILES that do not match the image.</li>
</ul>
<h2 id="3-novelty--core-innovation">3. Novelty / Core Innovation</h2>
<p>MolGrapher introduces a <strong>graph-based deep learning pipeline</strong> that explicitly models the molecule&rsquo;s geometry and topology.</p>
<ul>
<li><strong>Supergraph Concept:</strong> It first detects all atom keypoints and builds a &ldquo;supergraph&rdquo; of all plausible bonds.</li>
<li><strong>Hybrid Approach:</strong> It combines a ResNet-based keypoint detector with a Graph Neural Network (GNN) that classifies both atom nodes and bond nodes within the supergraph context. Both atoms and bonds are represented as nodes, with edges only connecting atom nodes to bond nodes.</li>
<li><strong>Synthetic Pipeline:</strong> A data generation pipeline that renders molecules with varying styles (fonts, bond widths) and augmentations (pepper patches, random lines, captions) to simulate real document noise.</li>
</ul>
<p>At the core of the Keypoint Detector&rsquo;s performance is the <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss. Since pixels without an atom drastically outnumber pixels containing an atom, WAHR loss is designed to counter the class imbalance. For ground-truth heatmap $y$ and prediction $p$:</p>
<p>$$ L_{WAHR}(p, y) = \sum_i \alpha_y (p_i - y_i)^2 $$</p>
<p>where $\alpha_y$ dynamically down-weights easily classified background pixels.</p>
<h2 id="4-methodology--experiments">4. Methodology &amp; Experiments</h2>
<p>The authors evaluated MolGrapher against both rule-based (OSRA, MolVec) and deep learning baselines (DECIMER, Img2Mol, Image2Graph).</p>
<ul>
<li><strong>Benchmarks:</strong> Evaluated on standard datasets: USPTO, Maybridge UoB, CLEF-2012, and JPO.</li>
<li><strong>New Benchmark:</strong> Introduced and tested on <strong>USPTO-30K</strong>, split into clean, abbreviated, and large molecule subsets.</li>
<li><strong>Ablations:</strong> Analyzed the impact of synthetic augmentations, keypoint loss functions, supergraph connectivity radius, and GNN layers.</li>
<li><strong>Robustness:</strong> Tested on perturbed images (rotations, shearing) to mimic scanned patent quality.</li>
</ul>
<p>The GNN iteratively updates node embeddings through layers ${g^k}_{k \in [1, N]}$, where $e^{k+1} = g^k(e^k)$. Final predictions are obtained via two MLPs (one for atoms, one for bonds): $p_i = MLP_t(e_i^N)$, where $p_i \in \mathbb{R}^{C_t}$ contains the logits for atom or bond classes.</p>
<h2 id="5-results--conclusions">5. Results &amp; Conclusions</h2>
<p>MolGrapher achieved the highest accuracy among synthetic-only deep learning methods on most benchmarks tested.</p>
<ul>
<li><strong>Accuracy:</strong> It achieved <strong>91.5%</strong> accuracy on USPTO, outperforming all other synthetic-only deep learning methods including ChemGrapher (80.9%), Graph Generation (67.0%), and DECIMER 2.0 (61.0%).</li>
<li><strong>Large Molecules:</strong> It demonstrated superior scaling, correctly recognizing large molecules (USPTO-10K-L) where image captioning methods like Img2Mol failed completely (0.0% accuracy).</li>
<li><strong>Generalization:</strong> The method proved robust to image perturbations and style variations without requiring fine-tuning on real data. The paper acknowledges that MolGrapher cannot recognize Markush structures (depictions of sets of molecules with positional and frequency variation indicators).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model relies on synthetic data for training due to the scarcity of annotated real-world images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic Data</td>
          <td>300,000 images</td>
          <td>Generated from PubChem SMILES using RDKit. Augmentations include pepper patches, random lines, and variable bond styles.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>USPTO-30K</td>
          <td>30,000 images</td>
          <td>Created by authors from USPTO patents (2001-2020). Subsets: 10K clean, 10K abbreviated, 10K large (&gt;70 atoms).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Standard Benchmarks</td>
          <td>Various</td>
          <td>USPTO (5,719), Maybridge UoB (5,740), CLEF-2012 (992), JPO (450).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of three distinct algorithmic stages:</p>
<ol>
<li>
<p><strong>Keypoint Detection</strong>:</p>
<ul>
<li>Predicts a heatmap of atom locations using a CNN.</li>
<li>Thresholds heatmaps at the bottom 10th percentile and uses a $5\times5$ window for local maxima.</li>
<li>Uses <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss to handle class imbalance (background vs. atoms).</li>
</ul>
</li>
<li>
<p><strong>Supergraph Construction</strong>:</p>
<ul>
<li>Connects every detected keypoint to neighbors within a radius of $3 \times$ the estimated bond length.</li>
<li>Prunes edges with no filled pixels or if obstructed by a third keypoint.</li>
<li>Keeps a maximum of 6 bond candidates per atom.</li>
</ul>
</li>
<li>
<p><strong>Superatom Recognition</strong>:</p>
<ul>
<li>Detects &ldquo;superatom&rdquo; nodes (abbreviations like <code>COOH</code>).</li>
<li>Uses <strong>PP-OCR</strong> to transcribe the text at these node locations.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The architecture utilizes standard backbones tailored for specific sub-tasks:</p>
<ul>
<li><strong>Keypoint Detector</strong>: <strong>ResNet-18</strong> backbone with $8\times$ dilation to preserve spatial resolution.</li>
<li><strong>Node Classifier</strong>: <strong>ResNet-50</strong> backbone with $2\times$ dilation for extracting visual features at node locations.</li>
<li><strong>Graph Neural Network</strong>: A custom GNN that updates node embeddings based on visual features and neighborhood context. The initial node embedding combines the visual feature vector $v_i$ and a learnable type encoding $w_{t_i}$.</li>
<li><strong>Readout</strong>: MLPs classify nodes into atom types (e.g., C, O, N) and bond types (No Bond, Single, Double, Triple).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Accuracy is defined strictly: the predicted molecule must have an identical <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> string to the ground truth. Stereochemistry and Markush structures are excluded from evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>MolGrapher Score</th>
          <th>Best DL Baseline (Synthetic)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>USPTO</td>
          <td><strong>91.5%</strong></td>
          <td>80.9% (ChemGrapher)</td>
          <td>Full USPTO benchmark</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>USPTO-10K-L</td>
          <td><strong>31.4%</strong></td>
          <td>0.0% (Img2Mol)</td>
          <td>Large molecules (&gt;70 atoms)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>JPO</td>
          <td><strong>67.5%</strong></td>
          <td>64.0% (DECIMER 2.0)</td>
          <td>Challenging, low-quality images</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: Trained on 3 NVIDIA A100 GPUs.</li>
<li><strong>Training Time</strong>: 20 epochs.</li>
<li><strong>Optimization</strong>: ADAM optimizer, learning rate 0.0001, decayed by 0.8 after 5000 iterations.</li>
<li><strong>Loss Weighting</strong>: Atom classifier loss weighted by 1; bond classifier loss weighted by 3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MolGrapher">DS4SD/MolGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with training and inference scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Title</strong>: MolGrapher: Graph-based Visual Recognition of Chemical Structures</p>
<p><strong>Authors</strong>: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valéry Weber, Ingmar Meijer, Peter Staar, Fisher Yu</p>
<p><strong>Citation</strong>: Morin, L., Danelljan, M., Agea, M. I., Nassar, A., Weber, V., Meijer, I., Staar, P., &amp; Yu, F. (2023). MolGrapher: Graph-based Visual Recognition of Chemical Structures. <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em>, 19552-19561.</p>
<p><strong>Publication</strong>: ICCV 2023</p>
<p><strong>Links</strong>:</p>
<ul>
<li><a href="https://openaccess.thecvf.com/content/ICCV2023/html/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.html">Paper</a></li>
<li><a href="https://github.com/DS4SD/MolGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMolGrapherGraphbasedVisual2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MolGrapher}}: {{Graph-based Visual Recognition}} of {{Chemical Structures}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MolGrapher}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valéry and Meijer, Ingmar and Staar, Peter and Yu, Fisher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{19552--19561}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICCV51070.2023.01791}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-18}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER.ai: Optical Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</guid><description>Open-source OCSR platform combining Mask R-CNN segmentation and Transformer recognition, trained on 450M+ synthetic images from RanDepict.</description><content:encoded><![CDATA[<h2 id="project-scope-and-contribution-type">Project Scope and Contribution Type</h2>
<p>This is primarily a <strong>Resource</strong> paper (Infrastructure Basis) with a significant <strong>Method</strong> component.</p>
<p>The primary contribution is DECIMER.ai, a fully open-source platform (web app and Python packages) for the entire chemical structure mining pipeline, filling a gap where most tools were proprietary or fragmented. It also contributes the RanDepict toolkit for massive synthetic data generation.</p>
<p>The secondary methodological contribution proposes and validates a specific deep learning architecture (EfficientNet-V2 encoder + Transformer decoder) that treats chemical structure recognition as an image-to-text translation task (SMILES generation).</p>
<h2 id="the-scarcity-of-machine-readable-chemical-data">The Scarcity of Machine-Readable Chemical Data</h2>
<p><strong>Data Scarcity</strong>: While the number of chemical publications is increasing, most chemical information is locked in non-machine-readable formats (images in PDFs) and is not available in public databases.</p>
<p><strong>Limitations of Existing Tools</strong>: Prior OCSR (Optical Chemical Structure Recognition) tools were largely rule-based (fragile to noise) or proprietary.</p>
<p><strong>Lack of Integration</strong>: There was no existing open-source system that combined segmentation (finding the molecule on a page), classification (confirming it is a molecule), and recognition (translating it to SMILES) into a single workflow.</p>
<h2 id="decimer-architecture-and-novel-image-to-smiles-approach">DECIMER Architecture and Novel Image-to-SMILES Approach</h2>
<p><strong>Comprehensive Workflow</strong>: It is the first open-source platform to integrate segmentation (Mask R-CNN), classification (EfficientNet), and recognition (Transformer) into a unified pipeline.</p>
<p><strong>Data-Driven Approach</strong>: Unlike tools like MolScribe which use intermediate graph representations and rules, DECIMER uses a purely data-driven &ldquo;image-to-SMILES&rdquo; translation approach without hard-coded chemical rules. The core recognition model operates as a sequence-to-sequence generator, mathematically formalizing the task as maximizing the conditional probability of a SMILES sequence given an image.</p>
<p><strong>Massive Synthetic Training</strong>: The use of RanDepict to generate over 450 million synthetic images, covering diverse depiction styles and augmentations (including Markush structures), to train the model from scratch.</p>
<h2 id="benchmarking-and-evaluation-methodology">Benchmarking and Evaluation Methodology</h2>
<p><strong>Benchmarking</strong>: The system was tested against openly available tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) on standard datasets: USPTO, UOB, CLEF, JPO, and a custom &ldquo;Hand-drawn&rdquo; dataset.</p>
<p><strong>Robustness Testing</strong>: Performance was evaluated on both clean images and images with added distortions (rotation, shearing) to test the fragility of rule-based systems vs. DECIMER.</p>
<p><strong>Markush Structure Analysis</strong>: Specific evaluation of the model&rsquo;s ability to interpret Markush structures (generic structures with R-groups).</p>
<p><strong>Comparison of Approaches</strong>: A direct comparison with MolScribe by training DECIMER on MolScribe&rsquo;s smaller training set to isolate the impact of architecture vs. data volume.</p>
<h2 id="performance-outcomes-and-key-findings">Performance Outcomes and Key Findings</h2>
<p><strong>Comparative Performance</strong>: DECIMER Image Transformer consistently produced average Tanimoto similarities above 0.95 on in-domain test data and achieved competitive or leading results across external benchmarks, with extremely low rates of catastrophic failure. Tanimoto similarity is calculated based on molecular fingerprints $A$ and $B$ as:
$$ T(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</p>
<p><strong>Data Volume Necessity</strong>: When trained on small datasets, MolScribe (graph/rule-based) outperformed DECIMER. DECIMER&rsquo;s performance advantage relies heavily on its massive training scale (&gt;400M images).</p>
<p><strong>Robustness</strong>: The model showed no performance degradation on distorted images, unlike rule-based legacy tools.</p>
<p><strong>Generalization</strong>: Despite having no hand-drawn images in the training set, the base model recognized 27% of hand-drawn structures perfectly (average Tanimoto 0.69), outperforming all alternative open tools. After fine-tuning with synthetic hand-drawn-like images from RanDepict, perfect predictions increased to 60% (average Tanimoto 0.89).</p>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OBrink/DECIMER.ai">DECIMER.ai Web App</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Laravel-based web application for the full pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Core OCSR Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image-Segmentation">DECIMER Image Segmentation</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Mask R-CNN segmentation for chemical structures in documents</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Iagea/DECIMER-Image-Classifier">DECIMER Image Classifier</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>EfficientNet-based chemical structure image classifier</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The models were trained on synthetic data generated from PubChem molecules.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Generation/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_1</code></td>
          <td>~108M mols</td>
          <td>PubChem molecules (mass &lt; 1500 Da), processed with RanDepict (v1.0.5). Included image augmentations.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_2</code></td>
          <td>~126M mols</td>
          <td>Included Markush structures generated by pseudo-randomly replacing atoms with R-groups. Image size 299x299.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_3</code></td>
          <td>&gt;453M images</td>
          <td>Re-depicted <code>pubchem_2</code> molecules at <strong>512x512</strong> resolution. Used RanDepict v1.0.8.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>In-domain</td>
          <td>250,000</td>
          <td>Held-out set generated similarly to training data.</td>
      </tr>
      <tr>
          <td><strong>Benchmark</strong></td>
          <td>External</td>
          <td>Various</td>
          <td>USPTO (5719), UOB (5740), CLEF (992), JPO (450), Indigo (50k), Hand-drawn (5088).</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: RanDepict (uses CDK, RDKit, Indigo, PIKAChU)</li>
<li><strong>Augmentations</strong>: Rotation, shearing, noise, pixelation, curved arrows, text labels</li>
<li><strong>Format</strong>: Data saved as TFRecord files for TPU training</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting (atoms, brackets, bonds). Added <code>&lt;start&gt;</code>, <code>&lt;end&gt;</code>, and padded with <code>&lt;pad&gt;</code>. <code>&lt;unk&gt;</code> used for unknown tokens.</li>
<li><strong>Markush Token Handling</strong>: To avoid ambiguity, digits following &lsquo;R&rsquo; (e.g., R1) were replaced with unique non-digit characters during training to distinguish them from ring-closure numbers.</li>
<li><strong>Image Augmentation Pipeline</strong>: Custom RanDepict features (v1.1.4) were used to simulate &ldquo;hand-drawn-like&rdquo; styles based on ChemPIX&rsquo;s implementation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The platform consists of three distinct models:</p>
<ol>
<li>
<p><strong>DECIMER Segmentation</strong>:</p>
<ul>
<li><strong>Architecture</strong>: Mask R-CNN (TensorFlow 2.10.0 implementation)</li>
<li><strong>Purpose</strong>: Detects and cuts chemical structures from full PDF pages</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Classifier</strong>:</p>
<ul>
<li><strong>Architecture</strong>: EfficientNet-V1-B0</li>
<li><strong>Input</strong>: 224x224 pixels</li>
<li><strong>Training</strong>: Fine-tuned on ~10.9M images (balanced chemical/non-chemical)</li>
<li><strong>Performance</strong>: AUC 0.99 on in-domain test set</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Transformer (OCSR Engine)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-V2-M (CNN). Input size <strong>512x512</strong>. 52M parameters</li>
<li><strong>Decoder</strong>: Transformer. 4 encoder blocks, 4 decoder blocks, 8 attention heads. d_model=512, d_ff=2048. 59M parameters</li>
<li><strong>Total Params</strong>: ~111 Million</li>
</ul>
</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Tanimoto Similarity (calculated on PubChem fingerprints of the predicted vs. ground truth SMILES)</li>
<li><strong>Secondary Metrics</strong>: Exact Match (Identity), BLEU score (for string similarity, esp. Markush)</li>
<li><strong>Failure Analysis</strong>: &ldquo;Catastrophic failure&rdquo; defined as Tanimoto similarity of 0 or invalid SMILES</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on Google Cloud TPUs due to the massive dataset size.</p>
<ul>
<li><strong><code>pubchem_1</code>/<code>pubchem_2</code></strong>: Trained on TPU v3-32 pod slice</li>
<li><strong><code>pubchem_3</code> (Final Model)</strong>: Trained on <strong>TPU v3-256</strong> pod slice</li>
<li><strong>Training Time</strong>:
<ul>
<li>Data generation (512x512): ~2 weeks on cluster (20 threads, 36 cores)</li>
<li>Model Training (EffNet-V2-M): <strong>1 day and 7 hours per epoch</strong> on TPU v3-256</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., &amp; Steinbeck, C. (2023). DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. <em>Nature Communications</em>, 14(1), 5045. <a href="https://doi.org/10.1038/s41467-023-40782-0">https://doi.org/10.1038/s41467-023-40782-0</a></p>
<p><strong>Publication</strong>: Nature Communications 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://decimer.ai">Web Application</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer GitHub</a></li>
<li><a href="https://github.com/OBrink/RanDepict">RanDepict GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERaiOpenPlatform2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Agea, M. Isabel and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5045}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41467-023-40782-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReco: Hand-Drawn Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</guid><description>A deep learning method using EfficientNet and Transformer to convert hand-drawn chemical structures into SMILES codes, achieving 96.9% accuracy.</description><content:encoded><![CDATA[<h2 id="research-contribution--classification">Research Contribution &amp; Classification</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: The primary contribution is &ldquo;ChemReco,&rdquo; a specific deep learning pipeline (EfficientNet + Transformer) designed to solve the Optical Chemical Structure Recognition (OCSR) task for hand-drawn images. The authors conduct extensive ablation studies on architecture and data mixing ratios to validate performance.</li>
<li><strong>Resource</strong>: The authors explicitly state that &ldquo;the primary focus of this paper is constructing datasets&rdquo; due to the scarcity of hand-drawn molecular data. They introduce a comprehensive synthetic data generation pipeline involving RDKit modifications and image degradation to create training data.</li>
</ul>
<h2 id="motivation-digitizing-hand-drawn-chemical-sketches">Motivation: Digitizing Hand-Drawn Chemical Sketches</h2>
<p>Hand-drawing is the most intuitive method for chemists and students to record molecular structures. However, digitizing these drawings into machine-readable formats (like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) usually requires time-consuming manual entry or specialized software.</p>
<ul>
<li><strong>Gap</strong>: Existing OCSR tools and rule-based methods often fail on hand-drawn sketches due to diverse writing styles, poor image quality, and the absence of labeled data.</li>
<li><strong>Application</strong>: Automated recognition enables efficient chemical research and allows for automatic grading in educational settings.</li>
</ul>
<h2 id="core-innovation-synthetic-pipeline-and-hybrid-architecture">Core Innovation: Synthetic Pipeline and Hybrid Architecture</h2>
<p>The paper introduces <strong>ChemReco</strong>, an end-to-end system for recognizing C-H-O structures. Key novelties include:</p>
<ol>
<li><strong>Synthetic Data Pipeline</strong>: A multi-stage generation method that modifies RDKit source code to randomize bond/angle parameters, followed by OpenCV-based augmentation, degradation, and background addition to simulate realistic hand-drawn artifacts.</li>
<li><strong>Architectural Choice</strong>: The specific application of <strong>EfficientNet</strong> (encoder) combined with a <strong>Transformer</strong> (decoder) for this domain, which the authors demonstrate outperforms the more common ResNet+LSTM baselines.</li>
<li><strong>Hybrid Training Strategy</strong>: Finding that a mix of 90% synthetic and 10% real data yields optimal performance, superior to using either dataset alone.</li>
</ol>
<h2 id="methodology--ablation-studies">Methodology &amp; Ablation Studies</h2>
<p>The authors performed a series of ablation studies and comparisons:</p>
<ul>
<li><strong>Synthesis Ablation</strong>: Evaluated the impact of each step in the generation pipeline (RDKit only $\rightarrow$ Augmentation $\rightarrow$ Degradation $\rightarrow$ Background) on validation loss and accuracy.</li>
<li><strong>Dataset Size Ablation</strong>: Tested model performance when trained on synthetic datasets ranging from 100,000 to 1,000,000 images.</li>
<li><strong>Real/Synthetic Ratio</strong>: Investigated the optimal mixing ratio of synthetic to real hand-drawn images (100:0, 90:10, 50:50, 10:90, 0:100), finding that the 90:10 ratio achieved 93.81% exact match, compared to 63.33% for synthetic-only and 65.83% for real-only.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked four encoder-decoder combinations: ResNet vs. EfficientNet encoders paired with LSTM vs. Transformer decoders.</li>
<li><strong>Baseline Comparison</strong>: Compared results against a related study utilizing a CNN+LSTM framework.</li>
</ul>
<h2 id="results--interpretations">Results &amp; Interpretations</h2>
<ul>
<li><strong>Best Performance</strong>: The EfficientNet + Transformer model trained on a 90:10 synthetic-to-real ratio achieved a <strong>96.90% Exact Match</strong> rate on the test set.</li>
<li><strong>Background Robustness</strong>: When training on synthetic data alone (no real images), the best accuracy on background-free test images was approximately 46% (using RDKit-aug-deg), while background test images reached approximately 53% (using RDKit-aug-bkg-deg). Adding random backgrounds during training helped prevent the model from overfitting to clean white backgrounds.</li>
<li><strong>Data Volume</strong>: Increasing the synthetic dataset size from 100k to 1M consistently improved accuracy (average exact match: 49.40% at 100k, 54.29% at 200k, 61.31% at 500k, 63.33% at 1M, all without real images in training).</li>
<li><strong>Encoder-Decoder Comparison</strong> (at 90:10 mix with 1M images):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Encoder</th>
          <th style="text-align: left">Decoder</th>
          <th style="text-align: left">Avg. Exact Match (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">93.81</td>
      </tr>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left">94.76</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">96.31</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left"><strong>96.90</strong></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Superiority over Baselines</strong>: The model outperformed the cited CNN+LSTM baseline from ChemPix (93% vs 76% on the ChemPix test set).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Restricted atom types</strong>: The system only handles molecules composed of carbon, hydrogen, and oxygen (C-H-O), excluding nitrogen, sulfur, halogens, and other heteroatoms commonly found in organic chemistry.</li>
<li><strong>Structural complexity</strong>: Only structures with at most one ring are supported. Complex multi-ring systems and fused ring structures are not covered.</li>
<li><strong>Dataset availability</strong>: The real hand-drawn dataset (2,598 images) is not publicly released and is only available upon request from the corresponding author.</li>
<li><strong>Future directions</strong>: The authors suggest expanding to more heteroatoms, complex ring structures, and applications in automated grading of chemistry exams.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/a-die/hdr-DeepLearning">hdr-DeepLearning</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation in PyTorch</td>
      </tr>
      <tr>
          <td style="text-align: left">Paper</td>
          <td style="text-align: left">Publication</td>
          <td style="text-align: left">CC-BY-4.0</td>
          <td style="text-align: left">Open access via Nature</td>
      </tr>
  </tbody>
</table>
<p>The real hand-drawn dataset (2,598 images) is available upon request from the corresponding author, not publicly downloadable. The synthetic data generation pipeline is described in detail but relies on modified RDKit source code, which is included in the repository.</p>
<h3 id="data">Data</h3>
<p>The study utilizes a combination of collected SMILES data, real hand-drawn images, and generated synthetic images.</p>
<ul>
<li><strong>Source Data</strong>: SMILES codes collected from PubChem, ZINC, <a href="/notes/chemistry/datasets/gdb-11/">GDB-11</a>, and <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>. Filtered for C, H, O atoms and max 1 ring.</li>
<li><strong>Real Dataset</strong>: 670 selected SMILES codes drawn by multiple volunteers, totaling <strong>2,598 images</strong>.</li>
<li><strong>Synthetic Dataset</strong>: Generated up to <strong>1,000,000 images</strong> using the pipeline below.</li>
<li><strong>Training Mix</strong>: The optimal training set used 1 million images with a <strong>90:10 ratio</strong> of synthetic to real images.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset Type</th>
          <th style="text-align: left">Source</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Real</strong></td>
          <td style="text-align: left">Volunteer Drawings</td>
          <td style="text-align: left">2,598 images</td>
          <td style="text-align: left">Used for mixed training and testing</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Synthetic</strong></td>
          <td style="text-align: left">Generated</td>
          <td style="text-align: left">100k - 1M</td>
          <td style="text-align: left">Generated via modified RDKit + OpenCV augmentation/degradation; optionally enhanced with Stable Diffusion</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>Synthetic Image Generation Pipeline</strong> is critical for reproduction:</p>
<ol>
<li><strong>RDKit Modification</strong>: Modify source code to introduce random keys, character width, length, and bond angles.</li>
<li><strong>Augmentation (OpenCV)</strong>: Apply sequence: Resize ($p=0.5$), Blur ($p=0.4$), Erode/Dilate ($p=0.2$), Distort ($p=0.8$), Flip ($p=0.5$), Affine ($p=0.7$).</li>
<li><strong>Degradation</strong>: Apply sequence: Salt+pepper noise ($p=0.1$), Contrast ($p=0.7$), Sharpness ($p=0.5$), Invert ($p=0.3$).</li>
<li><strong>Background Addition</strong>: Random backgrounds are augmented (Crop, Distort, Flip) and added to the molecular image to prevent background overfitting.</li>
<li><strong>Diffusion Enhancement</strong>: Stable Diffusion (v1-4) is used for image-to-image enhancement to better simulate hand-drawn styles (prompt: &ldquo;A pencil sketch of [Formula]&hellip; without charge distribution&rdquo;).</li>
</ol>
<h3 id="models">Models</h3>
<p>The system uses an encoder-decoder architecture:</p>
<ul>
<li><strong>Encoder</strong>: <strong>EfficientNet</strong> (pre-trained on ImageNet). The last layer is removed, and features are extracted into a Numpy array.</li>
<li><strong>Decoder</strong>: <strong>Transformer</strong>. Utilizes self-attention to generate the SMILES sequence. Chosen over LSTM for better handling of long-range dependencies.</li>
<li><strong>Output</strong>: Canonical SMILES string.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: <strong>Exact Match (EM)</strong>. A strict binary evaluation checking whether the complete generated SMILES perfectly replicates the target string.</li>
<li><strong>Other Metrics</strong>: <strong>Levenshtein Distance</strong> measures edit-level character proximity, while the <strong>Tanimoto coefficient</strong> evaluates structural similarity based on chemical fingerprints. Both were monitored during validation ablation runs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Baseline (CNN+LSTM)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Exact Match</strong></td>
          <td style="text-align: left"><strong>96.90%</strong></td>
          <td style="text-align: left">76%</td>
          <td style="text-align: left">Tested on the provided test set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>CPU</strong>: Intel(R) Xeon(R) Gold 6130 (40 GB RAM).</li>
<li><strong>GPU</strong>: NVIDIA Tesla V100 (32 GB video memory).</li>
<li><strong>Framework</strong>: PyTorch 1.9.1.</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Optimizer: Adam (learning rate 1e-4).</li>
<li>Batch size: 32.</li>
<li>Epochs: 100.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, H., Liu, W., Tao, J., et al. (2024). ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. <em>Scientific Reports</em>, 14, 17126. <a href="https://doi.org/10.1038/s41598-024-67496-7">https://doi.org/10.1038/s41598-024-67496-7</a></p>
<p><strong>Publication</strong>: Scientific Reports 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/a-die/hdr-DeepLearning">Official Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ouyangChemRecoAutomatedRecognition2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ChemReco: Automated Recognition of Hand-Drawn Carbon--Hydrogen--Oxygen Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ouyang, Hengjie and Liu, Wei and Tao, Jiajun and Luo, Yanghong and Zhang, Wanjia and Zhou, Jiayu and Geng, Shuqi and Zhang, Chengpeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{17126}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-024-67496-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AtomLenz: Atom-Level OCSR with Limited Supervision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</guid><description>Weakly supervised OCSR framework combining object detection and graph construction to recognize chemical structures from hand-drawn images using SMILES.</description><content:encoded><![CDATA[<h2 id="dual-contribution-method-and-data-resource">Dual Contribution: Method and Data Resource</h2>
<p>The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.</p>
<h2 id="overcoming-annotation-bottlenecks-in-ocsr">Overcoming Annotation Bottlenecks in OCSR</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:</p>
<ol>
<li><strong>Generalization Limits:</strong> They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.</li>
<li><strong>Annotation Cost:</strong> &ldquo;Atom-level&rdquo; methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.</li>
<li><strong>Lack of Interpretability/Localization:</strong> Pure &ldquo;Image-to-SMILES&rdquo; models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.</li>
</ol>
<h2 id="atomlenz-probkt-and-graph-edit-correction">AtomLenz, ProbKT*, and Graph Edit-Correction</h2>
<p>The core contribution is <strong>AtomLenz</strong>, an OCSR framework that achieves atom-level entity detection using <strong>only SMILES supervision</strong> on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:</p>
<p>$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$</p>
<p>To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:</p>
<ul>
<li><em><em>ProbKT</em> (Probabilistic Knowledge Transfer):</em>* Uses probabilistic logic and Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as <strong>EditKT</strong>*.</li>
<li><strong>ChemExpert:</strong> A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.</li>
</ul>
<h2 id="data-efficiency-and-domain-adaptation-experiments">Data Efficiency and Domain Adaptation Experiments</h2>
<p>The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:</p>
<ul>
<li><strong>Pretraining:</strong> Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).</li>
<li><strong>Target Domain Adaptation:</strong> Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.</li>
<li><strong>Evaluation Sets:</strong>
<ul>
<li><strong>Hand-drawn test set</strong>: 1,018 images.</li>
<li><strong>ChemPix</strong>: 614 out-of-domain hand-drawn images.</li>
<li><strong>Atom Localization set</strong>: 1,000 synthetic images to evaluate precise bounding box capabilities.</li>
</ul>
</li>
<li><strong>Baselines:</strong> Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.</li>
</ul>
<h2 id="state-of-the-art-ensembles-vs-standalone-limitations">State-of-the-Art Ensembles vs. Standalone Limitations</h2>
<ul>
<li><strong>SOTA Ensemble Performance:</strong> The <strong>ChemExpert</strong> module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.</li>
<li><strong>Data Efficiency under Bottleneck Regimes:</strong> AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.</li>
<li><strong>Localization Success:</strong> The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.</li>
<li><strong>Methodological Tradeoffs:</strong> While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz">Official Repository (AtomLenz)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz/tree/main/models">Pre-trained Models</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Downloadable weights for Faster R-CNN detection backbones.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset (Brinkhaus)</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Images and SMILES used for target domain fine-tuning and evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599172">Relabeled Hand-drawn Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">1,417 images with bounding box annotations generated via EditKT*.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/spaces/moldenhof/atomlenz">AtomLenz Web Demo</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Interactive Hugging Face space for testing model inference.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>Synthetic ChEMBL</td>
          <td>~214,000</td>
          <td>Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-drawn (Brinkhaus)</td>
          <td>4,070</td>
          <td>Used for weakly supervised adaptation (SMILES only).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Hand-drawn Test</td>
          <td>1,018</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChemPix</td>
          <td>614</td>
          <td>Out-of-distribution hand-drawn images.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Atom Localization</td>
          <td>1,000</td>
          <td>Synthetic images with ground truth bounding boxes.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Molecular Graph Constructor (Algorithm 1):</strong> A rule-based system to assemble the graph from detected objects:
<ol>
<li><strong>Filtering:</strong> Removes overlapping atom boxes (IoU threshold).</li>
<li><strong>Node Creation:</strong> Merges overlapping charge and stereocenter objects with their corresponding atom objects.</li>
<li><strong>Edge Creation:</strong> Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If &gt;2, it selects the most probable pair.</li>
<li><strong>Validation:</strong> Checks valency constraints; removes bonds iteratively if constraints are violated.</li>
</ol>
</li>
<li><strong>Weakly Supervised Training:</strong>
<ul>
<li><strong>ProbKT*:</strong> Uses Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; implied by the SMILES string, allowing backpropagation without explicit boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Object Detection Backbone:</strong> <strong>Faster R-CNN</strong>.
<ul>
<li>Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).</li>
<li><strong>Loss Function:</strong> Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).</li>
</ul>
</li>
<li><strong>ChemExpert:</strong> An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Primary metrics focused on structural correctness and localization accuracy.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Hand-drawn)</th>
          <th>Baseline (DECIMER FT)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (T=1)</strong></td>
          <td>33.8% (AtomLenz+EditKT*)</td>
          <td>62.2%</td>
          <td>Exact ECFP6 fingerprint match.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto Sim.</strong></td>
          <td>0.484</td>
          <td>0.727</td>
          <td>Average similarity.</td>
      </tr>
      <tr>
          <td><strong>mAP</strong></td>
          <td>0.801</td>
          <td>N/A</td>
          <td>Localization accuracy (IoU 0.05-0.35).</td>
      </tr>
      <tr>
          <td><strong>Ensemble Acc.</strong></td>
          <td><strong>63.5%</strong></td>
          <td>62.2%</td>
          <td>ChemExpert (DECIMER + AtomLenz).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Experiments utilized the Flemish Supercomputer Center (VSC) resources.</li>
<li><strong>Note:</strong> Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., De Brouwer, E., Arany, Á., &amp; Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2024.</p>
<p><strong>Publication venue/year</strong>: CVPR 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molden/atomlenz">Official Repository</a></li>
<li><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset on Figshare</a></li>
</ul>
<p><strong>BibTeX</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{oldenhofAtomLevelOpticalChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Atom-Level Optical Chemical Structure Recognition with Limited Supervision}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Oldenhof, Martijn and De Brouwer, Edward and Arany, {\&#39;A}d{\&#39;a}m and Moreau, Yves}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2404.01743}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SwinOCSR: End-to-End Chemical OCR with Swin Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</guid><description>Deep learning model using Swin Transformer and Focal Loss for OCSR, achieving 98.58% accuracy on synthetic benchmarks.</description><content:encoded><![CDATA[<h2 id="contribution-methodological-architecture-and-datasets">Contribution: Methodological Architecture and Datasets</h2>
<p>This is a <strong>Methodological Paper</strong> with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).</li>
<li><strong>Resource</strong>: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.</li>
</ul>
<h2 id="motivation-addressing-visual-context-and-data-imbalance">Motivation: Addressing Visual Context and Data Imbalance</h2>
<ul>
<li><strong>Problem</strong>: OCSR (converting images of chemical structures to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.</li>
<li><strong>Technical Gap</strong>: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.</li>
<li><strong>Data Imbalance</strong>: Chemical strings suffer from severe class imbalance (e.g., &lsquo;C&rsquo; and &lsquo;H&rsquo; are frequent; &lsquo;Br&rsquo; or &lsquo;Cl&rsquo; are rare), which causes standard Cross Entropy loss to underperform.</li>
</ul>
<h2 id="core-innovation-swin-transformers-and-focal-loss">Core Innovation: Swin Transformers and Focal Loss</h2>
<ul>
<li><strong>Swin Transformer Backbone</strong>: SwinOCSR replaces the standard CNN backbone with a <strong>Swin Transformer</strong>, using shifted window attention to capture both local and global image features more effectively.</li>
<li><strong>Multi-label Focal Loss (MFL)</strong>: The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the &ldquo;long-tail&rdquo; distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples:
$$
\begin{aligned}
FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\
\end{aligned}
$$</li>
<li><strong>Structured Synthetic Dataset</strong>: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Backbone Comparison</strong>: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).</li>
<li><strong>Loss Function Ablation</strong>: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).</li>
<li><strong>Category Stress Test</strong>: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.</li>
<li><strong>Real-world Evaluation</strong>: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.</li>
</ul>
<h2 id="results-and-limitations">Results and Limitations</h2>
<ul>
<li><strong>Synthetic test set performance</strong>: With Multi-label Focal Loss (MFL), SwinOCSR achieved <strong>98.58% accuracy</strong> on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).</li>
<li><strong>Handling of long sequences</strong>: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.</li>
<li><strong>Per-category results</strong>: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.</li>
<li><strong>Domain shift</strong>: While performance on synthetic data was strong, accuracy dropped to <strong>25%</strong> on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The first 8.5 million structures from <strong>PubChem</strong> were downloaded, yielding ~6.9 million unique SMILES.</li>
<li><strong>Generation Pipeline</strong>:
<ul>
<li><strong>Tools</strong>: <strong>CDK</strong> (Chemistry Development Kit) for image rendering; <strong>RDKit</strong> for SMILES canonicalization.</li>
<li><strong>Augmentation</strong>: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.</li>
<li><strong>Preprocessing</strong>: Images rendered as binary, resized to <strong>224x224</strong>, and copied to 3 channels (RGB simulation).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>4,500,000</td>
          <td>18:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
      <tr>
          <td>Test</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Loss Function</strong>: <strong>Multi-label Focal Loss (MFL)</strong>. The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Optimizer</strong>: <strong>Adam</strong> with initial learning rate <code>5e-4</code>.</li>
<li><strong>Schedulers</strong>: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.</li>
<li><strong>Regularization</strong>: Dropout rate of <code>0.1</code>.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Backbone (Encoder 1)</strong>: <strong>Swin Transformer</strong>.
<ul>
<li>Patch size: $4 \times 4$.</li>
<li>Linear embedding dimension: 192.</li>
<li>Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).</li>
<li>Output: Flattened patch sequence $S_b$.</li>
</ul>
</li>
<li><strong>Transformer Encoder (Encoder 2)</strong>: 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.</li>
<li><strong>Transformer Decoder</strong>: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).</li>
<li><strong>Tokenization</strong>: <strong>DeepSMILES</strong> format used (syntactically more robust than SMILES). Vocabulary size: <strong>76 tokens</strong> (76 unique characters found in dataset). Embedding dimension: 256.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SwinOCSR (CE)</th>
          <th>SwinOCSR (MFL)</th>
          <th>ResNet-50 (CE)</th>
          <th>EfficientNet-B3 (CE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>97.36%</td>
          <td><strong>98.58%</strong></td>
          <td>89.17%</td>
          <td>86.70%</td>
      </tr>
      <tr>
          <td>Tanimoto</td>
          <td>99.65%</td>
          <td><strong>99.77%</strong></td>
          <td>98.79%</td>
          <td>98.46%</td>
      </tr>
      <tr>
          <td>BLEU</td>
          <td>99.46%</td>
          <td><strong>99.59%</strong></td>
          <td>98.62%</td>
          <td>98.37%</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>99.64%</td>
          <td><strong>99.78%</strong></td>
          <td>98.87%</td>
          <td>98.66%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Trained on <strong>NVIDIA Tesla V100-PCIE</strong>.</li>
<li><strong>Training Time</strong>: 30 epochs.</li>
<li><strong>Batch Size</strong>: 256 images ($224 \times 224$ pixels).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/suanfaxiaohuo/SwinOCSR">SwinOCSR</a></td>
          <td>Code + Data</td>
          <td>Unknown</td>
          <td>Official implementation with dataset and trained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. <em>Journal of Cheminformatics</em>, 14(41). <a href="https://doi.org/10.1186/s13321-022-00624-5">https://doi.org/10.1186/s13321-022-00624-5</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/suanfaxiaohuo/SwinOCSR">GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>MolMiner: Deep Learning OCSR with YOLOv5 Detection</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/</guid><description>Deep learning OCSR tool using YOLOv5 and MobileNetV2 to extract machine-readable molecular structures from scientific documents and PDFs.</description><content:encoded><![CDATA[<h2 id="classification-and-contribution">Classification and Contribution</h2>
<p>This is primarily a <strong>Resource</strong> paper ($\Psi_{\text{Resource}}$) with a strong <strong>Method</strong> component ($\Psi_{\text{Method}}$).</p>
<ul>
<li><strong>Resource</strong>: It presents a complete software application (published as an &ldquo;Application Note&rdquo;) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated &ldquo;Real-World&rdquo; dataset of 3,040 molecular images.</li>
<li><strong>Method</strong>: It proposes a novel &ldquo;rule-free&rdquo; pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.</li>
</ul>
<h2 id="motivation-bottlenecks-in-rule-based-systems">Motivation: Bottlenecks in Rule-Based Systems</h2>
<ul>
<li><strong>Legacy Backlog</strong>: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.</li>
<li><strong>Limitations of Legacy Architecture</strong>: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes) which struggle with noise, low resolution, and complex drawing styles found in scanned documents.</li>
<li><strong>Deep Learning Gap</strong>: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.</li>
</ul>
<h2 id="core-innovation-object-detection-paradigm-for-ocsr">Core Innovation: Object Detection Paradigm for OCSR</h2>
<ul>
<li><strong>Object Detection Paradigm</strong>: MolMiner shifts away from the strategy of line-tracing (vectorization), opting to treat atoms and bonds directly as objects to be detected using <strong>YOLOv5</strong>. This allows it to &ldquo;look once&rdquo; at the image.</li>
<li><strong>End-to-End Pipeline</strong>: Integration of three specialized modules:
<ol>
<li><strong>MobileNetV2</strong> for segmenting molecular figures from PDF pages.</li>
<li><strong>YOLOv5</strong> for detecting chemical elements (atoms/bonds) as bounding boxes.</li>
<li><strong>EasyOCR</strong> for recognizing text labels and resolving abbreviations (supergroups) to full explicit structures.</li>
</ol>
</li>
<li><strong>Synthetic Training Strategy</strong>: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.</li>
</ul>
<h2 id="methodology-end-to-end-object-detection-pipeline">Methodology: End-to-End Object Detection Pipeline</h2>
<ul>
<li><strong>Benchmarks</strong>: Evaluated on four standard OCSR datasets: <strong>USPTO</strong> (5,719 images), <strong>UOB</strong> (5,740 images), <strong>CLEF2012</strong> (992 images), and <strong>JPO</strong> (450 images).</li>
<li><strong>New External Dataset</strong>: Collected and annotated a &ldquo;Real-World&rdquo; dataset of <strong>3,040 images</strong> from 239 scientific papers to test generalization beyond synthetic benchmarks.</li>
<li><strong>Baselines</strong>: Compared against open-source tools: <strong>MolVec</strong> (v0.9.8), <strong>OSRA</strong> (v2.1.0), and <strong>Imago</strong> (v2.0).</li>
<li><strong>Qualitative Tests</strong>: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).</li>
</ul>
<h2 id="results-speed-and-generalization-metrics">Results: Speed and Generalization Metrics</h2>
<ul>
<li><strong>Benchmark Performance</strong>: MolMiner outperformed open-source baselines on standard validation splits.
<ul>
<li><em>USPTO</em>: 93% MCS accuracy (vs. 89% for MolVec, per Table 2). The commercial CLiDE Pro tool reports 93.8% on USPTO, slightly higher than MolMiner&rsquo;s 93.3%.</li>
<li><em>Real-World Set</em>: 87.8% MCS accuracy (vs. 50.1% for MolVec, 8.9% for OSRA, and 10.3% for Imago).</li>
</ul>
</li>
<li><strong>Inference Velocity</strong>: The architecture allows for faster processing compared to CPU rule-based systems. On JPO (450 images), MolMiner finishes in under 1 minute versus 8-23 minutes for rule-based tools (Table 3).</li>
<li><strong>Robustness</strong>: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds, colorful backgrounds, crowded layout segmentation, and Markush structures.</li>
<li><strong>Software Release</strong>: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left"><strong>Synthetic RDKit</strong></td>
          <td style="text-align: left">Large-scale</td>
          <td style="text-align: left">Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>USPTO</strong></td>
          <td style="text-align: left">5,719</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 380.0.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>UOB</strong></td>
          <td style="text-align: left">5,740</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 213.5.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>CLEF2012</strong></td>
          <td style="text-align: left">992</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 401.2.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>JPO</strong></td>
          <td style="text-align: left">450</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 360.3.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>Real-World</strong></td>
          <td style="text-align: left">3,040</td>
          <td style="text-align: left"><strong>New Contribution</strong>. Collected from 239 scientific papers. <a href="https://zenodo.org/records/6973361">Download Link</a>.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Data Generation</strong>:
<ul>
<li>Uses <strong>RDKit</strong> <code>MolDraw2DSVG</code> and <code>CondenseMolAbbreviations</code> to generate images and ground truth.</li>
<li><strong>Augmentation</strong>: Rotation, line thinning/thickness variation, noise injection.</li>
</ul>
</li>
<li><strong>Graph Construction</strong>:
<ul>
<li>A distance-based algorithm connects recognized &ldquo;Atom&rdquo; and &ldquo;Bond&rdquo; objects into a molecular graph.</li>
<li><strong>Supergroup Parser</strong>: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Me&rdquo;).</li>
</ul>
</li>
<li><strong>Image Preprocessing</strong>:
<ul>
<li><strong>Resizing</strong>: Images with max dim &gt; 2560 are resized to 2560. Small images (&lt; 640) resized to 640.</li>
<li><strong>Padding</strong>: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).</li>
<li><strong>Dilation</strong>: For thick-line images, <code>cv2.dilate</code> (3x3 or 2x2 kernel) is applied to estimate median line width.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The system is a cascade of three distinct deep learning models:</p>
<ol>
<li><strong>MolMiner-ImgDet</strong> (Page Segmentation):
<ul>
<li><strong>Architecture</strong>: <strong>MobileNetV2</strong>.</li>
<li><strong>Task</strong>: Semantic segmentation to identify and crop chemical figures from full PDF pages.</li>
<li><strong>Classes</strong>: Background vs. Compound.</li>
<li><strong>Performance</strong>: Recall 95.5%.</li>
</ul>
</li>
<li><strong>MolMiner-ImgRec</strong> (Structure Recognition):
<ul>
<li><strong>Architecture</strong>: <strong>YOLOv5</strong> (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.</li>
<li><strong>Task</strong>: Detects atoms and bonds as bounding boxes.</li>
<li><strong>Labels</strong>:
<ul>
<li><em>Atoms</em>: Si, N, Br, S, I, Cl, H, P, O, C, B, F, Text.</li>
<li><em>Bonds</em>: Single, Double, Triple, Wedge, Dash, Wavy.</li>
</ul>
</li>
<li><strong>Performance</strong>: <a href="mailto:mAP@0.5">mAP@0.5</a> = 97.5%.</li>
</ul>
</li>
<li><strong>MolMiner-TextOCR</strong> (Character Recognition):
<ul>
<li><strong>Architecture</strong>: <strong>EasyOCR</strong> (fine-tuned).</li>
<li><strong>Task</strong>: Recognize specific characters in &ldquo;Text&rdquo; regions identified by YOLO (e.g., supergroups, complex labels).</li>
<li><strong>Performance</strong>: ~96.4% accuracy.</li>
</ul>
</li>
</ol>
<h2 id="performance-evaluation--accuracy-metrics">Performance Evaluation &amp; Accuracy Metrics</h2>
<p>The paper argues that computing the Maximum Common Substructure (MCS) accuracy is superior to string comparisons of canonical identifiers like InChI or SMILES. The InChI string is heavily sensitive to slight canonicalization or tautomerization discrepancies (like differing aromaticity models). Therefore, for comparing structural isomorphism:</p>
<p>$$ \text{MCS_Accuracy} = \frac{|\text{Edges}_{\text{MCS}}| + |\text{Nodes}_{\text{MCS}}|}{|\text{Edges}_{\text{Ground_Truth}}| + |\text{Nodes}_{\text{Ground_Truth}}|} $$</p>
<p>Using this metric to evaluate bond- and atom-level recall directly measures OCR extraction fidelity.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">MolMiner (Real-World)</th>
          <th style="text-align: left">MolVec</th>
          <th style="text-align: left">OSRA</th>
          <th style="text-align: left">Imago</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>MCS Accuracy</strong></td>
          <td style="text-align: left"><strong>87.8%</strong></td>
          <td style="text-align: left">50.1%</td>
          <td style="text-align: left">8.9%</td>
          <td style="text-align: left">10.3%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>InChI Accuracy</strong></td>
          <td style="text-align: left"><strong>88.9%</strong></td>
          <td style="text-align: left">62.6%</td>
          <td style="text-align: left">64.5%</td>
          <td style="text-align: left">10.8%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Inference Hardware</strong>: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.</li>
<li><strong>Acceleration</strong>: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.</li>
<li><strong>Runtime</strong>: Under 1 minute on JPO (450 images), 7 minutes on USPTO (5,719 images), compared to 29-148 minutes for baseline tools on USPTO (Table 3).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/iipharma/pharmamind-molminer">pharmamind-molminer</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">GitHub repo with user guides and release downloads</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://zenodo.org/records/6973361">Real-World Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">3,040 molecular images from 239 papers</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., &amp; Pei, J. (2022). MolMiner: You only look once for chemical structure recognition. <em>Journal of Chemical Information and Modeling</em>, 62(22), 5321&ndash;5328. <a href="https://doi.org/10.1021/acs.jcim.2c00733">https://doi.org/10.1021/acs.jcim.2c00733</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/iipharma/pharmamind-molminer">Github Repository</a></li>
<li><a href="https://zenodo.org/records/6973361">Zenodo Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xuMolMinerYouOnly2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MolMiner: You only look once for chemical structure recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MolMiner}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5321--5328}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1549-9596}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.2c00733}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Structure Recognition with RCGD</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</guid><description>An end-to-end framework (RCGD) and unambiguous markup language (SSML) for recognizing complex handwritten chemical structures with guided graph traversal.</description><content:encoded><![CDATA[<h2 id="contribution-and-methodological-framework">Contribution and Methodological Framework</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architectural framework (<strong>RCGD</strong>) and a new representation syntax (<strong>SSML</strong>) to solve the specific problem of handwritten chemical structure recognition.</li>
<li><strong>Resource</strong>: It introduces a new benchmark dataset, <strong>EDU-CHEMC</strong>, containing 50,000 handwritten images to address the lack of public data in this domain.</li>
</ul>
<h2 id="the-ambiguity-of-handwritten-chemical-structures">The Ambiguity of Handwritten Chemical Structures</h2>
<p>Recognizing handwritten chemical structures is significantly harder than printed ones due to:</p>
<ol>
<li><strong>Inherent Ambiguity</strong>: Handwritten atoms and bonds vary greatly in appearance.</li>
<li><strong>Projection Complexity</strong>: Converting 2D projected layouts (like Natta or Fischer projections) into linear strings is difficult.</li>
<li><strong>Limitations of Existing Formats</strong>: Standard formats like SMILES require domain knowledge (valence rules) and have a high semantic gap with the visual image. They often fail to represent &ldquo;invalid&rdquo; structures commonly found in educational/student work.</li>
</ol>
<h2 id="bridging-the-semantic-gap-with-ssml-and-rcgd">Bridging the Semantic Gap with SSML and RCGD</h2>
<p>The paper introduces two core contributions to bridge the semantic gap between image and markup:</p>
<ol>
<li>
<p><strong>Structure-Specific Markup Language (SSML)</strong>: An extension of Chemfig that provides an unambiguous, visual-based graph representation. Unlike SMILES, it describes <em>how to draw</em> the molecule step-by-step, making it easier for models to learn visual alignments. It supports &ldquo;reconnection marks&rdquo; to handle cyclic structures explicitly.</p>
</li>
<li>
<p><strong>Random Conditional Guided Decoder (RCGD)</strong>: A decoder that treats recognition as a graph traversal problem. It introduces three novel mechanisms:</p>
<ul>
<li><strong>Conditional Attention Guidance</strong>: Uses branch angle directions to guide the attention mechanism, preventing the model from getting lost in complex structures.</li>
<li><strong>Memory Classification</strong>: A module that explicitly stores and classifies &ldquo;unexplored&rdquo; branch points to handle ring closures (reconnections).</li>
<li><strong>Path Selection</strong>: A training strategy that randomly samples traversal paths to prevent overfitting to a specific serialization order.</li>
</ul>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p><strong>Datasets</strong>:</p>
<ul>
<li><strong>Mini-CASIA-CSDB</strong> (Printed): A subset of 97,309 printed molecular structure images, upscaled to $500 \times 500$ resolution.</li>
<li><strong>EDU-CHEMC</strong> (Handwritten): A new dataset of 52,987 images collected from educational settings (cameras, scanners, screens), including erroneous/non-existent structures.</li>
</ul>
<p><strong>Baselines</strong>:</p>
<ul>
<li>Compared against standard <strong>String Decoders (SD)</strong> (based on DenseWAP), tested with both SMILES and SSML on Mini-CASIA-CSDB and exclusively with SSML on EDU-CHEMC.</li>
<li>Compared against <strong>BTTR</strong> and <strong>ABM</strong> (recent mathematical expression recognition models) adapted for the chemical structure task, both using SSML on EDU-CHEMC.</li>
<li>On Mini-CASIA-CSDB, also compared against <strong>WYGIWYS</strong> (a SMILES-based string decoder at 300x300 resolution).</li>
</ul>
<p><strong>Ablation Studies</strong>:</p>
<ul>
<li>Evaluated the impact of removing Path Selection (PS) and Memory Classification (MC) mechanisms on EDU-CHEMC.</li>
<li>Tested robustness to image rotation ($180^{\circ}$) on Mini-CASIA-CSDB.</li>
</ul>
<h2 id="recognition-performance-and-robustness">Recognition Performance and Robustness</h2>
<ul>
<li><strong>Superiority of SSML</strong>: Models trained with SSML significantly outperformed those trained with SMILES (92.09% vs 81.89% EM on printed data) due to reduced semantic gap.</li>
<li><strong>Best Performance</strong>: RCGD achieved the highest Exact Match (EM) scores on both datasets:
<ul>
<li><strong>Mini-CASIA-CSDB</strong>: 95.01% EM.</li>
<li><strong>EDU-CHEMC</strong>: 62.86% EM.</li>
</ul>
</li>
<li><strong>EDU-CHEMC Baselines</strong>: On the handwritten dataset, SD (DenseWAP) achieved 61.35% EM, outperforming both BTTR (58.21% EM) and ABM (58.78% EM). The authors note that BTTR and ABM&rsquo;s reverse training mode, which helps in regular formula recognition, does not transfer well to graph-structured molecular data.</li>
<li><strong>Ablation Results</strong> (Table 5, EDU-CHEMC): Removing Path Selection alone dropped EM from 62.86% to 62.15%. Removing both Path Selection and Memory Classification dropped EM further to 60.31%, showing that memory classification has a larger impact.</li>
<li><strong>Robustness</strong>: RCGD showed minimal performance drop (0.85%) on rotated images compared to SMILES-based methods (10.36% drop). The SD with SSML dropped by 2.19%, confirming that SSML itself improves rotation invariance.</li>
<li><strong>Educational Utility</strong>: The method can recognize and reconstruct chemically invalid structures (e.g., a Carbon atom with 5 bonds), making it applicable for correcting and revising handwritten answers in chemistry education.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>1. EDU-CHEMC (Handwritten)</strong></p>
<ul>
<li><strong>Total Size</strong>: 52,987 images.</li>
<li><strong>Splits</strong>: Training (48,998), Validation (999), Test (2,992).</li>
<li><strong>Characteristics</strong>: Real-world educational data, mixture of isolated molecules and reaction equations, includes invalid chemical structures.</li>
</ul>
<p><strong>2. Mini-CASIA-CSDB (Printed)</strong></p>
<ul>
<li><strong>Total Size</strong>: 97,309 images.</li>
<li><strong>Splits</strong>: Training (80,781), Validation (8,242), Test (8,286).</li>
<li><strong>Preprocessing</strong>: Original $300 \times 300$ images were upscaled to $500 \times 500$ RGB to resolve blurring issues.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. SSML Generation</strong></p>
<p>To convert a molecular graph to SSML:</p>
<ol>
<li><strong>Traverse</strong>: Start from the left-most atom.</li>
<li><strong>Bonds/Atoms</strong>: Output atom text and bond format <code>&lt;bond&gt;[:&lt;angle&gt;]</code>.</li>
<li><strong>Branches</strong>: At branch points, use phantom symbols <code>(</code> and <code>)</code> to enclose branches, ordered by ascending bond angle.</li>
<li><strong>Reconnections</strong>: Use <code>?[tag]</code> and <code>?[tag, bond]</code> to mark start/end of ring closures.</li>
</ol>
<p><strong>2. RCGD Specifics</strong></p>
<ul>
<li><strong>RCGD-SSML</strong>: Modified version of SSML for the decoder. Removes <code>(</code> <code>)</code> delimiters; adds <code>\eob</code> (end of branch). Maintains a dynamic <strong>Branch Angle Set ($M$)</strong>.</li>
<li><strong>Path Selection</strong>: During training, when multiple branches exist in $M$, the model randomly selects one to traverse next. During inference, it uses beam search to score candidate paths.</li>
<li><strong>Loss Function</strong>:
$$
\begin{aligned}
L_{\text{total}} = L_{\text{ce}} + L_{\text{bc}}
\end{aligned}
$$
<ul>
<li>$L_{\text{ce}}$: Cross-entropy loss for character sequence generation.</li>
<li>$L_{\text{bc}}$: Multi-label classification loss for the memory module (predicting reconnection bond types for stored branch states).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>: DenseNet</p>
<ul>
<li><strong>Structure</strong>: 3 dense blocks.</li>
<li><strong>Growth Rate</strong>: 24.</li>
<li><strong>Depth</strong>: 32 per block.</li>
<li><strong>Output</strong>: High-dimensional feature map $x \in \mathbb{R}^{d_x \times h \times w}$.</li>
</ul>
<p><strong>Decoder</strong>: GRU with Attention</p>
<ul>
<li><strong>Hidden State Dimension</strong>: 256.</li>
<li><strong>Embedding Dimension</strong>: 256.</li>
<li><strong>Attention Projection</strong>: 128.</li>
<li><strong>Memory Classification Projection</strong>: 256.</li>
</ul>
<p><strong>Training Config</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: Adam.</li>
<li><strong>Learning Rate</strong>: 2e-4 with multi-step decay (gamma 0.5).</li>
<li><strong>Dropout</strong>: 15%.</li>
<li><strong>Strategy</strong>: Teacher-forcing used for validation selection.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Exact Match (EM)</strong>: Percentage of samples where the predicted graph structure perfectly matches the label. For SMILES, string comparison; for SSML, converted to graph for isomorphism check.</li>
<li><strong>Structure EM</strong>: Auxiliary metric for samples with mixed content (text + molecules), counting samples where <em>all</em> molecular structures are correct.</li>
</ul>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">EDU-CHEMC</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Dataset annotations and download links (actual data hosted on Google Drive)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing Components</strong>:</p>
<ul>
<li>No training or inference code is publicly released; only the dataset is available.</li>
<li>Pre-trained model weights are not provided.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, J., Wu, H., Chen, M., Liu, C., Wu, J., Yin, S., Yin, B., Yin, B., Liu, C., Du, J., &amp; Dai, L. (2023). Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder. <em>Proceedings of the 31st ACM International Conference on Multimedia</em> (pp. 8114-8124). <a href="https://doi.org/10.1145/3581783.3612573">https://doi.org/10.1145/3581783.3612573</a></p>
<p><strong>Publication</strong>: ACM Multimedia 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">GitHub Repository / EDU-CHEMC Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{huHandwrittenChemicalStructure2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 31st ACM International Conference on Multimedia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hu, Jinshui and Wu, Hao and Chen, Mingjun and Liu, Chenyu and Wu, Jiajia and Yin, Shi and Yin, Baocai and Yin, Bing and Liu, Cong and Du, Jun and Dai, Lirong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{8114--8124}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Ottawa ON Canada}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3581783.3612573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{979-8-4007-0108-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Deep Learning for Molecular Structure Extraction (2019)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</guid><description>An end-to-end deep learning approach using U-Net segmentation and a CNN encoder with GridLSTM decoder to predict chemical structures from document images.</description><content:encoded><![CDATA[<h2 id="contribution-type-method-and-resource">Contribution Type: Method and Resource</h2>
<p>This is primarily a <strong>methodological</strong> paper with a secondary <strong>resource</strong> contribution.</p>
<p><strong>Method</strong>: It proposes a novel end-to-end deep learning architecture (Segmentation U-Net + Recognition Encoder-Decoder) to replace traditional rule-based optical chemical structure recognition (OCSR) systems.</p>
<p><strong>Resource</strong>: It details a pipeline for generating large-scale synthetic datasets (images overlaying patent/journal backgrounds) necessary to train the deep learning models.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Existing tools for extracting chemical structures from literature (e.g., OSRA, CLIDE) rely on complex, handcrafted rules and heuristics (edge detection, vectorization). These systems suffer from:</p>
<ol>
<li><strong>Brittleness</strong>: They fail when image quality is low (low resolution, noise) or when artistic styles vary (wavy bonds, crossing lines).</li>
<li><strong>Maintenance difficulty</strong>: Improvements require manual codification of new rules for every edge case, which is difficult to scale.</li>
<li><strong>Data volume</strong>: The explosion of published life science papers (2000+ per day in Medline) creates a need for automated, robust curation tools that humans cannot match.</li>
</ol>
<h2 id="core-innovation-end-to-end-pixel-to-smiles-recognition">Core Innovation: End-to-End Pixel-to-SMILES Recognition</h2>
<p>The authors present an <strong>end-to-end deep learning approach</strong> for this task that operates directly on raw pixels without explicit subcomponent recognition (e.g., detecting atoms and bonds separately). Key innovations include:</p>
<ol>
<li><strong>Pixel-to-SMILES</strong>: Treating structure recognition as an image captioning problem using an encoder-decoder architecture with attention, generating SMILES directly.</li>
<li><strong>Low-Resolution Robustness</strong>: The model is trained on aggressively downsampled images (~60 dpi for segmentation, 256x256 for prediction), making it robust to poor quality and noisy inputs from legacy PDF extractions.</li>
<li><strong>Implicit Superatom Handling</strong>: The model learns to recognize and generate sequences for superatoms (e.g., &ldquo;OTBS&rdquo;) contextually.</li>
</ol>
<h2 id="experimental-setup-and-large-scale-synthetic-data">Experimental Setup and Large-Scale Synthetic Data</h2>
<p>The authors validated their approach using a mix of large-scale synthetic training sets and real-world test sets:</p>
<ol>
<li><strong>Synthetic Generation</strong>: They created a segmentation dataset by overlaying USPTO molecules onto &ldquo;whited-out&rdquo; journal pages.</li>
<li><strong>Ablation/Training</strong>: Metrics were tracked on Indigo (synthetic) and USPTO (real patent images) datasets.</li>
<li><strong>External Validation</strong>:
<ul>
<li><strong>Valko Dataset</strong>: A standard benchmark of 454 heterogeneous images from literature.</li>
<li><strong>Proprietary Dataset</strong>: A collection of images from 47 articles and 5 patents to simulate real-world drug discovery curation.</li>
</ul>
</li>
<li><strong>Stress Testing</strong>: They analyzed performance distributions across molecular weight, heavy atom count, and rare elements (e.g., Uranium, Vanadium).</li>
</ol>
<h2 id="results-and-limitations-in-complex-structures">Results and Limitations in Complex Structures</h2>
<ul>
<li><strong>High Accuracy on Standard Sets</strong>: The model achieved <strong>82% accuracy</strong> on the Indigo validation set and <strong>77%</strong> on the USPTO validation set. No apparent overfitting was observed on the Indigo data (57M training examples), though some overfitting occurred on the smaller USPTO set (1.7M training examples).</li>
<li><strong>Real-World Viability</strong>: It achieved <strong>83% accuracy</strong> on the proprietary internal test set, with validation and proprietary accuracies ranging from 77-83%, indicating the training sets reasonably approximate real drug discovery data.</li>
<li><strong>Segmentation Quality</strong>: Low segmentation error rates were observed: only 3.3% of the Valko dataset and 6.6% of the proprietary images failed to segment properly.</li>
<li><strong>Limitations on Complexity</strong>: Performance dropped to <strong>41% on the Valko test set</strong>. Superatoms were the single largest contributor to prediction errors, with 21% of Valko samples containing one or more incorrectly predicted superatoms. Only 6.6% of total training images contained any superatom, limiting the model&rsquo;s exposure.</li>
<li><strong>Stereochemistry Challenges</strong>: 60% of compounds with incorrectly predicted stereochemistry had explicit stereochemistry in both the ground truth and the prediction, but with wrong configurations assigned (e.g., predicting R instead of S). The model often correctly identified which atoms have stereocenters but assigned the wrong direction, suggesting the architecture may not incorporate sufficient spatial context for configuration assignment.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized three primary sources for generating training data. All inputs were strictly downsampled to improve robustness.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>Indigo Set</strong></td>
          <td>57M</td>
          <td>PubChem molecules rendered via Indigo (256x256).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO Set</strong></td>
          <td>1.7M</td>
          <td>Image/SMILES pairs from public patent data.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>OS X Indigo</strong></td>
          <td>10M</td>
          <td>Additional Indigo renders from Mac OS for style diversity.</td>
      </tr>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td><strong>Synthetic Pages</strong></td>
          <td>N/A</td>
          <td>Generated by overlaying USPTO images on text-cleared PDF pages.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Segmentation Inputs</strong>: Grayscale, downsampled to ~60 dpi.</li>
<li><strong>Prediction Inputs</strong>: Resized to 256x256 such that bond lengths are approximately 3-12 pixels.</li>
<li><strong>Augmentation</strong>: Random affine transforms, brightness scaling, and binarization applied during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Segmentation Pipeline</strong>:</p>
<ul>
<li><strong>Multi-scale Inference</strong>: Masks generated at resolutions from 30 to 60 dpi (3 dpi increments) and averaged for the final mask.</li>
<li><strong>Post-processing</strong>: Hough transform used to remove long straight lines (table borders). Mask blobs filtered by pixel count thresholds.</li>
</ul>
<p><strong>Prediction Pipeline</strong>:</p>
<ul>
<li><strong>Sequence Generation</strong>: SMILES generated character-by-character via greedy decoding. During inference, predictions are made at several low resolutions and the sequence with the highest confidence (product of per-character softmax outputs) is returned.</li>
<li><strong>Attention-based Verification</strong>: Attention weights used to re-project predicted atoms back into 2D space to visually verify alignment with the input image.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>1. Segmentation Model (U-Net Variant)</strong>:</p>
<ul>
<li><strong>Architecture</strong>: U-Net style with skip connections.</li>
<li><strong>Input</strong>: 128x128x1 grayscale image.</li>
<li><strong>Layers</strong>: Alternating 3x3 Conv and 2x2 Max Pool.</li>
<li><strong>Activation</strong>: Parametric ReLU (pReLU).</li>
<li><strong>Parameters</strong>: ~380,000.</li>
</ul>
<p><strong>2. Structure Prediction Model (Encoder-Decoder)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: CNN with 5x5 convolutions, 2x2 Max Pooling, pReLU. No pooling in first layers to preserve fine features.</li>
<li><strong>Decoder</strong>: 3 layers of <strong>GridLSTM</strong> cells.</li>
<li><strong>Attention</strong>: Soft/Global attention mechanism conditioned on the encoder state.</li>
<li><strong>Input</strong>: 256x256x1 image.</li>
<li><strong>Output</strong>: Sequence of characters (vocab size 65).</li>
<li><strong>Parameters</strong>: ~46.3 million.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation required an exact string match of the Canonical SMILES (including stereochemistry) to the ground truth.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Dataset</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td><strong>82%</strong></td>
          <td>Indigo Val</td>
          <td>Synthetic validation set</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>77%</strong></td>
          <td>USPTO Val</td>
          <td>Real patent images</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>83%</strong></td>
          <td>Proprietary</td>
          <td>Internal pharma dataset (real world)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>41%</strong></td>
          <td>Valko Test</td>
          <td>External benchmark; difficult due to superatoms</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Segmentation Training</strong>: 1 GPU, ~4 days (650k steps).</li>
<li><strong>Prediction Training</strong>: 8 NVIDIA Pascal GPUs, ~26 days (1M steps).</li>
<li><strong>Framework</strong>: TensorFlow.</li>
<li><strong>Optimizer</strong>: Adam.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code, pre-trained models, or generated datasets were released with this paper. The training pipeline relies on publicly available molecular databases (PubChem, USPTO) and open-source rendering tools (Indigo), but the specific training sets, model weights, and inference code remain unavailable.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Staker, J., Marshall, K., Abel, R., &amp; McQuaw, C. (2019). Molecular Structure Extraction From Documents Using Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 59(3), 1017-1029. <a href="https://doi.org/10.1021/acs.jcim.8b00669">https://doi.org/10.1021/acs.jcim.8b00669</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.schrodinger.com/publications/">Schrödinger Publication Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{stakerMolecularStructureExtraction2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Structure Extraction From Documents Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Staker, Joshua and Marshall, Kyle and Abel, Robert and McQuaw, Carolyn}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{feb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{59}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1017--1029}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.8b00669}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1021/acs.jcim.8b00669}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Mixfile &amp; MInChI: Machine-Readable Mixture Formats</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/mixfile-minchi/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/mixfile-minchi/</guid><description>Mixfile and MInChI provide the first standardized, machine-readable formats for representing chemical mixtures.</description><content:encoded><![CDATA[<h2 id="a-standardized-resource-for-chemical-mixtures">A Standardized Resource for Chemical Mixtures</h2>
<p>This is a <strong>Resource</strong> paper that introduces two complementary standards for representing chemical mixtures: the detailed <strong>Mixfile</strong> format for comprehensive mixture descriptions and the compact <strong>MInChI</strong> (Mixtures InChI) specification for canonical mixture identifiers.</p>
<h2 id="the-missing-format-for-complex-formulations">The Missing Format for Complex Formulations</h2>
<p>There is a fundamental gap in chemical informatics: current standards excel at representing pure individual molecules (SMILES, InChI, Molfile), but a corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly involves complex mixtures.</p>
<p>Everyday chemical work frequently involves:</p>
<ul>
<li>Reagents with specified purity (e.g., &ldquo;$\geq$ 97% pure&rdquo;)</li>
<li>Solutions and formulations</li>
<li>Complex mixtures like &ldquo;hexanes&rdquo; (which contains multiple isomers)</li>
<li>Drug formulations with active ingredients and excipients</li>
</ul>
<p>Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software cannot parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.</p>
<h2 id="dual-design-comprehensive-mixfiles-and-canonical-minchis">Dual Design: Comprehensive Mixfiles and Canonical MInChIs</h2>
<p>The authors propose a two-part solution:</p>
<ol>
<li><strong>Mixfile</strong>: A detailed, hierarchical JSON format that captures the complete composition of a mixture</li>
<li><strong>MInChI</strong>: A compact, canonical string identifier derived from Mixfile data</li>
</ol>
<p>This dual approach provides both comprehensive description (Mixfile) and simple identification (MInChI), similar to having both a detailed recipe and a short name for a dish.</p>
<h3 id="what-makes-a-good-mixture-format">What Makes a Good Mixture Format?</h3>
<p>The authors identify three essential properties any mixture format must capture:</p>
<ol>
<li><strong>Compound</strong>: What molecules are present?</li>
<li><strong>Quantity</strong>: How much of each component?</li>
<li><strong>Hierarchy</strong>: How are components organized (e.g., mixtures-of-mixtures)?</li>
</ol>
<p>The hierarchical aspect is crucial. Consider &ldquo;hexanes&rdquo;: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. A mixture format needs to represent both the individual isomers and the fact that they are grouped under the umbrella term &ldquo;hexanes.&rdquo;</p>
<h3 id="mixfile-format-details">Mixfile Format Details</h3>
<p>Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:</p>
<ul>
<li><strong>name</strong>: Component identifier</li>
<li><strong>molfile/smiles/inchi/formula</strong>: Molecular structure (molfile is the primary source of truth)</li>
<li><strong>quantity/units/relation/ratio</strong>: Concentration data with optional relation operators</li>
<li><strong>contents</strong>: Array of sub-components for hierarchical mixtures</li>
<li><strong>identifiers</strong>: Database IDs or URLs for additional information</li>
</ul>
<h4 id="simple-example">Simple Example</h4>
<p>A basic Mixfile might look like:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Acetone, ≥99%&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;acetone&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">99</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;relation&#34;</span>: <span style="color:#e6db74">&#34;&gt;=&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Note that the paper specifies distinct fields for molecular structures: <code>molfile</code> (the primary source of truth), <code>smiles</code>, <code>inchi</code>, and <code>formula</code>. Concentration data uses separate <code>quantity</code>, <code>units</code>, and <code>relation</code> fields.</p>
<h4 id="complex-example-mixture-of-mixtures">Complex Example: Mixture-of-Mixtures</h4>
<p>For something like &ldquo;ethyl acetate dissolved in hexanes,&rdquo; the structure would be:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Ethyl acetate in hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;ethyl acetate&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCOC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;n-hexane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCCCCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">60</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;2-methylpentane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(C)CCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">25</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>      ]
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>This hierarchical structure captures the &ldquo;recipe&rdquo; of complex mixtures while remaining machine-readable.</p>
<h3 id="minchi-canonical-mixture-identifiers">MInChI: Canonical Mixture Identifiers</h3>
<p>While Mixfiles provide comprehensive descriptions, simple identifiers are also needed for database storage and searching. This is where MInChI comes in.</p>
<p>A MInChI string is structured as:</p>
<pre><code>MInChI=0.00.1S/&lt;components&gt;/n&lt;indexing&gt;/g&lt;concentration&gt;
</code></pre>
<ul>
<li><strong>Header</strong>: Version information (<code>0.00.1S</code> in the paper&rsquo;s specification)</li>
<li><strong>Components</strong>: Standard InChI for each unique molecule, sorted alphabetically <em>by the InChI strings themselves</em>, then concatenated with <code>&amp;</code></li>
<li><strong>Indexing</strong> (prefixed with <code>/n</code>): Hierarchical structure using curly braces <code>{}</code> for branches and <code>&amp;</code> for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list</li>
<li><strong>Concentration</strong> (prefixed with <code>/g</code>): Quantitative information for each component, with units converted to canonical codes</li>
</ul>
<h4 id="why-this-matters">Why This Matters</h4>
<p>MInChI strings enable simple database searches:</p>
<ul>
<li>Check if a specific component appears in any mixture</li>
<li>Compare different formulations of the same product</li>
<li>Identify similar mixtures based on string similarity</li>
</ul>
<h2 id="validating-the-standard-through-practical-tooling">Validating the Standard Through Practical Tooling</h2>
<p>The paper demonstrates the format&rsquo;s capabilities through several practical applications and a proof-of-concept implementation:</p>
<h3 id="text-extraction-algorithm">Text Extraction Algorithm</h3>
<p>The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:</p>
<ol>
<li>Applies regex rules to remove filler words and extract concentrations</li>
<li>Looks up cleaned names against a custom chemical database</li>
<li>Falls back to OPSIN for SMILES generation from chemical names</li>
<li>Generates 2D coordinates for molecular structures</li>
</ol>
<h3 id="graphical-editor">Graphical Editor</h3>
<p>An open-source editor provides:</p>
<ul>
<li>Tree-based interface for building and editing hierarchical structures</li>
<li>Chemical structure sketching and editing</li>
<li>Database lookup (e.g., PubChem integration)</li>
<li>Automatic MInChI generation</li>
<li>Import/export capabilities</li>
</ul>
<h3 id="example-use-cases">Example Use Cases</h3>
<p>The paper validates the format through real-world applications:</p>
<ul>
<li><strong>Safety compliance</strong>: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)</li>
<li><strong>Inventory management</strong>: Precise, searchable laboratory records</li>
<li><strong>Data extraction</strong>: Parsing vendor catalogs and safety data sheets</li>
</ul>
<h2 id="outcomes-and-future-extensibility">Outcomes and Future Extensibility</h2>
<p>The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:</p>
<ul>
<li><strong>Comprehensive representation</strong>: Mixfile captures component identity, quantity, and hierarchy</li>
<li><strong>Canonical identification</strong>: MInChI provides compact, searchable identifiers</li>
<li><strong>Practical tooling</strong>: Open-source editor and text extraction demonstrate feasibility</li>
<li><strong>Real-world validation</strong>: Format handles diverse use cases from safety to inventory</li>
</ul>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The authors acknowledge areas for improvement:</p>
<ul>
<li><strong>Machine learning improvements</strong>: Better text extraction using modern NLP techniques</li>
<li><strong>Extended coverage</strong>: Support for polymers, complex formulations, analytical results</li>
<li><strong>Community adoption</strong>: Integration with existing chemical databases and software</li>
</ul>
<p>The hierarchical design makes Mixfile suitable for both &ldquo;recipe&rdquo; descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="open-source-tooling--data">Open Source Tooling &amp; Data</h3>
<p>While the central repository focusing on validating and establishing the MInChI standard is <a href="https://github.com/IUPAC/MInChI">github.com/IUPAC/MInChI</a>, the tools and datasets actually used to develop the paper&rsquo;s proofs-of-concept are hosted elsewhere:</p>
<ul>
<li><strong>Graphical Editor &amp; App codebase</strong>: The Electron application and Mixfile handling codebase (<code>console.js</code>) can be found at <a href="https://github.com/cdd/mixtures">github.com/cdd/mixtures</a>.</li>
<li><strong>Text Extraction Data</strong>: The several thousand extracted mixture records generated through the text extraction method can be accessed inside the <code>cdd/mixtures</code> repository under <a href="https://github.com/cdd/mixtures/tree/master/reference"><code>reference/gathering.zip</code></a>.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/IUPAC/MInChI">IUPAC/MInChI</a></td>
          <td style="text-align: left">Code / Data</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Validation test suite with ~150 mixture JSON files</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/cdd/mixtures">cdd/mixtures</a></td>
          <td style="text-align: left">Code / Data</td>
          <td style="text-align: left">GPL-3.0</td>
          <td style="text-align: left">Electron-based Mixfile editor, CLI tools, and reference mixture corpus</td>
      </tr>
  </tbody>
</table>
<p>The paper was funded by NIH Grant 1R43TR002528-01. No specific hardware requirements are needed, as this is a format specification with lightweight tooling.</p>
<h3 id="algorithms">Algorithms</h3>
<p>This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.</p>
<h4 id="the-strict-mixfile-json-schema">The Strict Mixfile JSON Schema</h4>
<p>To implement the format, a parser must recognize these specific fields:</p>
<p><strong>Root Structure</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;header&#34;</span>: {},
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: []
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Component Fields</strong>:</p>
<ul>
<li><code>name</code>: string (required if no structure is provided)</li>
<li><code>molfile</code>: string (the primary source of truth for molecular structure)</li>
<li><code>smiles</code>, <code>inchi</code>, <code>formula</code>: derived/transient fields for convenience</li>
<li><code>quantity</code>: number OR <code>[min, max]</code> array for ranges</li>
<li><code>units</code>: string (must map to supported ontology)</li>
<li><code>relation</code>: string (e.g., <code>&quot;&gt;&quot;</code>, <code>&quot;~&quot;</code>, <code>&quot;&gt;=&quot;</code>)</li>
<li><code>ratio</code>: array of two numbers <code>[numerator, denominator]</code></li>
<li><code>identifiers</code>: database assignments (e.g., CASRN, PubChem)</li>
<li><code>links</code>: URLs relevant to the component</li>
<li><code>contents</code>: recursive array for hierarchical mixtures</li>
</ul>
<h4 id="minchi-generation-algorithm">MInChI Generation Algorithm</h4>
<p>To generate <code>MInChI=0.00.1S/...</code>, the software must follow these steps:</p>
<ol>
<li>
<p><strong>Component Layer</strong>:</p>
<ul>
<li>Calculate standard <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> for all structures in the mixture</li>
<li>Sort distinct InChIs alphabetically by the InChI string itself</li>
<li>Join with <code>&amp;</code> to form the structure layer</li>
</ul>
</li>
<li>
<p><strong>Hierarchy &amp; Concentration Layers</strong>:</p>
<ul>
<li>Traverse the Mixfile tree recursively</li>
<li><strong>Indexing</strong>: Use integer indices (1-based) referring to the sorted InChI list</li>
<li><strong>Grouping</strong>: Use <code>{}</code> to denote hierarchy branches and <code>&amp;</code> to separate nodes at the same level</li>
<li><strong>Concentration</strong>: Convert all quantities to canonical unit codes and apply scaling factors</li>
</ul>
</li>
</ol>
<h4 id="unit-standardization-table">Unit Standardization Table</h4>
<p>Replication requires mapping input units to canonical MInChI codes. The full table from the paper (Table 1) includes:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Input Unit</th>
          <th style="text-align: left">MInChI Code</th>
          <th style="text-align: left">Scale Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">%</td>
          <td style="text-align: left">pp</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">w/v%</td>
          <td style="text-align: left">wv</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">w/w%</td>
          <td style="text-align: left">wf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">v/v%</td>
          <td style="text-align: left">vf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/mol%</td>
          <td style="text-align: left">mf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/L (M)</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">mmol/L</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
      <tr>
          <td style="text-align: left">g/L</td>
          <td style="text-align: left">wv</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/kg</td>
          <td style="text-align: left">mb</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">ratio</td>
          <td style="text-align: left">vp</td>
          <td style="text-align: left">1</td>
      </tr>
  </tbody>
</table>
<h4 id="text-extraction-logic">Text Extraction Logic</h4>
<p>The paper defines a recursive procedure for parsing plain-text mixture descriptions:</p>
<ol>
<li><strong>Input</strong>: Raw text string (e.g., &ldquo;2 M acetone in water&rdquo;)</li>
<li><strong>Rule Application</strong>: Apply RegEx rules in order:
<ul>
<li><em>Remove</em>: Delete common filler words (&ldquo;solution&rdquo;, &ldquo;in&rdquo;)</li>
<li><em>Replace</em>: Substitute known variations</li>
<li><em>Concentration</em>: Extract quantities like &ldquo;2 M&rdquo;, &ldquo;97%&rdquo;</li>
<li><em>Branch</em>: Split phrases like &ldquo;A in B&rdquo; into sub-nodes</li>
</ul>
</li>
<li><strong>Lookup</strong>: Check cleaned name against a custom table (handles cases like &ldquo;xylenes&rdquo; or specific structures)</li>
<li><strong>OPSIN</strong>: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name</li>
<li><strong>Embed</strong>: If structure found, generate 2D coordinates (Molfile) via RDKit</li>
</ol>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clark, A. M., McEwen, L. R., Gedeck, P., &amp; Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. <em>Journal of Cheminformatics</em>, <em>11</em>(1), 33. <a href="https://doi.org/10.1186/s13321-019-0357-4">https://doi.org/10.1186/s13321-019-0357-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2019)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{clark2019capturing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Capturing mixture composition: an open machine-readable format for representing mixed substances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IUPAC/MInChI">Official MInChI GitHub repository</a></li>
</ul>
]]></content:encoded></item><item><title>What is Optical Chemical Structure Recognition (OCSR)?</title><link>https://hunterheidenreich.com/posts/what-is-ocsr/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/what-is-ocsr/</guid><description>A micro-review of Optical Chemical Structure Recognition (OCSR), covering rule-based systems to modern deep learning models.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Decades of chemical research, breakthroughs in medicine, and novel materials are archived in journals, patents, and textbooks.
A huge portion of this knowledge is stored as images, a format inaccessible to standard computational tools.
This imposes challenges for both data retrieval and leveraging modern computational tools to analyze and predict chemical properties, inefficiencies that compound across the literature: knowledge locked in image form is invisible to search, mining, and downstream model training.</p>
<p>This is the central challenge that <strong>Optical Chemical Structure Recognition (OCSR)</strong> aims to solve. At its heart, OCSR is to chemistry what OCR (Optical Character Recognition) is to text: a technology that teaches computers to extract chemical information directly from 2D diagrams of molecules. It&rsquo;s the bridge between a picture of a molecule and a machine-readable format like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> (Simplified Molecular Input Line Entry System) that can be stored, searched, and used to power new discoveries.</p>















<figure class="post-figure center ">
    <img src="/img/ocsr/img2smiles.webp"
         alt="The transformation from a 2D chemical structure image to a SMILES representation."
         title="The transformation from a 2D chemical structure image to a SMILES representation."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The transformation from a 2D chemical structure image to a SMILES representation.</figcaption>
    
</figure>

<p>Teaching a computer to read a chemical structure requires specialized techniques.</p>
<h2 id="the-complexity-of-chemical-graphs">The Complexity of Chemical Graphs</h2>
<p>Recognizing a molecule requires specialized techniques that extend standard Optical Character Recognition (OCR). A molecule is a <em>graph</em>: a collection of atoms (nodes) connected by bonds (edges).</p>
<blockquote>
<p>(While this simplified view excludes complex structures like coordination compounds and polymers, it provides a highly effective starting point for this discussion.)</p></blockquote>
<p>An OCSR system must overcome several hurdles:</p>
<ul>
<li><strong>Varying Styles:</strong> Chemical drawings vary widely across publications. Bond lengths, angles, and fonts can differ dramatically from one document to another.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/acs.orglett.2c02187_1.webp"
         alt="An example from the Colored Background OSCR Benchmark, showing a complex and colorful chemical structure."
         title="An example from the Colored Background OSCR Benchmark, showing a complex and colorful chemical structure."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example from the <a href="https://huggingface.co/datasets/hheiden/Colored_Background_OCSR_benchmark">Colored Background OSCR Benchmark</a>, showing a complex and colorful chemical structure.</figcaption>
    
</figure>

<ul>
<li><strong>Image Quality:</strong> Older documents might be scanned at low resolutions, containing noise, blur, or other artifacts that make interpretation difficult.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/2008239616_449_chem.webp"
         alt="A challenging chemical structure image from the JPO benchmark, difficult due to its low quality."
         title="A challenging chemical structure image from the JPO benchmark, difficult due to its low quality."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A challenging chemical structure image from the <a href="https://huggingface.co/datasets/hheiden/JPO_OCSR_benchmark">JPO benchmark</a>, difficult due to its low quality.</figcaption>
    
</figure>

<ul>
<li><strong>Structural Complexity:</strong> From simple rings to sprawling polymers and complex <strong>Markush structures</strong> (common in patents to represent a whole family of related compounds), the variety is immense.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/markush.webp"
         alt="An example of a Markush structure, illustrating the complexity and variety of chemical compounds."
         title="An example of a Markush structure, illustrating the complexity and variety of chemical compounds."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a Markush structure, illustrating the complexity and variety of chemical compounds.</figcaption>
    
</figure>

<h2 id="the-evolution-of-ocsr">The Evolution of OCSR</h2>
<p>The quest to automate this process has evolved significantly, moving from brittle, hand-coded systems to sophisticated AI that can learn from data.</p>
<h3 id="act-1-the-rule-based-pioneers-ocr-10">Act 1: The Rule-Based Pioneers (OCR-1.0)</h3>
<p>The first OCSR systems, developed in the early 1990s, represent what we can now call the <strong>&ldquo;OCR-1.0&rdquo; era</strong>. Tools like <a href="https://pubs.acs.org/doi/10.1021/ci00008a018">Kekulé</a>, and later open-source solutions like <a href="/notes/chemistry/optical-structure-recognition/rule-based/osra/">OSRA</a> and <a href="https://github.com/ncats/molvec">MolVec</a>, operated like meticulous draftsmen. Their approach was methodical:</p>
<ol>
<li><strong>Vectorize the Image:</strong> Convert the pixel-based image into a collection of lines and shapes</li>
<li><strong>Identify Components:</strong> Use a set of hard-coded rules to classify these components. &ldquo;This thick line is a wedge bond.&rdquo; &ldquo;This group of pixels is the letter &lsquo;O&rsquo;.&rdquo;</li>
<li><strong>Reconstruct the Graph:</strong> Piece together the identified atoms and bonds into a coherent molecular graph</li>
</ol>
<p>This rule-based approach was a real first step but brittle. It struggled with the messiness of real-world documents and was expensive to maintain because each new style or error required new rules.</p>
<p>Additionally, they were designed as interactive tools to assist human experts in digitizing chemical structures.
There was always the assumption that a human would review and correct the output.</p>
<p>As a concrete case-study, consider the (reproduced) results from <a href="https://arxiv.org/abs/2411.11098">MolParser</a>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Method</th>
          <th style="text-align: center">USPTO</th>
          <th style="text-align: center">UoB</th>
          <th style="text-align: center">CLEF</th>
          <th style="text-align: center">JPO</th>
          <th style="text-align: center">ColoredBG</th>
          <th style="text-align: center">USPTO-10K</th>
          <th style="text-align: center">WildMol-10K</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Rule-based methods</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">OSRA 2.1 *</td>
          <td style="text-align: center">89.3</td>
          <td style="text-align: center">86.3</td>
          <td style="text-align: center"><strong>93.4</strong></td>
          <td style="text-align: center">56.3</td>
          <td style="text-align: center">5.5</td>
          <td style="text-align: center">89.7</td>
          <td style="text-align: center">26.3</td>
      </tr>
      <tr>
          <td style="text-align: left">MolVec 0.9.7 *</td>
          <td style="text-align: center">91.6</td>
          <td style="text-align: center">79.7</td>
          <td style="text-align: center">81.2</td>
          <td style="text-align: center">66.8</td>
          <td style="text-align: center">8.0</td>
          <td style="text-align: center">92.4</td>
          <td style="text-align: center">26.4</td>
      </tr>
      <tr>
          <td style="text-align: left">Imago 2.0 *</td>
          <td style="text-align: center">89.4</td>
          <td style="text-align: center">63.9</td>
          <td style="text-align: center">68.2</td>
          <td style="text-align: center">41.0</td>
          <td style="text-align: center">2.0</td>
          <td style="text-align: center">89.9</td>
          <td style="text-align: center">6.9</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Only synthetic training</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">Img2Mol *</td>
          <td style="text-align: center">30.0</td>
          <td style="text-align: center">68.1</td>
          <td style="text-align: center">17.9</td>
          <td style="text-align: center">16.1</td>
          <td style="text-align: center">3.5</td>
          <td style="text-align: center">33.7</td>
          <td style="text-align: center">24.4</td>
      </tr>
      <tr>
          <td style="text-align: left">MolGrapher †*</td>
          <td style="text-align: center">91.5</td>
          <td style="text-align: center"><strong>94.9</strong></td>
          <td style="text-align: center">90.5</td>
          <td style="text-align: center">67.5</td>
          <td style="text-align: center">7.5</td>
          <td style="text-align: center">93.3</td>
          <td style="text-align: center">45.5</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Real data finetuning</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">DECIMER 2.7 *</td>
          <td style="text-align: center">59.9</td>
          <td style="text-align: center">88.3</td>
          <td style="text-align: center">72.0</td>
          <td style="text-align: center">64.0</td>
          <td style="text-align: center">14.5</td>
          <td style="text-align: center">82.4</td>
          <td style="text-align: center">56.0</td>
      </tr>
      <tr>
          <td style="text-align: left">MolScribe *</td>
          <td style="text-align: center"><u>93.1</u></td>
          <td style="text-align: center">87.4</td>
          <td style="text-align: center">88.9</td>
          <td style="text-align: center">76.2</td>
          <td style="text-align: center">21.0</td>
          <td style="text-align: center"><strong>96.0</strong></td>
          <td style="text-align: center">66.4</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Tiny (Ours)</td>
          <td style="text-align: center">93.0</td>
          <td style="text-align: center">91.6</td>
          <td style="text-align: center"><u>91.0</u></td>
          <td style="text-align: center">75.6</td>
          <td style="text-align: center"><strong>58.5</strong></td>
          <td style="text-align: center">89.5</td>
          <td style="text-align: center">73.1</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Small (Ours)</td>
          <td style="text-align: center"><strong>93.1</strong></td>
          <td style="text-align: center">91.1</td>
          <td style="text-align: center">90.8</td>
          <td style="text-align: center">76.2</td>
          <td style="text-align: center">57.0</td>
          <td style="text-align: center"><u>94.8</u></td>
          <td style="text-align: center">76.3</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Base (Ours)</td>
          <td style="text-align: center">93.0</td>
          <td style="text-align: center"><u>91.8</u></td>
          <td style="text-align: center">90.7</td>
          <td style="text-align: center"><strong>78.9</strong></td>
          <td style="text-align: center">57.0</td>
          <td style="text-align: center">94.5</td>
          <td style="text-align: center"><strong>76.9</strong></td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>Table 2. Comparison of our method with existing OCSR models.</strong> We report the accuracy. We use <strong>bold</strong> to indicate the best performance and <u>underline</u> to denote the second-best performance. *: re-implemented results. †: results from original publications.</p></blockquote>
<p>In this table, we see that the rule-based methods (OSRA, MolVec, Imago) perform reasonably well on cleaner datasets like USPTO and UoB but falter on more challenging ones like JPO and ColoredBG. Modern AI-based methods (MolGrapher, DECIMER, MolScribe, MolParser) improve most on the hardest benchmarks (like JPO and ColoredBG), especially when fine-tuned on real data, while the rule-based tools still do reasonably well on cleaner sets like USPTO and UoB.</p>
<h3 id="act-2-the-ai-fork-in-the-road-2010s-2020s">Act 2: The AI Fork in the Road (2010s-2020s)</h3>
<p>The rise of deep learning in the 2010s brought new paradigms that could learn from data. Here, the field split into two distinct paths.</p>
<h4 id="path-a-the-rise-of-the-specialists-graph-based-ai">Path A: The Rise of the Specialists (Graph-Based AI)</h4>
<p>Some models replaced the hard-coded rules with AI components. Systems like <a href="https://github.com/DS4SD/MolGrapher">MolGrapher</a> and <a href="https://github.com/thomas0809/MolScribe">MolScribe</a> use a two-stage process:</p>
<ul>
<li><strong>Atom Detection:</strong> A neural network first identifies all the atoms in the image</li>
<li><strong>Bond Prediction:</strong> A second process then predicts the connections (bonds) between those atoms to form the final graph</li>
</ul>
<p>These are highly specialized tools, trained specifically for the task of building a molecular graph.</p>
<h4 id="path-b-the-rise-of-the-generalists-lvlms">Path B: The Rise of the Generalists (LVLMs)</h4>
<p>Another, more direct method treats OCSR as an image captioning task. This approach aligns with the broader trend of <strong>Large Vision-Language Models (LVLMs)</strong>: massive, general-purpose AIs like GPT-4V. Models like <a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER</a> and <a href="/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/">MolParser</a> look at a molecular image and directly generate its textual representation, most commonly a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES string</a>. This direct, end-to-end approach is powerful, though it requires enormous datasets to train effectively.</p>
<h2 id="the-next-frontier-the-ocr-20-vision-2024">The Next Frontier: The OCR-2.0 Vision (2024+)</h2>
<p>Recently, a proposal has emerged that charts a third path forward: <strong>OCR-2.0</strong>. This vision, proposed by <a href="https://arxiv.org/abs/2409.01704">Wei et al.</a> in 2024, argues for a new class of models that combine the best of both worlds. An OCR-2.0 model should be:</p>
<ol>
<li><strong>End-to-End:</strong> A single, unified model that simplifies maintenance</li>
<li><strong>Efficient &amp; Low-Cost:</strong> A specialized, highly efficient perception engine. The paper argues that using a giant LVLM for a pure recognition task is often inefficient</li>
<li><strong>Versatile:</strong> Capable of handling diverse artificial optical signals</li>
</ol>
<p>The flagship model for this theory is <a href="https://huggingface.co/stepfun-ai/GOT-OCR2_0">GOT (General OCR Theory)</a>. It&rsquo;s a single, unified model that can read an image and output structured text for a wide variety of inputs. It can translate a molecular diagram into a SMILES string, transcribe sheet music into musical notation, parse a bar chart into a data table, and describe a geometric shape using code.</p>
<p>This demonstrates that OCSR can be integrated into broader systems for processing human visual information. The same OCR-2.0 philosophy extends beyond chemistry: <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, for instance, applies grounded vision-language modeling to general document OCR, producing both text transcriptions and bounding-box outputs from a single model.</p>
<h2 id="pushing-the-boundaries-of-recognition">Pushing the Boundaries of Recognition</h2>
<p>OCR-2.0 models like GOT push for <em>breadth</em>, and other state-of-the-art research deepens the <em>depth</em> of understanding for the uniquely complex task of chemical recognition.</p>
<h3 id="deepening-reasoning-with-a-visual-chain-of-thought">Deepening Reasoning with a &ldquo;Visual Chain of Thought&rdquo;</h3>
<p>The <a href="https://arxiv.org/abs/2506.07553">GTR-Mol-VLM</a> model makes recognition more intelligent by mimicking how a person might analyze a complex diagram. The model traverses the molecule step-by-step, predicting an atom, then its bond, then the next atom, and so on. This &ldquo;Visual Chain of Thought&rdquo; improves accuracy, especially for complex molecules. It also faithfully recognizes abbreviations like &ldquo;Ph&rdquo; as single units, better representing the source image.</p>
<h3 id="deepening-application-with-visual-fingerprinting">Deepening Application with &ldquo;Visual Fingerprinting&rdquo;</h3>
<p><a href="https://link.springer.com/article/10.1186/s13321-025-01091-4">Subgrapher</a> rethinks the end goal. Many applications (like searching a patent database) require only the identification of specific molecular features. Subgrapher detects key functional groups and backbones directly from the image and creates a visual fingerprint. This approach mirrors identifying a person by key features (&ldquo;has glasses, a mustache&rdquo;), making it well-suited to finding matches in a large set.</p>
<h2 id="why-it-matters">Why It Matters</h2>
<p>The evolution of OCSR directly enables practical scientific advancements. This technology is a critical enabler for the future of science.</p>
<h3 id="searching-past-knowledge">Searching Past Knowledge</h3>
<p>OCSR digitizes decades of research from patents and journals, making it searchable and accessible for data mining. Imagine being able to search through every molecule ever published with a simple query. Or consider the practical impact: pharmaceutical companies can now automatically scan thousands of patent documents to ensure their new drug candidates don&rsquo;t infringe existing intellectual property, a process that previously required substantial manual review by patent analysts.</p>
<h3 id="accelerating-drug-discovery">Accelerating Drug Discovery</h3>
<p>By extracting vast datasets of molecules, scientists can train AI models to predict drug efficacy and toxicity, speeding up the discovery pipeline. The more molecular data we can digitize, the better our predictive models become.</p>
<h3 id="building-universal-document-intelligence">Building Universal Document Intelligence</h3>
<p>OCSR contributes to building AI systems capable of processing complex human documents. A scientific paper is a mix of text, equations, charts, tables, and molecular diagrams. Unified OCR-2.0 models are the key to making all of this knowledge searchable holistically.</p>
<h2 id="looking-forward">Looking Forward</h2>
<p>The goal is a loop where scientific knowledge, regardless of how it is stored, can be fed back into systems that read, search, and reason over it.</p>
<p>From the rule-based systems of the 1990s to today&rsquo;s models that read many printed diagrams reliably (though hard cases like low-quality scans and Markush structures remain open), OCSR has improved a great deal. As accuracy, efficiency, and breadth improve, more of the chemical literature becomes machine-readable.</p>
<p>This entire process begins with teaching a computer how to read a picture.</p>
]]></content:encoded></item><item><title>MolParser: End-to-End Molecular Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</guid><description>MolParser converts molecular images from scientific documents to machine-readable formats using end-to-end learning with Extended SMILES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., &amp; Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em> (pp. 24528-24538). <a href="https://doi.org/10.48550/arXiv.2411.11098">https://doi.org/10.48550/arXiv.2411.11098</a></p>
<p><strong>Publication</strong>: ICCV 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/">MolParser-7M Dataset</a> - 7M+ image-text pairs for OCSR</li>
<li><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M on HuggingFace</a> - Dataset repository</li>
<li><a href="https://huggingface.co/UniParser/MolDet">MolDet YOLO Detector</a> - Object detection model for extracting molecular images from documents</li>
</ul>
<h2 id="contribution-end-to-end-ocsr-and-real-world-resources">Contribution: End-to-End OCSR and Real-World Resources</h2>
<p>This is primarily a <strong>Method</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a>), with a significant secondary contribution as a <strong>Resource</strong> paper.</p>
<p><strong>Method contribution ($\Psi_{\text{Method}}$)</strong>: The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces <strong>Extended SMILES (E-SMILES)</strong>, a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).</p>
<p><strong>Resource contribution ($\Psi_{\text{Resource}}$)</strong>: The paper introduces <strong>MolParser-7M</strong>, the largest OCSR dataset to date (7.7M image-text pairs), and <strong>WildMol</strong>, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.</p>
<h2 id="motivation-extracting-chemistry-from-real-world-documents">Motivation: Extracting Chemistry from Real-World Documents</h2>
<p>The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.</p>
<p>Existing OCSR methods struggle with real-world documents for two fundamental reasons:</p>
<ol>
<li><strong>Representational limitations</strong>: Standard SMILES notation cannot capture complex structural templates like <strong>Markush structures</strong>, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.</li>
<li><strong>Data distribution mismatch</strong>: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.</li>
</ol>
<h2 id="novelty-e-smiles-and-human-in-the-loop-data-engine">Novelty: E-SMILES and Human-in-the-Loop Data Engine</h2>
<p>The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:</p>
<ol>
<li>
<p><strong>Extended SMILES (E-SMILES)</strong>: A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token <code>&lt;sep&gt;</code> to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.</p>
</li>
<li>
<p><strong>MolParser-7M Dataset</strong>: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 &ldquo;in-the-wild&rdquo; samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Data Engine</strong>: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.</p>
</li>
<li>
<p><strong>Efficient End-to-End Architecture</strong>: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:</p>
</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x; \theta)
\end{aligned}
$$</p>
<p>The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.</p>
<h2 id="experimental-setup-two-stage-training-and-benchmarking">Experimental Setup: Two-Stage Training and Benchmarking</h2>
<p>The evaluation focused on demonstrating that MolParser generalizes to real-world documents:</p>
<ol>
<li>
<p><strong>Two-Stage Training Protocol</strong>: The model underwent a systematic training process:</p>
<ul>
<li><strong>Pre-training</strong>: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).</li>
<li><strong>Fine-tuning</strong>: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.</li>
</ul>
</li>
<li>
<p><strong>Benchmark Evaluation</strong>: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.</p>
</li>
<li>
<p><strong>Real-World Document Analysis</strong>: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Experiments isolating the contribution of each component:</p>
<ul>
<li>The impact of real-world training data versus synthetic-only training</li>
<li>The effectiveness of curriculum learning versus standard training</li>
<li>The value of the human-in-the-loop annotation pipeline versus random sampling</li>
<li>The necessity of E-SMILES extensions for capturing complex structures</li>
</ul>
</li>
</ol>
<h2 id="outcomes-and-empirical-findings">Outcomes and Empirical Findings</h2>
<ul>
<li>
<p><strong>Performance on Benchmarks</strong>: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.</p>
</li>
<li>
<p><strong>Real-World Data is Critical</strong>: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.</p>
</li>
<li>
<p><strong>E-SMILES Enables Broader Coverage</strong>: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Scales Efficiently</strong>: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.</p>
</li>
<li>
<p><strong>Speed and Accuracy</strong>: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.</p>
</li>
<li>
<p><strong>Downstream Applications</strong>: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.</p>
</li>
</ul>
<p>The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.</p>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>7.7M image-SMILES pairs for OCSR pretraining and fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet</a></td>
          <td>Model</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>YOLO11-based molecule detector for PDF documents</td>
      </tr>
  </tbody>
</table>
<p>No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.</p>
<p><strong>Training Data Composition (MolParser-7M)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Name</th>
          <th>Size</th>
          <th>Composition / Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pre-training</strong></td>
          <td>MolParser-7M (Synthetic)</td>
          <td>~7.7M</td>
          <td><strong>Markush-3M</strong> (40%), <strong>ChEMBL-2M</strong> (27%), <strong>Polymer-1M</strong> (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-SFT-400k</td>
          <td>400k</td>
          <td>Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-Gen-200k</td>
          <td>200k</td>
          <td>Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Handwrite-5k</td>
          <td>5k</td>
          <td>Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Sources</strong>: 1.2M patents and scientific papers (PDF documents)</li>
<li><strong>Extraction</strong>: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates</li>
<li><strong>Selection</strong>: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation</li>
<li><strong>Annotation</strong>: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)</li>
</ul>
<p><strong>Test Benchmarks</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-10k</td>
          <td>10,000</td>
          <td>Standard synthetic benchmark</td>
      </tr>
      <tr>
          <td>Maybridge UoB</td>
          <td>-</td>
          <td>Synthetic molecules</td>
      </tr>
      <tr>
          <td>CLEF-2012</td>
          <td>-</td>
          <td>Patent images</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>-</td>
          <td>Japanese patent office</td>
      </tr>
      <tr>
          <td>ColoredBG</td>
          <td>-</td>
          <td>Colored background molecules</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td>10,000</td>
          <td>Ordinary molecules cropped from real PDFs (new)</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k-M</strong></td>
          <td>10,000</td>
          <td>Markush structures (significantly harder, new)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Extended SMILES (E-SMILES) Encoding</strong>:</p>
<ul>
<li><strong>Format</strong>: <code>SMILES&lt;sep&gt;EXTENSION</code> where <code>&lt;sep&gt;</code> separates core structure from supplementary annotations</li>
<li><strong>Extensions use XML-like tags</strong>:
<ul>
<li><code>&lt;a&gt;index:group&lt;/a&gt;</code> for substituents/variable groups (Markush structures)</li>
<li><code>&lt;r&gt;</code> for groups connected at any ring position</li>
<li><code>&lt;c&gt;</code> for abstract rings</li>
<li><code>&lt;dum&gt;</code> for connection points</li>
</ul>
</li>
<li><strong>Backward compatible</strong>: Core SMILES parseable by RDKit; extensions provide structured format for edge cases</li>
</ul>
<p><strong>Curriculum Learning Strategy</strong>:</p>
<ul>
<li><strong>Phase 1</strong>: No augmentation, simple molecules (&lt;60 tokens)</li>
<li><strong>Phase 2</strong>: Gradually increase augmentation intensity and sequence length</li>
<li>Progressive complexity allows stable training on diverse molecular structures</li>
</ul>
<p><strong>Active Learning Data Selection</strong>:</p>
<ol>
<li>Train 5 model folds on current dataset</li>
<li>Compute pairwise Tanimoto similarity of predictions on candidate images</li>
<li>Select samples with confidence scores <strong>0.6-0.9</strong> for human review (highest learning value)</li>
<li>Human experts correct model pre-annotations</li>
<li>Iteratively expand training set with hard samples</li>
</ol>
<p><strong>Data Augmentations</strong>:</p>
<ul>
<li>RandomAffine (rotation, scale, translation)</li>
<li>JPEGCompress (compression artifacts)</li>
<li>InverseColor (color inversion)</li>
<li>SurroundingCharacters (text interference)</li>
<li>RandomCircle (circular artifacts)</li>
<li>ColorJitter (brightness, contrast variations)</li>
<li>Downscale (resolution reduction)</li>
<li>Bounds (boundary cropping variations)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard <strong>Image Captioning</strong> (Encoder-Decoder) paradigm.</p>
<p><strong>Architecture Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vision Encoder</strong></td>
          <td>Swin Transformer (ImageNet pretrained)</td>
      </tr>
      <tr>
          <td>- Tiny variant</td>
          <td>66M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Small variant</td>
          <td>108M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Base variant</td>
          <td>216M parameters, $384 \times 384$ input</td>
      </tr>
      <tr>
          <td><strong>Connector</strong></td>
          <td>2-layer MLP reducing channel dimension by half</td>
      </tr>
      <tr>
          <td><strong>Text Decoder</strong></td>
          <td>BART-Decoder (12 layers, 16 attention heads)</td>
      </tr>
  </tbody>
</table>
<p><strong>Training Configuration</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Pre-training</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Hardware</strong></td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
      </tr>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>AdamW</td>
          <td>AdamW</td>
      </tr>
      <tr>
          <td><strong>Learning Rate</strong></td>
          <td>$1 \times 10^{-4}$</td>
          <td>$5 \times 10^{-5}$</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong></td>
          <td>$1 \times 10^{-2}$</td>
          <td>$1 \times 10^{-2}$</td>
      </tr>
      <tr>
          <td><strong>Scheduler</strong></td>
          <td>Cosine with warmup</td>
          <td>Cosine with warmup</td>
      </tr>
      <tr>
          <td><strong>Epochs</strong></td>
          <td>20</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Label Smoothing</strong></td>
          <td>0.01</td>
          <td>0.005</td>
      </tr>
  </tbody>
</table>
<p><strong>Curriculum Learning Schedule</strong> (Pre-training):</p>
<ul>
<li>Starts with simple molecules (&lt;60 tokens, no augmentation)</li>
<li>Gradually adds complexity and augmentation (blur, noise, perspective transforms)</li>
<li>Enables stable learning across diverse molecular structures</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)</p>
<p><strong>Key Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MolParser-Base</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td><strong>76.9%</strong></td>
          <td>66.4%</td>
          <td>45.5%</td>
          <td>Real-world patent/paper crops</td>
      </tr>
      <tr>
          <td><strong>USPTO-10k</strong></td>
          <td>94.5%</td>
          <td><strong>96.0%</strong></td>
          <td>93.3%</td>
          <td>Synthetic benchmark</td>
      </tr>
      <tr>
          <td><strong>Throughput (FPS)</strong></td>
          <td><strong>39.8</strong></td>
          <td>16.5</td>
          <td>2.2</td>
          <td>Measured on RTX 4090D</td>
      </tr>
  </tbody>
</table>
<p><strong>Additional Performance</strong>:</p>
<ul>
<li>MolParser-Tiny: 131 FPS on RTX 4090D (66M params)</li>
<li>Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents</li>
</ul>
<p><strong>Ablation Findings</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-world training data</td>
          <td>Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k</td>
      </tr>
      <tr>
          <td>Curriculum learning</td>
          <td>Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%</td>
      </tr>
      <tr>
          <td>Active learning selection</td>
          <td>More effective than random sampling for annotation budget</td>
      </tr>
      <tr>
          <td>E-SMILES extensions</td>
          <td>Essential for Markush structure recognition (impossible with standard SMILES)</td>
      </tr>
      <tr>
          <td>Dataset scale</td>
          <td>Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 8x NVIDIA RTX 4090D GPUs</li>
<li><strong>Inference</strong>: Single RTX 4090D sufficient for real-time processing</li>
<li><strong>Training time</strong>: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2025molparser,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24528--24538}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2411.11098}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2411.11098}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser-7M &amp; WildMol: Large-Scale OCSR Datasets</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</guid><description>MolParser-7M is the largest open-source OCSR dataset with 7.7M image-SMILES pairs including 400k real-world annotated samples.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/molparser-markush-example.webp"
         alt="Example of a complex Markush structure"
         title="Example of a complex Markush structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-low-quality-example.webp"
         alt="Sample from the WildMol benchmark"
         title="Sample from the WildMol benchmark"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-colored-example.webp"
         alt="Colored molecule with annotations"
         title="Colored molecule with annotations"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser-7M (Training Set)</strong></td>
          <td>7,740,871</td>
          <td>A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.</td>
      </tr>
      <tr>
          <td><strong>WildMol (Test Set)</strong></td>
          <td>20,000</td>
          <td>A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in &lsquo;in-the-wild&rsquo; scenarios. Comprises WildMol-10k (10k ordinary molecules) and WildMol-10k-M (10k Markush structures).</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="wildmol-10k-accuracy">WildMol-10K Accuracy<a hidden class="anchor" aria-hidden="true" href="#wildmol-10k-accuracy">#</a></h3>
    <p class="benchmark-description">Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents</p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Accuracy (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>MolParser-Base</strong><br><small>End-to-end visual recognition trained on MolParser-7M</small>
          </td>
          <td>76.9</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>MolScribe</strong><br><small>Transformer-based OCSR system</small>
          </td>
          <td>66.4</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>DECIMER 2.7</strong><br><small>Deep learning for chemical image recognition</small>
          </td>
          <td>56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>MolGrapher</strong><br><small>Graph-based molecular structure recognition</small>
          </td>
          <td>45.5</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>MolVec 0.9.7</strong><br><small>Vector-based structure recognition</small>
          </td>
          <td>26.4</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>OSRA 2.1</strong><br><small>Optical Structure Recognition Application</small>
          </td>
          <td>26.3</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Img2Mol</strong><br><small>Image-to-molecule translation</small>
          </td>
          <td>24.4</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Imago 2.0</strong><br><small>Chemical structure recognition toolkit</small>
          </td>
          <td>6.9</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="key-contribution">Key Contribution</h2>
<p>Introduces MolParser-7M, the largest open-source Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, &ldquo;in-the-wild&rdquo; images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.</p>
<h2 id="overview">Overview</h2>
<p>The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Largest open-source OCSR dataset with over 7.7 million pairs</li>
<li>The only large-scale OCSR training set that includes a significant amount (400k) of &ldquo;in-the-wild&rdquo; data cropped from real patents and literature</li>
<li>High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)</li>
<li>Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures</li>
<li>The &ldquo;in-the-wild&rdquo; fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, Markush structures depicted with special patterns, and replication of long structural segments on the skeleton</li>
<li>The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties</li>
<li>Performance could be further improved by scaling up the amount of real annotated training data</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="synthetic-data-generation">Synthetic Data Generation</h3>
<p>To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity. The pretraining dataset is composed of the following subsets:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Markush-3M</td>
          <td>40%</td>
          <td>Random groups replacement from PubChem</td>
      </tr>
      <tr>
          <td>ChEMBL-2M</td>
          <td>27%</td>
          <td>Molecules selected from ChEMBL</td>
      </tr>
      <tr>
          <td>Polymer-1M</td>
          <td>14%</td>
          <td>Randomly generated polymer molecules</td>
      </tr>
      <tr>
          <td>PAH-600k</td>
          <td>8%</td>
          <td>Randomly generated fused-ring molecules</td>
      </tr>
      <tr>
          <td>BMS-360k</td>
          <td>5%</td>
          <td>Molecules with long carbon chains from BMS</td>
      </tr>
      <tr>
          <td>MolGrapher-300K</td>
          <td>4%</td>
          <td>Training data from MolGrapher</td>
      </tr>
      <tr>
          <td>Pauling-100k</td>
          <td>2%</td>
          <td>Pauling-style images drawn using epam.indigo</td>
      </tr>
  </tbody>
</table>
<h3 id="in-the-wild-data-engine-molparser-sft-400k">In-the-Wild Data Engine (MolParser-SFT-400k)</h3>
<p>A YOLO11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.</p>
<p>An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.</p>
<p>This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, approximately 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, 13.87% were accepted after a second round of annotation, and 9.13% required three or more rounds.</p>
<p>The fine-tuning dataset is composed of:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-SFT-400k</td>
          <td>66%</td>
          <td>Manually annotated data obtained via data engine</td>
      </tr>
      <tr>
          <td>MolParser-Gen-200k</td>
          <td>32%</td>
          <td>Synthetic data selected from pretraining stage</td>
      </tr>
      <tr>
          <td>Handwrite-5k</td>
          <td>1%</td>
          <td>Handwritten molecules selected from Img2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="e-smiles-specification">E-SMILES Specification</h3>
<p>To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (<code>SMILES&lt;sep&gt;EXTENSION</code>). The <code>EXTENSION</code> component uses XML-like tokens to manage complexities:</p>
<ul>
<li><code>&lt;a&gt;...&lt;/a&gt;</code> encapsulates Markush R-groups and abbreviation groups.</li>
<li><code>&lt;r&gt;...&lt;/r&gt;</code> denotes ring attachments with uncertainty positions.</li>
<li><code>&lt;c&gt;...&lt;/c&gt;</code> defines abstract rings.</li>
<li><code>&lt;dum&gt;</code> identifies a connection point.</li>
</ul>
<p>This format enables Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>Training and test data on HuggingFace. SFT subset is partially released.</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet (YOLO11)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Molecule detection model on HuggingFace</td>
      </tr>
      <tr>
          <td><a href="https://ocsr.dp.tech/">MolParser Demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online OCSR demo using MolParser-Base</td>
      </tr>
  </tbody>
</table>
<p>The dataset is publicly available on HuggingFace under a CC-BY-NC-SA-4.0 (non-commercial) license. The MolParser-SFT-400k subset is only partially released. The YOLO11-based MolDet detection model is also available on HuggingFace. No public code repository is provided for the MolParser recognition model itself. All experiments were conducted on 8 NVIDIA RTX 4090D GPUs, and throughput benchmarks were measured on a single RTX 4090D GPU.</p>
]]></content:encoded></item><item><title>ZINC-22: A Multi-Billion Scale Database for Ligand Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</link><pubDate>Sat, 27 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</guid><description>The ZINC-22 dataset provides over 37 billion make-on-demand molecules enabling virtual screening and modern drug discovery.</description><content:encoded><![CDATA[<h2 id="key-contribution-scaling-make-on-demand-libraries">Key Contribution: Scaling Make-on-Demand Libraries</h2>
<p>ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.</p>
<h2 id="overview">Overview</h2>
<p>ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/zinc-22-sample.webp"
         alt="ZINC-22&#39;s 2D Tranche Browser"
         title="ZINC-22&#39;s 2D Tranche Browser"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">ZINC-22&rsquo;s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Database</strong></td>
          <td>37B+</td>
          <td>Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)</td>
      </tr>
      <tr>
          <td><strong>3D Database</strong></td>
          <td>4.5B+</td>
          <td>Ready-to-dock 3D conformations with pre-calculated charges and solvation energies</td>
      </tr>
      <tr>
          <td><strong>Custom Tranches</strong></td>
          <td>Variable</td>
          <td>User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)</td>
      </tr>
  </tbody>
</table>
<h2 id="use-cases">Use Cases</h2>
<p>ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.</p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ZINC-20</strong></td>
          <td>Predecessor</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Enamine REAL</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>WuXi GalaXi</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Massive scale</strong>: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)</li>
<li><strong>Federated architecture</strong>: Supports asynchronous building and horizontal scaling to trillion-molecule growth</li>
<li><strong>Platform access</strong>: CartBlanche GUI provides a shopping cart metaphor for compound acquisition</li>
<li><strong>Privacy protection</strong>: Dual public/private server clusters protect patentability of undisclosed catalogs</li>
<li><strong>Chemical diversity</strong>: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds</li>
<li><strong>Ready-to-dock</strong>: 3D models include pre-calculated charges, protonation states, and solvation energies</li>
<li><strong>Cloud distribution</strong>: Available via AWS Open Data, Oracle OCI, and UCSF servers</li>
<li><strong>Scale-aware search</strong>: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries</li>
<li><strong>Organized access</strong>: Tranche system enables targeted selection of chemical space</li>
<li><strong>Open access</strong>: Entire database freely available to academic and commercial users</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Data Transfer Bottlenecks</strong>: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.</li>
<li><strong>Search Result Caps</strong>: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.</li>
<li><strong>Enumeration Ceiling</strong>: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.</li>
<li><strong>Download Workflow</strong>: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.</li>
<li><strong>Vendor Updates</strong>: There is difficulty removing discontinued vendor molecules due to the federated structure.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<p><strong>Compute infrastructure</strong>:</p>
<ul>
<li>1,700 cores across 14 computers for parallel processing</li>
<li>174 independent PostgreSQL 12.0 databases (110 &lsquo;Sn&rsquo; for ZINC-ID, 64 &lsquo;Sb&rsquo; for Supplier Codes)</li>
<li>Distributed across Amazon AWS, Oracle OCI, and UCSF servers</li>
</ul>
<p><strong>Software stack</strong>:</p>
<ul>
<li>PostgreSQL 12.2</li>
<li>Python 3.6.8</li>
<li>RDKit 2020.03</li>
<li>Celery task queue with Redis for background processing</li>
<li>All code available on GitHub: docking-org/zinc22-2d, zinc22-3d</li>
</ul>
<h3 id="data-organization--access">Data Organization &amp; Access</h3>
<p><strong>Tranche system</strong>: Molecules organized into &ldquo;Tranches&rdquo; based on 4 dimensions:</p>
<ol>
<li>Heavy Atom Count</li>
<li>Lipophilicity (LogP)</li>
<li>Charge</li>
<li>File Format</li>
</ol>
<p>This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.</p>
<p><strong>Search infrastructure</strong>:
Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:</p>
<ul>
<li>
<p><strong>SmallWorld</strong>: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:</p>
<p>$$
\text{GED}(G_1, G_2) = \min_{(e_1, &hellip;, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i)
$$</p>
<p>Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.</p>
</li>
<li>
<p><strong>Arthor</strong>: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.</p>
</li>
<li>
<p><strong>CartBlanche</strong>: Web interface wrapping these search tools with shopping cart functionality.</p>
</li>
</ul>
<h3 id="3d-generation-pipeline">3D Generation Pipeline</h3>
<p>The 3D database construction pipeline involves multiple specialized tools:</p>
<ol>
<li><strong>ChemAxon JChem</strong>: Protonation state and tautomer generation at physiological pH</li>
<li><strong>Corina</strong>: Initial 3D structure generation</li>
<li><strong>Omega</strong>: Conformation sampling</li>
<li><strong>AMSOL 7.1</strong>: Calculation of atomic partial charges and desolvation energies</li>
<li><strong>Strain calculation</strong>: Relative energies of conformations</li>
</ol>
<p>At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.</p>
<h3 id="chemical-diversity-analysis">Chemical Diversity Analysis</h3>
<p>A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:</p>
<p>$$
\log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules})
$$</p>
<p>This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.</p>
<h3 id="vendor-integration">Vendor Integration</h3>
<p>ZINC-22 is built from five source catalogs with the following approximate sizes:</p>
<ul>
<li><strong>Enamine REAL Database</strong>: 5 billion compounds</li>
<li><strong>Enamine REAL Space</strong>: 29 billion compounds</li>
<li><strong>WuXi GalaXi</strong>: 2.5 billion compounds</li>
<li><strong>Mcule Ultimate</strong>: 128 million compounds</li>
<li><strong>ZINC20 in-stock</strong>: 4 million compounds (incorporated as layer &ldquo;g&rdquo;)</li>
</ul>
<p>This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://cartblanche22.docking.org/">CartBlanche web interface</a></td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Web GUI for searching and downloading ZINC-22</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>2D curation and loading pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>3D building pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>CartBlanche22 web application</td>
      </tr>
      <tr>
          <td>AWS Open Data / Oracle OCI</td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Cloud-hosted 3D database mirrors</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data Availability</strong>: The compiled database is openly accessible and searchable through the <a href="https://cartblanche22.docking.org/">CartBlanche web interface</a>. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.</li>
<li><strong>Code &amp; Algorithms</strong>: The source code for database construction, parallel processing, and querying is open-source.
<ul>
<li>2D Pipeline: <a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></li>
<li>3D Pipeline: <a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></li>
<li>CartBlanche: <a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></li>
<li>TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)</li>
</ul>
</li>
<li><strong>Software Dependencies</strong>: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.</li>
<li><strong>Hardware Limitations</strong>: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. <em>Journal of Chemical Information and Modeling</em>, 63(4), 1166&ndash;1176. <a href="https://doi.org/10.1021/acs.jcim.2c01253">https://doi.org/10.1021/acs.jcim.2c01253</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Tingle_2023,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{Feb}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1166--1176}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MARCEL: Molecular Conformer Ensemble Learning Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</guid><description>MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular representation learning research.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.</p>
<h2 id="overview">Overview</h2>
<p>The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer.webp"
         alt="Example conformer from Drugs-75K"
         title="Example conformer from Drugs-75K"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Drugs-75K (SMILES: <code>COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1</code>; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer-2d.webp"
         alt="2D structure of Drugs-75K conformer"
         title="2D structure of Drugs-75K conformer"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of Drugs-75K conformer above</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-2d.webp"
         alt="Example conformer from Kraken in 2D"
         title="Example conformer from Kraken in 2D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 2D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-3d.webp"
         alt="Example conformer from Kraken in 3D"
         title="Example conformer from Kraken in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 3D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-3d.webp"
         alt="Example substrate from BDE in 3D"
         title="Example substrate from BDE in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example substrate from BDE in 3D (Pt_9.63)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-2d.webp"
         alt="2D structure of BDE substrate"
         title="2D structure of BDE substrate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of BDE substrate above</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drugs-75K</strong></td>
          <td>75,099 molecules</td>
          <td>Drug-like molecules with at least 5 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>Kraken</strong></td>
          <td>1,552 molecules</td>
          <td>Monodentate organophosphorus (III) ligands</td>
      </tr>
      <tr>
          <td><strong>EE</strong></td>
          <td>872 reactions</td>
          <td>Rhodium (Rh)-bound atropisomeric catalyst-substrate pairs derived from chiral bisphosphine</td>
      </tr>
      <tr>
          <td><strong>BDE</strong></td>
          <td>5,915 reactions</td>
          <td>Organometallic catalysts ML$_1$L$_2$ with electronic binding energies</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="drugs-75k-ionization-potential">Ionization Potential (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-ionization-potential">#</a></h3>
    <p class="benchmark-description">Predict ionization potential from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.4066</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.4069</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.4126</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.4149</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.428</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4351</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4354</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4361</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4393</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4394</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4441</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4452</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4466</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4505</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4595</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4788</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4987</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.6617</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electron-affinity">Electron Affinity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electron-affinity">#</a></h3>
    <p class="benchmark-description">Predict electron affinity from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.391</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3922</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3944</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3953</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3964</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4033</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4085</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4207</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4233</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4232</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4251</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4269</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4417</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4495</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4648</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4747</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.585</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electronegativity">Electronegativity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electronegativity">#</a></h3>
    <p class="benchmark-description">Predict electronegativity (χ) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.197</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2027</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2069</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2083</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2199</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2212</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.226</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2294</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2324</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2378</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2436</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.2441</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2505</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2732</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4073</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-b5">B₅ Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-b5">#</a></h3>
    <p class="benchmark-description">Predict B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2225</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2313</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.263</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2644</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2704</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.2789</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3072</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.3128</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.3228</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.3293</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.3443</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.345</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.351</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.3567</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.476</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.485</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4873</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4879</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.9611</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-l">L Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-l">#</a></h3>
    <p class="benchmark-description">Predict L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.3386</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3468</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.3619</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3643</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3754</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4003</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4303</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4322</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4344</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4363</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4471</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4485</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4493</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.5142</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.5452</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.5458</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.6417</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.8389</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burb5">Buried B₅ Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burb5">#</a></h3>
    <p class="benchmark-description">Predict buried B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.1589</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1693</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.1719</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1782</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1783</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2024</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2017</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2066</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2097</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2178</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2176</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2295</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2395</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2422</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2758</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2813</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2884</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.3002</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4929</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burl">Buried L Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burl">#</a></h3>
    <p class="benchmark-description">Predict buried L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.0947</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1185</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.12</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1324</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.1386</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.1443</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.1486</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.15</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.1521</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.1526</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.1548</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1635</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.1673</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.1741</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.1861</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.1924</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.1948</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2529</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.2781</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="ee-enantioselectivity">Enantioselectivity (EE)<a hidden class="anchor" aria-hidden="true" href="#ee-enantioselectivity">#</a></h3>
    <p class="benchmark-description">Predict enantiomeric excess for Rh-catalyzed asymmetric reactions</p>
    <p class="benchmark-meta"><strong>Subset:</strong> EE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>11.61</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>12.03</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>13.56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>13.96</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>14.22</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>14.64</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>17.74</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>18.03</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>18.42</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>19.8</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>20.24</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>33.95</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>61.03</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>61.3</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>61.63</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>62.08</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>62.31</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>62.38</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>64.01</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="bde-bond-dissociation">Bond Dissociation Energy (BDE)<a hidden class="anchor" aria-hidden="true" href="#bde-bond-dissociation">#</a></h3>
    <p class="benchmark-description">Predict metal-ligand bond dissociation energy for organometallic catalysts</p>
    <p class="benchmark-meta"><strong>Subset:</strong> BDE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>1.45</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>1.47</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>1.61</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>1.65</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>1.87</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>1.97</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>2.01</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>2.13</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>2.48</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>2.55</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>2.61</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>2.64</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>2.66</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>2.74</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>2.83</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>3.03</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>10.08</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GEOM</strong></td>
          <td>Source</td>
          <td><a href="/notes/chemistry/datasets/geom/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Domain diversity</strong>: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks</li>
<li><strong>Ensemble-based</strong>: Provides full conformer ensembles with statistical weights</li>
<li><strong>DFT-quality energies</strong>: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)</li>
<li><strong>Realistic scenarios</strong>: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems</li>
<li><strong>Comprehensive baselines</strong>: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods</li>
<li><strong>Property diversity</strong>: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Regression only</strong>: All tasks evaluate regression metrics exclusively</li>
<li><strong>Chemical space coverage</strong>: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces</li>
<li><strong>Compute requirements</strong>: Working with large conformer ensembles demands significant computational resources</li>
<li><strong>Proprietary data</strong>: EE subset is proprietary (as of December 2025)</li>
<li><strong>DFT bottleneck</strong>: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics</li>
<li><strong>Uniform sampling baseline</strong>: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.</li>
<li><strong>Drugs-75K properties</strong>: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.</li>
<li><strong>Unrealistic single-conformer baselines</strong>: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum <em>a priori</em> requires exhaustively searching and computing energies for the entire conformer space.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<h4 id="drugs-75k">Drugs-75K</h4>
<p><strong>Source</strong>: GEOM-Drugs subset</p>
<p><strong>Filtering</strong>:</p>
<ul>
<li>Minimum 5 rotatable bonds (focus on flexible molecules)</li>
<li>Allowed elements: H, C, N, O, F, Si, P, S, Cl</li>
</ul>
<p><strong>Conformer generation</strong>:</p>
<ul>
<li>DFT-level calculations for both conformers and energies</li>
<li>Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)</li>
</ul>
<p><strong>Properties</strong>: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)</p>
<h4 id="kraken">Kraken</h4>
<p><strong>Source</strong>: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)</p>
<p><strong>Properties</strong>: 4 of 78 available properties (selected for high variance across conformer ensembles)</p>
<ul>
<li>$B_5$: Sterimol B5, maximum width of substituent (steric descriptor)</li>
<li>$L$: Sterimol L, length of substituent (steric descriptor)</li>
<li>$\text{Bur}B_5$: Buried Sterimol B5, steric effects within the first coordination sphere</li>
<li>$\text{Bur}L$: Buried Sterimol L, steric effects within the first coordination sphere</li>
</ul>
<h4 id="ee-enantiomeric-excess">EE (Enantiomeric Excess)</h4>
<p><strong>Generation method</strong>: Q2MM (Quantum-guided Molecular Mechanics)</p>
<p><strong>Reactions</strong>: 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine with 10 enamide substrates</p>
<p><strong>Property</strong>: Enantiomeric excess (EE) for asymmetric catalysis</p>
<p><strong>Availability</strong>: Proprietary-only (closed-source as of December 2025)</p>
<h4 id="bde-bond-dissociation-energy">BDE (Bond Dissociation Energy)</h4>
<p><strong>Molecules</strong>: 5,915 organometallic catalysts (ML₁L₂ structure)</p>
<p><strong>Initial conformers</strong>: OpenBabel with geometric optimization</p>
<p><strong>Energies</strong>: DFT calculations</p>
<p><strong>Property</strong>: Electronic binding energy (difference in minimum energies of bound-catalyst complex and unbound catalyst)</p>
<p><strong>Key constraint</strong>: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)</p>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble).
The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:</p>
<p>$$
\langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i
$$</p>
<p>Where $p_i$ is the conformer probability (Boltzmann weight) under experimental conditions derived from the conformer energy $e_i$:</p>
<p>$$
p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)}
$$</p>
<p><strong>Data splits</strong>: Datasets are partitioned 70% train, 10% validation, and 20% test.</p>
<p><strong>Model categories</strong>:</p>
<ol>
<li><strong>1D Models</strong>: SMILES-based (Random Forest on concatenated MACCS/ECFP/RDKit fingerprints, LSTM, Transformer).</li>
<li><strong>2D Models</strong>: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).</li>
<li><strong>3D Models</strong>: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.</li>
<li><strong>Ensemble Models</strong>: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:</li>
</ol>
<p><strong>Mean Pooling:</strong>
$$
\mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i
$$</p>
<p><strong>DeepSets:</strong>
$$
\mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right)
$$</p>
<p><strong>Self-Attention:</strong>
$$
\begin{aligned}
\mathbf{s}_{\text{ATT}} &amp;= \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\
\alpha_{ij} &amp;= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)}
\end{aligned}
$$</p>
<p><strong>Evaluation metric</strong>: Mean Absolute Error (MAE) for all tasks.</p>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Ensemble superiority (task-dependent)</strong>: Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:</p>
<ul>
<li><strong>Small-Scale Success</strong>: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).</li>
<li><strong>Large-Scale Plateau</strong>: The performance improvements did not strongly transfer to large subsets like Drugs-75K (best ensemble strategy for GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors conjecture that the computational burden of encoding all conformers in each ensemble alters learning dynamics and increases training difficulty.</li>
</ul>
<p><strong>Conformer Sampling for Noise</strong>: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).</p>
<p><strong>3D vs 2D</strong>: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.</p>
<p><strong>Model architecture</strong>: No single model dominates all tasks. GemNet and LEFTNet excel on large-scale Drugs-75K, while DimeNet++ shows strong performance on smaller Kraken and reaction datasets. Model selection depends on dataset size and task characteristics.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL">SXKDZ/MARCEL</a></td>
          <td>Code + Dataset</td>
          <td>Apache-2.0</td>
          <td>Benchmark suite, dataset loaders, and hyperparameter configs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Drugs">Drugs-75K</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>DFT-level conformers and energies derived from GEOM-Drugs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Kraken">Kraken</a></td>
          <td>Dataset</td>
          <td>Copyright retained by original authors</td>
          <td>Conformer ensembles and four steric descriptors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/BDE">BDE</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>OpenBabel-generated conformers with DFT binding energies</td>
      </tr>
      <tr>
          <td>EE</td>
          <td>Dataset</td>
          <td>Proprietary</td>
          <td>Closed-source as of 2026</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data</strong>: The Drugs-75K, Kraken, and BDE subsets are openly available via the project&rsquo;s GitHub repository. The EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.</li>
<li><strong>Code</strong>: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at <a href="https://github.com/SXKDZ/MARCEL">GitHub (SXKDZ/MARCEL)</a> under the Apache-2.0 license.</li>
<li><strong>Hardware</strong>: The authors trained models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.</li>
<li><strong>Algorithms/Models</strong>: Hyperparameters for all 18 evaluated models are provided in the repository configuration files (<code>benchmarks/params</code>). All baseline models use publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).</li>
<li><strong>Evaluation</strong>: Evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., and Wang, W. (2024). Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. In <em>The Twelfth International Conference on Learning Representations (ICLR 2024)</em>. <a href="https://openreview.net/forum?id=NSDszJ2uIV">https://openreview.net/forum?id=NSDszJ2uIV</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2024learning,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=NSDszJ2uIV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GEOM: Energy-Annotated Molecular Conformations Dataset</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</link><pubDate>Thu, 04 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</guid><description>Dataset card for GEOM, providing energy-annotated molecular conformations generated via CREST/xTB and refined with DFT for property prediction benchmarks.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.</p>
<h2 id="overview">Overview</h2>
<p>The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (<a href="/notes/chemistry/datasets/qm9/">QM9</a>), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/GEOM-sample-_4-pyrimidin-2-yloxyphenyl_acetamide.webp"
         alt="Example SARS-CoV-2 3CL protease active molecule"
         title="Example SARS-CoV-2 3CL protease active molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drug-like (AICures)</strong></td>
          <td>304,466 molecules</td>
          <td>Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)</td>
      </tr>
      <tr>
          <td><strong>QM9</strong></td>
          <td>133,258 molecules</td>
          <td>Small molecules from QM9 (up to 9 heavy atoms)</td>
      </tr>
      <tr>
          <td><strong>MoleculeNet</strong></td>
          <td>16,865 molecules</td>
          <td>Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology (includes BACE)</td>
      </tr>
      <tr>
          <td><strong>BACE (High-quality DFT)</strong></td>
          <td>1,511 molecules</td>
          <td>BACE subset of MoleculeNet with high-quality DFT energies (r2scan-3c) and experimental inhibition data</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="gibbs-free-energy-prediction">Gibbs Free Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#gibbs-free-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble Gibbs free energy (G) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.203</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.225</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.274</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.289</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.406</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="average-energy-prediction">Average Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#average-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble average energy (E) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.11</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.113</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.119</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.131</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.166</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="conformer-count-prediction">Conformer Count Prediction<a hidden class="anchor" aria-hidden="true" href="#conformer-count-prediction">#</a></h3>
    <p class="benchmark-description">Predict ln(number of unique conformers) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.363</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.38</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.455</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.484</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.763</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>QM9</strong></td>
          <td>134k small molecules with up to 9 heavy atoms and DFT properties</td>
      </tr>
      <tr>
          <td><strong>PCQM4Mv2</strong></td>
          <td>Millions of computationally generated molecules for HOMO-LUMO gap prediction</td>
      </tr>
      <tr>
          <td><strong>PubChemQC</strong></td>
          <td>DFT structures and energy properties for millions of PubChem molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Scale</strong>: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.</li>
<li><strong>Energy Annotations</strong>: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.</li>
<li><strong>Quality Tiers</strong>: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.</li>
<li><strong>Benchmark Ready</strong>: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.</li>
<li><strong>Task Diversity</strong>: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Computational Constraints</strong>: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.</li>
<li><strong>Semi-Empirical Accuracy Gap</strong>: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.</li>
<li><strong>Solvation Assumptions</strong>: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).</li>
<li><strong>Coverage Lapses</strong>: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<p><strong>Initial conformer sampling</strong> (RDKit):</p>
<ul>
<li><code>EmbedMultipleConfs</code> with <code>numConfs=50</code>, <code>pruneRmsThresh=0.01</code> Å</li>
<li>MMFF force field optimization</li>
<li>GFN2-xTB optimization of seed conformer</li>
</ul>
<p><strong>Conformational exploration</strong> (CREST):</p>
<ul>
<li>Metadynamics in NVT ensemble driven by a pushing bias potential:
$$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$
where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.</li>
<li>12 independent MTD runs per molecule with different settings for $k_i$ and $\alpha_i$.</li>
<li>6.0 kcal/mol safety window for conformer retention.</li>
<li>Solvent: ALPB for water (BACE); vacuum for others.</li>
</ul>
<p><strong>Energy calculation &amp; Weighting</strong>:</p>
<ul>
<li>
<p><strong>Standard (GFN2-xTB)</strong>: Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$:
$$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$</p>
</li>
<li>
<p><strong>High-Quality DFT (CENSO)</strong>: Refines structures using the <code>r2scan-3c</code> functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:</p>
<p>$$
\begin{aligned}
p^{\text{CENSO}}_i &amp;= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\
G_i &amp;= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)
\end{aligned}
$$</p>
</li>
</ul>
<h3 id="quality-levels">Quality Levels</h3>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>Method</th>
          <th>Subset</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Standard</strong></td>
          <td>CREST/GFN2-xTB</td>
          <td>All subsets</td>
          <td>~2 kcal/mol MAE vs DFT</td>
      </tr>
      <tr>
          <td><strong>DFT Single-Point</strong></td>
          <td>r2scan-3c/mTZVPP on CREST geometries</td>
          <td>BACE (1,511 molecules)</td>
          <td>Sub-kcal/mol</td>
      </tr>
      <tr>
          <td><strong>DFT Optimized</strong></td>
          <td>CENSO full optimization + free energies</td>
          <td>BACE (534 molecules)</td>
          <td>~0.3 kcal/mol vs CCSD(T)</td>
      </tr>
  </tbody>
</table>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:</p>
<ul>
<li><strong>Conformational Free Energy ($G$)</strong>: $G = -TS$, where $S = -R \sum_i p_i \log p_i$.</li>
<li><strong>Average Energy ($\langle E \rangle$)</strong>: $\langle E \rangle = \sum_i p_i E_i$.</li>
<li><strong>Unique Conformers</strong>: Natural log of the conformer count retained within the energy window.</li>
</ul>
<p><strong>Data</strong>: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).</p>
<p><strong>Hyperparameters</strong>: Optimized using Hyperopt package for each model/task combination.</p>
<p><strong>Models</strong>:</p>
<ul>
<li><strong>SchNetFeatures</strong>: 3D SchNet architecture + graph features, trained on highest-probability conformer</li>
<li><strong>ChemProp</strong>: Message Passing Neural Network on molecular graphs</li>
<li><strong>FFNN</strong>: Feed-forward network on Morgan fingerprints</li>
<li><strong>KRR</strong>: Kernel Ridge Regression on Morgan fingerprints</li>
<li><strong>Random Forest</strong>: Random Forest on Morgan fingerprints</li>
</ul>
<h3 id="hardware--computational-cost">Hardware &amp; Computational Cost</h3>
<h4 id="crestgfn2-xtb-generation">CREST/GFN2-xTB Generation</h4>
<p><strong>Total compute</strong>: ~15.7 million core hours</p>
<p><strong>AICures subset</strong>:</p>
<ul>
<li>13M core hours on Knights Landing (32-core nodes)</li>
<li>1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)</li>
<li>Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)</li>
</ul>
<p><strong>MoleculeNet subset</strong>: 1.5M core hours</p>
<h4 id="dft-calculations-bace-only">DFT Calculations (BACE only)</h4>
<p><strong>Software</strong>: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)</p>
<p><strong>Solvent</strong>: C-PCM implicit solvation (water)</p>
<p><strong>Hardware</strong>: ~54 cores per job</p>
<p><strong>Compute cost</strong>:</p>
<ul>
<li>781,000 CPU hours for CENSO optimizations</li>
<li>1.1M CPU hours for single-point energy calculations</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Data Availability</strong>: All generated conformations, energies, and thermodynamic properties are publicly hosted on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF">Harvard Dataverse</a>. The data is provided in language-agnostic MessagePack format and Python-specific RDKit <code>.pkl</code> formats.</li>
<li><strong>Code &amp; Analysis</strong>: The primary GitHub repository (<a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a>) provides tutorials for data extraction, RDKit processing, and conformational visualization.</li>
<li><strong>Model Training &amp; Baselines</strong>: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors&rsquo; <a href="https://github.com/learningmatter-mit/NeuralForceField">NeuralForceField repository</a>.</li>
<li><strong>Hardware &amp; Compute</strong>: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See <em>Hardware &amp; Computational Cost</em> section above for full details.</li>
<li><strong>Software Versions</strong>: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.</li>
<li><strong>Open-Access Paper</strong>: The full methodology is accessible via the <a href="https://arxiv.org/abs/2006.05531">arXiv preprint</a>.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Axelrod, S. and Gómez-Bombarelli, R. (2022). GEOM, energy-annotated molecular conformations for property prediction and molecular generation. <em>Scientific Data</em>, 9(1), 185. <a href="https://doi.org/10.1038/s41597-022-01288-4">https://doi.org/10.1038/s41597-022-01288-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Axelrod_2022,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GEOM, energy-annotated molecular conformations for property prediction and molecular generation}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{2052-4463}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science and Business Media LLC}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Axelrod, Simon and Gómez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{apr}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{185}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-11: Chemical Universe Database (26.4M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</guid><description>GDB-11 systematically enumerates 26.4M small organic molecules (up to 11 atoms of C, N, O, F) for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_11_sample.webp"
         alt="GDB-11 molecule"
         title="GDB-11 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">GDB-11 molecule (SMILES: <code>FC1C2OC1c3c(F)coc23</code>)</figcaption>
    
</figure>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.</p>
<h2 id="overview">Overview</h2>
<p>GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database contains 26.4 million unique molecules corresponding to 110.9 million stereoisomers. It was created to support virtual screening and drug discovery by providing a comprehensive collection of diverse, drug-like small molecules that obey standard chemical stability rules.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Systematic Enumeration</strong>: Exhaustive coverage of mathematically and chemically possible structures up to 11 atoms.</li>
<li><strong>Drug-Likeness</strong>: 100% of compounds follow Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability, and 50% (13.2 million) follow Congreve&rsquo;s more restrictive &ldquo;Rule of 3&rdquo; for lead-likeness.</li>
<li><strong>Structural Novelty</strong>: Features 538 newly identified ring systems that were previously unknown in existing chemical databases (like the CAS Registry or Beilstein).</li>
<li><strong>High Chirality</strong>: Over 70% of GDB molecules are chiral, providing rich 3D structural diversity, particularly in fused carbocycles and heterocycles.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Size Restriction</strong>: Strictly limited to small molecules with a maximum of 11 heavy atoms.</li>
<li><strong>Element Restriction</strong>: Only contains C, N, O, and F. Important biological and pharmaceutical elements like Phosphorus (P), Sulfur (S), and Silicon (Si) are excluded to prevent combinatorial explosion.</li>
<li><strong>Excluded Topologies</strong>: Excludes highly strained molecules (e.g., specific bridged systems), allenes, and bridgehead double bonds.</li>
<li><strong>Unstable Functional Groups</strong>: Excludes chemical classes deemed unstable or highly reactive (e.g., gem-diols, hemiacetals, aminals, enols, orthoacids).</li>
<li><strong>Computational Nature</strong>: Consists entirely of computer-generated, theoretical structures without experimental synthesis or biological validation.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="construction">Construction</h3>
<h4 id="graph-selection">Graph Selection</h4>
<p>The program GENG was used to generate an initial set of 843,335 connected graphs with up to 11 nodes and a maximum node connectivity of 4. These were filtered to 15,726 stable saturated hydrocarbon graphs using:</p>
<ul>
<li><strong>Topological Criteria</strong>: Removed graphs with a node in multiple small (3- or 4-membered) rings, tetravalent bridgeheads in small rings, and nonplanar graphs (e.g., Claus-benzol).</li>
<li><strong>Steric Criteria</strong>: Graphs containing highly distorted centers were removed using an adapted MM2 force field energy-minimization with a cutoff of +17 kcal/mol.</li>
</ul>
<h4 id="structure-generation">Structure Generation</h4>
<p>Graph symmetry algorithms identified valid locations for unsaturations and heteroatoms (C, N, O, F). Specific valence constraints were continuously enforced. Combinatorial distribution of elements and multiple bonds (excluding bridgehead double bonds, triple bonds in rings smaller than nine, and allenes) yielded a theoretical &ldquo;dark matter universe&rdquo; (DMU) of over 1.7 billion unique structures.</p>
<h4 id="filters">Filters</h4>
<p>The 1.7 billion structural candidates contained unstable environments which were aggressively filtered, reducing the set to 27.7 million possible stable molecules. Rejected unstable/reactive features included:</p>
<ul>
<li><strong>High-Energy Bonds</strong>: Gem-diols, non-stabilized aminals, hemiaminals, enols, orthoesters, unstable imines, acyl fluorides, and geminal di-heteroatoms.</li>
<li><strong>Heteroatom-Heteroatom Bonds</strong>: Peroxides (O-O), N-O, N-N, N-F, and triazanes, unless stabilized (e.g., hydrazones, oximes).</li>
<li><strong>Strained Topologies</strong>: 3/4-membered rings containing N-N or N-O bonds, and bridgehead heteroatom bonds causing instabilities (like Bredt&rsquo;s rule violations).</li>
</ul>
<p>Removal of redundant tautomeric forms collapsed the set to the foundational 26.4 million structures.</p>
<h4 id="stereoisomer-generation">Stereoisomer Generation</h4>
<p>Stereoisomers were cleanly enumerated by identifying all asymmetric centers and functional double bonds, blocking Z/E isomerism in rings smaller than 10 nodes. From the 26.4 million unique constitutional isomers, 110.9 million stereoisomers were generated (averaging 4.2 stereoisomers per molecule).</p>
<h3 id="analysis-methodology">Analysis Methodology</h3>
<h4 id="kohonen-maps-self-organizing-maps">Kohonen Maps (Self-Organizing Maps)</h4>
<p>The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):</p>
<ul>
<li><strong>Input Features</strong>: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties. The autocorrelation vector $\text{AC}_d$ for a topological distance $d$ is defined as:</li>
</ul>
<p>$$
\text{AC}_d = \sum_{i=1}^{N} \sum_{j=1}^{N} \delta (p_i p_j)_d
$$</p>
<p><em>(where $N$ is the number of atoms, $p$ is the atomic property, and $\delta (p_i, p_j)_d = p_i p_j$ if the topological distance between atoms $i$ and $j$ is $d$, and 0 otherwise).</em></p>
<ul>
<li><strong>Training Data</strong>: Random subset of 1,000,000 GDB molecules</li>
<li><strong>Architecture</strong>: 200x200 neuron grid</li>
<li><strong>Training Protocol</strong>: 250,000 epochs with 100 molecules presented per epoch</li>
<li><strong>Algorithm</strong>: Standard Kohonen algorithm</li>
<li><strong>Key Insight</strong>: Reveals that &ldquo;lead-like&rdquo; compounds cluster in chiral regions of fused carbocycles/heterocycles</li>
</ul>
<h4 id="comparison">Comparison</h4>
<p>The full database was compared comprehensively to a Reference Database (RDB) of 63,857 known compounds (up to 11 atoms) extracted from PubChem, ChemACX, ChemSCX, NCI Open Database, and the Merck Index. Of the 63,857 RDB compounds, 37,393 (58.6%) were found in GDB. The remaining 26,464 compounds were absent due to structural rule violations, exclusion of elements beyond C/N/O/F, and filtered unstable chemistries.</p>
<h4 id="new-rings">New Rings</h4>
<p>All 309 entirely acyclic graphs in GDB mapped cleanly to published structures. External databases contained only 670 of the 1,208 purely cyclic theoretical ring systems (55.5%). Furthermore, 367 of the 538 newly identified ring systems (68.2%) express inherently chiral topologies.</p>
<h4 id="stereochemistry">Stereochemistry</h4>
<p>Small molecules under 5 heavy atoms skew strongly towards simple achiral structures. As the atom count increases, a dominant stereochemical shift emerges: over two-thirds of structures containing exactly 10 or 11 atoms occupy chiral configuration spaces. Approximately 86% of the molecules in GDB contain exactly 11 atoms (22.8 million of 26.4 million).</p>
<h4 id="physicochemical-properties">Physicochemical Properties</h4>
<p>Because all GDB molecules contain at most 11 heavy atoms, 100% of them satisfy Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability. Under the more restrictive Congreve &ldquo;Rule of 3&rdquo; for lead-likeness (MW &lt; 300, RBC &lt; 3, logP &lt; 3, HBDC &lt; 3, HBAC &lt; 3, TPSA &lt; 60 $\text{\AA}^2$), exactly 50% (13.2 million structures) qualify. Virtual screening using the Molinspiration miscreen toolkit (Bayesian statistics-based) identified 42,804 virtual hits across three drug target classes: 3,043 kinase inhibitor candidates, 24,489 GPCR ligand candidates, and 19,696 ion-channel modulator candidates. Of these virtual hits, 59.8% occupied Kohonen map neurons not populated by any known RDB compound.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>While the generated GDB-11 database is openly available, reproducing the exact generation from graph to stereoisomer relies on in-house and proprietary software which is not publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB Downloads (University of Berne)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Official host for GDB databases</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172017">Zenodo Record (10.5281/zenodo.5172017)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Version-agnostic Zenodo archive of GDB-11</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Paper Accessibility</strong>: Closed-access (Published in JCIM 2007; no preprint available).</li>
<li><strong>Data Availability</strong>: The complete dataset is hosted on an open-access Zenodo repository (version-agnostic DOI): <a href="https://doi.org/10.5281/zenodo.5172017">10.5281/zenodo.5172017</a>.</li>
<li><strong>Software Dependencies (Closed/Commercial)</strong>:
<ul>
<li>Generation code is a closed-source Java (J2SE v5.0) application.</li>
<li>Relies on proprietary ChemAxon libraries (JChem v3.1, Marvin v4.0 API).</li>
<li>Virtual screening evaluation utilized the commercial Molinspiration <code>miscreen</code> toolkit.</li>
</ul>
</li>
<li><strong>Hardware Profile</strong>:
<ul>
<li><strong>CPUs</strong>: Two AMD Opteron 252 2.6 GHz processors</li>
<li><strong>Parallelization</strong>: 80-fold parallelization</li>
<li><strong>Compute Time</strong>: Approximately 20 hours for full generation</li>
</ul>
</li>
</ul>
<h3 id="force-field">Force Field</h3>
<p>A custom implementation of the MM2 force field was used for steric energy minimization during structure validation. It used the parameter set from Allinger, specifically adding a quartic term for bond stretching to prevent bond lengthening far from equilibrium:</p>
<p>$$
\begin{aligned}
E_{\text{Steric}} &amp;= \sum_{\text{bonds}} k_b(l_i - l_{0,i})^2 \left[1 + k&rsquo;_b(l_i - l_{0,i}) + k&rsquo;&rsquo;_b(l_i - l_{0,i})^2\right] \\
&amp;\quad + \sum_{\text{angles}} k_\theta(\theta_i - \theta_{0,i})^2 \left[1 + k&rsquo;_\theta(\theta_i - \theta_{0,i})^4\right] \\
&amp;\quad + \sum_{\text{angles}} k_{b,\theta}(\theta_i - \theta_{0,i})^2 \left[(l_a - l_{0,a}) + (l_b - l_{0,b})\right] \\
&amp;\quad + \sum_{\text{torsions}} \left[ \frac{V_1}{2} (1 + \cos \omega) + \frac{V_2}{2} (1 - \cos 2\omega) + \frac{V_3}{2} (1 + \cos 3\omega) \right] \\
&amp;\quad + \sum_{i=1}^N \sum_{j=i+1}^N \epsilon_{ij} \left[ A \exp \left( \frac{-B r_{ij}}{\sum r^{\ast}_{ij}} \right) - C \left( \frac{r_{ij}}{\sum r^{\ast}_{ij}} \right)^6 \right]
\end{aligned}
$$</p>
<h2 id="paper-information">Paper Information</h2>
<p>Fink, T. and Reymond, J.-L. (2007). Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. <em>Journal of Chemical Information and Modeling</em>, 47(2), 342&ndash;353. <a href="https://doi.org/10.1021/ci600423u">https://doi.org/10.1021/ci600423u</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fink2007virtual,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Virtual exploration of the chemical universe up to 11 atoms of C, N, O, and F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fink, Tobias and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{342--353}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-17: Chemical Universe Database (166.4B Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</guid><description>Dataset card for GDB-17, containing 166.4 billion small organic molecules representing the largest enumerated chemical space to date.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 &lt; \text{MW} &lt; 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks in the $240$-$250 \text{ Da}$ range. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding &ldquo;flatland&rdquo; by deeply populating the third dimension in shape space.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_17_sample.webp"
         alt="Example GDB-17 molecule"
         title="Example GDB-17 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-17 molecule (SMILES: <code>C1CC2C3CCCC3C3(C4CCC3CC4)C2C1</code>) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-17 (Full)</strong></td>
          <td>166.4B</td>
          <td>Complete enumeration of the database</td>
      </tr>
      <tr>
          <td><strong>GDBLL-17</strong></td>
          <td>29B</td>
          <td>Lead-like subset ($1 &lt; \text{clogP} &lt; 3$ and $100 &lt; \text{MW} &lt; 350$ Da)</td>
      </tr>
      <tr>
          <td><strong>GDBLLnoSR-17</strong></td>
          <td>22B</td>
          <td>Lead-like subset excluding compounds with small rings (3- or 4-membered)</td>
      </tr>
      <tr>
          <td><strong>Random Sample</strong></td>
          <td>50M</td>
          <td>Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>
<p><em>Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.</em></p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths:</strong></p>
<ul>
<li><strong>3D Shape Space (&ldquo;Escape out of Flatland&rdquo;)</strong>: Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance</li>
<li><strong>Stereochemical Complexity</strong>: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings</li>
<li><strong>Massive Scaffold Diversity</strong>: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem</li>
<li><strong>Rich in Known Drug Isomers</strong>: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and &ldquo;methyl walk&rdquo; analogs</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li><strong>Experimental Gap</strong>: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.</li>
<li><strong>Small Ring Dominance</strong>: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds</li>
<li><strong>Elemental Scope Restrictions</strong>: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded</li>
<li><strong>Strict Stability Filters</strong>: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)</li>
<li><strong>Polarity Skew</strong>: The full database contains disproportionately more polar molecules ($\text{clogP} &lt; 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:</p>
<ol>
<li><strong>Graphs $\rightarrow$ Hydrocarbons</strong>: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).</li>
<li><strong>Hydrocarbons $\rightarrow$ Skeletons</strong>: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).</li>
<li><strong>Skeletons $\rightarrow$ CNO Molecules</strong>: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).</li>
<li><strong>Post-processing</strong>: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.</li>
</ol>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<ul>
<li><strong>Compute</strong>: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)</li>
<li><strong>Software</strong>: Powered by <strong>GENG</strong> (Nauty package) for graph generation, <strong>CORINA</strong> for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications</li>
</ul>
<h3 id="shape-analysis-pmi">Shape Analysis (PMI)</h3>
<p>To quantitatively define the &ldquo;escape from flatland,&rdquo; the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:</p>
<p>$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$</p>
<p>The vertices of this plot define the three geometrical boundaries of chemical space:</p>
<ul>
<li><strong>Rod-like (1D)</strong>: $(0, 1)$ typical of stretched alkanes</li>
<li><strong>Disc-like (2D)</strong>: $(0.5, 0.5)$ typical of flat aromatics like benzene</li>
<li><strong>Sphere-like (3D)</strong>: $(1, 1)$ typical of globular structures like cubane</li>
</ul>
<p>GDB-17&rsquo;s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.</p>
<h3 id="differences-from-gdb-13">Differences from GDB-13</h3>
<ul>
<li>The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit</li>
<li>Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework</li>
<li>Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion</li>
<li>Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The original paper is published in the <em>Journal of Chemical Information and Modeling</em> and is available as an Open Access publication under a CC-BY license.</li>
<li><strong>Data Availability</strong>: The full 166.4 billion molecule dataset is not publicly available for download (estimated &gt;400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the <a href="https://gdb.unibe.ch/downloads/">GDB website</a> and archived on <a href="https://zenodo.org/records/5172018">Zenodo</a>.</li>
<li><strong>Code &amp; Algorithms</strong>: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.</li>
<li><strong>Dependencies</strong>: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.</li>
<li><strong>Hardware Specifications</strong>: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Ruddigkeit, L., van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. <em>Journal of Chemical Information and Modeling</em>, 52(11), 2864&ndash;2875. <a href="https://doi.org/10.1021/ci300415d">https://doi.org/10.1021/ci300415d</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Ruddigkeit_2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{52}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span>=nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2864--2875}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-13: Chemical Universe Database (970M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</guid><description>A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_13_sample.webp"
         alt="Example GDB-13 molecule"
         title="Example GDB-13 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-13 molecule (SMILES: <code>CCCC(O)(CO)CC1CC1CN</code>)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>C/N/O Set</strong></td>
          <td>~910.1M</td>
          <td>Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.</td>
      </tr>
      <tr>
          <td><strong>Cl/S Set</strong></td>
          <td>~67.3M</td>
          <td>Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).</td>
      </tr>
  </tbody>
</table>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The creation and release of the 977.5 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.</p>
<h2 id="overview">Overview</h2>
<p>GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Systematic coverage of structures with up to 13 atoms</li>
<li>High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance</li>
<li>High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules</li>
<li>Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl</li>
<li>Omits 66.2% of known chemical space up to 13 atoms found in external databases</li>
<li>Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)</li>
<li>Excludes highly strained molecules and highly polar combinations</li>
<li>Consists entirely of computer-generated structures pending experimental validation</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="algorithmic-approach">Algorithmic Approach</h3>
<p><strong>Type</strong>: Rule-Based Combinatorial Graph Enumeration</p>
<p>This approach relies on <strong>combinatorial enumeration</strong>. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.</p>
<p><strong>Process</strong>:</p>
<ol>
<li>Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)</li>
<li>Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)</li>
<li>Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if:
$$ V &lt; 0.345 \text{ \AA}^3 $$</li>
<li>Introduce unsaturations and heteroatoms through systematic substitution</li>
<li>Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness</li>
<li>Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas</li>
</ol>
<p><strong>Key Optimization</strong>: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast &ldquo;element-ratio&rdquo; filters. This achieved a <strong>6.4-fold speedup</strong> in structure validation early in the pipeline.</p>
<h3 id="differences-from-gdb-11">Differences from GDB-11</h3>
<ul>
<li><strong>Element Selection</strong>: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).</li>
<li><strong>Optimization Method</strong>: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).</li>
<li><strong>Heuristic Filters</strong>: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="paper--data-availability">Paper &amp; Data Availability</h3>
<ul>
<li><strong>Paper Access</strong>: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.</li>
<li><strong>Data Access</strong>: The full GDB-13 database and its subsets are freely available via the <a href="https://gdb.unibe.ch/downloads/">Reymond Group Downloads Page</a> and are persistently hosted on <a href="https://doi.org/10.5281/zenodo.5172018">Zenodo</a>.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB-13 Database (Reymond Group)</a></td>
          <td>Dataset</td>
          <td>Free download</td>
          <td>Official download page hosted by the Reymond Group</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172018">GDB-13 on Zenodo</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Persistent archival copy</td>
      </tr>
  </tbody>
</table>
<h3 id="source-code--algorithms">Source Code &amp; Algorithms</h3>
<p>The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.</p>
<h3 id="heuristic-filters">Heuristic Filters</h3>
<p>Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:</p>
<p>$$
\begin{aligned}
\frac{N + O}{C} &amp;&lt; 1.0 \\
\frac{N}{C} &amp;&lt; 0.571 \\
\frac{O}{C} &amp;&lt; 0.666
\end{aligned}
$$</p>
<h3 id="excluded-functional-groups">Excluded Functional Groups</h3>
<ul>
<li>O-O bonds (peroxides)</li>
<li>Hemiacetals, aminals, acyclic imines, non-aromatic enols</li>
<li>Compounds containing both primary/secondary amines and aldehydes/ketones</li>
<li>Nonenumerated elements (F, Br, I, P, Si, metals)</li>
<li>High-heteroatom ratio structures (e.g., mannitol)</li>
</ul>
<h3 id="hardware--compute">Hardware &amp; Compute</h3>
<ul>
<li><strong>Compute Cost</strong>: ~40,000 CPU hours for the 910 million C/N/O structures.</li>
<li><strong>Infrastructure</strong>: Executed in parallel on a <strong>500-node cluster</strong></li>
<li><strong>Assembly Optimization</strong>: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Blum, L. C. and Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. <em>Journal of the American Chemical Society</em>, 131(25), 8732&ndash;8733. <a href="https://doi.org/10.1021/ja902302h">https://doi.org/10.1021/ja902302h</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blum2009gdb13,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{970 million druglike small molecules for virtual screening in the chemical universe database GDB-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blum, Lorenz C and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{131}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8732--8733}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/ja902302h}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GEOM Dataset: 3D Molecular Conformer Generation</title><link>https://hunterheidenreich.com/posts/geom-conformer-generation-dataset/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/geom-conformer-generation-dataset/</guid><description>Learn how GEOM transforms 2D molecular graphs into dynamic 3D conformer ensembles for molecular machine learning applications.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In molecular machine learning, we often start with a 2D graph, a blueprint of atoms and bonds. A molecule&rsquo;s function is deeply tied to its dynamic 3D shape. Molecules are flexible entities that exist as an <strong>ensemble of low-energy conformations</strong>. Capturing 3D molecular shapes is crucial for predicting molecular behavior.</p>
<p>The <a href="/notes/chemistry/datasets/geom/">GEOM</a> (Geometric Ensemble Of Molecules) dataset was created to bridge this gap. It provides a massive collection of high-quality 3D conformer ensembles, transforming static 2D graphs into something much closer to physical reality. This makes it an invaluable resource for anyone working in geometric deep learning for chemistry and drug discovery.</p>















<figure class="post-figure center ">
    <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01288-4/MediaObjects/41597_2022_1288_Fig1_HTML.png?as=webp"
         alt="Overlay of conformers for a complex molecule"
         title="Overlay of conformers for a complex molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3D conformer ensembles expand upon 2D blueprints by revealing the diverse shapes the latanoprost molecule adopts.</figcaption>
    
</figure>

<h2 id="the-challenge-of-conformer-generation">The Challenge of Conformer Generation</h2>
<p>Generating 3D structures for every molecule is computationally hard for two main reasons:</p>
<ol>
<li><strong>Combinatorial Explosion</strong>: Think of a molecule with several rotatable bonds. Each bond is like a joint that can be twisted. The number of possible 3D shapes grows exponentially with each new joint. Trying every combination is impractical for most molecules.</li>
<li><strong>Speed vs. Accuracy</strong>: We need to calculate the energy of each shape to know if it&rsquo;s realistic (low energy). Classical <strong>force fields</strong> are fast. <strong>Density Functional Theory (DFT)</strong> provides quantum mechanical accuracy.</li>
</ol>
<p>GEOM uses a semi-empirical method to capture the underlying quantum mechanics efficiently, enabling the generation of millions of conformations for a large dataset.</p>
<h2 id="a-deeper-look-inside-the-geom-dataset">A Deeper Look Inside the GEOM Dataset</h2>
<p>The scale of GEOM is impressive: over <strong>37 million conformations</strong> for more than <strong>450,000 unique molecules</strong>. But the numbers in the paper&rsquo;s tables tell a more interesting story about the dataset&rsquo;s composition.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">AICures drug dataset (N=304,466)</th>
          <th style="text-align: left">Mean</th>
          <th style="text-align: left">Max</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Number of heavy atoms</td>
          <td style="text-align: left">24.9</td>
          <td style="text-align: left">91</td>
      </tr>
      <tr>
          <td style="text-align: left">Number of rotatable bonds</td>
          <td style="text-align: left">6.5</td>
          <td style="text-align: left">53</td>
      </tr>
      <tr>
          <td style="text-align: left">Conformers</td>
          <td style="text-align: left">102.6</td>
          <td style="text-align: left">7,451</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>QM9 dataset (N=133,258)</strong></td>
          <td style="text-align: left"><strong>Mean</strong></td>
          <td style="text-align: left"><strong>Max</strong></td>
      </tr>
      <tr>
          <td style="text-align: left">Number of heavy atoms</td>
          <td style="text-align: left">8.8</td>
          <td style="text-align: left">9</td>
      </tr>
      <tr>
          <td style="text-align: left">Number of rotatable bonds</td>
          <td style="text-align: left">2.2</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left">Conformers</td>
          <td style="text-align: left">13.5</td>
          <td style="text-align: left">1,101</td>
      </tr>
  </tbody>
</table>
<p><em>A simplified view of Tables 1 &amp; 4 from the paper, highlighting the key differences.</em></p>
<p>What does this tell us?</p>
<ul>
<li><strong>Two Worlds of Molecules</strong>: The dataset is clearly split. The <strong>QM9</strong> subset contains small, relatively rigid molecules (mean of 2.2 rotatable bonds). In contrast, the <strong>AICures</strong> subset contains larger, more flexible drug-like molecules (mean of 6.5 rotatable bonds, with one molecule having 53!). This diversity is ideal for training machine learning models that need to generalize from simple cases to complex, real-world examples.</li>
<li><strong>Conformational Complexity</strong>: The number of conformers found per molecule reflects this flexibility. A typical QM9 molecule has about 13 conformers, while a drug-like molecule has over 100 on average. This highlights the necessity of 3D ensembles for flexible molecules.</li>
</ul>
<p>Beyond the structures themselves, GEOM is rich with experimental data, connecting the 3D shapes to real-world properties. The molecules are labeled with data for everything from <strong>water solubility</strong> and <strong>blood-brain barrier penetration</strong> to <strong>toxicity</strong> and inhibition of key viral targets like the <strong>SARS-CoV-2 3CL protease</strong>. This makes it a powerful tool for developing property prediction models.</p>
<p>In fact, this creates a benchmark for:</p>
<ul>
<li>Property prediction models that can leverage conformer ensembles (or members of the ensemble) as input.</li>
<li>Conformer generation models that must transform 2D graphs into realistic, 3D distributions.</li>
<li>End-to-end property-based evaluation of the conformer ensembles generated by a model.</li>
</ul>
<h2 id="the-toolbox-behind-geom-key-techniques-explained">The Toolbox Behind GEOM: Key Techniques Explained</h2>
<p>The GEOM paper mentions several advanced computational chemistry methods. Let&rsquo;s briefly break down the most important ones:</p>
<ul>
<li><strong>GFN2-xTB</strong>: This is the semi-empirical quantum mechanical method used to calculate energies and forces in GEOM. Think of it as a &ldquo;middle ground&rdquo; method. It provides greater speed than full DFT while capturing electronic effects absent in classical force fields, making it a pragmatic choice for generating a large dataset.</li>
<li><strong>CREST</strong>: This is the program that actually performs the conformer search. It uses a clever technique based on <strong>metadynamics</strong>, where it simulates the molecule&rsquo;s movement and adds a &ldquo;penalty&rdquo; potential to discourage it from revisiting shapes it has already seen. This pushes the molecule to explore its conformational space efficiently, finding many diverse, low-energy structures.</li>
<li><strong>CENSO</strong>: For a small subset of molecules, the authors went a step further with CENSO. This program takes the conformers found by CREST and refines them with more accurate (and expensive) DFT calculations. It&rsquo;s a way of getting very high-quality &ldquo;gold standard&rdquo; data for benchmarking.</li>
<li><strong>Implicit Solvent Models</strong>: Molecules in the body exist in aqueous environments. Methods like <strong>C-PCM</strong> and <strong>ALPB</strong> model water as a continuous medium, which affects the molecule&rsquo;s preferred shape and energy. This is crucial for biological applications.</li>
</ul>
<h2 id="the-math-behind-the-molecules-explained-simply">The Math Behind the Molecules (Explained Simply)</h2>
<p>The paper includes a couple of equations based on the Boltzmann distribution, which is a fundamental concept from statistical mechanics that tells us the probability of finding a system in a certain state.</p>
<p>The key equation used by CREST to assign a probability (or &ldquo;statistical weight&rdquo;) to the <em>i</em>-th conformer is:</p>
<p>$$ P_{i}^{\text{CREST}} = \frac{d_{i}\exp(-E_{i}/k_{B}T)}{\sum_{j}d_{j}\exp(-E_{j}/k_{B}T)} $$</p>
<p>Let&rsquo;s demystify this:</p>
<ul>
<li>$E_i$ is the energy of the conformer. The negative sign and the exponential mean that <strong>lower energy leads to a much higher probability</strong>.</li>
<li>$k_B T$ is the thermal energy at a given temperature $T$. It sets the energy scale. If the energy difference between two conformers is much larger than $k_B T$, the higher-energy one will be virtually nonexistent.</li>
<li>$d_i$ represents the degeneracy of the conformer, which accounts for the number of equivalent states or configurations that share the same energy $E_i$.
<ul>
<li>Degeneracy refers to the number of equivalent, indistinguishable atomic arrangements (rotamers) that correspond to a single overall molecular shape (conformer). For example, the rotation of a methyl group ($-\text{CH}_3$) produces multiple identical-looking orientations of its hydrogen atoms.</li>
</ul>
</li>
<li>The denominator, $\sum_{j}d_{j}\exp(-E_{j}/k_{B}T)$, is the <strong>partition function</strong>. Its job is to sum up the terms from all possible conformers to ensure that all the probabilities add up to 100%.</li>
</ul>
<p>For the high-quality CENSO calculations, the equation uses the <strong>Gibbs Free Energy ($G_i$)</strong>. Free energy provides a complete measure by including the molecule&rsquo;s internal energy, its interaction with a solvent, and entropic effects (like how much it can &ldquo;wiggle&rdquo;). This gives a more accurate ranking of the conformer probabilities.</p>
<h2 id="a-closer-look-at-the-figures-what-the-data-really-shows">A Closer Look at the Figures: What the Data Really Shows</h2>
<p>The paper&rsquo;s figures offer some honest insights into the dataset&rsquo;s quality and the trade-offs involved.</p>















<figure class="post-figure center ">
    <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01288-4/MediaObjects/41597_2022_1288_Fig4_HTML.png?as=webp"
         alt="Scatter plot comparing energy calculation methods."
         title="Scatter plot comparing energy calculation methods."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Comparing the &lsquo;fast&rsquo; GFN2-xTB energies with &lsquo;accurate&rsquo; DFT energies. (a) There&rsquo;s a clear correlation, but also a lot of spread. (b) The ranking accuracy (Spearman ρ) is decent on average (0.39) but highly variable.</figcaption>
    
</figure>

<p>Figure 4 is particularly important. It compares the fast GFN2-xTB (CREST) energies with much more accurate single-point r2scan-3c DFT energies.</p>
<ul>
<li>The <strong>Mean Absolute Error (MAE) of 1.96 kcal/mol</strong> shows that, on average, the fast method gets the energy wrong by about 2 kcal/mol. At room temperature, the thermal energy ($k_B T$) is only about 0.6 kcal/mol. Because the Boltzmann probability depends on the energy _exponentially_, a 2 kcal/mol error can dramatically change the predicted importance of a conformer.</li>
<li>The <strong>Spearman correlation plot</strong> (right side) shows how well GFN2-xTB <em>ranks</em> the conformers from lowest to highest energy compared to DFT. An average correlation of 0.39 provides a strong baseline, though the wide distribution indicates variable performance across different molecules. The ranking accuracy fluctuates, achieving near perfection for certain molecules and showing significant deviation for others.</li>
</ul>
<p>This is a key takeaway: the GFN2-xTB/CREST method excels at <strong>discovering</strong> low-energy shapes. For accurate probability <strong>ranking</strong>, the higher-level DFT energies provided in GEOM are required.</p>
<h2 id="conclusion-what-this-means-for-machine-learning">Conclusion: What This Means for Machine Learning</h2>
<p>For researchers at the intersection of machine learning and chemistry, GEOM provides a realistic foundation to build upon. By shifting the focus from static 2D graphs to dynamic 3D ensembles, GEOM enables a new generation of models.</p>
<p>This dataset is an ideal training ground for models designed to understand 3D geometry, such as <strong>SE(3)-equivariant neural networks</strong>, <strong>diffusion models</strong>, <strong>transformers</strong>, and <strong>VAEs</strong>, which can learn to generate conformer ensembles directly from a 2D graph. By training on GEOM, these models can learn the complex relationship between a molecule&rsquo;s chemical blueprint and its real-world, flexible nature.</p>
<p>For a comprehensive technical reference including detailed specifications, quality metrics, and performance leaderboards, see my <a href="/notes/chemistry/datasets/geom/">GEOM Dataset Card</a>.</p>
<p>Explore the GEOM dataset further by visiting its <a href="https://github.com/learningmatter-mit/geom">GitHub repository</a>.</p>
<h2 id="references">References</h2>
<ul>
<li>Axelrod, S. &amp; Gómez-Bombarelli, R. &ldquo;GEOM, energy-annotated molecular conformations for property prediction and molecular generation.&rdquo; <em>Scientific Data</em> 9, 185 (2022). <a href="https://doi.org/10.1038/s41597-022-01288-4">https://doi.org/10.1038/s41597-022-01288-4</a></li>
<li>GitHub repositories:
<ul>
<li><a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a></li>
<li><a href="https://github.com/learningmatter-mit/NeuralForceField">learningmatter-mit/NeuralForceField</a></li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title>Synthetic Isomer Data Generation Pipeline</title><link>https://hunterheidenreich.com/projects/isomer-dataset-generation/</link><pubDate>Sat, 09 Mar 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/isomer-dataset-generation/</guid><description>An end-to-end cheminformatics pipeline transforming 1D chemical formulas into 3D conformer datasets using graph enumeration and physics-based featurization.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>In computational drug discovery, data scarcity is often the bottleneck. This project builds a synthetic data generator that creates labeled 3D molecular datasets starting from nothing but a raw chemical formula (e.g., $C_6H_{14}$).</p>
<p>The pipeline bridges the gap between <strong>1D Chemical Information</strong> (stoichiometry) and <strong>3D Geometric Data</strong> (conformers), effectively serving as a &ldquo;data factory&rdquo; for training molecular machine learning models.</p>
<h2 id="features">Features</h2>
<h3 id="1-graph-enumeration--3d-embedding">1. Graph Enumeration &amp; 3D Embedding</h3>
<p>The core of the project is <code>pysomer/data/gen.py</code>, which orchestrates a multi-step generation process:</p>
<ul>
<li><strong>Structural Isomerism:</strong> Uses <strong>MAYGEN</strong> (via a Java bridge) to mathematically enumerate all valid graph connectivities for a given formula</li>
<li><strong>Conformer Sampling:</strong> Uses <strong>RDKit</strong> to embed these graphs into 3D space, generating multiple conformers (rotamers) per isomer to capture flexibility</li>
<li><strong>IUPAC Labeling:</strong> Automatically queries PubChem APIs to assign human-readable labels (e.g., &ldquo;2-methylpentane&rdquo;) to the generated structures</li>
</ul>
<h3 id="2-physics-aware-featurization">2. Physics-Aware Featurization</h3>
<p>The pipeline computes <strong>Coulomb Matrices</strong>, ensuring the input respects physical invariants:</p>
<p>$$C_{ij} = \begin{cases} 0.5 Z_i^{2.4} &amp; i = j \ \frac{Z_i Z_j}{|R_i - R_j|} &amp; i \neq j \end{cases}$$</p>
<p>This representation encodes the electrostatic potential of the molecule, providing a more informative signal for the neural network than raw Cartesian coordinates.</p>
<h3 id="3-hdf5-data-storage">3. HDF5 Data Storage</h3>
<p>To handle the large volume of generated conformers, the system writes to hierarchical <strong>HDF5</strong> files. This allows for efficient, chunked I/O during training, a critical pattern for scaling to larger chemical spaces.</p>
<h2 id="usage">Usage</h2>
<p>The pipeline is executed via a CLI, taking a chemical formula as input and outputting an HDF5 dataset of 3D conformers.</p>
<h2 id="results">Results</h2>
<p>This project serves as a &ldquo;vertical slice&rdquo; of a cheminformatics workflow.</p>
<ul>
<li><strong>The Good:</strong> The separation of concerns is clean: <code>dataclasses</code> for configuration and HDF5 for storage keep the data-engineering layer tidy and extensible.</li>
<li><strong>The &ldquo;Old School&rdquo;:</strong> The model used is a simple Multi-Layer Perceptron (MLP) on flattened Coulomb Matrices. In a modern production setting (post-2020), I would replace this with an <strong>E(3)-Equivariant GNN</strong> (like SchNet or E3NN) to handle rotational symmetry natively, eliminating manual feature engineering.</li>
<li><strong>Dependency Management:</strong> The reliance on an external Java JAR (<code>MAYGEN</code>) for graph enumeration makes the environment brittle. Today, I would likely swap this for a pure Python enumerator or a containerized microservice to improve portability.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<p>This data pipeline powers the analysis in my comprehensive guide on molecular representation:</p>
<ul>
<li><a href="/posts/alkane-constitutional-isomer-classification/">Coulomb Matrix Eigenvalues: Can You Hear the Shape of a Molecule?</a>: A deep dive into data generation, unsupervised clustering, and supervised classification of alkane isomers.</li>
</ul>
<p>See also:</p>
<ul>
<li><a href="/posts/molecular-descriptor-coulomb-matrix/">The Coulomb Matrix</a>: Deep dive into the physics-based featurization used here</li>
<li><a href="/notes/chemistry/molecular-representations/notations/number-of-isomeric-hydrocarbons/">The Number of Isomeric Hydrocarbons</a>: The foundational 1931 paper on alkane enumeration</li>
</ul>
]]></content:encoded></item><item><title>Sarcasm Detection with Transformers: A Cautionary Tale</title><link>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</link><pubDate>Sun, 25 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</guid><description>Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that learned to classify news sources.</description><content:encoded><![CDATA[<h2 id="why-sarcasm-detection-is-hard">Why Sarcasm Detection Is Hard</h2>
<p>Sarcasm detection represents one of the most challenging problems in NLP. The difficulties include:</p>
<p><strong>Context dependence</strong>: Sarcasm relies on situational knowledge and shared understanding that extends beyond the text itself.</p>
<p><strong>Subtlety</strong>: Even humans struggle with sarcastic interpretation, especially in written text without vocal cues.</p>
<p><strong>Cultural variability</strong>: Sarcastic expressions vary significantly across cultures and regions.</p>
<p><strong>Annotation disagreement</strong>: Human annotators often disagree on what constitutes sarcasm.</p>
<p>These challenges raise a fundamental question: can sarcasm detection be well-defined as a computational problem? This case study explores what happens when we try (and reveals a common pitfall in dataset construction).</p>
<h2 id="the-dataset-a-hidden-flaw">The Dataset: A Hidden Flaw</h2>
<p>I used the <a href="https://huggingface.co/datasets/raquiba/Sarcasm_News_Headline">Sarcasm News Headlines dataset</a>, which combines headlines from <a href="https://theonion.com/">The Onion</a> (satirical) and <a href="https://www.huffpost.com/">The Huffington Post</a> (traditional news). The dataset contains ~50,000 examples.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;raquiba/Sarcasm_News_Headline&#34;</span>)
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">1</span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>{&#39;headline&#39;: &#39;thirtysomething scientists unveil doomsday clock of hair loss&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 1}
</span></span><span style="display:flex;"><span>{&#39;headline&#39;: &#39;dem rep. totally nails why congress is falling short on gender, racial equality&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 0}
</span></span></code></pre></div><p><strong>The critical flaw</strong>: This dataset uses binary classification based on source domain. The Onion headlines are labeled sarcastic, HuffPost headlines are not. This creates a dangerous shortcut where models learn to detect the publication source.</p>
<p>After preprocessing to standardize column names:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">lambda</span> example: {<span style="color:#e6db74">&#34;text&#34;</span>: example[<span style="color:#e6db74">&#34;headline&#34;</span>], <span style="color:#e6db74">&#34;label&#34;</span>: example[<span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]},
</span></span><span style="display:flex;"><span>    remove_columns<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;headline&#34;</span>, <span style="color:#e6db74">&#34;article_link&#34;</span>, <span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="fine-tuning-roberta">Fine-Tuning RoBERTa</h2>
<p>I fine-tuned a pre-trained RoBERTa model using standard practices:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;FacebookAI/roberta-base&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#f92672">=</span> AutoTokenizer<span style="color:#f92672">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(model_name, num_labels<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Tokenize the data</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">tokenize_function</span>(examples):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> tokenizer(examples[<span style="color:#e6db74">&#34;text&#34;</span>], truncation<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, max_length<span style="color:#f92672">=</span><span style="color:#ae81ff">512</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tokenized_datasets <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(tokenize_function, batched<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Training configuration</span>
</span></span><span style="display:flex;"><span>training_args <span style="color:#f92672">=</span> TrainingArguments(
</span></span><span style="display:flex;"><span>    output_dir<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;./results&#34;</span>,
</span></span><span style="display:flex;"><span>    num_train_epochs<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>,
</span></span><span style="display:flex;"><span>    per_device_train_batch_size<span style="color:#f92672">=</span><span style="color:#ae81ff">32</span>,
</span></span><span style="display:flex;"><span>    evaluation_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    save_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    load_best_model_at_end<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer <span style="color:#f92672">=</span> Trainer(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span>model,
</span></span><span style="display:flex;"><span>    args<span style="color:#f92672">=</span>training_args,
</span></span><span style="display:flex;"><span>    train_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;train&#34;</span>],
</span></span><span style="display:flex;"><span>    eval_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;test&#34;</span>],
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#f92672">=</span>tokenizer,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer<span style="color:#f92672">.</span>train()
</span></span></code></pre></div><h2 id="results-too-good-to-be-true">Results: Too Good to Be True</h2>
<p>The model achieved high accuracy:</p>
<table>
  <thead>
      <tr>
          <th>Epoch</th>
          <th>Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>96.3%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>97.8%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>99.8%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>99.8%</td>
      </tr>
  </tbody>
</table>
<p>This should immediately raise red flags. Sarcasm detection is notoriously difficult, even for humans. Such high accuracy suggests the model learned a proxy task.</p>
<p>My hypothesis: <strong>The model bypassed sarcasm detection entirely, learning only to distinguish between The Onion and HuffPost writing styles.</strong></p>
<h2 id="interacting-with-the-model">Interacting with the Model</h2>
<p>Let&rsquo;s test our hypothesis by interacting with the model.</p>
<p>First, let&rsquo;s load the model and tokenizer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> pipeline
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(<span style="color:#e6db74">&#39;results/2024-02-25_20-24-51/checkpoint-4475&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>clf <span style="color:#f92672">=</span> pipeline(<span style="color:#e6db74">&#39;text-classification&#39;</span>, model<span style="color:#f92672">=</span>model, tokenizer<span style="color:#f92672">=</span>tokenizer)
</span></span></code></pre></div><p>Now, let&rsquo;s test the model with some examples.</p>
<p>First, let&rsquo;s try an Onion article from this week, something I know to be sarcastic and not in the training data.
Let&rsquo;s use <a href="https://theonion.com/alabama-supreme-court-justice-invokes-veggietales-in-1851282252/">&ldquo;Alabama Supreme Court Justice Invokes &lsquo;VeggieTales&rsquo; In Ruling&rdquo;</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Alabama Supreme Court Justice Invokes ‘VeggieTales&#39; In Ruling&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.99916672706604}]
</span></span></code></pre></div><p>The model is extremely confident that this is not sarcastic.</p>
<p>Let&rsquo;s try a different Onion article, possibly even more difficult: <a href="https://theonion.com/trump-booed-frozen-burritos-and-more-this-week-in-br-1851282066/">Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993497729301453}]
</span></span></code></pre></div><p>Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.</p>
<p>Let&rsquo;s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: <a href="https://theonion.com/mom-only-likes-the-other-outback-steakhouse-1851265335/">Mom Only Likes The Other Outback Steakhouse</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Mom Only Likes The Other Outback Steakhouse&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_1&#39;, &#39;score&#39;: 0.9997231364250183}]
</span></span></code></pre></div><p>Finally, a correct prediction! The model is confident that this is sarcastic.
Our model detects only very specific types of sarcasm. It fails to generalize to new, unseen data within the same domain.</p>
<p>Let&rsquo;s also try some headlines from the Huffington Post, which the model should predict as not sarcastic.
Let&rsquo;s try the five most recent headlines from the Huffington Post:</p>
<ul>
<li><a href="https://www.huffpost.com/entry/donald-trump-south-carolina-nikki-haley_n_65db61f5e4b0e4346d52bed8">Donald Trump Won South Carolina - But There&rsquo;s 1 Big Caveat</a></li>
<li><a href="https://www.huffpost.com/entry/israeli-embassy-washington-man-set-fire_n_65db9364e4b0e4346d52ce3d">Man Sets Himself On Fire In Front Of Israeli Embassy In Washington</a></li>
<li><a href="https://www.huffpost.com/entry/bc-ml-israel-palestinians-temporary-truce-cease-fire_n_65db2e9ae4b0189a6a7e32ea">Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange</a></li>
<li><a href="https://www.huffpost.com/entry/george-latimer-race-comments-democratic-primary_n_65d8fac3e4b0cc1f2f7bafd8">A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.</a></li>
<li><a href="https://www.huffpost.com/entry/mongolia-climate-change-extreme-weather_n_65d90294e4b0cc1f2f7bb527">Climate Change-Fueled Winter Extremes Put 90% Of This Country At &lsquo;High Risk&rsquo;</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf([
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Donald Trump Won South Carolina - But There&#39;s 1 Big Caveat&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Man Sets Himself On Fire In Front Of Israeli Embassy In Washington&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Climate Change-Fueled Winter Extremes Put 90% Of This Country At &#39;High Risk&#39;&#34;</span>
</span></span><span style="display:flex;"><span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993808269500732},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993786811828613},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9985186457633972},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993883371353149},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993487000465393}]
</span></span></code></pre></div><p>The model is extremely confident that these are not sarcastic.</p>
<p>The model detects sarcasm in limited cases. It fails to generalize to new, unseen data within the same domain. This is a common problem in machine learning. Training a model that performs well on a specific dataset is straightforward. Training a model that generalizes to new, unseen data remains a significant challenge.
Furthermore, our sarcasm detection project resulted in a domain classifier. For fuzzier concepts like sarcasm, it&rsquo;s important to be clear about what we&rsquo;re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.</p>
<h2 id="key-takeaways">Key Takeaways</h2>
<p>This case study reveals a fundamental problem in ML: <strong>high accuracy guarantees only performance on the training distribution</strong>. Here&rsquo;s what actually happened:</p>
<ol>
<li><strong>Dataset bias</strong>: Using publication source as a proxy for sarcasm created a shortcut for the model</li>
<li><strong>Domain classification</strong>: The model exclusively learned to distinguish writing styles</li>
<li><strong>Poor generalization</strong>: New examples from the same sources often failed</li>
</ol>
<p>This is a common pitfall when building datasets for subjective concepts. The lesson: high accuracy must be accompanied by validation of the model&rsquo;s actual learned behavior.</p>
<p>For better sarcasm detection, we&rsquo;d need:</p>
<ul>
<li>Diverse sources beyond two publications</li>
<li>Human annotation across multiple contexts</li>
<li>Careful evaluation on out-of-domain examples</li>
</ul>
<p>Instructive failures in ML projects provide valuable lessons about our assumptions and the limitations of our approaches.</p>
]]></content:encoded></item><item><title>How Does Congress Actually Work? Data from 15K Bills</title><link>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</link><pubDate>Thu, 05 Oct 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</guid><description>What happens to bills in Congress? Analyzing 15K+ bills from the 117th Congress to understand legislative patterns, party dynamics, and success rates.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Analyzing congressional data reveals the underlying mechanics of the legislative process. Legislative text is a large, structured corpus well suited to text classification and other NLP tasks. I scraped data from Congress.gov to analyze what actually happens to the thousands of bills introduced each session and to build a foundational dataset for downstream machine learning tasks.</p>
<p>This analysis focuses on the 117th Congress (2021-2023), examining 15,000+ bills to understand basic patterns: Which bills get introduced? How many receive votes? What factors influence success?</p>
<p>This post covers the foundational exploratory analysis and data collection process, setting the stage for <a href="/posts/congressional-bill-policy-area-classification/">predictive modeling and policy area classification</a>.</p>
<h2 id="data-collection">Data Collection</h2>
<p>My primary source is <a href="https://www.congress.gov/">Congress.gov</a>, maintained by the Library of Congress. I focused on the 117th Congress (2021-2023), collecting data on bills and joint resolutions, omitting simple resolutions, concurrent resolutions, and amendments.</p>
<p><strong>Data collected:</strong></p>
<table>
  <thead>
      <tr>
          <th>Bill Type</th>
          <th>Introduced</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>9,698</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>106</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,357</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>70</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>15,231</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="technical-implementation">Technical Implementation</h3>
<p>Building a usable NLP dataset requires careful handling of the source. Congress.gov loads content dynamically and presents nested DOM structures, so the scraper combines static HTML parsing with a headless browser to render JavaScript before parsing.</p>
<p><strong>Implementation details:</strong></p>
<ul>
<li><a href="https://www.python.org/">Python</a> for core orchestration and data schema management</li>
<li><a href="https://www.selenium.dev/">Selenium</a> for executing JavaScript and loading dynamic page elements</li>
<li><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a> for structured HTML parsing</li>
<li>Regex for text normalization and extracting clean legislative text for language models</li>
</ul>
<p>The crawler used 5-second delays between requests to respect server limits, a roughly 3-day collection run. It handles edge cases in congressional text formatting and writes one JSON record per bill on a fixed schema. The crawler and processed data are available on <a href="https://github.com/hunter-heidenreich/congress-scraper">GitHub</a>.</p>
<p>For each bill, I queried two pages:</p>
<ul>
<li>All info page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/all-info</code></li>
<li>Text page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/text?format=txt</code></li>
</ul>
<p>The parsing process involved targeting specific HTML elements and implementing basic caching to avoid redundant requests.</p>
<h2 id="key-findings">Key Findings</h2>
<p>The analysis reveals clear patterns in congressional activity. Most bills never receive votes, and success rates vary significantly by party and policy area.</p>
<h3 id="legislative-outcomes">Legislative Outcomes</h3>
<p>The fundamental question: what happens to bills after introduction?</p>
<p>Each bill has a tracker status indicating its position in the legislative process. The eight possible statuses can be grouped into three meaningful categories:</p>
<ul>
<li><strong>Introduced</strong>: Bills introduced but never voted on</li>
<li><strong>Stalled</strong>: Bills that saw votes but didn&rsquo;t become law (since the 117th Congress ended, these effectively died)</li>
<li><strong>Law</strong>: Bills signed by the President</li>
</ul>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>8,977</td>
          <td>523</td>
          <td>198</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>102</td>
          <td>1</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,083</td>
          <td>114</td>
          <td>160</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>57</td>
          <td>9</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>14,219</strong></td>
          <td><strong>647</strong></td>
          <td><strong>365</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Key insights:</strong></p>
<ul>
<li>Only 7% of introduced bills ever receive a vote</li>
<li>Of bills that receive votes, 36% become law</li>
<li>Overall, just 2% of introduced bills become law</li>
</ul>
<h3 id="sponsor-analysis">Sponsor Analysis</h3>
<p>The bill sponsor (the primary member who introduces legislation) provides insights into party and geographic patterns.</p>
<h4 id="party-breakdown">Party Breakdown</h4>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Democrat</td>
          <td>8,271</td>
          <td>437</td>
          <td>235</td>
      </tr>
      <tr>
          <td>Republican</td>
          <td>5,883</td>
          <td>210</td>
          <td>130</td>
      </tr>
      <tr>
          <td>Independent</td>
          <td>65</td>
          <td>0</td>
          <td>0</td>
      </tr>
  </tbody>
</table>
<p><strong>Party comparison:</strong></p>
<ul>
<li><strong>Democrats</strong>: 7.5% of bills moved beyond introduction; 2.6% became law</li>
<li><strong>Republicans</strong>: 5.5% of bills moved beyond introduction; 2.1% became law</li>
<li>When bills do advance, Republicans have a slightly higher success rate (38% vs 35%)</li>
</ul>
<h4 id="geographic-distribution">Geographic Distribution</h4>
<p><strong>Top 10 states by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>CA: 1,350</td>
          <td>CA: 93</td>
          <td>CA: 34</td>
      </tr>
      <tr>
          <td>2</td>
          <td>TX: 879</td>
          <td>NY: 44</td>
          <td>MI: 30</td>
      </tr>
      <tr>
          <td>3</td>
          <td>NY: 784</td>
          <td>TX: 43</td>
          <td>TX: 25</td>
      </tr>
      <tr>
          <td>4</td>
          <td>FL: 766</td>
          <td>MI: 28</td>
          <td>NY: 24</td>
      </tr>
      <tr>
          <td>5</td>
          <td>IL: 660</td>
          <td>NJ: 28</td>
          <td>MN: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>PA: 521</td>
          <td>IL: 27</td>
          <td>IL: 16</td>
      </tr>
      <tr>
          <td>7</td>
          <td>NJ: 478</td>
          <td>VA: 26</td>
          <td>OH: 11</td>
      </tr>
      <tr>
          <td>8</td>
          <td>MI: 380</td>
          <td>FL: 24</td>
          <td>VA: 11</td>
      </tr>
      <tr>
          <td>9</td>
          <td>OH: 377</td>
          <td>PA: 22</td>
          <td>FL: 11</td>
      </tr>
      <tr>
          <td>10</td>
          <td>MA: 361</td>
          <td>OH: 19</td>
          <td>GA: 9</td>
      </tr>
  </tbody>
</table>
<p><strong>Per-representative normalization reveals different patterns:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>DC: 101.0</td>
          <td>DC: 7.0</td>
          <td>AK: 2.2</td>
      </tr>
      <tr>
          <td>2</td>
          <td>NH: 47.5</td>
          <td>AK: 2.8</td>
          <td>NH: 2.0</td>
      </tr>
      <tr>
          <td>3</td>
          <td>MT: 44.0</td>
          <td>IA: 2.3</td>
          <td>MT: 2.0</td>
      </tr>
      <tr>
          <td>4</td>
          <td>OR: 41.0</td>
          <td>SD: 2.3</td>
          <td>MI: 1.9</td>
      </tr>
      <tr>
          <td>5</td>
          <td>NV: 40.0</td>
          <td>NH: 2.2</td>
          <td>MN: 1.5</td>
      </tr>
      <tr>
          <td>6</td>
          <td>DE: 38.7</td>
          <td>VA: 2.0</td>
          <td>HI: 1.5</td>
      </tr>
      <tr>
          <td>7</td>
          <td>SD: 38.3</td>
          <td>NJ: 2.0</td>
          <td>CT: 1.3</td>
      </tr>
      <tr>
          <td>8</td>
          <td>IA: 37.7</td>
          <td>PR: 2.0</td>
          <td>IA: 1.2</td>
      </tr>
      <tr>
          <td>9</td>
          <td>RI: 36.5</td>
          <td>NV: 1.8</td>
          <td>OR: 1.1</td>
      </tr>
      <tr>
          <td>10</td>
          <td>UT: 36.0</td>
          <td>MO: 1.8</td>
          <td>SD: 1.0</td>
      </tr>
  </tbody>
</table>
<h4 id="top-individual-sponsors">Top Individual Sponsors</h4>
<p><strong>Most prolific legislators by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Introduced</th>
          <th>Individual: Stalled</th>
          <th>Individual: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Sen. Rubio (R-FL): 186</td>
          <td>Sen. Peters (D-MI): 11</td>
          <td>Sen. Peters (D-MI): 19</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Sen. Klobuchar (D-MN): 143</td>
          <td>Sen. Cornyn (R-TX): 8</td>
          <td>Sen. Cornyn (R-TX): 15</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Sen. Lee (R-UT): 125</td>
          <td>Rep. Connolly (D-VA-11): 8</td>
          <td>Sen. Klobuchar (D-MN): 7</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Sen. Markey (D-MA): 118</td>
          <td>Rep. Takano (D-CA-41): 8</td>
          <td>Sen. Tester (D-MT): 6</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Sen. Casey (D-PA): 116</td>
          <td>Sen. Grassley (R-IA): 7</td>
          <td>Sen. Rubio (R-FL): 6</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Sen. Cortez Masto (D-NV): 109</td>
          <td>Del. Norton (D-DC): 7</td>
          <td>Rep. DeLauro (D-CT-3): 6</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Sen. Booker (D-NJ): 106</td>
          <td>Rep. Johnson (D-TX-30): 7</td>
          <td>Sen. Grassley (R-IA): 5</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Sen. Durbin (D-IL): 102</td>
          <td>Rep. Katko (R-NY-24): 7</td>
          <td>Sen. Ossoff (D-GA): 4</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Del. Norton (D-DC): 101</td>
          <td>Rep. Dean (D-PA-4): 6</td>
          <td>Sen. Murkowski (R-AK): 4</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Sen. Menendez (D-NJ): 99</td>
          <td>Rep. Wagner (R-MO-2): 6</td>
          <td>Sen. Padilla (D-CA): 4</td>
      </tr>
  </tbody>
</table>
<p><strong>Effectiveness score (laws enacted / total bills):</strong></p>
<p>$$
\text{effectiveness} = \frac{\text{bills that became law}}{\text{total bills introduced}}
$$</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Effectiveness Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Rep. Pelosi (D-CA-12): 0.500</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Rep. Mrvan (D-IN-1): 0.444</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Rep. Yarmuth (D-KY-3): 0.333</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Rep. Stivers (R-OH-15): 0.250</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Rep. Graves (R-MO-6): 0.222</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Rep. Jeffries (D-NY-8): 0.200</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Rep. Neal (D-MA-1): 0.200</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Rep. Palazzo (R-MS-4): 0.200</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Sen. Peters (D-MI): 0.186</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Rep. Fischbach (R-MN-7): 0.176</td>
      </tr>
  </tbody>
</table>
<h3 id="policy-focus-areas">Policy Focus Areas</h3>
<p>Each bill is assigned a primary policy area. Here are the most active areas by legislative outcome:</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Policy Area: Introduced</th>
          <th>Policy Area: Stalled</th>
          <th>Policy Area: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Health: 1,885</td>
          <td>Government Operations: 79</td>
          <td>Government Operations: 94</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Armed Forces: 1,114</td>
          <td>Armed Forces: 60</td>
          <td>Armed Forces: 69</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Taxation: 1,066</td>
          <td>International Affairs: 60</td>
          <td>Crime &amp; Law Enforcement: 31</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Government Operations: 982</td>
          <td>Health: 56</td>
          <td>Health: 19</td>
      </tr>
      <tr>
          <td>5</td>
          <td>International Affairs: 866</td>
          <td>Crime &amp; Law Enforcement: 44</td>
          <td>Native Americans: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Crime &amp; Law Enforcement: 842</td>
          <td>Public Lands: 44</td>
          <td>International Affairs: 14</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Education: 663</td>
          <td>Science &amp; Technology: 44</td>
          <td>Economics &amp; Finance: 13</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Transportation: 663</td>
          <td>Commerce: 43</td>
          <td>Public Lands: 13</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Public Lands: 548</td>
          <td>Finance: 34</td>
          <td>Commerce: 13</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Finance: 547</td>
          <td>Emergency Management: 27</td>
          <td>Emergency Management: 11</td>
      </tr>
  </tbody>
</table>
<p>Notable patterns: Health dominates introductions but has lower success rates, while government operations and armed forces bills are more likely to become law.</p>
<h2 id="next-steps">Next Steps</h2>
<p>This analysis establishes baseline patterns: most bills fail, party affiliation affects success rates, and certain policy areas perform better than others.</p>
<p>Future work could explore:</p>
<ul>
<li>Committee dynamics and voting patterns</li>
<li>Geographic analysis of state-level interests</li>
<li>Bill text analysis using NLP techniques</li>
<li>Predictive modeling for bill outcomes</li>
</ul>
<blockquote>
<p><strong>Update</strong>: I&rsquo;ve since applied machine learning to this type of data in <a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a>, using 48K+ bills from three Congresses to automatically categorize bills by policy area.</p></blockquote>
<p>The complete dataset and code are publicly available to support further research into legislative transparency.</p>
]]></content:encoded></item><item><title>LAMMPS Tutorial: Copper and Platinum Adatom Diffusion</title><link>https://hunterheidenreich.com/posts/adatom-cu-diffusion/</link><pubDate>Wed, 27 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/adatom-cu-diffusion/</guid><description>LAMMPS tutorial for copper and platinum surface diffusion simulation and ML training data generation. Includes setup, analysis, and Ovito visualization.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Understanding how individual atoms move on crystal surfaces is fundamental to materials science, catalysis, and nanotechnology. This atomic-scale motion, called adatom diffusion, drives processes like thin film growth and surface chemical reactions.</p>
<p>While learning molecular dynamics simulations for my graduate work, I discovered these simulations generate valuable training data for machine learning models. This tutorial walks through simulating copper adatom diffusion on a Cu(100) surface using LAMMPS, building on Eric N. Hahn&rsquo;s excellent <a href="https://www.ericnhahn.com/tutorials/lammps-tutorials/adatom">adatom tutorial</a>.</p>
<p><strong>What you&rsquo;ll learn:</strong></p>
<ul>
<li>Setting up LAMMPS for surface diffusion simulations</li>
<li>Understanding simulation parameters and their impact</li>
<li>Visualizing results with Ovito</li>
<li>Analyzing trajectory data for ML applications</li>
<li>Connecting simulation data to machine learning workflows</li>
</ul>
<p>In this tutorial, we will explore both Copper (Cu) and Platinum (Pt) to show how atomic properties affect diffusion behavior, generating data for training element-aware ML models.</p>
<h2 id="prerequisites">Prerequisites</h2>
<p>Before starting this tutorial, you&rsquo;ll need:</p>
<ul>
<li><strong>LAMMPS</strong> with EAM potential support (version 2020 or later recommended)</li>
<li><strong>Python 3.x</strong> with matplotlib for analysis scripts</li>
<li><strong>Ovito</strong> (free version) for trajectory visualization</li>
<li><strong>Cu01.eam.alloy</strong> potential file from the <a href="https://www.ctcms.nist.gov/potentials/">NIST repository</a></li>
<li>Basic familiarity with molecular dynamics concepts (atoms, forces, timesteps)</li>
</ul>
<h2 id="understanding-adatoms-and-surface-diffusion">Understanding Adatoms and Surface Diffusion</h2>
<h3 id="what-is-an-adatom">What is an Adatom?</h3>
<p>An <strong>adatom</strong> (adsorbed atom) sits on a crystal surface but isn&rsquo;t incorporated into the bulk structure. Adatoms have fewer bonds than fully coordinated bulk atoms, making them highly mobile and reactive.</p>















<figure class="post-figure center ">
    <img src="/img/posts/crystal-surface.webp"
         alt="Ball model representation of a real (atomically rough) crystal surface with steps, kinks, adatoms, and vacancies in a closely-packed crystalline material. Adsorbed molecules, substitutional and interstitial atoms are also illustrated."
         title="Ball model representation of a real (atomically rough) crystal surface with steps, kinks, adatoms, and vacancies in a closely-packed crystalline material. Adsorbed molecules, substitutional and interstitial atoms are also illustrated."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ball model representation of a real (atomically rough) crystal surface with steps, kinks, adatoms, and vacancies in a closely-packed crystalline material. Adsorbed molecules, substitutional and interstitial atoms are also illustrated. (<a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">CC-BY-SA-4.0: ShutterWaves</a>)</figcaption>
    
</figure>

<h3 id="why-study-adatom-diffusion">Why Study Adatom Diffusion?</h3>
<p>Adatom diffusion is important for several technological processes:</p>
<ul>
<li><strong>Thin film growth</strong>: Adatoms are the building blocks of deposited films</li>
<li><strong>Catalysis</strong>: Many reactions happen at these mobile surface atoms</li>
<li><strong>Corrosion</strong>: How surface atoms move affects material degradation</li>
<li><strong>Self-assembly</strong>: Adatom movement enables formation of ordered structures</li>
</ul>
<p>From a <strong>machine learning perspective</strong>, adatom diffusion is an ideal test case because:</p>
<ul>
<li>Well-understood physics provides ground truth for validation</li>
<li>Small system size enables extensive simulation</li>
<li>Behavior varies significantly with temperature and atomic species</li>
<li>Systematic data generation across different conditions</li>
</ul>
<h3 id="why-cu100">Why Cu(100)?</h3>
<p>Cu(100) surfaces are well-studied in literature, making them excellent benchmarks. The face-centered cubic (fcc) structure creates clear diffusion pathways, and copper&rsquo;s moderate binding energy lets us observe diffusion at reasonable temperatures without extreme computational demands.</p>
<h2 id="simulation-overview">Simulation Overview</h2>
<p>Before diving into the code details, let&rsquo;s understand the simulation design:</p>
<h3 id="key-simulation-parameters">Key Simulation Parameters</h3>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value</th>
          <th>Why this choice</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>System size</strong></td>
          <td>$8 \x8 \x6$ unit cells</td>
          <td>Large enough to avoid edge effects while keeping simulation time reasonable</td>
      </tr>
      <tr>
          <td><strong>Ensemble</strong></td>
          <td>NVT (constant volume, temperature)</td>
          <td>Appropriate for surface studies where pressure isn&rsquo;t the focus</td>
      </tr>
      <tr>
          <td><strong>Potential</strong></td>
          <td>EAM (Embedded Atom Method)</td>
          <td>Captures metallic bonding better than simple pair potentials</td>
      </tr>
      <tr>
          <td><strong>Time step</strong></td>
          <td>5 fs</td>
          <td>Small enough for numerical stability while allowing reasonable run times</td>
      </tr>
      <tr>
          <td><strong>Duration</strong></td>
          <td>500 ps</td>
          <td>Long enough to see multiple diffusion events</td>
      </tr>
      <tr>
          <td><strong>Temperature</strong></td>
          <td>600 K initial seed; 850 K thermostat on the bottom reservoir layer</td>
          <td>Drives thermal energy up from the substrate into the free surface where the adatom diffuses</td>
      </tr>
  </tbody>
</table>
<h3 id="simulation-strategy">Simulation Strategy</h3>
<p>The approach uses a <strong>thermal gradient setup</strong>:</p>
<ul>
<li>Bottom layers: Fixed to represent bulk crystal</li>
<li>Middle layers: Heated to 850 K for thermal energy</li>
<li>Top layers and adatom: Equilibrate to $\sim 600$ K for diffusion</li>
<li>This lets thermal energy propagate up from the heated reservoir to the free surface where the adatom diffuses</li>
</ul>
<p>The complete LAMMPS script implementing this approach:</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">### Original Created by Eric N. Hahn  ###
### ericnhahn@gmail.com ###

### Modifications by Hunter Heidenreich, CSE lab (Harvard, 2023)
### hheidenreich@g.harvard.edu
### 2023-09-01

### Simulating adatoms ###
### Version 0.2 ###


units metal
dimension 3
boundary p p s
atom_style atomic

lattice fcc 3.614
variable cubel equal 4
variable fixer1 equal &#34;v_cubel+2&#34;
variable fixer2 equal &#34;v_cubel+1.49&#34;
region  box block -${cubel} ${cubel} -${cubel} ${cubel} -${fixer1} 1 units lattice
region cbox block -${cubel} ${cubel} -${cubel} ${cubel} -${fixer1} 0 units lattice
create_box 1 box
create_atoms 1 region cbox
create_atoms 1 single -0.5 0 0.5 units lattice
region hold block INF INF INF INF -${fixer1} -${fixer2} units lattice
region temp block INF INF INF INF -${fixer2} -${cubel} units lattice
group hold region hold
group temp region temp

pair_style eam/alloy
pair_coeff * * Cu01.eam.alloy Cu

timestep        0.005
compute         new all temp
velocity        temp create 600 12345
fix heater temp temp/rescale 1 850 850 5 1
fix nve all nve
fix freeze hold setforce 0 0 0

variable e     equal pe
variable k     equal ke
variable t     equal etotal
variable T     equal temp
fix energy all ave/time 1 50 50 v_k v_e v_t v_T file energy_avg.txt

minimize 1.0e-4 1.0e-6 1000 10000

dump eve all custom 5 dump.lammpstrj id type xu yu zu   # fx fy fz  # uncomment for forces
dump_modify eve sort id

thermo 50
run 100000  # 100_000 * 5 fs = 500 ps
</code></pre><h2 id="line-by-line-breakdown">Line-by-Line Breakdown</h2>
<p>Let&rsquo;s examine each part of the LAMMPS script:</p>
<h3 id="simulation-setup">Simulation Setup</h3>
<h4 id="units">Units</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">units metal
</code></pre><p>Sets simulation units to &ldquo;metal&rdquo; units (a standard choice for metallic systems). Key conversions: length in $\text{\AA}$, energy in eV, time in ps. Full details in the <a href="https://docs.lammps.org/units.html">LAMMPS documentation</a>.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">dimension 3
</code></pre><p>Sets 3D simulation.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">boundary p p s
</code></pre><p>Boundary conditions: periodic in x,y (infinite surface) and shrink-wrapped in z (finite surface height). This allows the adatom to potentially leave the surface if needed.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">atom_style atomic
</code></pre><p>Uses &ldquo;atomic&rdquo; style, atoms as point masses without internal structure. Standard for metallic systems.</p>
<h4 id="lattice">Lattice</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">lattice fcc 3.614
</code></pre><p>Defines face-centered cubic lattice with experimental Cu lattice constant ($3.614 \text{ \AA}$).</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">variable cubel equal 4
variable fixer1 equal &#34;v_cubel+2&#34;
variable fixer2 equal &#34;v_cubel+1.49&#34;
</code></pre><p>Define variables for simulation box dimensions. <code>cubel=4</code> sets system size, while <code>fixer1</code> and <code>fixer2</code> define the frozen and heated regions.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">region  box block -${cubel} ${cubel} -${cubel} ${cubel} -${fixer1} 1 units lattice
region cbox block -${cubel} ${cubel} -${cubel} ${cubel} -${fixer1} 0 units lattice
</code></pre><p>Define regions: <code>box</code> for the entire simulation volume and <code>cbox</code> for crystal creation (excludes the surface layer where we&rsquo;ll place the adatom).</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">create_box 1 box
create_atoms 1 region cbox
create_atoms 1 single -0.5 0 0.5 units lattice
</code></pre><p>Create simulation box, populate with Cu atoms, then add single adatom at specified position.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">region hold block INF INF INF INF -${fixer1} -${fixer2} units lattice
region temp block INF INF INF INF -${fixer2} -${cubel} units lattice
group hold region hold
group temp region temp
</code></pre><p>Define atom groups: <code>hold</code> (frozen bottom layers) and <code>temp</code> (heated middle layers for thermal energy).</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">pair_style eam/alloy
pair_coeff * * Cu01.eam.alloy Cu
</code></pre><p>Use <a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method/">Embedded Atom Method (EAM)</a> potential for metallic bonding. The Cu01.eam.alloy potential from <a href="https://doi.org/10.1103/PhysRevB.63.224106">Mishin et al.</a> is available from the <a href="https://www.ctcms.nist.gov/potentials/testing/entry/2001--Mishin-Y-Mehl-M-J-Papaconstantopoulos-D-A-et-al--Cu-1/">NIST repository</a>.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">timestep        0.005
</code></pre><p>5 femtosecond timestep (small enough for numerical stability).</p>
<h4 id="initial-conditions">Initial Conditions</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">velocity        temp create 600 12345
</code></pre><p>Initialize velocities for 600 K temperature using random seed 12345.</p>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">fix heater temp temp/rescale 1 850 850 5 1
fix nve all nve
fix freeze hold setforce 0 0 0
</code></pre><p>Three fixes control dynamics:</p>
<ul>
<li><code>heater</code>: Maintains 850 K in middle layers</li>
<li><code>nve</code>: Velocity Verlet integration for all atoms</li>
<li><code>freeze</code>: Sets forces to zero for bottom atoms</li>
</ul>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">variable e     equal pe
variable k     equal ke
variable t     equal etotal
variable T     equal temp
fix energy all ave/time 1 50 50 v_k v_e v_t v_T file energy_avg.txt
</code></pre><p>Track energies and temperature, averaging every 50 timesteps and writing to file.</p>
<h3 id="execution">Execution</h3>
<h4 id="minimization">Minimization</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">minimize 1.0e-4 1.0e-6 1000 10000
</code></pre><p>Relax initial structure. Should converge quickly, indicating the system is already well-optimized.</p>
<h4 id="output-setup">Output Setup</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">dump eve all custom 5 dump.lammpstrj id type xu yu zu   # fx fy fz  # uncomment for forces
dump_modify eve sort id
</code></pre><p>Write atomic positions every 5 timesteps, sorted by atom ID. Uncomment force components if needed for analysis.</p>
<h4 id="production-run">Production Run</h4>
<pre tabindex="0"><code class="language-lammps" data-lang="lammps">thermo 50
run 100000  # 100_000 * 5 fs = 500 ps
</code></pre><p>Run simulation for 500 ps with thermo output every 50 steps.</p>
<h2 id="visualization-and-analysis">Visualization and Analysis</h2>
<p>Visualize results using <a href="https://www.ovito.org/">Ovito</a>, a free atomistic visualization tool:</p>
<ol>
<li>Open the trajectory file in Ovito</li>
<li>Color atoms by z-coordinate</li>
<li>Restrict height range to $0\text{-}2 \text{ \AA}$ for surface focus</li>
<li>Animate to observe diffusion events</li>
</ol>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/nIdbNqEEPys?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h2 id="analysis-results">Analysis Results</h2>
<p>The simulation generates rich data for machine learning applications:</p>
<h3 id="energy-analysis">Energy Analysis</h3>
<p>Energy fluctuations reveal thermal motion patterns:</p>















<figure class="post-figure center ">
    <img src="/img/adatom_cu_energy_avg.webp"
         alt="Average kinetic energy, potential energy, total energy, and temperature over time."
         title="Average kinetic energy, potential energy, total energy, and temperature over time."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Energy and temperature evolution over 500 ps simulation.</figcaption>
    
</figure>

<p>Skipping the first 30 logged data points (each averaged over 50 timesteps, so the first ~1500 timesteps / 7.5 ps of equilibration), these fluctuations enable:</p>
<ul>
<li><strong>Anomaly detection</strong>: Identifying unusual diffusion events</li>
<li><strong>Temperature prediction</strong>: Estimating local temperature from atomic motion</li>
<li><strong>Stability analysis</strong>: Detecting equilibrium states</li>
</ul>
<h3 id="trajectory-analysis">Trajectory Analysis</h3>
<p>Adatom motion reveals diffusion mechanisms:</p>















<figure class="post-figure center ">
    <img src="/img/adatom_cu_xy.webp"
         alt="x and y coordinates of the adatom over time."
         title="x and y coordinates of the adatom over time."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adatom surface trajectory showing random walk behavior.</figcaption>
    
</figure>

<p>This data enables:</p>
<ul>
<li><strong>Path prediction</strong>: Training models for future position forecasting</li>
<li><strong>Diffusion coefficient estimation</strong>: Learning temperature-mobility relationships</li>
<li><strong>Transition state identification</strong>: Detecting hops between stable sites</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/adatom_cu_z.webp"
         alt="z coordinate of the adatom over time."
         title="z coordinate of the adatom over time."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Height fluctuations revealing exchange events with surface atoms.</figcaption>
    
</figure>

<p>Z-coordinate data shows <strong>exchange events</strong> where the adatom swaps with surface atoms (crucial for surface chemistry understanding). This enables:</p>
<ul>
<li><strong>Event classification</strong>: Distinguishing diffusion vs. exchange mechanisms</li>
<li><strong>Activation barrier estimation</strong>: Learning energy landscapes from fluctuations</li>
<li><strong>Surface coordination analysis</strong>: Correlating height with local environment</li>
</ul>
<h3 id="machine-learning-applications">Machine Learning Applications</h3>
<p>This simulation produces multiple data types for ML training:</p>
<ol>
<li><strong>Coordinate trajectories</strong>: Neural network potential inputs or graph neural network features</li>
<li><strong>Energy time series</strong>: Regression model features for system property prediction</li>
<li><strong>Event annotations</strong>: Supervised learning labels for diffusion mechanism classification</li>
<li><strong>Environmental descriptors</strong>: Local atomic arrangement features</li>
</ol>
<p>Systematic MD simulations generate large, labeled datasets across varied conditions.</p>
<h2 id="extending-to-platinum-mass-and-bonding-effects">Extending to Platinum: Mass and Bonding Effects</h2>
<p>To understand how different elements behave, we can extend this framework to platinum (Pt). Platinum&rsquo;s higher atomic mass and stronger metallic bonding create notably different diffusion behavior, providing comparative data for machine learning.</p>
<h3 id="key-differences-from-copper">Key Differences from Copper</h3>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Copper (Cu)</th>
          <th>Platinum (Pt)</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Atomic mass</strong></td>
          <td>63.5 u</td>
          <td>195.1 u</td>
          <td>Slower diffusion, longer correlation times</td>
      </tr>
      <tr>
          <td><strong>Lattice const.</strong></td>
          <td>3.614 Å</td>
          <td>3.96 Å</td>
          <td>Larger diffusion barriers, different pathways</td>
      </tr>
      <tr>
          <td><strong>Potential</strong></td>
          <td>Mishin et al.</td>
          <td>Zhou et al.</td>
          <td>Different interaction strengths</td>
      </tr>
      <tr>
          <td><strong>Melting point</strong></td>
          <td>1358 K</td>
          <td>2041 K</td>
          <td>Stronger surface binding</td>
      </tr>
  </tbody>
</table>
<h3 id="modifying-the-lammps-script">Modifying the LAMMPS Script</h3>
<p>The platinum simulation uses the exact same framework as the copper case, with three simple element-specific modifications:</p>
<ol>
<li><strong>Lattice constant</strong>: Change <code>lattice fcc 3.614</code> to <code>lattice fcc 3.96</code></li>
<li><strong>Potential file</strong>: Change <code>Cu01.eam.alloy</code> to <code>Pt_Zhou04.eam.alloy</code> (available from the <a href="https://www.ctcms.nist.gov/potentials/testing/entry/2004--Zhou-X-W-Johnson-R-A-Wadley-H-N-G--Pt/">NIST repository</a>)</li>
<li><strong>Element specification</strong>: Change <code>Cu</code> to <code>Pt</code> in the <code>pair_coeff</code> line</li>
</ol>
<p>These simple changes capture the essential physics differences between elements while maintaining the same simulation protocol, which is ideal for generating comparative datasets for ML training.</p>
<h3 id="expected-behavior-vs-copper">Expected Behavior vs. Copper</h3>
<p>When you run the analysis scripts on the platinum trajectory, you will observe:</p>
<ul>
<li><strong>Slower motion</strong>: Heavier atoms move more slowly at the same temperature. Platinum&rsquo;s ~3x greater mass reduces diffusion rates.</li>
<li><strong>Higher energy barriers</strong>: Stronger metallic bonding creates deeper potential wells, requiring more thermal energy for diffusion hops.</li>
<li><strong>Different pathways</strong>: The larger lattice constant changes the energy landscape, potentially favoring different diffusion mechanisms.</li>
</ul>
<p>Comparing Cu and Pt trajectories enables training element-aware models that account for atomic mass effects, binding strengths, and temperature scaling across different metals.</p>
<h2 id="code-and-data">Code and Data</h2>
<p>The complete simulation scripts and analysis tools are available for reproducibility:</p>
<h3 id="energy-analysis-script">Energy Analysis Script</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Hunter Heidenreich, 2023</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Plots the energy of a simulation over time.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> argparse <span style="color:#f92672">import</span> ArgumentParser
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#39;__main__&#39;</span>:
</span></span><span style="display:flex;"><span>    parser <span style="color:#f92672">=</span> ArgumentParser()
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--input&#39;</span>, type<span style="color:#f92672">=</span>str, required<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--output&#39;</span>, type<span style="color:#f92672">=</span>str, required<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--skip&#39;</span>, type<span style="color:#f92672">=</span>int, default<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> parser<span style="color:#f92672">.</span>parse_args()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Parse energy data</span>
</span></span><span style="display:flex;"><span>    data <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#39;ts&#39;</span>: [], <span style="color:#e6db74">&#39;kes&#39;</span>: [], <span style="color:#e6db74">&#39;pes&#39;</span>: [], <span style="color:#e6db74">&#39;tes&#39;</span>: [], <span style="color:#e6db74">&#39;Ts&#39;</span>: []}
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(args<span style="color:#f92672">.</span>input, <span style="color:#e6db74">&#39;r&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> f:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> line<span style="color:#f92672">.</span>startswith(<span style="color:#e6db74">&#39;#&#39;</span>) <span style="color:#f92672">or</span> <span style="color:#f92672">not</span> line<span style="color:#f92672">.</span>strip():
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>            t, v_k, v_e, v_t, v_T <span style="color:#f92672">=</span> map(float, line<span style="color:#f92672">.</span>split())
</span></span><span style="display:flex;"><span>            data[<span style="color:#e6db74">&#39;ts&#39;</span>]<span style="color:#f92672">.</span>append(t)
</span></span><span style="display:flex;"><span>            data[<span style="color:#e6db74">&#39;kes&#39;</span>]<span style="color:#f92672">.</span>append(v_k)
</span></span><span style="display:flex;"><span>            data[<span style="color:#e6db74">&#39;pes&#39;</span>]<span style="color:#f92672">.</span>append(v_e)
</span></span><span style="display:flex;"><span>            data[<span style="color:#e6db74">&#39;tes&#39;</span>]<span style="color:#f92672">.</span>append(v_t)
</span></span><span style="display:flex;"><span>            data[<span style="color:#e6db74">&#39;Ts&#39;</span>]<span style="color:#f92672">.</span>append(v_T)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Skip initial equilibration</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> key <span style="color:#f92672">in</span> data:
</span></span><span style="display:flex;"><span>        data[key] <span style="color:#f92672">=</span> data[key][args<span style="color:#f92672">.</span>skip:]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create subplots</span>
</span></span><span style="display:flex;"><span>    fig, axs <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">2</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">16</span>, <span style="color:#ae81ff">12</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    plots <span style="color:#f92672">=</span> [(<span style="color:#e6db74">&#39;Kinetic Energy&#39;</span>, <span style="color:#e6db74">&#39;kes&#39;</span>), (<span style="color:#e6db74">&#39;Potential Energy&#39;</span>, <span style="color:#e6db74">&#39;pes&#39;</span>),
</span></span><span style="display:flex;"><span>             (<span style="color:#e6db74">&#39;Total Energy&#39;</span>, <span style="color:#e6db74">&#39;tes&#39;</span>), (<span style="color:#e6db74">&#39;Temperature&#39;</span>, <span style="color:#e6db74">&#39;Ts&#39;</span>)]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> ax, (title, key) <span style="color:#f92672">in</span> zip(axs<span style="color:#f92672">.</span>flat, plots):
</span></span><span style="display:flex;"><span>        ax<span style="color:#f92672">.</span>plot(data[<span style="color:#e6db74">&#39;ts&#39;</span>], data[key])
</span></span><span style="display:flex;"><span>        ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;TimeStep&#39;</span>)
</span></span><span style="display:flex;"><span>        ax<span style="color:#f92672">.</span>set_ylabel(title)
</span></span><span style="display:flex;"><span>        ax<span style="color:#f92672">.</span>set_title(title)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>tight_layout()
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>savefig(args<span style="color:#f92672">.</span>output, dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">300</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div><h3 id="trajectory-analysis-script">Trajectory Analysis Script</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Hunter Heidenreich, 2023</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Plots the coordinates of the adatom.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> argparse <span style="color:#f92672">import</span> ArgumentParser
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#39;__main__&#39;</span>:
</span></span><span style="display:flex;"><span>    parser <span style="color:#f92672">=</span> ArgumentParser()
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--input&#39;</span>, type<span style="color:#f92672">=</span>str, required<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--output&#39;</span>, type<span style="color:#f92672">=</span>str, required<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--id&#39;</span>, type<span style="color:#f92672">=</span>int, default<span style="color:#f92672">=</span><span style="color:#ae81ff">1665</span>,
</span></span><span style="display:flex;"><span>                       help<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Atom ID to track (the adatom is the last created atom)&#39;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#39;--do_z&#39;</span>, action<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;store_true&#39;</span>,
</span></span><span style="display:flex;"><span>                       help<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Plot z-coordinate instead of xy scatter&#39;</span>)
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> parser<span style="color:#f92672">.</span>parse_args()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    coords <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#39;x&#39;</span>: [], <span style="color:#e6db74">&#39;y&#39;</span>: [], <span style="color:#e6db74">&#39;z&#39;</span>: []}
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(args<span style="color:#f92672">.</span>input, <span style="color:#e6db74">&#39;r&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> f:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> line<span style="color:#f92672">.</span>startswith(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;</span><span style="color:#e6db74">{</span>args<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74"> &#39;</span>):
</span></span><span style="display:flex;"><span>                x, y, z <span style="color:#f92672">=</span> map(float, line<span style="color:#f92672">.</span>split()[<span style="color:#ae81ff">2</span>:<span style="color:#ae81ff">5</span>])
</span></span><span style="display:flex;"><span>                coords[<span style="color:#e6db74">&#39;x&#39;</span>]<span style="color:#f92672">.</span>append(x)
</span></span><span style="display:flex;"><span>                coords[<span style="color:#e6db74">&#39;y&#39;</span>]<span style="color:#f92672">.</span>append(y)
</span></span><span style="display:flex;"><span>                coords[<span style="color:#e6db74">&#39;z&#39;</span>]<span style="color:#f92672">.</span>append(z)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">8</span>))
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> args<span style="color:#f92672">.</span>do_z:
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>plot(range(len(coords[<span style="color:#e6db74">&#39;z&#39;</span>])), coords[<span style="color:#e6db74">&#39;z&#39;</span>], <span style="color:#e6db74">&#39;b-&#39;</span>, linewidth<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>xlabel(<span style="color:#e6db74">&#39;Simulation Step&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>ylabel(<span style="color:#e6db74">&#39;Z Coordinate (Å)&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Height vs. Time for Adatom </span><span style="color:#e6db74">{</span>args<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>grid(<span style="color:#66d9ef">True</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.3</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>scatter(coords[<span style="color:#e6db74">&#39;x&#39;</span>], coords[<span style="color:#e6db74">&#39;y&#39;</span>], s<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>, c<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;red&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>xlabel(<span style="color:#e6db74">&#39;X Coordinate (Å)&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>ylabel(<span style="color:#e6db74">&#39;Y Coordinate (Å)&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;XY Trajectory for Adatom </span><span style="color:#e6db74">{</span>args<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>axis(<span style="color:#e6db74">&#39;equal&#39;</span>)
</span></span><span style="display:flex;"><span>        plt<span style="color:#f92672">.</span>grid(<span style="color:#66d9ef">True</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.3</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>savefig(args<span style="color:#f92672">.</span>output, dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">300</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div><h2 id="summary-and-next-steps">Summary and Next Steps</h2>
<p>This tutorial demonstrates how molecular dynamics generates valuable ML training data for materials science. Adatom diffusion provides an ideal starting point because it:</p>
<ul>
<li><strong>Has interpretable physics</strong>: Well-understood mechanisms enable ML validation</li>
<li><strong>Shows diverse behaviors</strong>: Temperature-dependent dynamics create rich datasets</li>
<li><strong>Scales efficiently</strong>: Small systems allow extensive parameter exploration</li>
<li><strong>Connects to applications</strong>: Direct relevance to catalysis and surface engineering</li>
</ul>
<h3 id="whats-next">What&rsquo;s Next</h3>
<p>Future posts will extend this framework:</p>
<ol>
<li><strong>Mixed-metal surfaces</strong>: Alloy effects on diffusion pathways</li>
<li><strong>Stepped surfaces</strong>: How defects alter atomic mobility</li>
<li><strong>ML implementation</strong>: Training neural networks on simulation data</li>
</ol>
<h3 id="broader-applications">Broader Applications</h3>
<p>These simulation techniques enable various ML applications:</p>
<ul>
<li><strong>Neural network potentials</strong>: Replacing expensive quantum calculations with trained models</li>
<li><strong>Rare event sampling</strong>: ML-enhanced diffusion pathway identification</li>
<li><strong>Catalyst design</strong>: Predicting surface modification effects on reactivity</li>
<li><strong>Materials discovery</strong>: Screening alloy compositions for desired properties</li>
</ul>
<h3 id="getting-started">Getting Started</h3>
<p>To reproduce these simulations:</p>
<ol>
<li>Install LAMMPS with EAM potential support</li>
<li>Download Cu01.eam.alloy from the <a href="https://www.ctcms.nist.gov/potentials/entry/2001--Mishin-Y-Mehl-M-J-Papaconstantopoulos-D-A-et-al--Cu-1/">NIST repository</a> and place in your working directory</li>
<li>Save the LAMMPS script as <code>adatom_cu.lammps</code> and run:
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>lammps -in adatom_cu.lammps
</span></span></code></pre></div></li>
<li>Analyze the results with the Python scripts:
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>python plot_energy.py --input energy_avg.txt --output energy.png --skip <span style="color:#ae81ff">30</span>
</span></span><span style="display:flex;"><span>python plot_trajectory.py --input dump.lammpstrj --output trajectory_xy.png
</span></span><span style="display:flex;"><span>python plot_trajectory.py --input dump.lammpstrj --output trajectory_z.png --do_z
</span></span></code></pre></div></li>
<li>Visualize in Ovito by opening <code>dump.lammpstrj</code></li>
<li>Experiment with different temperatures, orientations, or elements</li>
</ol>
<hr>
<p>The full project, including the simulation architecture and automated analysis pipeline, is documented on the <a href="/projects/lammps-adatom-diffusion/">Automated Adatom Diffusion Workflow project page</a>.</p>
<p><em>Questions about the simulation setup or interested in applying these techniques to your research? Feel free to reach out. I&rsquo;m always happy to discuss molecular dynamics and machine learning applications.</em></p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://www.lammps.org/">LAMMPS</a></li>
<li><a href="https://www.ovito.org/">Ovito</a></li>
<li><a href="https://www.ctcms.nist.gov/potentials/">NIST Interatomic Potentials Repository</a></li>
<li><a href="https://doi.org/10.1103/PhysRevB.63.224106">Mishin et al.</a></li>
</ul>
]]></content:encoded></item><item><title>Generating Mini-Protein Trajectories with GROMACS</title><link>https://hunterheidenreich.com/posts/mini-proteins/</link><pubDate>Thu, 21 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/mini-proteins/</guid><description>Systematic GROMACS workflows for simulating mini-proteins across multiple amino acids to generate diverse MD trajectories for ML applications.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>When developing machine learning models for protein dynamics, I needed training data, lots of it. Most researchers start with alanine dipeptide, a tiny two-amino-acid system that&rsquo;s become the &ldquo;hello world&rdquo; of protein simulation. It&rsquo;s small enough to simulate quickly but complex enough to show interesting folding behavior.</p>
<p>I wanted more diversity in my training data. Different amino acid side chains behave differently, and I was curious how this would affect model performance. So I extended the typical alanine dipeptide approach to include eight other amino acids, creating a small collection of &ldquo;mini-proteins&rdquo; for ML studies.</p>
<p>These dipeptides give a controlled testbed for studying how different chemical properties (aromatic rings, flexibility, branching) affect molecular dynamics, and for generating training data that varies those properties systematically.</p>
<h2 id="what-are-mini-proteins">What Are Mini-Proteins?</h2>
<p>In this context, &ldquo;mini-proteins&rdquo; are single amino acid residues capped with acetyl and N-methyl groups (Ace-X-Nme, where X is the amino acid). These systems act as the simplest possible models that still capture essential protein-like behavior.</p>
<p>These systems are popular in computational studies because they:</p>
<ul>
<li>Simulate quickly (seconds to minutes instead of hours)</li>
<li>Have well-characterized behavior for validation</li>
<li>Show enough complexity to be interesting</li>
<li>Can be systematically varied to study different chemical effects</li>
</ul>
<h2 id="getting-started">Getting Started</h2>
<p>The complete workflow and scripts are available on GitHub: <a href="https://github.com/hunter-heidenreich/mini-proteins/">mini-proteins</a>. The full project overview is on the <a href="/projects/mini-protein-trajectories/">Mini-Protein Trajectory Generation project page</a>.</p>
<h3 id="requirements">Requirements</h3>
<ul>
<li>Linux system with GROMACS installed</li>
<li>Python 3 with numpy and matplotlib</li>
<li>Basic familiarity with molecular dynamics concepts</li>
</ul>
<h3 id="quick-start">Quick Start</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>git clone https://github.com/hunter-heidenreich/mini-proteins.git
</span></span><span style="display:flex;"><span>cd mini-proteins
</span></span><span style="display:flex;"><span>ID<span style="color:#f92672">=</span>ala sh scripts/run.sh
</span></span></code></pre></div><p>This runs the complete pipeline: energy minimization, solvation, equilibration, and production simulation. The default settings generate 1 ns of trajectory data saved every 100 fs. I chose high temporal resolution for my ML models, but you can adjust this in <code>config/md_langevin.mdp</code>.</p>
<p>For longer production runs (recommended for most applications), increase the simulation time to ~100 ns and reduce the save frequency to manage file sizes.</p>
<h2 id="the-collection">The Collection</h2>
<p>I&rsquo;ve included nine different amino acid dipeptides, each with distinct chemical properties:</p>
<p><strong>Flexible systems</strong>: Glycine (smallest side chain), Alanine (methyl group)</p>
<p><strong>Branched systems</strong>: Valine, Isoleucine, Leucine (different branching patterns)</p>
<p><strong>Aromatic systems</strong>: Phenylalanine, Tryptophan (different ring structures)</p>
<p><strong>Special cases</strong>: Proline (ring constraint), Methionine (sulfur chemistry)</p>
<p>This systematic set allows studying how different chemical features affect dynamics:</p>
<ul>
<li>Does the flexibility of glycine lead to more diverse conformational sampling?</li>
<li>How do aromatic rings in tryptophan affect folding pathways?</li>
<li>Does the ring constraint in proline create different energy landscapes?</li>
</ul>
<p>These fundamental questions provide systematic data to test ML models against known chemical intuition, building confidence in the approach.</p>
<p>Ideally, a neural network trained on this dataset should learn physical <em>invariances</em>. By training on both aliphatic (Val, Leu, Ile) and aromatic (Phe, Trp) systems, the model learns to focus entirely on how electron density (π-systems vs. σ-bonds) influences local potential energy surfaces.</p>
<h3 id="generating-ml-ready-trajectory-data">Generating ML-Ready Trajectory Data</h3>
<p>Generating raw coordinates is easy; generating <strong>ML-ready data</strong> requires specific configurations. Standard MD simulations compress trajectory files to save space, discarding high-frequency velocity and force data. To train Neural Network Potentials (NNPs), I configured the GROMACS pipeline differently.</p>
<p>The fastest way to generate trajectory data is using the <code>run.sh</code> script:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ID<span style="color:#f92672">=</span>ala sh scripts/run.sh
</span></span></code></pre></div><p>where <code>ID</code> is the three-letter amino acid code (here, <code>ala</code> for alanine).</p>
<p>This script performs energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation. The resulting trajectory saves to the <code>out/ID/data</code> directory.</p>
<h4 id="why-this-pipeline-differs-from-standard-tutorials">Why This Pipeline Differs from Standard Tutorials</h4>
<p>A key deviation from standard tutorials is the use of <strong>Stochastic Dynamics (Langevin)</strong> as the integrator. This adds friction and noise terms to the equations of motion, ensuring correct thermodynamic sampling:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-ini" data-lang="ini"><span style="display:flex;"><span><span style="color:#75715e">; config/md_langevin.mdp</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">integrator</span>  <span style="color:#f92672">=</span> <span style="color:#e6db74">sd        ; Stochastic dynamics (Langevin)</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">dt</span>          <span style="color:#f92672">=</span> <span style="color:#e6db74">0.001     ; 1 fs timestep</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">nstxout</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">100       ; Save coordinates every 100 steps</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">nstvout</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">100       ; Save velocities every 100 steps</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">nstfout</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">100       ; Save forces every 100 steps</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tc-grps</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">Protein Non-Protein</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tau_t</span>       <span style="color:#f92672">=</span> <span style="color:#e6db74">0.1  0.1  ; Friction constant (ps)</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">ref_t</span>       <span style="color:#f92672">=</span> <span style="color:#e6db74">298  298  ; Reference temperature (K)</span>
</span></span></code></pre></div><p>The critical settings for ML applications:</p>
<ol>
<li><strong>Langevin Dynamics (<code>sd</code>)</strong>: Ensures proper canonical (NVT) sampling, providing a robust alternative to the velocity-rescaling thermostat often used in tutorials</li>
<li><strong>Uncompressed Force Output (<code>nstfout = 100</code>)</strong>: Writing to <code>.trr</code> format captures the precise atomic forces acting on every atom, essential for force-matching in NNP training</li>
<li><strong>High-Frequency Sampling (0.1 ps)</strong>: Saving frames every 100 fs captures fast bond vibrations often missed in standard 10 ps snapshots</li>
</ol>
<p><strong>Note</strong>: A production simulation currently runs for 1 nanosecond, saved every 0.1 picoseconds (100 fs). For most applications, increase this to 100 nanoseconds and adjust the save frequency to avoid large data files. I targeted 100 fs because I needed correlated time data for ML models; other applications may require a lower frequency.</p>
<p>You can also run each step individually (see <code>scripts/run.sh</code> for examples).</p>
<h2 id="the-systems">The Systems</h2>
<p>Here are the nine amino acid dipeptides I&rsquo;ve included, each chosen for different chemical properties:</p>
<h3 id="alanine-dipeptide-the-standard">Alanine Dipeptide: The Standard</h3>















<figure class="post-figure center ">
    <img src="/img/alanine-dipeptide-molecular-dynamics.webp"
         alt="Alanine dipeptide molecular dynamics simulation animation"
         title="Alanine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Alanine Dipeptide</figcaption>
    
</figure>

<p>The classic starting point for protein folding studies. The small methyl side chain provides a simple yet challenging system.</p>
<h3 id="glycine-dipeptide-maximum-flexibility">Glycine Dipeptide: Maximum Flexibility</h3>















<figure class="post-figure center ">
    <img src="/img/glycine-dipeptide-molecular-dynamics.webp"
         alt="Glycine dipeptide molecular dynamics simulation animation"
         title="Glycine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Glycine Dipeptide</figcaption>
    
</figure>

<p>No side chain means maximum backbone flexibility. Great for studying how constraints affect conformational sampling.</p>
<h3 id="proline-dipeptide-built-in-rigidity">Proline Dipeptide: Built-in Rigidity</h3>















<figure class="post-figure center ">
    <img src="/img/proline-dipeptide-molecular-dynamics.webp"
         alt="Proline dipeptide molecular dynamics simulation animation"
         title="Proline dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Proline Dipeptide</figcaption>
    
</figure>

<p>The ring structure creates backbone constraints. Interesting comparison to glycine&rsquo;s flexibility.</p>
<h3 id="aromatic-systems">Aromatic Systems</h3>















<figure class="post-figure center ">
    <img src="/img/phenylalanine-dipeptide-molecular-dynamics.webp"
         alt="Phenylalanine dipeptide molecular dynamics simulation animation"
         title="Phenylalanine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Phenylalanine Dipeptide</figcaption>
    
</figure>

<p><strong>Phenylalanine</strong>: Simple benzene ring for studying aromatic interactions.</p>















<figure class="post-figure center ">
    <img src="/img/tryptophan-dipeptide-molecular-dynamics.webp"
         alt="Tryptophan dipeptide molecular dynamics simulation animation"
         title="Tryptophan dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Tryptophan Dipeptide</figcaption>
    
</figure>

<p><strong>Tryptophan</strong>: Larger indole ring system with more complex aromatic chemistry.</p>
<h3 id="branched-aliphatic-systems">Branched Aliphatic Systems</h3>















<figure class="post-figure center ">
    <img src="/img/valine-dipeptide-molecular-dynamics.webp"
         alt="Valine dipeptide molecular dynamics simulation animation"
         title="Valine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Valine Dipeptide</figcaption>
    
</figure>

<p><strong>Valine</strong>: β-branched, creates steric constraints near the backbone.</p>















<figure class="post-figure center ">
    <img src="/img/isoleucine-dipeptide-molecular-dynamics.webp"
         alt="Isoleucine dipeptide molecular dynamics simulation animation"
         title="Isoleucine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Isoleucine Dipeptide</figcaption>
    
</figure>

<p><strong>Isoleucine</strong>: γ-branched, different steric profile than valine.</p>















<figure class="post-figure center ">
    <img src="/img/leucine-dipeptide-molecular-dynamics.webp"
         alt="Leucine dipeptide molecular dynamics simulation animation"
         title="Leucine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Leucine Dipeptide</figcaption>
    
</figure>

<p><strong>Leucine</strong>: Longer branched chain with more conformational freedom.</p>
<h3 id="special-chemistry">Special Chemistry</h3>















<figure class="post-figure center ">
    <img src="/img/methionine-dipeptide-molecular-dynamics.webp"
         alt="Methionine dipeptide molecular dynamics simulation animation"
         title="Methionine dipeptide molecular dynamics simulation animation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Methionine Dipeptide</figcaption>
    
</figure>

<p><strong>Methionine</strong>: Sulfur chemistry, different from the others and interesting for studying heteroatom effects.</p>
<h2 id="whats-next">What&rsquo;s Next?</h2>
<p>These mini-protein simulations have been useful for my ML work, providing systematic training data with controlled chemical variation. These simple systems have helped me understand how different amino acid properties affect molecular behavior, knowledge that&rsquo;s valuable when working with larger, more complex proteins.</p>
<p>The primary value of this pipeline lies in the <strong>force extraction</strong> workflow. Having atomic forces alongside coordinates enables training NNPs via force matching; force information is a richer training signal than energies alone. Tools like <a href="https://github.com/torchmd/torchmd-net">TorchMD-Net</a>, <a href="https://github.com/mir-group/nequip">NequIP</a>, and <a href="https://github.com/ACEsuit/mace">MACE</a> can directly consume this data format.</p>
<p>The scripts are designed to be easily modified for different amino acids or simulation conditions. I&rsquo;ve tried to make the workflow straightforward while keeping it flexible.</p>
<p>This work complements my other molecular dynamics projects:</p>
<ul>
<li><a href="/posts/adatom-cu-diffusion/">LAMMPS Tutorial: Copper and Platinum Adatom Diffusion</a>: Learning LAMMPS for surface simulations and extending to different elements</li>
</ul>
<p>Together, these projects have given me a solid foundation in MD simulations for generating ML training data across different molecular systems.</p>
<hr>
<p><em>Find the complete code and documentation on <a href="https://github.com/hunter-heidenreich/mini-proteins">GitHub</a>. Questions or suggestions? I&rsquo;d love to hear from you, especially if you&rsquo;ve found interesting ways to extend or improve the approach.</em></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The scripts build on the <a href="https://cbp-unitn.gitlab.io/qcb22-23/QCB/tutorial2_gromacs">GROMACS tutorial</a> by Luca Tubiana at the University of Trento.</p>
]]></content:encoded></item><item><title>Mini-Protein Trajectory Generation</title><link>https://hunterheidenreich.com/projects/mini-protein-trajectories/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/mini-protein-trajectories/</guid><description>Automated GROMACS pipeline generating MD trajectories with atomic force extraction for Neural Network Potential training.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>I developed an automated GROMACS pipeline to generate molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs). The pipeline is built off Luca Tubiana&rsquo;s GROMACS tutorial (University of Trento); the Python analysis layer and the curated dipeptide dataset are my own.</p>
<h2 id="features">Features</h2>
<h3 id="automated-simulation-pipeline">Automated Simulation Pipeline</h3>
<ul>
<li><strong>End-to-End Scripting</strong>: Bash-automated workflow handling topology generation (<code>pdb2gmx</code>), solvation, ionization, and equilibration</li>
<li><strong>Langevin Dynamics</strong>: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling</li>
<li><strong>High-Resolution Output</strong>: Configured to capture <strong>0.1 ps (100 fs) resolution</strong> trajectories, critical for capturing fast bond vibrations</li>
<li><strong>Force Extraction</strong>: Optimized output to <code>.trr</code> format preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-ini" data-lang="ini"><span style="display:flex;"><span><span style="color:#75715e">; md_langevin.mdp</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">integrator</span>  <span style="color:#f92672">=</span> <span style="color:#e6db74">sd        ; Stochastic dynamics for proper sampling</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">dt</span>          <span style="color:#f92672">=</span> <span style="color:#e6db74">0.001     ; 1 fs timestep</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">nstxout</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">100       ; Output every 100 steps = 0.1 ps resolution</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tc-grps</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">Protein Non-Protein</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tau_t</span>       <span style="color:#f92672">=</span> <span style="color:#e6db74">0.1  0.1  ; Friction constant (ps)</span>
</span></span></code></pre></div><h3 id="chemical-diversity-suite">Chemical Diversity Suite</h3>
<p>Designed to stress-test ML models against varied kinematic constraints:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Residues</th>
          <th>Dynamics Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Aromatic</strong></td>
          <td>Phe, Trp</td>
          <td>π-stacking, bulky side chains</td>
      </tr>
      <tr>
          <td><strong>Constrained</strong></td>
          <td>Pro</td>
          <td>Cyclic backbone restrictions</td>
      </tr>
      <tr>
          <td><strong>Flexible</strong></td>
          <td>Gly, Ala</td>
          <td>High conformational entropy</td>
      </tr>
      <tr>
          <td><strong>Branched</strong></td>
          <td>Val, Ile, Leu</td>
          <td>Steric clashes, rotamer preferences</td>
      </tr>
      <tr>
          <td><strong>Sulfur-Containing</strong></td>
          <td>Met</td>
          <td>Flexible thioether linkage</td>
      </tr>
  </tbody>
</table>
<h2 id="usage">Usage</h2>
<p>The pipeline is executed via bash scripts, requiring GROMACS to be installed.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Data Volume vs. Fidelity</strong>: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight <code>.xvg</code> formats</li>
<li><strong>Force Field Consistency</strong>: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training</li>
</ul>
<blockquote>
<p><strong>Note</strong>: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.</p></blockquote>
<h2 id="retrospective">Retrospective</h2>
<ul>
<li><strong>Demonstrative, not production-scale</strong>: the 1 ns trajectories exercise the pipeline and capture fast bond vibrations, but proper conformational sampling needs 100 ns to 1 µs runs. This is a working reference, not a finished dataset.</li>
<li><strong>Dated force field</strong>: Amber03 / TIP3P keeps the potential energy surface consistent across residues, but it is not state-of-the-art for ML-potential training; CHARMM36m or Amber ff19SB would be the upgrade path.</li>
<li><strong>Paused, not abandoned</strong>: a candidate to revive and extend (more residues, longer trajectories, Ramachandran analysis) for future force-matching work.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/mini-proteins/">Mini-Protein Dynamics</a> - Detailed blog post on the simulation methodology</li>
</ul>
]]></content:encoded></item><item><title>Congressional Knowledge Graph &amp; Policy Classification</title><link>https://hunterheidenreich.com/projects/congressional-data-analysis/</link><pubDate>Wed, 01 Mar 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/congressional-data-analysis/</guid><description>A 47,000+ bill knowledge graph from Congress.gov with co-sponsorship networks and TF-IDF baselines for 33-class policy-area classification.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>A computational social science project that constructed a dataset of 47,000+ US congressional bills by extracting legislative text and metadata from the 115th-117th Congresses. The project creates a &ldquo;legislative graph&rdquo;
(linking sponsors, committees, and bill text) and establishes TF-IDF baseline models for policy area classification across 33 (highly imbalanced) policy classes, now hosted on Hugging Face to support reproducible political science research.</p>
<h2 id="features">Features</h2>
<h3 id="intelligent-data-acquisition">Intelligent Data Acquisition</h3>
<p>Standard APIs impose strict rate limits. I built a Selenium-based extraction engine to handle Congress.gov&rsquo;s complex DOM structures.</p>
<ul>
<li><strong>Optimization</strong>: Targeted aggregate endpoints (e.g., <code>/all-info</code>) to pull each bill&rsquo;s text and metadata in fewer requests.</li>
<li><strong>Resilience</strong>: Implemented a local caching layer to store raw HTML, separating the fetch step from the parse step. This made the parse step re-runnable without re-fetching, and minimized server load during iterative development.</li>
<li><strong>Graph construction</strong>: Beyond simple text, the script extracts relational data including co-sponsorship networks, committee assignments, and related bill lineage.</li>
</ul>
<h3 id="natural-language-processing">Natural Language Processing</h3>
<ul>
<li><strong>Corpus construction</strong>: Cleaned and normalized legislative text, removing procedural artifacts (e.g., &ldquo;A BILL TO&hellip;&rdquo;) to isolate semantic policy content.</li>
<li><strong>Feature engineering</strong>: Utilized TF-IDF vectorization with N-gram analysis to capture legislative jargon.</li>
<li><strong>Modeling</strong>: Benchmarked Naive Bayes, Logistic Regression, and gradient-boosted trees (XGBoost), reaching ~0.86 weighted F1 on bill summaries and up to ~0.89 on full text (cross-validated). Weighted F1, not raw accuracy, is the honest metric here: the 33 policy classes are severely imbalanced (Health has 5,911 bills; Social Sciences and History has 15).</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The dataset is available on Hugging Face and can be loaded directly via the <code>datasets</code> library. The scraper can be run locally to fetch new bills.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>The &ldquo;partisan vocabulary&rdquo;</strong>: Feature importance analysis revealed distinct linguistic markers separating Democratic and Republican legislation, identifiable even without metadata.</li>
<li><strong>Temporal drift</strong>: Policy priorities and terminology showed measurable shifts across congressional sessions (115th vs 117th).</li>
<li><strong>Classification success</strong>: Simple linear models (Logistic Regression and Naive Bayes) proved effective at distinguishing policy domains, outperforming gradient-boosted trees on these sparse TF-IDF features and suggesting legislative language is highly structured.</li>
</ul>
<h2 id="impact--deliverables">Impact &amp; Deliverables</h2>
<ul>
<li><strong>Hugging Face dataset</strong>: Released a machine-readable, ML-ready dataset of modern bills (115th-117th Congresses) on Hugging Face for reproducible research.</li>
<li><strong>Open source tooling</strong>: Published the scraper and parsing logic to allow others to extend the dataset to future congresses.</li>
<li><strong>Academic benchmark</strong>: Establishing a clear baseline for &ldquo;Government NLP&rdquo; tasks, aiding in the automated transparency and monitoring of new legislation.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/us-117th-congress-data-exploration/">117th Congress Data Exploration</a></li>
<li><a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a></li>
</ul>
]]></content:encoded></item><item><title>Look, Don't Tweet: Unified Data Models for Social NLP</title><link>https://hunterheidenreich.com/research/look-dont-tweet/</link><pubDate>Wed, 30 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/look-dont-tweet/</guid><description>PyConversations library and unified data schema for normalizing 300M+ posts across Twitter, Reddit, Facebook, and 4chan.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>This is my undergraduate senior thesis, completed at Drexel University in 2021. The scope (308 million posts across four platforms, structural topology analysis, and domain adaptation experiments with Transformer models) was unusually broad for a senior thesis, spanning large-scale data engineering, graph-structural analysis, and representation-learning experiments.</p>
<p>Social media research is often siloed by platform, with tools built specifically for Twitter&rsquo;s flat structure or Reddit&rsquo;s tree structure. This fragmentation makes cross-platform analysis difficult. In this work, I introduce <strong><a href="https://github.com/hunter-heidenreich/pyconversations">PyConversations</a></strong>, an open-source Python package that normalizes data from Twitter, Facebook, Reddit, and 4chan into a single, platform-agnostic data model. <em>(Note: the repository is archived and no longer actively maintained.)</em></p>
<p>Leveraging this tool, I processed over <strong>308 million posts</strong> to analyze the structural &ldquo;shape&rdquo; of online conversations. I then evaluated the efficacy of domain-adaptive pre-training (DAPT) for Transformer-based language models, finding that training on a toxic domain (4chan) boosts hate-speech detection by over 5 F1.</p>
<h2 id="the-engineering-problem-data-normalization">The Engineering Problem: Data Normalization</h2>
<p>Social media platforms impose different structural constraints on discourse, making it difficult to feed heterogeneous data into a single ML pipeline:</p>
<ul>
<li><strong>Twitter:</strong> Technically allows infinite depth, but functionally operates as a flat stream or shallow tree.</li>
<li><strong>Facebook:</strong> Enforces a hard limit of two depth levels (comments and replies), resulting in &ldquo;short and fat&rdquo; conversation trees.</li>
<li><strong>Reddit &amp; 4chan:</strong> Allow for deep, branching tree structures.</li>
</ul>
<p>To solve this, I designed a <strong>Universal Message Schema</strong> and the <strong>PyConversations</strong> library. This system ingests raw dumps from these disparate sources and maps them to a unified Directed Acyclic Graph (DAG) format, preserving the parent-child relationships regardless of the source platform&rsquo;s constraints.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>PyConversations Library</strong>: An open-source package for robust conversational analysis, featuring graph-based traversing and filtering.</li>
<li><strong>Massive Dataset Analysis</strong>: Processed a collection of <strong>308 million posts</strong> and <strong>15.8 million conversations</strong>, creating one of the largest comparative cross-platform analyses at the time of thesis submission.</li>
<li><strong>Structural Insights</strong>: Quantified how UI constraints shape human behavior. For instance, Facebook&rsquo;s depth limit forces users to &ldquo;bunch&rdquo; comments, creating uniquely wide conversation trees compared to Reddit&rsquo;s deep, narrow threads.</li>
<li><strong>Domain Adaptation Experiments</strong>: Continued-pretrained RoBERTa on platform-specific slices (e.g., the 4chan-adapted <code>RoBERTa-4chan</code>), demonstrating that exposing models to toxic domains improved hate-speech detection F1 by over 5 points.</li>
</ul>
<h2 id="structural-analysis-findings">Structural Analysis Findings</h2>
<p>By treating conversations as graphs, we uncovered distinct topological signatures for each platform:</p>
<h3 id="the-shape-of-discourse">The &ldquo;Shape&rdquo; of Discourse</h3>
<p>We measured the <strong>width</strong> (max posts at any depth) and <strong>depth</strong> (max distance from root) of conversation trees.</p>
<ul>
<li><strong>Facebook</strong> exhibited a &ldquo;short and fat&rdquo; topology due to its 2-level nesting limit.</li>
<li><strong>4chan</strong> threads were surprisingly shallow despite having no depth limits. This suggests that the platform&rsquo;s <strong>ephemerality</strong> (threads are deleted quickly) and the &ldquo;bump limit&rdquo; mechanic discourage long-term dialogue, though data scraping limitations on this transient platform also contribute to this topology.</li>
<li><strong>Reddit</strong> maintained the most robust tree structures, with &ldquo;good faith&rdquo; communities like <em>r/ChangeMyView</em> showing distinct patterns of sustained engagement.</li>
</ul>
<h3 id="information-density">Information Density</h3>
<p>We analyzed <strong>Innovation Rate</strong>, a measure of how quickly a text introduces new vocabulary. We found that Twitter threads have negative innovation rates (indicating high novelty per token) likely forced by the strict character limits. In contrast, Reddit posts showed higher redundancy, typical of longer-form essay writing.</p>
<h2 id="representation-learning--domain-adaptation">Representation Learning &amp; Domain Adaptation</h2>
<p>We experimented with &ldquo;Warm-Start&rdquo; tuning: taking a standard RoBERTa model and pre-training it further on platform-specific data before fine-tuning on downstream tasks (TweetEval).</p>
<ul>
<li><strong>Limited gains on most general tasks:</strong> Domain-adaptive pre-training added little on sentiment and emotion (from well under 1 up to a few F1 points), with irony detection the exception (+5.6 to +5.9 F1). Base RoBERTa already covers most of the signal for general NLP tasks.</li>
<li><strong>The Toxic Exception:</strong> The notable exception was <strong>Hate Speech Detection</strong>. The 4chan-adapted model (<code>RoBERTa-4chan</code>) was the strongest here, outperforming the baseline by over 5 F1. This highlights that for specialized, out-of-distribution language (like toxic slang), domain adaptation remains valuable.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>This work bridges the gap between <strong>Computational Social Science</strong> and <strong>ML Engineering</strong>. It provides the community with a reusable tool (<code>PyConversations</code>) to handle the messy reality of social data and offers empirical evidence on the limits and benefits of domain-adaptive pre-training for LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@thesis</span>{heidenreich2021look,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Look, Don&#39;t Tweet: Representation Learning and Social Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">school</span>=<span style="color:#e6db74">{Drexel University}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span>=<span style="color:#e6db74">{Undergraduate Senior Thesis}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For related work on how social media content surfaces in digital journalism, including a dataset of embedded tweets across 273,899 news articles, see <a href="/research/newstweet-social-media-journalism/">NewsTweet Dataset: Social Media in Digital Journalism</a>.</p>
]]></content:encoded></item><item><title>NewsTweet Dataset: Social Media in Digital Journalism</title><link>https://hunterheidenreich.com/research/newstweet-social-media-journalism/</link><pubDate>Sat, 01 Aug 2020 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/newstweet-social-media-journalism/</guid><description>NewsTweet dataset for studying embedded tweets in online journalism. Analysis shows 13% of Google News stories contain tweets.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We introduce NewsTweet, a dataset and data collection pipeline designed to study the embedding of social media in digital journalism. Our descriptive analysis of articles collected from Google News (chosen for its significant role in shaping attention) reveals that 13% of stories include embedded tweets. The dataset provides a foundation for exploring how social media content is sourced and which users become newsworthy. <em>(Note: this is an arXiv preprint from 2020 and was not published at a peer-reviewed venue.)</em></p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Large-Scale Dataset</strong>: A dataset of 273,899 news articles, with 35,218 containing embedded tweets, collected from Google News RSS feeds over a four-month period.</li>
<li><strong>Data Collection Pipeline</strong>: Details an automated pipeline for acquiring news articles, extracting embedded tweets, and collecting the corresponding user timelines from Twitter&rsquo;s API.</li>
<li><strong>Descriptive Statistics</strong>: Presents statistics on the prevalence of tweet embedding across different news categories, outlets, and users, highlighting key patterns.</li>
</ul>
<h2 id="data-availability">Data Availability</h2>
<p>The NewsTweet dataset is not publicly available for direct download. Due to Twitter/X&rsquo;s Terms of Service restrictions on redistributing tweet content, the full dataset cannot be shared openly. Researchers interested in accessing the data or the collection pipeline are encouraged to contact the authors via the <a href="https://arxiv.org/abs/2008.02870">arXiv paper (arXiv:2008.02870)</a>.</p>
<h2 id="dataset-characteristics">Dataset Characteristics</h2>
<h3 id="scale-and-coverage">Scale and Coverage</h3>
<ul>
<li><strong>News Sources</strong>: 5,961 unique news domains aggregated through Google News RSS feeds.</li>
<li><strong>Time Period</strong>: Data collection initiated on May 15th, 2019, with the paper describing the first four months of data.</li>
<li><strong>Collection Velocity</strong>: The pipeline averaged <strong>2,302 articles per day</strong>, with approximately 296 containing embedded tweets.</li>
<li><strong>Content Types</strong>: Focuses specifically on embedded tweets from Twitter, the most frequently embedded platform.</li>
<li><strong>Metadata</strong>: Includes article source, Google News category (e.g., Sports, Health), and full tweet and user objects from the Twitter API.</li>
</ul>
<h3 id="technical-implementation">Technical Implementation</h3>
<ul>
<li><strong>RSS-to-API Pipeline</strong>: Automatically crawls Google News RSS feeds to extract article HTML, identifying embedded tweet IDs to fetch full objects via the Twitter API.</li>
<li><strong>Artifact Filtering</strong>: Implements cleaning protocols to handle artifacts, such as detecting and excluding YouTube pages that appear as articles in Google News feeds.</li>
<li><strong>Longitudinal Tracking</strong>: Features a &ldquo;top-off&rdquo; mechanism that continuously tracks discovered users, updating their timelines to capture historical context.</li>
<li><strong>Rate Limit Management</strong>: Utilizes a random sampling queue to maintain continuous data collection across thousands of users without exceeding Twitter API limits.</li>
</ul>
<h2 id="key-findings">Key Findings</h2>
<h3 id="embedding-prevalence">Embedding Prevalence</h3>
<ul>
<li><strong>13% of news articles</strong> in our Google News-sourced collection contained embedded tweets.</li>
<li><strong>Significant variation across categories</strong>: Sports (24% of articles) and Entertainment (14%) had the highest rates of embedding, while Health (2%) had the lowest.</li>
<li>News outlets that publish the most articles are well-known mass media organizations, while outlets with the highest average number of embeds per article are often focused on Sports and Entertainment.</li>
</ul>
<h3 id="user-and-content-patterns">User and Content Patterns</h3>
<ul>
<li><strong>Public figures dominate</strong>: Well-known figures like politicians and celebrities, alongside organizations, are embedded far more often than ordinary users.</li>
<li>Some users have a small number of their tweets embedded many times, while others gain newsworthiness from a wider range of their content.</li>
<li>The Health category, despite having few embedded tweets, had the highest proportion of unique tweets (93%), suggesting that when tweets are embedded, they are less likely to be reused across multiple stories.</li>
<li><strong>&ldquo;Catch-up&rdquo; Phenomenon</strong>: Data reveals a class of users with high &ldquo;embedding effectiveness&rdquo;: those embedded more frequently than they tweet. This suggests journalists often use embeddings to &ldquo;catch readers up&rdquo; on backstories for previously unknown individuals.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>The dataset is a foundation for studying how social media surfaces in journalism: how sourcing routines are evolving in the digital age, how traditional outlets and social platforms interact, and how previously-unknown users become newsworthy, grounded in the per-category and per-outlet embedding rates and the user-newsworthiness patterns the dataset captures.</p>
<h2 id="my-contribution">My Contribution</h2>
<p>I am the second of six authors on this paper. My contributions focused on the descriptive analysis: writing code to process the collected data, generating summary tables and statistics, and helping write and review the manuscript.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{mujib2020newstweetdatasetsocialmedia,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{NewsTweet: A Dataset of Social Media Embedding in Online Journalism}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Munif Ishad Mujib and Hunter Scott Heidenreich and Colin J. Murphy and Giovanni C. Santia and Asta Zelenkauskaite and Jake Ryland Williams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2008.02870}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.SI}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2008.02870}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/research/look-dont-tweet/">Look, Don&rsquo;t Tweet: Unified Data Models for Social NLP</a>: provides the unified cross-platform social media data model underlying broader Twitter analysis.</li>
<li><a href="/research/coordinated-social-targeting/">Coordinated Social Targeting on Twitter</a>: a companion study from the same research group and time period, documenting coordinated follower-manipulation patterns on high-profile Twitter accounts.</li>
</ul>
]]></content:encoded></item><item><title>Data-Driven WordNet Construction from Wiktionary</title><link>https://hunterheidenreich.com/research/semantic-network-induction/</link><pubDate>Fri, 01 Nov 2019 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/semantic-network-induction/</guid><description>We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a resource with over 344,000 linked examples.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We introduce a novel <strong>unsupervised algorithm</strong> for inducing semantic networks from noisy, crowd-sourced data. By framing network construction as a &ldquo;relationship disambiguation&rdquo; task, we process Wiktionary&rsquo;s English entries to build a massive, WordNet-like semantic resource. The resulting network is more than 5x larger than Princeton WordNet and features over <strong>344,000 linked example sentences</strong> (vs. WordNet&rsquo;s 68k). Evaluation on standard word similarity benchmarks demonstrates that our fully data-driven approach yields semantic structures competitive with expert-annotated resources.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Unsupervised Hierarchy Induction</strong>: We propose a deterministic algorithm to construct a Directed Acyclic Graph (DAG) of senses from pairwise relationships, effectively inducing a semantic hierarchy without human supervision.</li>
<li><strong>A Massive Semantic Resource</strong>: We release a dataset enriched with hundreds of thousands of semantically linked usage examples, serving as a critical resource for tasks like Word Sense Disambiguation (WSD).</li>
<li><strong>Disambiguation Framework</strong>: We model &ldquo;relationship disambiguation&rdquo; using a Laplacian kernel and FastText embeddings to filter noisy user annotations.</li>
<li><strong>Open-Source Infrastructure</strong>: We provide a full pipeline for downloading, parsing, and constructing networks from Wiktionary data.</li>
</ul>
<h2 id="technical-approach">Technical Approach</h2>
<p>The core of our method addresses the noise inherent in crowd-sourced dictionaries. We frame the problem as <strong>Latent Semantic Network Induction</strong>:</p>
<ol>
<li><strong>Relationship Disambiguation</strong>: For every linked pair of words (e.g., <em>go</em> ~ <em>proceed</em>), we define a semantic subspace using their definitions. We utilize <strong>FastText embeddings</strong> and a <strong>Laplacian kernel</strong> to identify which specific definitions participate in the relationship.</li>
<li><strong>Hierarchy Construction</strong>: We apply a custom intersection algorithm that treats more general senses as the &ldquo;overlap&rdquo; between specific definition sets. We formalize this as a set-theoretic &ldquo;hole punching&rdquo; operation, where a general sense $t$ is defined by the intersection of definition sets $\mathbb{D}&rsquo;$, excluding any broader intersections:</li>
</ol>
<p>$$f^{-1}(t) = \left(\bigcap_{\mathbb{D}&rsquo;} D_{u\sim v}\right) \setminus \left(\bigcup_{\mathbb{D} \supset \mathbb{D}&rsquo;} \bigcap_{\mathbb{D}} D_{u\sim v}\right)$$</p>
<h2 id="evaluation--validation">Evaluation &amp; Validation</h2>
<p>The primary achievement is scale: our induced network contains over <strong>344,000 linked example sentences</strong>, compared to Princeton WordNet&rsquo;s 68,000 (more than 5x the coverage), built entirely from crowd-sourced data without expert annotation.</p>
<p>Beyond scale, the network holds up semantically. On standard noun-similarity benchmarks (RG-65), the unsupervised network achieves a Spearman rank correlation of $\rho = 0.83$, matching the performance of Explicit Semantic Analysis (ESA) models built on expert-annotated WordNet ($\rho = 0.82$). The point is not that we beat WordNet by 0.01. It is that a fully automated approach over noisy Wiktionary data produces a resource of comparable quality at 5x the scale.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Building high-quality linguistic resources typically requires expensive expert annotation. Princeton WordNet took decades of lexicographer effort. This work demonstrates that an unsupervised algorithm over crowd-sourced data can produce a resource of comparable semantic quality at more than 5x the scale. For ML practitioners, that matters: larger coverage means more training signal for downstream tasks like Word Sense Disambiguation. For this portfolio, it shows early experience building structured NLP datasets from scratch, a theme that continues in later work on large-scale document corpora.</p>
<h2 id="related-work">Related Work</h2>
<p>For a theoretical treatment of word semantics from the same collaboration, including the first analytical solution to Word2Vec&rsquo;s softmax objective, see <a href="/research/word-company-vicinity/">Analytical Solution to Word2Vec Softmax &amp; Bias Probing</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2019latent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Latent semantic network induction in the context of linked example senses}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Williams, Jake}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{170--180}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>QuAC: Question Answering in Context Dataset</title><link>https://hunterheidenreich.com/posts/quac-question-answering-in-context/</link><pubDate>Wed, 31 Oct 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/quac-question-answering-in-context/</guid><description>Analysis of QuAC's conversational QA through student-teacher interactions, featuring ~100K context-dependent questions and coreference challenges.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>The <a href="https://aclanthology.org/D18-1241/">QuAC dataset</a> (Question Answering in Context) presents a conversational question answering approach that models student-teacher interactions. Published at EMNLP 2018, this work by Choi et al. addresses how systems can understand dialogue context, resolve references across conversation turns, and handle natural conversation ambiguity. Previous datasets treated questions independently.</p>
<p>The dataset addresses limitations in question answering research by incorporating real-world information-seeking dialogue complexities, where questions build upon previous exchanges and context drives understanding.</p>
<p>For comparison with related work, see my analysis of <a href="/posts/coqa-conversation-question-answering/">CoQA</a>.</p>
<h2 id="the-student-teacher-framework">The Student-Teacher Framework</h2>
<p>QuAC models information-seeking dialogue through a student-teacher setup:</p>
<ul>
<li><strong>Teacher</strong>: Has complete access to information (Wikipedia passage)</li>
<li><strong>Student</strong>: Seeks knowledge through questioning with limited initial context</li>
<li><strong>Interaction</strong>: Handles context-dependent questions, abstract inquiries, and unanswerable requests</li>
</ul>
<p>This framework mirrors real-world scenarios where one party has expertise while another seeks to learn through dialogue. AI systems must act as effective teachers, using available information to provide helpful responses despite ambiguous or incomplete questions.</p>
<p>The dataset contains roughly 100K questions across ~14K dialogues (precisely 98,407 questions and 13,594 dialogues), providing substantial scale for training and evaluation.</p>















<figure class="post-figure center ">
    <img src="/img/quac_stats.webp"
         alt="QuAC dataset statistics and scale"
         title="QuAC dataset statistics and scale"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">QuAC dataset statistics and scale</figcaption>
    
</figure>

<h2 id="dataset-construction">Dataset Construction</h2>
<p>QuAC was built using Amazon Mechanical Turk with a two-person dialogue setup:</p>
<p><strong>Teacher role</strong>: Has access to the complete Wikipedia passage and provides answers extracted directly from the text</p>
<p><strong>Student role</strong>: Sees only the article title, introduction paragraph, and section heading, then asks questions to learn about the content</p>
<p>This asymmetric information design ensures student questions naturally differ from the passage content, creating realistic information-seeking scenarios. The extractive answer requirement maintains objective evaluation while simplifying scoring.</p>
<p><strong>Dialogue termination</strong>:</p>
<ul>
<li>12 questions answered</li>
<li>Manual termination by either participant</li>
<li>Two consecutive unanswerable questions</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_convo.webp"
         alt="Example QuAC conversation showing student-teacher interaction"
         title="Example QuAC conversation showing student-teacher interaction"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example QuAC conversation showing student-teacher interaction</figcaption>
    
</figure>

<h3 id="content-selection">Content Selection</h3>
<p>QuAC focuses on Wikipedia biographical articles for several practical reasons:</p>
<ul>
<li><strong>Reduced complexity</strong>: People-focused content requires less specialized domain knowledge</li>
<li><strong>Natural question flow</strong>: Biographical information lends itself to sequential questioning</li>
<li><strong>Quality control</strong>: Articles filtered to include only subjects with 100+ incoming links, ensuring content depth</li>
</ul>
<p>This focused scope enables consistent evaluation while maintaining broad coverage through diverse biographical subjects across fields and time periods.</p>
<h2 id="key-dataset-characteristics">Key Dataset Characteristics</h2>
<p>QuAC introduces several features that distinguish it from existing question answering benchmarks:</p>















<figure class="post-figure center ">
    <img src="/img/quac_comparison.webp"
         alt="Comparative analysis of QuAC against other QA datasets"
         title="Comparative analysis of QuAC against other QA datasets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Comparative analysis of QuAC against other QA datasets</figcaption>
    
</figure>

<p><strong>Notable features</strong>:</p>
<ul>
<li><strong>High contextual dependency</strong>: a large majority of questions depend on the conversation context, and a substantial share require coreference resolution</li>
<li><strong>Non-factoid focus</strong>: 54% of questions go beyond simple fact retrieval</li>
<li><strong>Extended answers</strong>: Responses are longer and more detailed</li>
<li><strong>Unanswerable questions</strong>: Realistic scenarios where information isn&rsquo;t available</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_dist.webp"
         alt="Distribution of question types in QuAC"
         title="Distribution of question types in QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Distribution of question types in QuAC</figcaption>
    
</figure>

<h3 id="the-coreference-resolution-challenge">The Coreference Resolution Challenge</h3>
<p>QuAC&rsquo;s complexity stems from its heavy reliance on coreference resolution across multiple contexts:</p>
<p><strong>Reference types</strong>:</p>
<ul>
<li><strong>Passage references</strong>: Pronouns and references to entities in the source text</li>
<li><strong>Dialogue references</strong>: References to previously discussed topics</li>
<li><strong>Abstract references</strong>: Challenging cases like &ldquo;what else?&rdquo; that require inferring the inquiry scope</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_coref.webp"
         alt="Types and distribution of coreferences in QuAC"
         title="Types and distribution of coreferences in QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Types and distribution of coreferences in QuAC</figcaption>
    
</figure>

<p>The prevalence of coreference resolution makes QuAC particularly challenging, as this remains an active research problem in NLP. Models must understand passage content, track dialogue history, and resolve complex referential expressions simultaneously.</p>
<h2 id="performance-results">Performance Results</h2>
<p>Models face substantial challenges on QuAC, with significant gaps between human and machine performance:</p>















<figure class="post-figure center ">
    <img src="/img/quac_performance.webp"
         alt="Baseline model performance comparison on QuAC"
         title="Baseline model performance comparison on QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Baseline model performance comparison on QuAC</figcaption>
    
</figure>

<p><strong>Performance summary</strong>:</p>
<ul>
<li><strong>Human performance</strong>: 81.1% F1 score</li>
<li><strong>Best baseline</strong>: BiDAF++ with context achieves 60.2% F1</li>
<li><strong>Performance gap</strong>: 20+ point difference shows room for improvement</li>
</ul>
<h3 id="human-equivalence-metrics">Human Equivalence Metrics</h3>
<p>QuAC introduces evaluation metrics beyond traditional F1 scores:</p>
<p><strong>HEQ-Q (Human Equivalence Question-level)</strong>: Percentage of questions where the model achieves human-level or better performance</p>
<p><strong>HEQ-D (Human Equivalence Dialogue-level)</strong>: Percentage of complete dialogues where the model matches human performance across all questions</p>
<p><strong>Current results</strong>:</p>
<ul>
<li>Human baseline: 100% HEQ-Q, 100% HEQ-D (by definition)</li>
<li>Best model: 55.1% HEQ-Q, 5.2% HEQ-D</li>
</ul>
<p>These metrics show both average performance and consistency across questions and conversations, important for practical dialogue systems.</p>
<h2 id="research-impact">Research Impact</h2>
<p>QuAC represents an important step in question answering research by introducing realistic conversational dynamics that existing datasets lack. The student-teacher framework captures natural information-seeking behavior while maintaining extractive evaluation for objective assessment.</p>
<p><strong>Key contributions</strong>:</p>
<ul>
<li><strong>Conversational realism</strong>: Context-dependent questions that mirror dialogue patterns</li>
<li><strong>Coreference complexity</strong>: Integration of challenging NLP problems into QA evaluation</li>
<li><strong>Evaluation metrics</strong>: HEQ scores that measure consistency alongside average performance</li>
<li><strong>Large-scale framework</strong>: Substantial dataset enabling robust model training and evaluation</li>
</ul>
<p>The dataset&rsquo;s <a href="https://quac.ai/">leaderboard</a> provides researchers with a challenging benchmark for developing conversational AI systems. As models improve on QuAC, we can expect progress in dialogue agents, virtual assistants, and educational AI systems that engage in more natural, context-aware conversations.</p>
<p>QuAC&rsquo;s focus on dialogue context and reference resolution pushes the field toward AI systems that can engage in genuine conversation and understand complex dialogue flows.</p>
<h2 id="a-builders-perspective-quac-and-modern-instruction-tuning">A Builder&rsquo;s Perspective: QuAC and Modern Instruction Tuning</h2>
<p>Looking at QuAC through the lens of modern production ML, the student-teacher framework maps directly onto how we now train and evaluate assistants. Today, we train foundation models using Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, which rely heavily on multi-turn, context-aware interactions.</p>
<p>When building a system like GutenOCR, users rarely ask perfectly formulated, context-free questions. They ask follow-ups, use pronouns, and expect the system to act as a knowledgeable &ldquo;teacher&rdquo; guiding them through the document. QuAC was an early dataset to formalize this asymmetric information dynamic. It highlighted the necessity of handling unanswerable questions gracefully, a critical feature for preventing hallucinations in today&rsquo;s production LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{choi-etal-2018-quac,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">&#34;{Q}u{AC}: Question Answering in Context&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">&#34;Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">&#34;Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span> = oct # <span style="color:#e6db74">&#34;-&#34;</span> # nov,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">&#34;2018&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">&#34;Brussels, Belgium&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">&#34;Association for Computational Linguistics&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">&#34;https://aclanthology.org/D18-1241/&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">&#34;10.18653/v1/D18-1241&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">&#34;2174--2184&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CoQA Dataset: Advancing Conversational Question Answering</title><link>https://hunterheidenreich.com/posts/coqa-conversation-question-answering/</link><pubDate>Thu, 23 Aug 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/coqa-conversation-question-answering/</guid><description>Analysis of CoQA, a conversational QA dataset with multi-turn dialogue, coreference resolution, and natural answers for QA research.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>The <a href="https://doi.org/10.1162/tacl_a_00266">CoQA dataset</a> (Reddy et al., 2019) introduces conversational dynamics to question answering research. CoQA requires models to maintain context across multi-turn conversations while reading and reasoning about text passages. Previous datasets focused on isolated question-answer pairs.</p>
<p>This dataset addresses a gap in conversational AI research by providing a benchmark for systems that must understand dialogue flow and implicit references. These are key components of natural human conversation.</p>
<p>For related work on conversational question answering, see my analysis of <a href="/posts/quac-question-answering-in-context/">QuAC</a>.</p>
<h2 id="what-makes-conversational-qa-different">What Makes Conversational QA Different</h2>
<p>Conversational question answering introduces challenges beyond traditional reading comprehension:</p>
<ol>
<li><strong>Context dependency</strong>: Questions rely on previous dialogue turns for meaning</li>
<li><strong>Coreference resolution</strong>: Understanding pronouns and implicit references</li>
<li><strong>Abstractive answering</strong>: Rephrasing information to generate natural responses</li>
<li><strong>Multi-turn reasoning</strong>: Maintaining coherent dialogue across multiple exchanges</li>
</ol>
<p>These requirements differentiate CoQA from existing question answering datasets that treat each question independently.</p>
<h2 id="why-coqa-matters">Why CoQA Matters</h2>
<p>Question answering systems typically excel at finding specific information in text. However, they often struggle with natural conversation. Human communication involves building on previous exchanges, using pronouns and implicit references, and expressing ideas in varied ways.</p>
<p>CoQA addresses this by creating a large-scale dataset for conversational question answering with three primary characteristics:</p>
<ol>
<li>
<p><strong>Conversation-dependent questions</strong>: After the first question, every subsequent question depends on dialogue history across 127,000 questions spanning 8,000 conversations</p>
</li>
<li>
<p><strong>Natural, abstractive answers</strong>: CoQA requires rephrased responses that sound natural in conversation. The answerer first highlighted the relevant text span, then rephrased the information.</p>
</li>
<li>
<p><strong>Domain diversity</strong>: Training covers 5 domains with testing on 7 domains, including 2 unseen during training</p>
</li>
</ol>
<p>The performance gap is notable: humans achieve 88.8% F1 score while the best models at the time reached 65.1% F1, indicating substantial room for improvement.</p>
<h2 id="dataset-construction">Dataset Construction</h2>
<p>CoQA was constructed using Amazon Mechanical Turk, pairing workers in a question-answer dialogue setup. One worker asked questions about a given passage while another provided answers. The answerer first highlighted the relevant text span, then rephrased the information using different words to create natural, abstractive responses.</p>
<p>This methodology produces answers that sound conversational. This makes the dataset highly realistic for dialogue applications.</p>
<h3 id="domain-coverage">Domain Coverage</h3>
<p>CoQA spans diverse text types to ensure evaluation across different writing styles and topics:</p>
<p><strong>Training domains (5):</strong></p>
<ul>
<li>Children&rsquo;s stories from <a href="https://web.archive.org/web/20180829214346/https://uclmr.github.io/ai4exams/data.html#mctest">MCTest</a></li>
<li>Literature from <a href="https://www.gutenberg.org/">Project Gutenberg</a></li>
<li>Educational content from <a href="https://www.cs.cmu.edu/~glai1/data/race/">RACE</a> (middle/high school English)</li>
<li>CNN news articles</li>
<li>Wikipedia articles</li>
</ul>
<p><strong>Test-only domains (2):</strong></p>
<ul>
<li>Science articles from <a href="http://data.allenai.org/ai2-science-questions/">AI2 Science Questions</a></li>
<li>Creative writing from <a href="https://www.reddit.com/r/WritingPrompts/">Reddit WritingPrompts</a></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/coqa_domains.webp"
         alt="Domain distribution in the CoQA dataset"
         title="Domain distribution in the CoQA dataset"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Domain distribution in the CoQA dataset</figcaption>
    
</figure>

<p>The inclusion of test-only domains provides a rigorous evaluation of model generalization to unseen text types.</p>
<h2 id="comparison-with-existing-datasets">Comparison with Existing Datasets</h2>
<p>Prior to CoQA, the dominant question answering benchmark was <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD (Stanford Question Answering Dataset)</a>. SQuAD established foundations for reading comprehension and presented specific constraints:</p>
<ul>
<li><strong>SQuAD 1.0</strong>: 100,000+ questions requiring exact text extraction from Wikipedia passages</li>
<li><strong>SQuAD 2.0</strong>: Added 50,000+ unanswerable questions to test when no answer exists</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/squad_coqa_size.webp"
         alt="Scale comparison between SQuAD and CoQA datasets"
         title="Scale comparison between SQuAD and CoQA datasets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Scale comparison between SQuAD and CoQA datasets</figcaption>
    
</figure>

<p>SQuAD treats each question independently and requires only extractive answers. CoQA addresses these constraints through conversational context and abstractive responses.</p>
<h3 id="question-and-answer-analysis">Question and Answer Analysis</h3>
<p>The differences between SQuAD and CoQA extend beyond conversational context:</p>
<p><strong>Question diversity</strong>: SQuAD heavily favors &ldquo;what&rdquo; questions (~50%). CoQA shows a more balanced distribution across question types, reflecting natural conversation patterns.</p>















<figure class="post-figure center ">
    <img src="/img/squad_v_coqa.webp"
         alt="Question type distribution comparison between SQuAD and CoQA"
         title="Question type distribution comparison between SQuAD and CoQA"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Question type distribution comparison between SQuAD and CoQA</figcaption>
    
</figure>

<p><strong>Context dependence</strong>: CoQA includes challenging single-word questions like &ldquo;who?&rdquo;, &ldquo;where?&rdquo;, or &ldquo;why?&rdquo; that depend entirely on dialogue history.</p>
<p><strong>Answer characteristics</strong>: CoQA answers vary significantly in length and style. SQuAD primarily features extractive spans.</p>















<figure class="post-figure center ">
    <img src="/img/squad_coqa_answers.webp"
         alt="Answer length distribution in SQuAD vs CoQA"
         title="Answer length distribution in SQuAD vs CoQA"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Answer length distribution in SQuAD vs CoQA</figcaption>
    
</figure>

<h2 id="the-coreference-challenge">The Coreference Challenge</h2>
<p>CoQA&rsquo;s difficulty stems largely from its reliance on coreference resolution (determining when different expressions refer to the same entity). This remains a challenging research problem in NLP.</p>
<p><strong>Coreference types in CoQA</strong>:</p>
<ul>
<li><strong>Explicit coreferences</strong> (~50% of questions): Clear indicators like pronouns (&ldquo;him,&rdquo; &ldquo;it,&rdquo; &ldquo;her,&rdquo; &ldquo;that&rdquo;)</li>
<li><strong>Implicit coreferences</strong> (~20% of questions): Context-dependent references requiring inference (e.g., asking &ldquo;where?&rdquo; without specifying what)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/coqa_coreferences.webp"
         alt="Distribution of coreference types in CoQA questions"
         title="Distribution of coreference types in CoQA questions"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Distribution of coreference types in CoQA questions</figcaption>
    
</figure>

<p>These linguistic phenomena make CoQA more difficult than traditional reading comprehension, as models must resolve references across dialogue turns while maintaining conversational coherence.</p>
<h2 id="performance-benchmarks">Performance Benchmarks</h2>
<p>Models faced significant challenges on CoQA, with substantial room for improvement:</p>















<figure class="post-figure center ">
    <img src="/img/coqa_scores.webp"
         alt="Performance comparison on CoQA across different model types"
         title="Performance comparison on CoQA across different model types"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Performance comparison on CoQA across different model types</figcaption>
    
</figure>

<p>The performance gap between human and machine capabilities highlighted conversational question answering as a challenging frontier in NLP research.</p>
<h2 id="research-impact-and-future-directions">Research Impact and Future Directions</h2>
<p>CoQA represents a step toward more natural conversational AI systems. By requiring models to handle dialogue context, coreference resolution, and abstractive reasoning simultaneously, it challenges current NLP system capabilities.</p>
<p>The dataset&rsquo;s <a href="https://stanfordnlp.github.io/coqa/">leaderboard</a> provides a benchmark for measuring progress on this task. As models improve on CoQA, we can expect advances in conversational AI applications, from chatbots to virtual assistants that engage in more natural, context-aware dialogue.</p>
<p>CoQA&rsquo;s contribution to the field aims to parallel ImageNet&rsquo;s impact on computer vision, providing a challenging, well-constructed benchmark that drives research toward more capable AI systems.</p>
<h2 id="a-builders-perspective-coqa-in-the-era-of-llms">A Builder&rsquo;s Perspective: CoQA in the Era of LLMs</h2>
<p>Looking back at CoQA from the perspective of modern production systems, the dataset anticipated where the field went. The challenges it introduced, such as multi-turn reasoning, coreference resolution, and abstractive answering, are the exact capabilities we now expect from instruction-tuned Large Language Models (LLMs).</p>
<p>Production document-processing pipelines rarely extract isolated facts. Users want to chat with their documents, asking follow-up questions like, &ldquo;What does that mean for the Q3 budget?&rdquo; Resolving &ldquo;that&rdquo; to a previous turn&rsquo;s context is exactly the problem CoQA formalized. Datasets like CoQA shifted the field&rsquo;s focus from simple extraction toward dialogue comprehension, the foundation modern conversational document interfaces are built on.</p>
<h2 id="references">References</h2>
<p>Reddy, S., Chen, D., &amp; Manning, C. D. (2019). CoQA: A conversational question answering challenge. <em>Transactions of the Association for Computational Linguistics</em>, 7, 249-266.</p>
]]></content:encoded></item></channel></rss>