<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular Databases &amp; Datasets on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/</link><description>Recent content in Molecular Databases &amp; Datasets on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Fri, 13 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/datasets/index.xml" rel="self" type="application/rss+xml"/><item><title>ZINC-22: A Multi-Billion Scale Database for Ligand Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</link><pubDate>Sat, 27 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</guid><description>The ZINC-22 dataset provides over 37 billion make-on-demand molecules enabling virtual screening and modern drug discovery.</description><content:encoded><![CDATA[<h2 id="key-contribution-scaling-make-on-demand-libraries">Key Contribution: Scaling Make-on-Demand Libraries</h2>
<p>ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.</p>
<h2 id="overview">Overview</h2>
<p>ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/zinc-22-sample.webp"
         alt="ZINC-22&#39;s 2D Tranche Browser"
         title="ZINC-22&#39;s 2D Tranche Browser"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">ZINC-22&rsquo;s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Database</strong></td>
          <td>37B+</td>
          <td>Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)</td>
      </tr>
      <tr>
          <td><strong>3D Database</strong></td>
          <td>4.5B+</td>
          <td>Ready-to-dock 3D conformations with pre-calculated charges and solvation energies</td>
      </tr>
      <tr>
          <td><strong>Custom Tranches</strong></td>
          <td>Variable</td>
          <td>User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)</td>
      </tr>
  </tbody>
</table>
<h2 id="use-cases">Use Cases</h2>
<p>ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.</p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ZINC-20</strong></td>
          <td>Predecessor</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Enamine REAL</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>WuXi GalaXi</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Massive scale</strong>: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)</li>
<li><strong>Federated architecture</strong>: Supports asynchronous building and horizontal scaling to trillion-molecule growth</li>
<li><strong>Platform access</strong>: CartBlanche GUI provides a shopping cart metaphor for compound acquisition</li>
<li><strong>Privacy protection</strong>: Dual public/private server clusters protect patentability of undisclosed catalogs</li>
<li><strong>Chemical diversity</strong>: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds</li>
<li><strong>Ready-to-dock</strong>: 3D models include pre-calculated charges, protonation states, and solvation energies</li>
<li><strong>Cloud distribution</strong>: Available via AWS Open Data, Oracle OCI, and UCSF servers</li>
<li><strong>Scale-aware search</strong>: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries</li>
<li><strong>Organized access</strong>: Tranche system enables targeted selection of chemical space</li>
<li><strong>Open access</strong>: Entire database freely available to academic and commercial users</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Data Transfer Bottlenecks</strong>: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.</li>
<li><strong>Search Result Caps</strong>: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.</li>
<li><strong>Enumeration Ceiling</strong>: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.</li>
<li><strong>Download Workflow</strong>: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.</li>
<li><strong>Vendor Updates</strong>: There is difficulty removing discontinued vendor molecules due to the federated structure.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<p><strong>Compute infrastructure</strong>:</p>
<ul>
<li>1,700 cores across 14 computers for parallel processing</li>
<li>174 independent PostgreSQL 12.0 databases (110 &lsquo;Sn&rsquo; for ZINC-ID, 64 &lsquo;Sb&rsquo; for Supplier Codes)</li>
<li>Distributed across Amazon AWS, Oracle OCI, and UCSF servers</li>
</ul>
<p><strong>Software stack</strong>:</p>
<ul>
<li>PostgreSQL 12.2</li>
<li>Python 3.6.8</li>
<li>RDKit 2020.03</li>
<li>Celery task queue with Redis for background processing</li>
<li>All code available on GitHub: docking-org/zinc22-2d, zinc22-3d</li>
</ul>
<h3 id="data-organization--access">Data Organization &amp; Access</h3>
<p><strong>Tranche system</strong>: Molecules organized into &ldquo;Tranches&rdquo; based on 4 dimensions:</p>
<ol>
<li>Heavy Atom Count</li>
<li>Lipophilicity (LogP)</li>
<li>Charge</li>
<li>File Format</li>
</ol>
<p>This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.</p>
<p><strong>Search infrastructure</strong>:
Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:</p>
<ul>
<li>
<p><strong>SmallWorld</strong>: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:</p>
<p>$$
\text{GED}(G_1, G_2) = \min_{(e_1, &hellip;, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i)
$$</p>
<p>Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.</p>
</li>
<li>
<p><strong>Arthor</strong>: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.</p>
</li>
<li>
<p><strong>CartBlanche</strong>: Web interface wrapping these search tools with shopping cart functionality.</p>
</li>
</ul>
<h3 id="3d-generation-pipeline">3D Generation Pipeline</h3>
<p>The 3D database construction pipeline involves multiple specialized tools:</p>
<ol>
<li><strong>ChemAxon JChem</strong>: Protonation state and tautomer generation at physiological pH</li>
<li><strong>Corina</strong>: Initial 3D structure generation</li>
<li><strong>Omega</strong>: Conformation sampling</li>
<li><strong>AMSOL 7.1</strong>: Calculation of atomic partial charges and desolvation energies</li>
<li><strong>Strain calculation</strong>: Relative energies of conformations</li>
</ol>
<p>At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.</p>
<h3 id="chemical-diversity-analysis">Chemical Diversity Analysis</h3>
<p>A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:</p>
<p>$$
\log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules})
$$</p>
<p>This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.</p>
<h3 id="vendor-integration">Vendor Integration</h3>
<p>ZINC-22 is built from five source catalogs with the following approximate sizes:</p>
<ul>
<li><strong>Enamine REAL Database</strong>: 5 billion compounds</li>
<li><strong>Enamine REAL Space</strong>: 29 billion compounds</li>
<li><strong>WuXi GalaXi</strong>: 2.5 billion compounds</li>
<li><strong>Mcule Ultimate</strong>: 128 million compounds</li>
<li><strong>ZINC20 in-stock</strong>: 4 million compounds (incorporated as layer &ldquo;g&rdquo;)</li>
</ul>
<p>This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://cartblanche22.docking.org/">CartBlanche web interface</a></td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Web GUI for searching and downloading ZINC-22</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>2D curation and loading pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>3D building pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>CartBlanche22 web application</td>
      </tr>
      <tr>
          <td>AWS Open Data / Oracle OCI</td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Cloud-hosted 3D database mirrors</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data Availability</strong>: The compiled database is openly accessible and searchable through the <a href="https://cartblanche22.docking.org/">CartBlanche web interface</a>. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.</li>
<li><strong>Code &amp; Algorithms</strong>: The source code for database construction, parallel processing, and querying is open-source.
<ul>
<li>2D Pipeline: <a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></li>
<li>3D Pipeline: <a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></li>
<li>CartBlanche: <a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></li>
<li>TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)</li>
</ul>
</li>
<li><strong>Software Dependencies</strong>: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.</li>
<li><strong>Hardware Limitations</strong>: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. <em>Journal of Chemical Information and Modeling</em>, 63(4), 1166&ndash;1176. <a href="https://doi.org/10.1021/acs.jcim.2c01253">https://doi.org/10.1021/acs.jcim.2c01253</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Tingle_2023,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{Feb}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1166--1176}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MARCEL: Molecular Conformer Ensemble Learning Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</guid><description>MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular representation learning research.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.</p>
<h2 id="overview">Overview</h2>
<p>The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer.webp"
         alt="Example conformer from Drugs-75K"
         title="Example conformer from Drugs-75K"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Drugs-75K (SMILES: <code>COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1</code>; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer-2d.webp"
         alt="2D structure of Drugs-75K conformer"
         title="2D structure of Drugs-75K conformer"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of Drugs-75K conformer above</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-2d.webp"
         alt="Example conformer from Kraken in 2D"
         title="Example conformer from Kraken in 2D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 2D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-3d.webp"
         alt="Example conformer from Kraken in 3D"
         title="Example conformer from Kraken in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 3D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-3d.webp"
         alt="Example substrate from BDE in 3D"
         title="Example substrate from BDE in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example substrate from BDE in 3D (Pt_9.63)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-2d.webp"
         alt="2D structure of BDE substrate"
         title="2D structure of BDE substrate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of BDE substrate above</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drugs-75K</strong></td>
          <td>75,099 molecules</td>
          <td>Drug-like molecules with at least 5 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>Kraken</strong></td>
          <td>1,552 molecules</td>
          <td>Monodentate organophosphorus (III) ligands</td>
      </tr>
      <tr>
          <td><strong>EE</strong></td>
          <td>872 reactions</td>
          <td>Rhodium (Rh)-bound atropisomeric catalyst-substrate pairs derived from chiral bisphosphine</td>
      </tr>
      <tr>
          <td><strong>BDE</strong></td>
          <td>5,915 reactions</td>
          <td>Organometallic catalysts ML$_1$L$_2$ with electronic binding energies</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="drugs-75k-ionization-potential">Ionization Potential (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-ionization-potential">#</a></h3>
    <p class="benchmark-description">Predict ionization potential from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.4066</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.4069</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.4126</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.4149</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.428</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4351</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4354</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4361</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4393</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4394</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4441</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4452</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4466</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4505</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4595</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4788</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4987</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.6617</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electron-affinity">Electron Affinity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electron-affinity">#</a></h3>
    <p class="benchmark-description">Predict electron affinity from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.391</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3922</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3944</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3953</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3964</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4033</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4085</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4207</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4233</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4232</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4251</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4269</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4417</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4495</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4648</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4747</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.585</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electronegativity">Electronegativity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electronegativity">#</a></h3>
    <p class="benchmark-description">Predict electronegativity (χ) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.197</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2027</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2069</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2083</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2199</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2212</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.226</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2294</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2324</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2378</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2436</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.2441</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2505</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2732</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4073</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-b5">B₅ Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-b5">#</a></h3>
    <p class="benchmark-description">Predict B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2225</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2313</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.263</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2644</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2704</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.2789</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3072</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.3128</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.3228</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.3293</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.3443</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.345</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.351</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.3567</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.476</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.485</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4873</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4879</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.9611</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-l">L Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-l">#</a></h3>
    <p class="benchmark-description">Predict L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.3386</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3468</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.3619</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3643</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3754</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4003</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4303</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4322</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4344</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4363</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4471</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4485</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4493</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.5142</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.5452</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.5458</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.6417</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.8389</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burb5">Buried B₅ Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burb5">#</a></h3>
    <p class="benchmark-description">Predict buried B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.1589</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1693</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.1719</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1782</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1783</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2024</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2017</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2066</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2097</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2178</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2176</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2295</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2395</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2422</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2758</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2813</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2884</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.3002</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4929</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burl">Buried L Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burl">#</a></h3>
    <p class="benchmark-description">Predict buried L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.0947</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1185</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.12</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1324</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.1386</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.1443</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.1486</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.15</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.1521</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.1526</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.1548</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1635</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.1673</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.1741</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.1861</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.1924</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.1948</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2529</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.2781</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="ee-enantioselectivity">Enantioselectivity (EE)<a hidden class="anchor" aria-hidden="true" href="#ee-enantioselectivity">#</a></h3>
    <p class="benchmark-description">Predict enantiomeric excess for Rh-catalyzed asymmetric reactions</p>
    <p class="benchmark-meta"><strong>Subset:</strong> EE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>11.61</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>12.03</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>13.56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>13.96</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>14.22</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>14.64</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>17.74</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>18.03</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>18.42</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>19.8</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>20.24</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>33.95</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>61.03</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>61.3</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>61.63</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>62.08</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>62.31</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>62.38</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>64.01</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="bde-bond-dissociation">Bond Dissociation Energy (BDE)<a hidden class="anchor" aria-hidden="true" href="#bde-bond-dissociation">#</a></h3>
    <p class="benchmark-description">Predict metal-ligand bond dissociation energy for organometallic catalysts</p>
    <p class="benchmark-meta"><strong>Subset:</strong> BDE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>1.45</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>1.47</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>1.61</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>1.65</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>1.87</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>1.97</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>2.01</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>2.13</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>2.48</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>2.55</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>2.61</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>2.64</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>2.66</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>2.74</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>2.83</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>3.03</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>10.08</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GEOM</strong></td>
          <td>Source</td>
          <td><a href="/notes/chemistry/datasets/geom/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Domain diversity</strong>: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks</li>
<li><strong>Ensemble-based</strong>: Provides full conformer ensembles with statistical weights</li>
<li><strong>DFT-quality energies</strong>: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)</li>
<li><strong>Realistic scenarios</strong>: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems</li>
<li><strong>Comprehensive baselines</strong>: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods</li>
<li><strong>Property diversity</strong>: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Regression only</strong>: All tasks evaluate regression metrics exclusively</li>
<li><strong>Chemical space coverage</strong>: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces</li>
<li><strong>Compute requirements</strong>: Working with large conformer ensembles demands significant computational resources</li>
<li><strong>Proprietary data</strong>: EE subset is proprietary (as of December 2025)</li>
<li><strong>DFT bottleneck</strong>: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics</li>
<li><strong>Uniform sampling baseline</strong>: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.</li>
<li><strong>Drugs-75K properties</strong>: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.</li>
<li><strong>Unrealistic single-conformer baselines</strong>: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum <em>a priori</em> requires exhaustively searching and computing energies for the entire conformer space.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<h4 id="drugs-75k">Drugs-75K</h4>
<p><strong>Source</strong>: GEOM-Drugs subset</p>
<p><strong>Filtering</strong>:</p>
<ul>
<li>Minimum 5 rotatable bonds (focus on flexible molecules)</li>
<li>Allowed elements: H, C, N, O, F, Si, P, S, Cl</li>
</ul>
<p><strong>Conformer generation</strong>:</p>
<ul>
<li>DFT-level calculations for both conformers and energies</li>
<li>Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)</li>
</ul>
<p><strong>Properties</strong>: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)</p>
<h4 id="kraken">Kraken</h4>
<p><strong>Source</strong>: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)</p>
<p><strong>Properties</strong>: 4 of 78 available properties (selected for high variance across conformer ensembles)</p>
<ul>
<li>$B_5$: Sterimol B5, maximum width of substituent (steric descriptor)</li>
<li>$L$: Sterimol L, length of substituent (steric descriptor)</li>
<li>$\text{Bur}B_5$: Buried Sterimol B5, steric effects within the first coordination sphere</li>
<li>$\text{Bur}L$: Buried Sterimol L, steric effects within the first coordination sphere</li>
</ul>
<h4 id="ee-enantiomeric-excess">EE (Enantiomeric Excess)</h4>
<p><strong>Generation method</strong>: Q2MM (Quantum-guided Molecular Mechanics)</p>
<p><strong>Reactions</strong>: 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine with 10 enamide substrates</p>
<p><strong>Property</strong>: Enantiomeric excess (EE) for asymmetric catalysis</p>
<p><strong>Availability</strong>: Proprietary-only (closed-source as of December 2025)</p>
<h4 id="bde-bond-dissociation-energy">BDE (Bond Dissociation Energy)</h4>
<p><strong>Molecules</strong>: 5,915 organometallic catalysts (ML₁L₂ structure)</p>
<p><strong>Initial conformers</strong>: OpenBabel with geometric optimization</p>
<p><strong>Energies</strong>: DFT calculations</p>
<p><strong>Property</strong>: Electronic binding energy (difference in minimum energies of bound-catalyst complex and unbound catalyst)</p>
<p><strong>Key constraint</strong>: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)</p>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble).
The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:</p>
<p>$$
\langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i
$$</p>
<p>Where $p_i$ is the conformer probability (Boltzmann weight) under experimental conditions derived from the conformer energy $e_i$:</p>
<p>$$
p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)}
$$</p>
<p><strong>Data splits</strong>: Datasets are partitioned 70% train, 10% validation, and 20% test.</p>
<p><strong>Model categories</strong>:</p>
<ol>
<li><strong>1D Models</strong>: SMILES-based (Random Forest on concatenated MACCS/ECFP/RDKit fingerprints, LSTM, Transformer).</li>
<li><strong>2D Models</strong>: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).</li>
<li><strong>3D Models</strong>: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.</li>
<li><strong>Ensemble Models</strong>: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:</li>
</ol>
<p><strong>Mean Pooling:</strong>
$$
\mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i
$$</p>
<p><strong>DeepSets:</strong>
$$
\mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right)
$$</p>
<p><strong>Self-Attention:</strong>
$$
\begin{aligned}
\mathbf{s}_{\text{ATT}} &amp;= \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\
\alpha_{ij} &amp;= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)}
\end{aligned}
$$</p>
<p><strong>Evaluation metric</strong>: Mean Absolute Error (MAE) for all tasks.</p>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Ensemble superiority (task-dependent)</strong>: Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:</p>
<ul>
<li><strong>Small-Scale Success</strong>: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).</li>
<li><strong>Large-Scale Plateau</strong>: The performance improvements did not strongly transfer to large subsets like Drugs-75K (best ensemble strategy for GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors conjecture that the computational burden of encoding all conformers in each ensemble alters learning dynamics and increases training difficulty.</li>
</ul>
<p><strong>Conformer Sampling for Noise</strong>: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).</p>
<p><strong>3D vs 2D</strong>: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.</p>
<p><strong>Model architecture</strong>: No single model dominates all tasks. GemNet and LEFTNet excel on large-scale Drugs-75K, while DimeNet++ shows strong performance on smaller Kraken and reaction datasets. Model selection depends on dataset size and task characteristics.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL">SXKDZ/MARCEL</a></td>
          <td>Code + Dataset</td>
          <td>Apache-2.0</td>
          <td>Benchmark suite, dataset loaders, and hyperparameter configs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Drugs">Drugs-75K</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>DFT-level conformers and energies derived from GEOM-Drugs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Kraken">Kraken</a></td>
          <td>Dataset</td>
          <td>Copyright retained by original authors</td>
          <td>Conformer ensembles and four steric descriptors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/BDE">BDE</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>OpenBabel-generated conformers with DFT binding energies</td>
      </tr>
      <tr>
          <td>EE</td>
          <td>Dataset</td>
          <td>Proprietary</td>
          <td>Closed-source as of 2026</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data</strong>: The Drugs-75K, Kraken, and BDE subsets are openly available via the project&rsquo;s GitHub repository. The EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.</li>
<li><strong>Code</strong>: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at <a href="https://github.com/SXKDZ/MARCEL">GitHub (SXKDZ/MARCEL)</a> under the Apache-2.0 license.</li>
<li><strong>Hardware</strong>: The authors trained models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.</li>
<li><strong>Algorithms/Models</strong>: Hyperparameters for all 18 evaluated models are provided in the repository configuration files (<code>benchmarks/params</code>). All baseline models use publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).</li>
<li><strong>Evaluation</strong>: Evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., and Wang, W. (2024). Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. In <em>The Twelfth International Conference on Learning Representations (ICLR 2024)</em>. <a href="https://openreview.net/forum?id=NSDszJ2uIV">https://openreview.net/forum?id=NSDszJ2uIV</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2024learning,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=NSDszJ2uIV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GEOM: Energy-Annotated Molecular Conformations Dataset</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</link><pubDate>Thu, 04 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</guid><description>Dataset card for GEOM, providing energy-annotated molecular conformations generated via CREST/xTB and refined with DFT for property prediction benchmarks.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.</p>
<h2 id="overview">Overview</h2>
<p>The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/GEOM-sample-_4-pyrimidin-2-yloxyphenyl_acetamide.webp"
         alt="Example SARS-CoV-2 3CL protease active molecule"
         title="Example SARS-CoV-2 3CL protease active molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drug-like (AICures)</strong></td>
          <td>304,466 molecules</td>
          <td>Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)</td>
      </tr>
      <tr>
          <td><strong>QM9</strong></td>
          <td>133,258 molecules</td>
          <td>Small molecules from QM9 (up to 9 heavy atoms)</td>
      </tr>
      <tr>
          <td><strong>MoleculeNet</strong></td>
          <td>16,865 molecules</td>
          <td>Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology (includes BACE)</td>
      </tr>
      <tr>
          <td><strong>BACE (High-quality DFT)</strong></td>
          <td>1,511 molecules</td>
          <td>BACE subset of MoleculeNet with high-quality DFT energies (r2scan-3c) and experimental inhibition data</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="gibbs-free-energy-prediction">Gibbs Free Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#gibbs-free-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble Gibbs free energy (G) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.203</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.225</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.274</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.289</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.406</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="average-energy-prediction">Average Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#average-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble average energy (E) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.11</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.113</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.119</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.131</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.166</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="conformer-count-prediction">Conformer Count Prediction<a hidden class="anchor" aria-hidden="true" href="#conformer-count-prediction">#</a></h3>
    <p class="benchmark-description">Predict ln(number of unique conformers) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.363</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.38</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.455</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.484</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.763</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>QM9</strong></td>
          <td>134k small molecules with up to 9 heavy atoms and DFT properties</td>
      </tr>
      <tr>
          <td><strong>PCQM4Mv2</strong></td>
          <td>Millions of computationally generated molecules for HOMO-LUMO gap prediction</td>
      </tr>
      <tr>
          <td><strong>PubChemQC</strong></td>
          <td>DFT structures and energy properties for millions of PubChem molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Scale</strong>: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.</li>
<li><strong>Energy Annotations</strong>: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.</li>
<li><strong>Quality Tiers</strong>: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.</li>
<li><strong>Benchmark Ready</strong>: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.</li>
<li><strong>Task Diversity</strong>: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Computational Constraints</strong>: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.</li>
<li><strong>Semi-Empirical Accuracy Gap</strong>: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.</li>
<li><strong>Solvation Assumptions</strong>: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).</li>
<li><strong>Coverage Lapses</strong>: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<p><strong>Initial conformer sampling</strong> (RDKit):</p>
<ul>
<li><code>EmbedMultipleConfs</code> with <code>numConfs=50</code>, <code>pruneRmsThresh=0.01</code> Å</li>
<li>MMFF force field optimization</li>
<li>GFN2-xTB optimization of seed conformer</li>
</ul>
<p><strong>Conformational exploration</strong> (CREST):</p>
<ul>
<li>Metadynamics in NVT ensemble driven by a pushing bias potential:
$$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$
where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.</li>
<li>12 independent MTD runs per molecule with different settings for $k_i$ and $\alpha_i$.</li>
<li>6.0 kcal/mol safety window for conformer retention.</li>
<li>Solvent: ALPB for water (BACE); vacuum for others.</li>
</ul>
<p><strong>Energy calculation &amp; Weighting</strong>:</p>
<ul>
<li>
<p><strong>Standard (GFN2-xTB)</strong>: Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$:
$$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$</p>
</li>
<li>
<p><strong>High-Quality DFT (CENSO)</strong>: Refines structures using the <code>r2scan-3c</code> functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:</p>
<p>$$
\begin{aligned}
p^{\text{CENSO}}_i &amp;= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\
G_i &amp;= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)
\end{aligned}
$$</p>
</li>
</ul>
<h3 id="quality-levels">Quality Levels</h3>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>Method</th>
          <th>Subset</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Standard</strong></td>
          <td>CREST/GFN2-xTB</td>
          <td>All subsets</td>
          <td>~2 kcal/mol MAE vs DFT</td>
      </tr>
      <tr>
          <td><strong>DFT Single-Point</strong></td>
          <td>r2scan-3c/mTZVPP on CREST geometries</td>
          <td>BACE (1,511 molecules)</td>
          <td>Sub-kcal/mol</td>
      </tr>
      <tr>
          <td><strong>DFT Optimized</strong></td>
          <td>CENSO full optimization + free energies</td>
          <td>BACE (534 molecules)</td>
          <td>~0.3 kcal/mol vs CCSD(T)</td>
      </tr>
  </tbody>
</table>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:</p>
<ul>
<li><strong>Conformational Free Energy ($G$)</strong>: $G = -TS$, where $S = -R \sum_i p_i \log p_i$.</li>
<li><strong>Average Energy ($\langle E \rangle$)</strong>: $\langle E \rangle = \sum_i p_i E_i$.</li>
<li><strong>Unique Conformers</strong>: Natural log of the conformer count retained within the energy window.</li>
</ul>
<p><strong>Data</strong>: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).</p>
<p><strong>Hyperparameters</strong>: Optimized using Hyperopt package for each model/task combination.</p>
<p><strong>Models</strong>:</p>
<ul>
<li><strong>SchNetFeatures</strong>: 3D SchNet architecture + graph features, trained on highest-probability conformer</li>
<li><strong>ChemProp</strong>: Message Passing Neural Network on molecular graphs</li>
<li><strong>FFNN</strong>: Feed-forward network on Morgan fingerprints</li>
<li><strong>KRR</strong>: Kernel Ridge Regression on Morgan fingerprints</li>
<li><strong>Random Forest</strong>: Random Forest on Morgan fingerprints</li>
</ul>
<h3 id="hardware--computational-cost">Hardware &amp; Computational Cost</h3>
<h4 id="crestgfn2-xtb-generation">CREST/GFN2-xTB Generation</h4>
<p><strong>Total compute</strong>: ~15.7 million core hours</p>
<p><strong>AICures subset</strong>:</p>
<ul>
<li>13M core hours on Knights Landing (32-core nodes)</li>
<li>1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)</li>
<li>Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)</li>
</ul>
<p><strong>MoleculeNet subset</strong>: 1.5M core hours</p>
<h4 id="dft-calculations-bace-only">DFT Calculations (BACE only)</h4>
<p><strong>Software</strong>: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)</p>
<p><strong>Solvent</strong>: C-PCM implicit solvation (water)</p>
<p><strong>Hardware</strong>: ~54 cores per job</p>
<p><strong>Compute cost</strong>:</p>
<ul>
<li>781,000 CPU hours for CENSO optimizations</li>
<li>1.1M CPU hours for single-point energy calculations</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Data Availability</strong>: All generated conformations, energies, and thermodynamic properties are publicly hosted on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF">Harvard Dataverse</a>. The data is provided in language-agnostic MessagePack format and Python-specific RDKit <code>.pkl</code> formats.</li>
<li><strong>Code &amp; Analysis</strong>: The primary GitHub repository (<a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a>) provides tutorials for data extraction, RDKit processing, and conformational visualization.</li>
<li><strong>Model Training &amp; Baselines</strong>: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors&rsquo; <a href="https://github.com/learningmatter-mit/NeuralForceField">NeuralForceField repository</a>.</li>
<li><strong>Hardware &amp; Compute</strong>: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See <em>Hardware &amp; Computational Cost</em> section above for full details.</li>
<li><strong>Software Versions</strong>: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.</li>
<li><strong>Open-Access Paper</strong>: The full methodology is accessible via the <a href="https://arxiv.org/abs/2006.05531">arXiv preprint</a>.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Axelrod, S. and Gómez-Bombarelli, R. (2022). GEOM, energy-annotated molecular conformations for property prediction and molecular generation. <em>Scientific Data</em>, 9(1), 185. <a href="https://doi.org/10.1038/s41597-022-01288-4">https://doi.org/10.1038/s41597-022-01288-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Axelrod_2022,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GEOM, energy-annotated molecular conformations for property prediction and molecular generation}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{2052-4463}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science and Business Media LLC}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Axelrod, Simon and Gómez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{apr}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{185}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-11: Chemical Universe Database (26.4M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</guid><description>GDB-11 systematically enumerates 26.4M small organic molecules (up to 11 atoms of C, N, O, F) for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_11_sample.webp"
         alt="GDB-11 molecule"
         title="GDB-11 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">GDB-11 molecule (SMILES: <code>FC1C2OC1c3c(F)coc23</code>)</figcaption>
    
</figure>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.</p>
<h2 id="overview">Overview</h2>
<p>GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database contains 26.4 million unique molecules corresponding to 110.9 million stereoisomers. It was created to support virtual screening and drug discovery by providing a comprehensive collection of diverse, drug-like small molecules that obey standard chemical stability rules.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Systematic Enumeration</strong>: Exhaustive coverage of mathematically and chemically possible structures up to 11 atoms.</li>
<li><strong>Drug-Likeness</strong>: 100% of compounds follow Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability, and 50% (13.2 million) follow Congreve&rsquo;s more restrictive &ldquo;Rule of 3&rdquo; for lead-likeness.</li>
<li><strong>Structural Novelty</strong>: Features 538 newly identified ring systems that were previously unknown in existing chemical databases (like the CAS Registry or Beilstein).</li>
<li><strong>High Chirality</strong>: Over 70% of GDB molecules are chiral, providing rich 3D structural diversity, particularly in fused carbocycles and heterocycles.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Size Restriction</strong>: Strictly limited to small molecules with a maximum of 11 heavy atoms.</li>
<li><strong>Element Restriction</strong>: Only contains C, N, O, and F. Important biological and pharmaceutical elements like Phosphorus (P), Sulfur (S), and Silicon (Si) are excluded to prevent combinatorial explosion.</li>
<li><strong>Excluded Topologies</strong>: Excludes highly strained molecules (e.g., specific bridged systems), allenes, and bridgehead double bonds.</li>
<li><strong>Unstable Functional Groups</strong>: Excludes chemical classes deemed unstable or highly reactive (e.g., gem-diols, hemiacetals, aminals, enols, orthoacids).</li>
<li><strong>Computational Nature</strong>: Consists entirely of computer-generated, theoretical structures without experimental synthesis or biological validation.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="construction">Construction</h3>
<h4 id="graph-selection">Graph Selection</h4>
<p>The program GENG was used to generate an initial set of 843,335 connected graphs with up to 11 nodes and a maximum node connectivity of 4. These were filtered to 15,726 stable saturated hydrocarbon graphs using:</p>
<ul>
<li><strong>Topological Criteria</strong>: Removed graphs with a node in multiple small (3- or 4-membered) rings, tetravalent bridgeheads in small rings, and nonplanar graphs (e.g., Claus-benzol).</li>
<li><strong>Steric Criteria</strong>: Graphs containing highly distorted centers were removed using an adapted MM2 force field energy-minimization with a cutoff of +17 kcal/mol.</li>
</ul>
<h4 id="structure-generation">Structure Generation</h4>
<p>Graph symmetry algorithms identified valid locations for unsaturations and heteroatoms (C, N, O, F). Specific valence constraints were continuously enforced. Combinatorial distribution of elements and multiple bonds (excluding bridgehead double bonds, triple bonds in rings smaller than nine, and allenes) yielded a theoretical &ldquo;dark matter universe&rdquo; (DMU) of over 1.7 billion unique structures.</p>
<h4 id="filters">Filters</h4>
<p>The 1.7 billion structural candidates contained unstable environments which were aggressively filtered, reducing the set to 27.7 million possible stable molecules. Rejected unstable/reactive features included:</p>
<ul>
<li><strong>High-Energy Bonds</strong>: Gem-diols, non-stabilized aminals, hemiaminals, enols, orthoesters, unstable imines, acyl fluorides, and geminal di-heteroatoms.</li>
<li><strong>Heteroatom-Heteroatom Bonds</strong>: Peroxides (O-O), N-O, N-N, N-F, and triazanes, unless stabilized (e.g., hydrazones, oximes).</li>
<li><strong>Strained Topologies</strong>: 3/4-membered rings containing N-N or N-O bonds, and bridgehead heteroatom bonds causing instabilities (like Bredt&rsquo;s rule violations).</li>
</ul>
<p>Removal of redundant tautomeric forms collapsed the set to the foundational 26.4 million structures.</p>
<h4 id="stereoisomer-generation">Stereoisomer Generation</h4>
<p>Stereoisomers were cleanly enumerated by identifying all asymmetric centers and functional double bonds, blocking Z/E isomerism in rings smaller than 10 nodes. From the 26.4 million unique constitutional isomers, 110.9 million stereoisomers were generated (averaging 4.2 stereoisomers per molecule).</p>
<h3 id="analysis-methodology">Analysis Methodology</h3>
<h4 id="kohonen-maps-self-organizing-maps">Kohonen Maps (Self-Organizing Maps)</h4>
<p>The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):</p>
<ul>
<li><strong>Input Features</strong>: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties. The autocorrelation vector $\text{AC}_d$ for a topological distance $d$ is defined as:</li>
</ul>
<p>$$
\text{AC}_d = \sum_{i=1}^{N} \sum_{j=1}^{N} \delta (p_i p_j)_d
$$</p>
<p><em>(where $N$ is the number of atoms, $p$ is the atomic property, and $\delta (p_i, p_j)_d = p_i p_j$ if the topological distance between atoms $i$ and $j$ is $d$, and 0 otherwise).</em></p>
<ul>
<li><strong>Training Data</strong>: Random subset of 1,000,000 GDB molecules</li>
<li><strong>Architecture</strong>: 200x200 neuron grid</li>
<li><strong>Training Protocol</strong>: 250,000 epochs with 100 molecules presented per epoch</li>
<li><strong>Algorithm</strong>: Standard Kohonen algorithm</li>
<li><strong>Key Insight</strong>: Reveals that &ldquo;lead-like&rdquo; compounds cluster in chiral regions of fused carbocycles/heterocycles</li>
</ul>
<h4 id="comparison">Comparison</h4>
<p>The full database was compared comprehensively to a Reference Database (RDB) of 63,857 known compounds (up to 11 atoms) extracted from PubChem, ChemACX, ChemSCX, NCI Open Database, and the Merck Index. Of the 63,857 RDB compounds, 37,393 (58.6%) were found in GDB. The remaining 26,464 compounds were absent due to structural rule violations, exclusion of elements beyond C/N/O/F, and filtered unstable chemistries.</p>
<h4 id="new-rings">New Rings</h4>
<p>All 309 entirely acyclic graphs in GDB mapped cleanly to published structures. External databases contained only 670 of the 1,208 purely cyclic theoretical ring systems (55.5%). Furthermore, 367 of the 538 newly identified ring systems (68.2%) express inherently chiral topologies.</p>
<h4 id="stereochemistry">Stereochemistry</h4>
<p>Small molecules under 5 heavy atoms skew strongly towards simple achiral structures. As the atom count increases, a dominant stereochemical shift emerges: over two-thirds of structures containing exactly 10 or 11 atoms occupy chiral configuration spaces. Approximately 86% of the molecules in GDB contain exactly 11 atoms (22.8 million of 26.4 million).</p>
<h4 id="physicochemical-properties">Physicochemical Properties</h4>
<p>Because all GDB molecules contain at most 11 heavy atoms, 100% of them satisfy Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability. Under the more restrictive Congreve &ldquo;Rule of 3&rdquo; for lead-likeness (MW &lt; 300, RBC &lt; 3, logP &lt; 3, HBDC &lt; 3, HBAC &lt; 3, TPSA &lt; 60 $\text{\AA}^2$), exactly 50% (13.2 million structures) qualify. Virtual screening using the Molinspiration miscreen toolkit (Bayesian statistics-based) identified 42,804 virtual hits across three drug target classes: 3,043 kinase inhibitor candidates, 24,489 GPCR ligand candidates, and 19,696 ion-channel modulator candidates. Of these virtual hits, 59.8% occupied Kohonen map neurons not populated by any known RDB compound.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>While the generated GDB-11 database is openly available, reproducing the exact generation from graph to stereoisomer relies on in-house and proprietary software which is not publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB Downloads (University of Berne)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Official host for GDB databases</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172017">Zenodo Record (10.5281/zenodo.5172017)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Version-agnostic Zenodo archive of GDB-11</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Paper Accessibility</strong>: Closed-access (Published in JCIM 2007; no preprint available).</li>
<li><strong>Data Availability</strong>: The complete dataset is hosted on an open-access Zenodo repository (version-agnostic DOI): <a href="https://doi.org/10.5281/zenodo.5172017">10.5281/zenodo.5172017</a>.</li>
<li><strong>Software Dependencies (Closed/Commercial)</strong>:
<ul>
<li>Generation code is a closed-source Java (J2SE v5.0) application.</li>
<li>Relies on proprietary ChemAxon libraries (JChem v3.1, Marvin v4.0 API).</li>
<li>Virtual screening evaluation utilized the commercial Molinspiration <code>miscreen</code> toolkit.</li>
</ul>
</li>
<li><strong>Hardware Profile</strong>:
<ul>
<li><strong>CPUs</strong>: Two AMD Opteron 252 2.6 GHz processors</li>
<li><strong>Parallelization</strong>: 80-fold parallelization</li>
<li><strong>Compute Time</strong>: Approximately 20 hours for full generation</li>
</ul>
</li>
</ul>
<h3 id="force-field">Force Field</h3>
<p>A custom implementation of the MM2 force field was used for steric energy minimization during structure validation. It used the parameter set from Allinger, specifically adding a quartic term for bond stretching to prevent bond lengthening far from equilibrium:</p>
<p>$$
\begin{aligned}
E_{\text{Steric}} &amp;= \sum_{\text{bonds}} k_b(l_i - l_{0,i})^2 \left[1 + k&rsquo;_b(l_i - l_{0,i}) + k&rsquo;&rsquo;_b(l_i - l_{0,i})^2\right] \\
&amp;\quad + \sum_{\text{angles}} k_\theta(\theta_i - \theta_{0,i})^2 \left[1 + k&rsquo;_\theta(\theta_i - \theta_{0,i})^4\right] \\
&amp;\quad + \sum_{\text{angles}} k_{b,\theta}(\theta_i - \theta_{0,i})^2 \left[(l_a - l_{0,a}) + (l_b - l_{0,b})\right] \\
&amp;\quad + \sum_{\text{torsions}} \left[ \frac{V_1}{2} (1 + \cos \omega) + \frac{V_2}{2} (1 - \cos 2\omega) + \frac{V_3}{2} (1 + \cos 3\omega) \right] \\
&amp;\quad + \sum_{i=1}^N \sum_{j=i+1}^N \epsilon_{ij} \left[ A \exp \left( \frac{-B r_{ij}}{\sum r^{\ast}_{ij}} \right) - C \left( \frac{r_{ij}}{\sum r^{\ast}_{ij}} \right)^6 \right]
\end{aligned}
$$</p>
<h2 id="paper-information">Paper Information</h2>
<p>Fink, T. and Reymond, J.-L. (2007). Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. <em>Journal of Chemical Information and Modeling</em>, 47(2), 342&ndash;353. <a href="https://doi.org/10.1021/ci600423u">https://doi.org/10.1021/ci600423u</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fink2007virtual,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Virtual exploration of the chemical universe up to 11 atoms of C, N, O, and F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fink, Tobias and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{342--353}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-17: Chemical Universe Database (166.4B Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</guid><description>Dataset card for GDB-17, containing 166.4 billion small organic molecules representing the largest enumerated chemical space to date.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 &lt; \text{MW} &lt; 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks in the $240$-$250 \text{ Da}$ range. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding &ldquo;flatland&rdquo; by deeply populating the third dimension in shape space.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_17_sample.webp"
         alt="Example GDB-17 molecule"
         title="Example GDB-17 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-17 molecule (SMILES: <code>C1CC2C3CCCC3C3(C4CCC3CC4)C2C1</code>) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-17 (Full)</strong></td>
          <td>166.4B</td>
          <td>Complete enumeration of the database</td>
      </tr>
      <tr>
          <td><strong>GDBLL-17</strong></td>
          <td>29B</td>
          <td>Lead-like subset ($1 &lt; \text{clogP} &lt; 3$ and $100 &lt; \text{MW} &lt; 350$ Da)</td>
      </tr>
      <tr>
          <td><strong>GDBLLnoSR-17</strong></td>
          <td>22B</td>
          <td>Lead-like subset excluding compounds with small rings (3- or 4-membered)</td>
      </tr>
      <tr>
          <td><strong>Random Sample</strong></td>
          <td>50M</td>
          <td>Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>
<p><em>Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.</em></p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths:</strong></p>
<ul>
<li><strong>3D Shape Space (&ldquo;Escape out of Flatland&rdquo;)</strong>: Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance</li>
<li><strong>Stereochemical Complexity</strong>: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings</li>
<li><strong>Massive Scaffold Diversity</strong>: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem</li>
<li><strong>Rich in Known Drug Isomers</strong>: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and &ldquo;methyl walk&rdquo; analogs</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li><strong>Experimental Gap</strong>: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.</li>
<li><strong>Small Ring Dominance</strong>: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds</li>
<li><strong>Elemental Scope Restrictions</strong>: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded</li>
<li><strong>Strict Stability Filters</strong>: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)</li>
<li><strong>Polarity Skew</strong>: The full database contains disproportionately more polar molecules ($\text{clogP} &lt; 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:</p>
<ol>
<li><strong>Graphs $\rightarrow$ Hydrocarbons</strong>: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).</li>
<li><strong>Hydrocarbons $\rightarrow$ Skeletons</strong>: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).</li>
<li><strong>Skeletons $\rightarrow$ CNO Molecules</strong>: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).</li>
<li><strong>Post-processing</strong>: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.</li>
</ol>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<ul>
<li><strong>Compute</strong>: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)</li>
<li><strong>Software</strong>: Powered by <strong>GENG</strong> (Nauty package) for graph generation, <strong>CORINA</strong> for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications</li>
</ul>
<h3 id="shape-analysis-pmi">Shape Analysis (PMI)</h3>
<p>To quantitatively define the &ldquo;escape from flatland,&rdquo; the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:</p>
<p>$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$</p>
<p>The vertices of this plot define the three geometrical boundaries of chemical space:</p>
<ul>
<li><strong>Rod-like (1D)</strong>: $(0, 1)$ typical of stretched alkanes</li>
<li><strong>Disc-like (2D)</strong>: $(0.5, 0.5)$ typical of flat aromatics like benzene</li>
<li><strong>Sphere-like (3D)</strong>: $(1, 1)$ typical of globular structures like cubane</li>
</ul>
<p>GDB-17&rsquo;s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.</p>
<h3 id="differences-from-gdb-13">Differences from GDB-13</h3>
<ul>
<li>The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit</li>
<li>Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework</li>
<li>Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion</li>
<li>Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The original paper is published in the <em>Journal of Chemical Information and Modeling</em> and is available as an Open Access publication under a CC-BY license.</li>
<li><strong>Data Availability</strong>: The full 166.4 billion molecule dataset is not publicly available for download (estimated &gt;400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the <a href="https://gdb.unibe.ch/downloads/">GDB website</a> and archived on <a href="https://zenodo.org/records/5172018">Zenodo</a>.</li>
<li><strong>Code &amp; Algorithms</strong>: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.</li>
<li><strong>Dependencies</strong>: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.</li>
<li><strong>Hardware Specifications</strong>: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Ruddigkeit, L., van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. <em>Journal of Chemical Information and Modeling</em>, 52(11), 2864&ndash;2875. <a href="https://doi.org/10.1021/ci300415d">https://doi.org/10.1021/ci300415d</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Ruddigkeit_2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{52}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span>=nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2864--2875}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-13: Chemical Universe Database (970M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</guid><description>A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_13_sample.webp"
         alt="Example GDB-13 molecule"
         title="Example GDB-13 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-13 molecule (SMILES: <code>CCCC(O)(CO)CC1CC1CN</code>)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>C/N/O Set</strong></td>
          <td>~910.1M</td>
          <td>Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.</td>
      </tr>
      <tr>
          <td><strong>Cl/S Set</strong></td>
          <td>~67.3M</td>
          <td>Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).</td>
      </tr>
  </tbody>
</table>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The creation and release of the 977.5 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.</p>
<h2 id="overview">Overview</h2>
<p>GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Systematic coverage of structures with up to 13 atoms</li>
<li>High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance</li>
<li>High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules</li>
<li>Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl</li>
<li>Omits 66.2% of known chemical space up to 13 atoms found in external databases</li>
<li>Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)</li>
<li>Excludes highly strained molecules and highly polar combinations</li>
<li>Consists entirely of computer-generated structures pending experimental validation</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="algorithmic-approach">Algorithmic Approach</h3>
<p><strong>Type</strong>: Rule-Based Combinatorial Graph Enumeration</p>
<p>This approach relies on <strong>combinatorial enumeration</strong>. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.</p>
<p><strong>Process</strong>:</p>
<ol>
<li>Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)</li>
<li>Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)</li>
<li>Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if:
$$ V &lt; 0.345 \text{ \AA}^3 $$</li>
<li>Introduce unsaturations and heteroatoms through systematic substitution</li>
<li>Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness</li>
<li>Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas</li>
</ol>
<p><strong>Key Optimization</strong>: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast &ldquo;element-ratio&rdquo; filters. This achieved a <strong>6.4-fold speedup</strong> in structure validation early in the pipeline.</p>
<h3 id="differences-from-gdb-11">Differences from GDB-11</h3>
<ul>
<li><strong>Element Selection</strong>: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).</li>
<li><strong>Optimization Method</strong>: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).</li>
<li><strong>Heuristic Filters</strong>: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="paper--data-availability">Paper &amp; Data Availability</h3>
<ul>
<li><strong>Paper Access</strong>: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.</li>
<li><strong>Data Access</strong>: The full GDB-13 database and its subsets are freely available via the <a href="https://gdb.unibe.ch/downloads/">Reymond Group Downloads Page</a> and are persistently hosted on <a href="https://doi.org/10.5281/zenodo.5172018">Zenodo</a>.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB-13 Database (Reymond Group)</a></td>
          <td>Dataset</td>
          <td>Free download</td>
          <td>Official download page hosted by the Reymond Group</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172018">GDB-13 on Zenodo</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Persistent archival copy</td>
      </tr>
  </tbody>
</table>
<h3 id="source-code--algorithms">Source Code &amp; Algorithms</h3>
<p>The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.</p>
<h3 id="heuristic-filters">Heuristic Filters</h3>
<p>Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:</p>
<p>$$
\begin{aligned}
\frac{N + O}{C} &amp;&lt; 1.0 \\
\frac{N}{C} &amp;&lt; 0.571 \\
\frac{O}{C} &amp;&lt; 0.666
\end{aligned}
$$</p>
<h3 id="excluded-functional-groups">Excluded Functional Groups</h3>
<ul>
<li>O-O bonds (peroxides)</li>
<li>Hemiacetals, aminals, acyclic imines, non-aromatic enols</li>
<li>Compounds containing both primary/secondary amines and aldehydes/ketones</li>
<li>Nonenumerated elements (F, Br, I, P, Si, metals)</li>
<li>High-heteroatom ratio structures (e.g., mannitol)</li>
</ul>
<h3 id="hardware--compute">Hardware &amp; Compute</h3>
<ul>
<li><strong>Compute Cost</strong>: ~40,000 CPU hours for the 910 million C/N/O structures.</li>
<li><strong>Infrastructure</strong>: Executed in parallel on a <strong>500-node cluster</strong></li>
<li><strong>Assembly Optimization</strong>: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Blum, L. C. and Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. <em>Journal of the American Chemical Society</em>, 131(25), 8732&ndash;8733. <a href="https://doi.org/10.1021/ja902302h">https://doi.org/10.1021/ja902302h</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blum2009gdb13,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{970 million druglike small molecules for virtual screening in the chemical universe database GDB-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blum, Lorenz C and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{131}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8732--8733}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/ja902302h}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>