<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Evaluation, Benchmarks &amp; Surveys on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/</link><description>Recent content in Evaluation, Benchmarks &amp; Surveys on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/index.xml" rel="self" type="application/rss+xml"/><item><title>RNNs vs Transformers for Molecular Generation Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</guid><description>Empirical comparison of RNN and Transformer architectures for molecular generation using SMILES and SELFIES across three generative tasks.</description><content:encoded><![CDATA[<h2 id="an-empirical-comparison-of-sequence-architectures-for-molecular-generation">An Empirical Comparison of Sequence Architectures for Molecular Generation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.</p>
<h2 id="why-compare-rnns-and-transformers-for-molecular-design">Why Compare RNNs and Transformers for Molecular Design?</h2>
<p>Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like <a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">CharRNN</a>, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.</p>
<p>Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.</p>
<h2 id="experimental-design-three-tasks-two-architectures-two-representations">Experimental Design: Three Tasks, Two Architectures, Two Representations</h2>
<p>The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.</p>
<h3 id="three-generative-tasks">Three generative tasks</h3>
<p>The three tasks, drawn from <a href="/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/">Flam-Shepherd et al.</a>, are designed with increasing complexity:</p>
<ol>
<li>
<p><strong>Penalized LogP task</strong>: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP &gt; 4.0. Molecule sequences are relatively short (50-75 tokens).</p>
</li>
<li>
<p><strong>Multidistribution task</strong>: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW &lt;= 185), ZINC (185 &lt;= MW &lt;= 425), Harvard Clean Energy Project (460 &lt;= MW &lt;= 600), and POLYMERS (MW &gt; 600). This tests the ability to capture multiple modes simultaneously.</p>
</li>
<li>
<p><strong>Large-scale task</strong>: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.</p>
</li>
</ol>
<h3 id="model-configuration">Model configuration</h3>
<p>Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>The evaluation covers multiple dimensions:</p>
<ul>
<li><strong>Standard metrics</strong>: validity, uniqueness, novelty</li>
<li><strong>Molecular properties</strong>: <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)</li>
<li><strong>Wasserstein distance</strong>: measures distributional similarity between generated and training molecules for each property</li>
<li><strong>Tanimoto similarity</strong>: structural and scaffold similarity between generated and training molecules</li>
<li><strong>Token length (TL)</strong>: comparison of generated vs. training sequence lengths</li>
</ul>
<p>For each task, 10,000 molecules are generated and evaluated.</p>
<h2 id="key-results-across-tasks">Key Results Across Tasks</h2>
<h3 id="penalized-logp-task">Penalized LogP task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.56</td>
          <td>0.12</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>16.61</td>
          <td>0.09</td>
          <td>5.90</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.63</td>
          <td>0.25</td>
          <td>0.42</td>
          <td>0.02</td>
          <td>36.43</td>
          <td>0.23</td>
          <td>2.35</td>
          <td>0.40</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.83</td>
          <td>0.18</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>23.77</td>
          <td>0.09</td>
          <td>7.99</td>
          <td>0.84</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.97</td>
          <td>0.22</td>
          <td>0.47</td>
          <td>0.02</td>
          <td>44.43</td>
          <td>0.28</td>
          <td>5.04</td>
          <td>0.53</td>
      </tr>
  </tbody>
</table>
<p>RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs&rsquo; strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).</p>
<h3 id="multidistribution-task">Multidistribution task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.16</td>
          <td>0.07</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>18.34</td>
          <td>0.02</td>
          <td>7.07</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.46</td>
          <td>0.38</td>
          <td>0.55</td>
          <td>0.03</td>
          <td>110.72</td>
          <td>0.24</td>
          <td>10.00</td>
          <td>1.58</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.16</td>
          <td>0.16</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>39.94</td>
          <td>0.02</td>
          <td>10.03</td>
          <td>1.28</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.73</td>
          <td>0.37</td>
          <td>0.63</td>
          <td>0.04</td>
          <td>107.46</td>
          <td>0.30</td>
          <td>17.57</td>
          <td>2.40</td>
      </tr>
  </tbody>
</table>
<p>Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer&rsquo;s global attention mechanism over the RNN&rsquo;s sequential processing.</p>
<h3 id="large-scale-task">Large-scale task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.46</td>
          <td>1.89</td>
          <td>0.20</td>
          <td>0.01</td>
          <td>307.09</td>
          <td>0.03</td>
          <td>105.29</td>
          <td>12.05</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.65</td>
          <td>1.78</td>
          <td>0.43</td>
          <td>0.01</td>
          <td>456.98</td>
          <td>0.14</td>
          <td>100.79</td>
          <td>15.26</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.36</td>
          <td>1.64</td>
          <td>0.07</td>
          <td>0.01</td>
          <td>172.93</td>
          <td>0.02</td>
          <td>59.04</td>
          <td>7.41</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.91</td>
          <td>2.82</td>
          <td>0.47</td>
          <td>0.01</td>
          <td>464.75</td>
          <td>0.18</td>
          <td>92.91</td>
          <td>11.57</td>
      </tr>
  </tbody>
</table>
<p>The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.</p>
<h3 id="standard-metrics-across-all-tasks">Standard metrics across all tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>SM-RNN</th>
          <th>SF-RNN</th>
          <th>SM-Transformer</th>
          <th>SF-Transformer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP</td>
          <td>Valid</td>
          <td>0.90</td>
          <td>1.00</td>
          <td>0.89</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Uniqueness</td>
          <td>0.98</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Novelty</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.71</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Valid</td>
          <td>0.95</td>
          <td>1.00</td>
          <td>0.97</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Uniqueness</td>
          <td>0.96</td>
          <td>1.00</td>
          <td>1.00</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Novelty</td>
          <td>0.91</td>
          <td>0.98</td>
          <td>0.91</td>
          <td>0.98</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Valid</td>
          <td>0.84</td>
          <td>1.00</td>
          <td>0.88</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Uniqueness</td>
          <td>0.99</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Novelty</td>
          <td>0.85</td>
          <td>0.92</td>
          <td>0.86</td>
          <td>0.94</td>
      </tr>
  </tbody>
</table>
<p>SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).</p>
<h2 id="conclusions-and-practical-guidelines">Conclusions and Practical Guidelines</h2>
<p>The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:</p>
<ul>
<li>
<p><strong>RNNs are preferred</strong> when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.</p>
</li>
<li>
<p><strong>Transformers are preferred</strong> when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.</p>
</li>
<li>
<p><strong>SMILES outperforms SELFIES</strong> on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.</p>
</li>
</ul>
<p>The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Task 1</td>
          <td>ZINC15 (penalized LogP &gt; 4.0)</td>
          <td>Not specified</td>
          <td>High penalized LogP molecules</td>
      </tr>
      <tr>
          <td>Task 2</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> + ZINC + CEP + POLYMERS</td>
          <td>~200K</td>
          <td>Multimodal MW distribution</td>
      </tr>
      <tr>
          <td>Task 3</td>
          <td>PubChem (&gt;100 heavy atoms)</td>
          <td>Not specified</td>
          <td>MW range 1250-5000</td>
      </tr>
  </tbody>
</table>
<p>Data processing code available at <a href="https://github.com/danielflamshep/genmoltasks">https://github.com/danielflamshep/genmoltasks</a> (from the original Flam-Shepherd et al. study).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenizer (not character-by-character)</li>
<li><strong>Hyperparameter search</strong>: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]</li>
<li><strong>Selection</strong>: Top 20% by sum of valid + unique + novelty, then final selection on all indicators</li>
<li><strong>Generation</strong>: 10K molecules per model per task</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN variants</td>
          <td>5.2M - 36.4M</td>
          <td>RNN (LSTM/GRU)</td>
      </tr>
      <tr>
          <td>Transformer variants</td>
          <td>5.3M - 36.4M</td>
          <td>Transformer decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/viko-3/language_model">trans_language</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Transformer implementation by the authors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/danielflamshep/genmoltasks">genmoltasks</a></td>
          <td>Code/Data</td>
          <td>Apache-2.0</td>
          <td>Dataset construction from Flam-Shepherd et al.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., &amp; Sakurai, T. (2023). Molecular language models: RNNs or transformer? <em>Briefings in Functional Genomics</em>, 22(4), 392-400. <a href="https://doi.org/10.1093/bfgp/elad012">https://doi.org/10.1093/bfgp/elad012</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chen2023molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular language models: RNNs or transformer?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Functional Genomics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--400}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bfgp/elad012}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review: Deep Learning for Molecular Design (2019)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/</guid><description>A 2019 review surveying deep generative models for molecular design, covering RNNs, VAEs, GANs, and RL approaches with SMILES and graph representations.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-deep-generative-models-for-molecular-design">A Systematization of Deep Generative Models for Molecular Design</h2>
<p>This is a <strong>Systematization</strong> paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.</p>
<h2 id="the-challenge-of-navigating-vast-chemical-space">The Challenge of Navigating Vast Chemical Space</h2>
<p>The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.</p>
<p>By 2016, <a href="/notes/machine-learning/generative-models/">deep generative models</a> had shown strong results in producing original images, music, and text. The &ldquo;molecular autoencoder&rdquo; of <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016/2018)</a> first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.</p>
<h2 id="molecular-representations-and-architecture-taxonomy">Molecular Representations and Architecture Taxonomy</h2>
<p>The review&rsquo;s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.</p>
<h3 id="molecular-representations">Molecular Representations</h3>
<p>The review categorizes representations into 3D and 2D graph-based schemes:</p>
<p><strong>3D representations</strong> include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.</p>
<p><strong>2D graph representations</strong> include:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings</strong>: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.</li>
<li><strong>Canonical SMILES</strong>: Unique but potentially encode grammar rules rather than chemical structure.</li>
<li><strong>Context-free grammars (CFGs)</strong>: Decompose SMILES into grammar rules to improve validity rates, though not to 100%.</li>
<li><strong>Tensor representations</strong>: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.</li>
<li><strong>Graph operations</strong>: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.</li>
</ul>
<h3 id="deep-learning-architectures">Deep Learning Architectures</h3>
<p><strong>Recurrent Neural Networks (RNNs)</strong> generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:</p>
<p>$$
L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1})
$$</p>
<p>Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> learn a continuous latent space by maximizing the evidence lower bound (ELBO):</p>
<p>$$
\mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)]
$$</p>
<p>The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">grammar VAEs</a> (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> train a generator against a discriminator using the minimax objective:</p>
<p>$$
\min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))]
$$</p>
<p>The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more &ldquo;balanced&rdquo; training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover&rsquo;s distance for more stable training:</p>
<p>$$
W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y|
$$</p>
<p><strong>Reinforcement Learning</strong> recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:</p>
<p>$$
\nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right]
$$</p>
<p>To prevent RL fine-tuning from causing the generator to &ldquo;drift&rdquo; away from viable chemical structures, an augmented reward function incorporates the prior likelihood:</p>
<p>$$
R&rsquo;(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2}
$$</p>
<h2 id="cataloging-45-models-and-their-design-choices">Cataloging 45 Models and Their Design Choices</h2>
<p>Rather than running new experiments, the review&rsquo;s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model&rsquo;s architecture, representation, training dataset, and dataset size. Key patterns include:</p>
<ul>
<li><strong>RNN-based models</strong> (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.</li>
<li><strong>VAE variants</strong> (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.</li>
<li><strong>GAN models</strong> (7 entries): Include <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.</li>
<li><strong>Other approaches</strong> (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.</li>
</ul>
<p>The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).</p>
<h3 id="metrics-and-reward-function-design">Metrics and Reward Function Design</h3>
<p>A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:</p>
<p><strong>Diversity</strong> using Tanimoto similarity over fingerprints:</p>
<p>$$
r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2})
$$</p>
<p><strong>Novelty</strong> measured as the fraction of generated molecules not appearing in a hold-out test set:</p>
<p>$$
r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|}
$$</p>
<p><strong>Synthesizability</strong> primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.</p>
<p>The review also discusses the <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, and DiversityNet.</p>
<h2 id="key-findings-and-future-directions">Key Findings and Future Directions</h2>
<p>The review identifies several major trends and conclusions:</p>
<p><strong>Shift from SMILES to graph-based representations.</strong> SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.</p>
<p><strong>Advantages of adversarial and RL training over MLE.</strong> The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.</p>
<p><strong>Genetic algorithms remain competitive.</strong> The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.</p>
<p><strong>Reward function design is underappreciated.</strong> Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.</p>
<p><strong>Need for standardized benchmarks.</strong> The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.</p>
<h3 id="limitations">Limitations</h3>
<p>As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></td>
          <td>977M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC15</td>
          <td>750M+</td>
          <td>Commercially available compounds</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td>50M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ChEMBL</td>
          <td>2M</td>
          <td>Curated bioactive molecules</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>QM9</td>
          <td>133,885</td>
          <td>Small organic molecules with DFT properties</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>PubChemQC</td>
          <td>3.98M</td>
          <td>PubChem compounds with DFT data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).</p>
<h3 id="evaluation">Evaluation</h3>
<p>Key evaluation frameworks discussed:</p>
<ul>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (molecular analog of FID)</li>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> benchmarking platform</li>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmarking suite</li>
<li>Validity rate, uniqueness, novelty, and internal diversity metrics</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Elton, D. C., Boukouvalas, Z., Fuge, M. D., &amp; Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. <em>Molecular Systems Design &amp; Engineering</em>, 4(4), 828-849. <a href="https://doi.org/10.1039/C9ME00039A">https://doi.org/10.1039/C9ME00039A</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{elton2019deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep Learning for Molecular Design -- A Review of the State of the Art}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Molecular Systems Design \&amp; Engineering}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{828--849}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C9ME00039A}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Re-evaluating Sample Efficiency in Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/</guid><description>Thomas et al. re-evaluate generative model benchmarks for de novo drug design, adding property filters and diversity metrics that re-rank model performance.</description><content:encoded><![CDATA[<h2 id="an-empirical-re-evaluation-of-generative-model-benchmarks">An Empirical Re-evaluation of Generative Model Benchmarks</h2>
<p>This is an <strong>Empirical</strong> paper. The primary contribution is a critical reassessment of the <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">Practical Molecular Optimization (PMO)</a> benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed <a href="/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb (AHC)</a> method.</p>
<h2 id="sample-efficiency-and-chemical-quality-in-drug-design">Sample Efficiency and Chemical Quality in Drug Design</h2>
<p>Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> to be the most sample-efficient model across 23 tasks.</p>
<p>However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.</p>
<h2 id="modified-metrics-property-filters-and-diversity-requirements">Modified Metrics: Property Filters and Diversity Requirements</h2>
<p>The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:</p>
<p><strong>AUC Top-10 (Filtered)</strong>: Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.</p>
<p><strong>AUC Top-10 (Diverse)</strong>: The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.</p>
<p><strong>AUC Top-10 (Combined)</strong>: Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.</p>
<h2 id="benchmark-setup-and-generative-models-evaluated">Benchmark Setup and Generative Models Evaluated</h2>
<h3 id="implementation-details">Implementation Details</h3>
<p>The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.</p>
<p>Two AHC variants are benchmarked:</p>
<ul>
<li><strong>SMILES-AHC</strong>: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><strong>SMILES-AHC</strong>*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality</li>
</ul>
<p>Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.</p>
<h3 id="models-compared">Models Compared</h3>
<p>The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark.</p>
<h2 id="re-ranked-results-and-augmented-hill-climb-performance">Re-ranked Results and Augmented Hill-Climb Performance</h2>
<p>The modified metrics substantially re-order the ranking of generative models:</p>
<ol>
<li>
<p><em><em>SMILES-AHC</em> achieves top performance on AUC Top-10 (Combined)</em>*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.</p>
</li>
<li>
<p><strong>SMILES-AHC (data-driven hyperparameters) ranks first</strong> when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.</p>
</li>
<li>
<p><strong>REINVENT retains its first-place rank under property filters alone</strong>, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.</p>
</li>
<li>
<p><strong>Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly</strong> under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.</p>
</li>
<li>
<p><strong>Both AHC variants excel on empirically difficult tasks</strong>, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics</li>
<li>Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection</li>
<li>Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches</li>
<li>Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives</li>
<li>Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC250k</td>
          <td>~250K molecules</td>
          <td>Subset of ZINC15, provided by PMO benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO</a> benchmark tasks</td>
          <td>23 objectives</td>
          <td>Derived primarily from <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Augmented Hill-Climb</strong>: RL strategy from Thomas et al. (2022), patience of 5</li>
<li><strong>Hyperparameters (SMILES-AHC)</strong>: batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><em><em>Hyperparameters (SMILES-AHC</em>)</em>*: $\sigma = 60$ (domain-informed selection)</li>
<li><strong>Prior training</strong>: 5 epochs, batch size 128, SMILES notation</li>
<li><strong>Oracle budget</strong>: 10,000 evaluations per task</li>
<li><strong>Replicates</strong>: 5 per model per task</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Embedding (128) + 3x GRU (512), following REINVENT</li>
<li><strong>All 25 PMO benchmark models</strong> re-evaluated using original implementations</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-10 (Original)</td>
          <td>Area under curve of average top 10 molecules</td>
          <td>Standard PMO metric</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Filtered)</td>
          <td>Original with MW/LogP and ECFP4 novelty filters</td>
          <td>$\mu \pm 4\sigma$ from ZINC250k</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Diverse)</td>
          <td>Top 10 selected with Tanimoto &lt; 0.35 diversity</td>
          <td>ECFP4 fingerprints</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Combined)</td>
          <td>Both filters and diversity applied</td>
          <td>Most stringent metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Scoring and benchmarking framework by the first author</td>
      </tr>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original benchmark code and data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. <em>arXiv preprint arXiv:2212.01385</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{thomas2022reevaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Re-evaluating sample efficiency in de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2212.01385}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2212.01385}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Inverse Molecular Design with ML Generative Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/</guid><description>Review of inverse molecular design approaches including VAEs, GANs, and RL for navigating chemical space and generating novel molecules with desired properties.</description><content:encoded><![CDATA[<h2 id="a-foundational-systematization-of-inverse-molecular-design">A Foundational Systematization of Inverse Molecular Design</h2>
<p>This paper is a <strong>Systematization</strong> of the nascent field of inverse molecular design using machine learning generative models. Published in <em>Science</em> in 2018, it organizes and contextualizes the rapidly emerging body of work on using deep generative models (variational autoencoders, generative adversarial networks, and reinforcement learning) to navigate chemical space and propose novel molecules with targeted properties. Rather than introducing a new method, the paper synthesizes the conceptual framework connecting molecular representations, generative architectures, and inverse design objectives, establishing a reference point for the field at a critical early stage.</p>
<h2 id="the-challenge-of-navigating-chemical-space">The Challenge of Navigating Chemical Space</h2>
<p>The core problem is the sheer scale of chemical space. For pharmacologically relevant small molecules alone, the number of possible structures is estimated at $10^{60}$. Traditional approaches to materials discovery rely on trial and error or high-throughput virtual screening (HTVS), both of which are fundamentally limited by the need to enumerate and evaluate candidates from a predefined library.</p>
<p>The conventional materials discovery pipeline, from concept to commercial product, historically takes 15 to 20 years, involving iterative cycles of simulation, synthesis, device integration, and characterization. Inverse design offers a conceptual alternative: start from a desired functionality and search for molecular structures that satisfy it. This inverts the standard paradigm where a molecule is proposed first and its properties are computed or measured afterward.</p>
<p>The key distinction the authors draw is between discriminative and generative models. A discriminative model learns $p(y|x)$, the conditional probability of properties $y$ given a molecule $x$. A <a href="/notes/machine-learning/generative-models/">generative model</a> instead learns the joint distribution $p(x,y)$, which can be conditioned to yield either the direct design problem $p(y|x)$ or the inverse design problem $p(x|y)$.</p>
<h2 id="three-pillars-vaes-gans-and-reinforcement-learning">Three Pillars: VAEs, GANs, and Reinforcement Learning</h2>
<p>The review organizes inverse molecular design approaches around three generative paradigms and the molecular representations they operate on.</p>
<h3 id="molecular-representations">Molecular Representations</h3>
<p>The paper surveys representations across three broad categories:</p>
<ul>
<li><strong>Discrete (text-based)</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings encode molecular structure as 1D text following a grammar syntax. Their adoption has been driven by the availability of NLP deep learning tools.</li>
<li><strong>Continuous (vectors/tensors)</strong>: <a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrices</a>, bag of bonds, fingerprints, symmetry functions, and electronic density representations. These expose different physical symmetries (permutational, rotational, reflectional, translational invariance).</li>
<li><strong>Weighted graphs</strong>: Molecules as undirected graphs where atoms are nodes and bonds are edges, with vectorized features on edges and nodes (bonding type, aromaticity, charge, distance).</li>
</ul>
<p>An ideal representation for inverse design should be invertible, meaning it supports mapping back to a synthesizable molecular structure. SMILES strings and molecular graphs are invertible, while many continuous representations require lookup tables or auxiliary methods.</p>
<h3 id="variational-autoencoders-vaes">Variational Autoencoders (VAEs)</h3>
<p><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAEs</a> encode molecules into a continuous latent space and decode latent vectors back to molecular representations. The key insight is that by constraining the encoder to produce latent vectors following a Gaussian distribution, the model gains the ability to <a href="/posts/modern-variational-autoencoder-in-pytorch/">interpolate between molecules and sample novel structures</a>. The latent space encodes a geometry: nearby points decode to similar molecules, and gradient-based optimization over this continuous space enables direct property optimization.</p>
<p>The VAE loss function combines a reconstruction term with a KL divergence regularizer:</p>
<p>$$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) | p(z))$$</p>
<p>where $q(z|x)$ is the encoder (approximate posterior), $p(x|z)$ is the decoder, and $p(z)$ is the prior (typically Gaussian).</p>
<p>Semi-supervised variants jointly train on molecules and properties, reorganizing latent space so molecules with similar properties cluster together. <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> demonstrated local and global optimization across generated distributions using Bayesian optimization over latent space.</p>
<p>The review traces the evolution from character-level SMILES VAEs to <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">grammar-aware and syntax-directed variants</a> that improve the generation of syntactically valid structures.</p>
<h3 id="generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p><a href="/posts/what-is-a-gan/">GANs</a> pit a generator against a discriminator in an adversarial training framework. The generator learns to produce synthetic molecules from noise, while the discriminator learns to distinguish synthetic from real molecules. Training convergence for GANs is challenging, suffering from mode collapse and generator-discriminator imbalance.</p>
<p>For molecular applications, dealing with discrete SMILES data introduces nondifferentiability, addressed through workarounds like SeqGAN&rsquo;s policy gradient approach and boundary-seeking GANs.</p>
<h3 id="reinforcement-learning-rl">Reinforcement Learning (RL)</h3>
<p>RL treats molecule generation as a sequential decision process where an agent (the generator) takes actions (adding characters to a SMILES string) to maximize a reward (desired molecular properties). Since rewards can only be assigned after sequence completion, Monte Carlo Tree Search (MCTS) is used to simulate possible completions and weight paths based on their success.</p>
<p>Applications include generation of drug-like molecules and <a href="https://en.wikipedia.org/wiki/Retrosynthesis">retrosynthesis</a> planning. Notable examples cited include RL for optimizing putative <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> inhibitors and molecules active against <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2</a>.</p>
<h3 id="hybrid-approaches">Hybrid Approaches</h3>
<p>The review highlights that these paradigms are not exclusive. Examples include druGAN (adversarial autoencoder) and <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a> (combined GAN and RL), which leverage strengths of multiple frameworks.</p>
<h2 id="survey-of-applications-and-design-paradigms">Survey of Applications and Design Paradigms</h2>
<p>Being a review paper, this work does not present new experiments but surveys existing applications across domains:</p>
<p><strong>Drug Discovery</strong>: Most generative model applications at the time of writing targeted pharmaceutical properties, including solubility, melting temperature, synthesizability, and target activity. Popova et al. optimized for JAK2 inhibitors, and Olivecrona et al. targeted dopamine receptor type 2.</p>
<p><strong>Materials Science</strong>: HTVS had been applied to organic photovoltaics (screening by frontier orbital energies and conversion efficiency), organic redox flow batteries (redox potential and solubility), organic LEDs (singlet-triplet gap), and inorganic materials via the Materials Project.</p>
<p><strong>Chemical Space Exploration</strong>: Evolution strategies had been applied to map chemical space, with structured search procedures incorporating genotype representations and mutation operations. Bayesian sampling with sequential Monte Carlo and gradient-based optimization of properties with respect to molecular systems represented alternative inverse design strategies.</p>
<p><strong>Graph-Based Generation</strong>: The paper notes the emerging extension of VAEs to molecular graphs (junction tree VAE) and message passing networks for incremental graph construction, though the graph isomorphism approximation problem remained a practical challenge.</p>
<h2 id="future-directions-and-open-challenges">Future Directions and Open Challenges</h2>
<p>The authors identify several open directions for the field:</p>
<p><strong>Closed-Loop Discovery</strong>: The ultimate goal is to concurrently propose, create, and characterize new materials with simultaneous data flow between components. At the time of writing, very few examples of successful closed-loop approaches existed.</p>
<p><strong>Active Learning</strong>: Combining inverse design with Bayesian optimization enables models that adapt as they explore chemical space, expanding in regions of high uncertainty and discovering molecular regions with desirable properties as a function of composition.</p>
<p><strong>Representation Learning</strong>: No single molecular representation works optimally for all properties. Graph and hierarchical representations were identified as areas needing further study. Representations that encode relevant physics tend to generalize better.</p>
<p><strong>Improved Architectures</strong>: Memory-augmented sequence generation models, Riemannian optimization methods exploiting latent space geometry, multi-level VAEs for structured latent spaces, and inverse RL for learning reward functions were highlighted as promising research directions.</p>
<p><strong>Integration into Education</strong>: The authors advocate for integrating ML into curricula across chemical, biochemical, medicinal, and materials sciences.</p>
<h3 id="limitations">Limitations</h3>
<p>As a review paper from 2018, this work captures the field at an early stage. Several limitations are worth noting:</p>
<ul>
<li>The survey is dominated by SMILES-based approaches, reflecting the state of the field at the time. Graph-based and 3D-aware generative models were just emerging.</li>
<li>Quantitative benchmarking of generative models was not yet standardized. The review does not provide systematic comparisons across methods.</li>
<li>The synthesis feasibility of generated molecules receives limited attention. The gap between computationally generated candidates and experimentally realizable molecules was (and remains) a significant challenge.</li>
<li>Transformer-based architectures, which would come to dominate chemical language modeling, are not discussed, as the Transformer had only been published the year prior.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>As a review/perspective paper, this work does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the cited primary works rather than the review itself.</p>
<h3 id="key-cited-methods-and-their-resources">Key Cited Methods and Their Resources</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Authors</th>
          <th>Type</th>
          <th>Availability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Automatic Chemical Design (VAE)</a></td>
          <td>Gomez-Bombarelli et al.</td>
          <td>Code + Data</td>
          <td>Published in ACS Central Science</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a></td>
          <td>Kusner et al.</td>
          <td>Code</td>
          <td>arXiv:1703.01925</td>
      </tr>
      <tr>
          <td>Junction Tree VAE</td>
          <td>Jin et al.</td>
          <td>Code</td>
          <td>arXiv:1802.04364</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a></td>
          <td>Sanchez-Lengeling et al.</td>
          <td>Code</td>
          <td>ChemRxiv preprint</td>
      </tr>
      <tr>
          <td>SeqGAN</td>
          <td>Yu et al.</td>
          <td>Code</td>
          <td>AAAI 2017</td>
      </tr>
      <tr>
          <td>Neural Message Passing</td>
          <td>Gilmer et al.</td>
          <td>Code</td>
          <td>arXiv:1704.01212</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sánchez-Lengeling, B., &amp; Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative models for matter engineering. <em>Science</em>, 361(6400), 360-365. <a href="https://doi.org/10.1126/science.aat2663">https://doi.org/10.1126/science.aat2663</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sanchez-lengeling2018inverse,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Inverse molecular design using machine learning: Generative models for matter engineering}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{S{\&#39;a}nchez-Lengeling, Benjamin and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{361}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6400}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{360--365}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1126/science.aat2663}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Generative AI Survey for De Novo Molecule and Protein Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/generative-ai-drug-design-survey/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/generative-ai-drug-design-survey/</guid><description>Comprehensive survey of generative AI for de novo drug design covering molecule and protein generation with VAEs, GANs, diffusion, and flow models.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-generative-ai-for-drug-design">A Systematization of Generative AI for Drug Design</h2>
<p>This is a <strong>Systematization</strong> paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.</p>
<p>The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.</p>
<h2 id="the-challenge-of-navigating-de-novo-drug-design">The Challenge of Navigating De Novo Drug Design</h2>
<p>The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.</p>
<p>AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.</p>
<p>The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.</p>
<h2 id="unified-taxonomy-two-themes-seven-subtasks">Unified Taxonomy: Two Themes, Seven Subtasks</h2>
<p>The survey&rsquo;s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.</p>
<h3 id="generative-model-architectures">Generative Model Architectures</h3>
<p>The survey covers four main generative model families used across both molecule and protein generation:</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$</p>
<p>where the KL loss is:</p>
<p>$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:</p>
<p>$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$</p>
<p><strong>Flow-Based Models</strong> generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:</p>
<p>$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$</p>
<p><strong>Diffusion Models</strong> gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:</p>
<p>$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$</p>
<p>The training loss minimizes the difference between the true noise and the predicted noise:</p>
<p>$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$</p>
<p>Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.</p>
<h2 id="small-molecule-generation-tasks-datasets-and-models">Small Molecule Generation: Tasks, Datasets, and Models</h2>
<h3 id="target-agnostic-molecule-design">Target-Agnostic Molecule Design</h3>
<p>The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).</p>
<p><strong>Datasets</strong>: QM9 (small stable molecules from <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>) and <a href="/notes/chemistry/datasets/geom/">GEOM</a>-Drug (more complex, drug-like molecules).</p>
<p>The field has shifted from SMILES-based VAEs (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GVAE</a>, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>EGNN, Diffusion</td>
          <td>99.8</td>
          <td>97.5</td>
          <td>97.9</td>
          <td>97.6</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>EGNN, VAE, Diffusion</td>
          <td>99.2</td>
          <td>89.6</td>
          <td>98.6</td>
          <td>94.6</td>
      </tr>
      <tr>
          <td>JODO</td>
          <td>EGNN, Diffusion</td>
          <td>99.2</td>
          <td>93.4</td>
          <td>99.0</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>VAE, Diffusion</td>
          <td>98.9</td>
          <td>89.4</td>
          <td>93.8</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>EGNN, Diffusion</td>
          <td>98.7</td>
          <td>82.0</td>
          <td>91.9</td>
          <td>90.7</td>
      </tr>
  </tbody>
</table>
<p>EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a &ldquo;relaxed&rdquo; EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.</p>
<p>On the larger GEOM-Drugs dataset, performance drops for most models:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>99.8</td>
          <td>91.6</td>
          <td>77.8</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>&ndash;</td>
          <td>62.2</td>
          <td>99.5</td>
          <td>99.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>84.4</td>
          <td>&ndash;</td>
          <td>99.3</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>81.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<p>MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.</p>
<h3 id="target-aware-molecule-design">Target-Aware Molecule Design</h3>
<p>Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.</p>
<p><strong>Datasets</strong>: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.</p>
<p><strong>Metrics</strong>: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Vina</th>
          <th>Affinity (%)</th>
          <th>QED</th>
          <th>SA</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffSBDD</td>
          <td>EGNN, Diffusion</td>
          <td>-7.333</td>
          <td>&ndash;</td>
          <td>0.467</td>
          <td>0.554</td>
          <td>0.758</td>
      </tr>
      <tr>
          <td>Luo et al.</td>
          <td>SchNet</td>
          <td>-6.344</td>
          <td>29.09</td>
          <td>0.525</td>
          <td>0.657</td>
          <td>0.720</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>EGNN, Diffusion</td>
          <td>-6.3</td>
          <td>58.1</td>
          <td>0.48</td>
          <td>0.58</td>
          <td>0.72</td>
      </tr>
      <tr>
          <td>LiGAN</td>
          <td>CNN, VAE</td>
          <td>-6.144</td>
          <td>21.1</td>
          <td>0.39</td>
          <td>0.59</td>
          <td>0.66</td>
      </tr>
      <tr>
          <td>Pocket2Mol</td>
          <td>EGNN, MLP</td>
          <td>-5.14</td>
          <td>48.4</td>
          <td>0.56</td>
          <td>0.74</td>
          <td>0.69</td>
      </tr>
  </tbody>
</table>
<p>DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).</p>
<h3 id="molecular-conformation-generation">Molecular Conformation Generation</h3>
<p>Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations &ldquo;covered&rdquo; within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).</p>
<p><strong>Datasets</strong>: GEOM-QM9, GEOM-Drugs, ISO17.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>GEOM-QM9 COV (%)</th>
          <th>GEOM-QM9 MAT</th>
          <th>GEOM-Drugs COV (%)</th>
          <th>GEOM-Drugs MAT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Torsional Diff.</td>
          <td>Diffusion</td>
          <td>92.8</td>
          <td>0.178</td>
          <td>72.7*</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>DGSM</td>
          <td>MPNN, Diffusion</td>
          <td>91.49</td>
          <td>0.2139</td>
          <td>78.73</td>
          <td>1.0154</td>
      </tr>
      <tr>
          <td>GeoDiff</td>
          <td>GFN, Diffusion</td>
          <td>90.07</td>
          <td>0.209</td>
          <td>89.13</td>
          <td>0.8629</td>
      </tr>
      <tr>
          <td>ConfGF</td>
          <td>GIN, Diffusion</td>
          <td>88.49</td>
          <td>0.2673</td>
          <td>62.15</td>
          <td>1.1629</td>
      </tr>
      <tr>
          <td>GeoMol</td>
          <td>MPNN</td>
          <td>71.26</td>
          <td>0.3731</td>
          <td>67.16</td>
          <td>1.0875</td>
      </tr>
  </tbody>
</table>
<p>*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.</p>
<p>Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.</p>
<h2 id="protein-generation-from-sequence-to-structure">Protein Generation: From Sequence to Structure</h2>
<h3 id="protein-representation-learning">Protein Representation Learning</h3>
<p>Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman&rsquo;s $\rho$).</p>
<p>Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.</p>
<h3 id="protein-structure-prediction">Protein Structure Prediction</h3>
<p>Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.</p>
<p>AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>CAMEO RMSD</th>
          <th>CAMEO TMScore</th>
          <th>CAMEO GDT-TS</th>
          <th>CAMEO lDDT</th>
          <th>CASP14 TMScore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AlphaFold2</td>
          <td>Transformer</td>
          <td>3.30</td>
          <td>0.87</td>
          <td>0.86</td>
          <td>0.90</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>ESMFold</td>
          <td>Transformer</td>
          <td>3.99</td>
          <td>0.85</td>
          <td>0.83</td>
          <td>0.87</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>RoseTTAFold</td>
          <td>Transformer</td>
          <td>5.72</td>
          <td>0.77</td>
          <td>0.71</td>
          <td>0.79</td>
          <td>0.37</td>
      </tr>
      <tr>
          <td>EigenFold</td>
          <td>Diffusion</td>
          <td>7.37</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.78</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<h3 id="sequence-generation-inverse-folding">Sequence Generation (Inverse Folding)</h3>
<p>Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.</p>
<p>Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):</p>
<p>$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$</p>
<p>ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>AAR (%)</th>
          <th>Div.</th>
          <th>RMSD</th>
          <th>Non.</th>
          <th>Time (s)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ProteinMPNN</td>
          <td>MPNN</td>
          <td>48.7</td>
          <td>0.168</td>
          <td>1.019</td>
          <td>1.061</td>
          <td>112</td>
      </tr>
      <tr>
          <td>ESM-IF1</td>
          <td>Transformer</td>
          <td>47.7</td>
          <td>0.184</td>
          <td>1.265</td>
          <td>1.201</td>
          <td>1980</td>
      </tr>
      <tr>
          <td>GPD</td>
          <td>Transformer</td>
          <td>46.2</td>
          <td>0.219</td>
          <td>1.758</td>
          <td>1.333</td>
          <td>35</td>
      </tr>
      <tr>
          <td>ABACUS-R</td>
          <td>Transformer</td>
          <td>45.7</td>
          <td>0.124</td>
          <td>1.482</td>
          <td>0.968</td>
          <td>233280</td>
      </tr>
      <tr>
          <td>3D CNN</td>
          <td>CNN</td>
          <td>44.5</td>
          <td>0.272</td>
          <td>1.62</td>
          <td>1.027</td>
          <td>536544</td>
      </tr>
      <tr>
          <td>PiFold</td>
          <td>GNN</td>
          <td>42.8</td>
          <td>0.141</td>
          <td>1.592</td>
          <td>1.464</td>
          <td>221</td>
      </tr>
      <tr>
          <td>ProteinSolver</td>
          <td>GNN</td>
          <td>24.6</td>
          <td>0.186</td>
          <td>5.354</td>
          <td>1.389</td>
          <td>180</td>
      </tr>
  </tbody>
</table>
<p>Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.</p>
<h3 id="backbone-design">Backbone Design</h3>
<p>Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.</p>
<p>Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).</p>
<p>ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.</p>
<p>Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using &ldquo;self-conditioning&rdquo; on predicted structures. Protpardelle co-designs sequence and structure by creating a &ldquo;superposition&rdquo; over possible sidechain states and collapsing them during each iterative diffusion step.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>scTM (%)</th>
          <th>Design. (%)</th>
          <th>PPL</th>
          <th>AAR (%)</th>
          <th>RMSD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RFDiffusion</td>
          <td>Diffusion</td>
          <td>&ndash;</td>
          <td>95.1</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Protpardelle</td>
          <td>Diffusion</td>
          <td>85</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FrameDiff</td>
          <td>Diffusion</td>
          <td>84</td>
          <td>48.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Genie</td>
          <td>Diffusion</td>
          <td>81.5</td>
          <td>79.0</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>LatentDiff</td>
          <td>EGNN, Diffusion</td>
          <td>31.6</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FoldingDiff</td>
          <td>Diffusion</td>
          <td>14.2</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>ProtDiff</td>
          <td>EGNN, Diffusion</td>
          <td>11.8</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>12.47*</td>
          <td>8.01*</td>
      </tr>
  </tbody>
</table>
<p>*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.</p>
<h3 id="antibody-design">Antibody Design</h3>
<p>The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.</p>
<p>For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.</p>
<h3 id="peptide-design">Peptide Design</h3>
<p>The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).</p>
<h2 id="current-trends-challenges-and-future-directions">Current Trends, Challenges, and Future Directions</h2>
<h3 id="current-trends">Current Trends</h3>
<p>The survey identifies several parallel trends across molecule and protein generation:</p>
<ol>
<li>
<p><strong>Shift from sequence to structure</strong>: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.</p>
</li>
<li>
<p><strong>Dominance of E(3) equivariant architectures</strong>: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.</p>
</li>
<li>
<p><strong>Structure-based over ligand-based approaches</strong>: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.</p>
</li>
</ol>
<h3 id="challenges">Challenges</h3>
<p><strong>For small molecule generation:</strong></p>
<ul>
<li><strong>Complexity</strong>: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.</li>
<li><strong>Applicability</strong>: Generating molecules with high binding affinity to targets remains difficult.</li>
<li><strong>Explainability</strong>: Methods are black-box, offering no insight into why generated molecules have desired properties.</li>
</ul>
<p><strong>For protein generation:</strong></p>
<ul>
<li><strong>Benchmarking</strong>: Protein generative tasks lack a standard evaluative procedure, with variance between each model&rsquo;s metrics and testing conditions.</li>
<li><strong>Performance</strong>: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.</li>
</ul>
<p>The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.</p>
<h3 id="data">Data</h3>
<p>The survey catalogs the following key datasets across subtasks:</p>
<table>
  <thead>
      <tr>
          <th>Subtask</th>
          <th>Datasets</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target-agnostic molecule</td>
          <td>QM9, <a href="/notes/chemistry/datasets/geom/">GEOM</a>-Drug</td>
          <td>QM9 from <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>; GEOM-Drug for complex molecules</td>
      </tr>
      <tr>
          <td>Target-aware molecule</td>
          <td>CrossDocked2020, ZINC20, Binding MOAD</td>
          <td>CrossDocked2020 most used (22.5M pairs)</td>
      </tr>
      <tr>
          <td>Conformation generation</td>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a>-QM9, GEOM-Drugs, ISO17</td>
          <td>Conformer sets for molecules</td>
      </tr>
      <tr>
          <td>Protein structure prediction</td>
          <td>PDB, CASP14, CAMEO</td>
          <td>CASP biennial blind evaluation</td>
      </tr>
      <tr>
          <td>Protein sequence generation</td>
          <td>PDB, UniRef, UniParc, CATH, TS500</td>
          <td>CATH for domain classification</td>
      </tr>
      <tr>
          <td>Backbone design</td>
          <td>PDB, AlphaFoldDB, SCOP, CATH</td>
          <td>AlphaFoldDB for expanded structural coverage</td>
      </tr>
      <tr>
          <td>Antibody structure</td>
          <td>SAbDab, RAB</td>
          <td>SAbDab: all antibody structures from PDB</td>
      </tr>
      <tr>
          <td>Antibody CDR generation</td>
          <td>SAbDab, RAB, SKEMPI</td>
          <td>SKEMPI for affinity optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gersteinlab/GenAI4Drug">GenAI4Drug</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Organized repository of all covered sources</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., &amp; Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. <em>Briefings in Bioinformatics</em>, 25(4), bbae338. <a href="https://doi.org/10.1093/bib/bbae338">https://doi.org/10.1093/bib/bbae338</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2402.08703">arXiv: 2402.08703</a></li>
<li><a href="https://github.com/gersteinlab/GenAI4Drug">GitHub: GenAI4Drug</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247410/">PMC: PMC11247410</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{tang2024survey,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbae338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbae338}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Language Models for De Novo Drug Design Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/</guid><description>Review of chemical language models for de novo drug design covering string representations, architectures, training strategies, and experimental validation.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-chemical-language-models-for-drug-design">A Systematization of Chemical Language Models for Drug Design</h2>
<p>This paper is a <strong>Systematization</strong> (minireview) that surveys the landscape of chemical language models (CLMs) for de novo drug design. It organizes the field along three axes: molecular string representations, deep learning architectures, and generation strategies (distribution learning, goal-directed, and conditional). The review also highlights experimental validations, current gaps, and future opportunities.</p>
<h2 id="why-chemical-language-models-matter-for-drug-design">Why Chemical Language Models Matter for Drug Design</h2>
<p>De novo drug design faces an enormous combinatorial challenge: the &ldquo;chemical universe&rdquo; is estimated to contain up to $10^{60}$ drug-like small molecules. Exhaustive enumeration is infeasible, and traditional design algorithms rely on hand-crafted assembly rules. Chemical language models address this by borrowing natural language processing techniques to learn the &ldquo;chemical language,&rdquo; generating molecules as string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, DeepSMILES) that satisfy both syntactic validity (chemically valid structures) and semantic correctness (desired pharmacological properties).</p>
<p>CLMs have gained traction because string representations are readily available for most molecular databases, generation is computationally cheap (one molecule per forward pass through a sequence model), and the same architecture can be applied to diverse tasks (property prediction, de novo generation, reaction prediction). At the time of this review, CLMs had produced experimentally validated bioactive molecules in several prospective studies, establishing them as practical tools for drug discovery.</p>
<h2 id="molecular-string-representations-smiles-deepsmiles-and-selfies">Molecular String Representations: SMILES, DeepSMILES, and SELFIES</h2>
<p>The review covers three main string representations used as input/output for CLMs:</p>
<p><strong>SMILES</strong> (Simplified Molecular Input Line Entry Systems) converts hydrogen-depleted molecular graphs into strings where atoms are denoted by atomic symbols, bonds and branching by punctuation, and ring openings/closures by numbers. SMILES are non-univocal (multiple valid strings per molecule), and canonicalization algorithms are needed for unique representations. Multiple studies show that using randomized (non-canonical) SMILES for data augmentation improves CLM performance, with diminishing returns beyond 10- to 20-fold augmentation.</p>
<p><strong><a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a></strong> modifies SMILES to improve machine-readability by replacing the paired ring-opening/closure digits with a count-based system and using closing parentheses only (no opening ones). This reduces the frequency of syntactically invalid strings but does not eliminate them entirely.</p>
<p><strong><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a></strong> (Self-Referencing Embedded Strings) use a formal grammar that guarantees 100% syntactic validity of decoded molecules. Every SELFIES string maps to a valid molecular graph. However, SELFIES can produce chemically unrealistic molecules (e.g., highly strained ring systems), and the mapping between string edits and molecular changes is less intuitive than for SMILES.</p>
<p>The review notes a key tradeoff: SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.</p>
<h2 id="clm-architectures-and-training-strategies">CLM Architectures and Training Strategies</h2>
<h3 id="architectures">Architectures</h3>
<p>The review describes the main architectures used in CLMs:</p>
<p><strong>Recurrent Neural Networks (RNNs)</strong>, particularly LSTMs and GRUs, dominated early CLM work. These models process SMILES character-by-character and generate new strings autoregressively via next-token prediction. RNNs are computationally efficient and well-suited to the sequential nature of molecular strings.</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> encode molecules into a continuous latent space and decode them back into strings. This enables smooth interpolation between molecules and latent-space optimization, but generated strings may be syntactically invalid.</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> have been adapted for molecular string generation (e.g., <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>), though they face training instability and mode collapse challenges that limit their adoption.</p>
<p><strong>Transformers</strong> have emerged as an increasingly popular alternative, offering parallelized training and the ability to capture long-range dependencies in molecular strings. The review notes the growing relevance of Transformer-based CLMs, particularly for large-scale pretraining.</p>
<h3 id="generation-strategies">Generation Strategies</h3>
<p>The review organizes CLM generation into three categories:</p>
<ol>
<li>
<p><strong>Distribution learning</strong>: The model learns to reproduce the statistical distribution of a training set of molecules. No explicit scoring function is used during generation. The generated molecules are evaluated post-hoc by comparing their property distributions to the training set. This approach is end-to-end but provides no direct indication of individual molecule quality.</p>
</li>
<li>
<p><strong>Goal-directed generation</strong>: A pretrained CLM is steered toward molecules optimizing a specified scoring function (e.g., predicted bioactivity, physicochemical properties). Common approaches include reinforcement learning (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and variants), hill-climbing, and Bayesian optimization. Scoring functions provide direct quality signals but can introduce biases, shortcuts, and limited structural diversity.</p>
</li>
<li>
<p><strong>Conditional generation</strong>: An intermediate approach that learns a joint semantic space between molecular structures and desired properties. The desired property profile serves as an input &ldquo;prompt&rdquo; for generation (e.g., a protein target, gene expression signature, or 3D shape). This bypasses the need for external scoring functions but has seen limited experimental application.</p>
</li>
</ol>
<h3 id="transfer-learning-and-chemical-space-exploration">Transfer Learning and Chemical Space Exploration</h3>
<p>Transfer learning is the dominant paradigm for CLM-driven chemical space exploration. A large-scale pretraining step (on $10^5$ to $10^6$ molecules via next-character prediction) is followed by fine-tuning on a smaller set of molecules with desired properties (often 10 to $10^2$ molecules). Key findings from the literature:</p>
<ul>
<li>The minimum training set size depends on target molecule complexity and heterogeneity.</li>
<li>SMILES augmentation is most beneficial with small training sets (fewer than 10,000 molecules) and plateaus for large, structurally complex datasets.</li>
<li>Fine-tuning with as few as 10 to 100 molecules has produced experimentally validated bioactive designs.</li>
<li>Hyperparameter tuning has relatively little effect on overall CLM performance.</li>
</ul>
<h2 id="evaluating-clm-designs-and-experimental-validation">Evaluating CLM Designs and Experimental Validation</h2>
<p>The review identifies evaluation as a critical gap. CLMs are often benchmarked on &ldquo;toy&rdquo; properties such as calculated logP, molecular weight, or QED (quantitative estimate of drug-likeness). These metrics capture the ability to satisfy predefined criteria but fail to reflect real-world drug discovery complexity and may lead to trivial solutions.</p>
<p>Existing benchmarks (<a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>) enable comparability across independently developed approaches but do not fully address the quality of generated compounds. The review emphasizes that experimental validation is the ultimate test. At the time of writing, only a few prospective applications had been published:</p>
<ul>
<li>Dual modulator of <a href="https://en.wikipedia.org/wiki/Retinoid_X_receptor">retinoid X</a> and <a href="https://en.wikipedia.org/wiki/Peroxisome_proliferator-activated_receptor">PPAR</a> receptors (EC50 ranging from 0.06 to 2.3 uM)</li>
<li>Inhibitor of <a href="https://en.wikipedia.org/wiki/Pim_kinase">Pim1 kinase</a> and <a href="https://en.wikipedia.org/wiki/Cyclin-dependent_kinase_4">CDK4</a> (manually modified from generated design)</li>
<li>Natural-product-inspired <a href="https://en.wikipedia.org/wiki/RAR-related_orphan_receptor_gamma">RORgamma</a> agonist (EC50 = 0.68 uM)</li>
<li>Molecules designed via combined generative AI and on-chip synthesis</li>
</ul>
<p>The scarcity of experimental validations reflects the interdisciplinary expertise required and the time/cost of chemical synthesis.</p>
<h2 id="gaps-limitations-and-future-directions">Gaps, Limitations, and Future Directions</h2>
<p>The review identifies several key gaps and opportunities:</p>
<p><strong>Scoring function limitations</strong>: Current scoring functions struggle with activity cliffs and non-additive structure-activity relationships. Conditional generation methods may help overcome these limitations by learning direct structure-property mappings.</p>
<p><strong>Structure-based design</strong>: Generating molecules that match electrostatic and shape features of protein binding pockets holds promise for addressing unexplored targets. However, prospective applications have been limited, potentially due to bias in existing protein-ligand affinity datasets.</p>
<p><strong>Synthesizability</strong>: Improving the ability of CLMs to propose synthesizable molecules is expected to increase practical relevance. Automated synthesis platforms may help but could also limit accessible chemical space.</p>
<p><strong>Few-shot learning</strong>: Large-scale pretrained CLMs combined with few-shot learning approaches are expected to boost prospective applications.</p>
<p><strong>Extensions beyond small molecules</strong>: Extending chemical languages to more complex molecular entities (proteins with non-natural amino acids, crystals, supramolecular chemistry) is an open frontier.</p>
<p><strong>Failure modes</strong>: Several studies have documented failure modes in goal-directed generation, including model shortcuts (exploiting scoring function artifacts), limited structural diversity, and generation of chemically unrealistic molecules.</p>
<p><strong>Interdisciplinary collaboration</strong>: The review emphasizes that bridging deep learning, cheminformatics, and medicinal chemistry expertise is essential for translating CLM designs into real-world drug candidates.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper and does not present novel experimental data. The paper surveys results from the literature.</p>
<h3 id="algorithms">Algorithms</h3>
<p>No novel algorithms are introduced. The review categorizes existing approaches (RNNs, VAEs, GANs, Transformers) and generation strategies (distribution learning, goal-directed, conditional).</p>
<h3 id="models">Models</h3>
<p>No new models are presented. The paper references existing implementations including REINVENT, ORGAN, and various RNN-based and Transformer-based CLMs.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The review discusses existing benchmarks:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong>: Benchmarking suite for de novo molecular design</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></strong>: Benchmarking platform for molecular generation models</li>
<li><strong>QED</strong>: Quantitative estimate of drug-likeness</li>
<li>Various physicochemical property metrics (logP, molecular weight)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. <em>Current Opinion in Structural Biology</em>, 79, 102527. <a href="https://doi.org/10.1016/j.sbi.2023.102527">https://doi.org/10.1016/j.sbi.2023.102527</a></p>
<p><strong>Publication</strong>: Current Opinion in Structural Biology, Volume 79, April 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grisoni2023chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language models for de novo drug design: Challenges and opportunities}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Current Opinion in Structural Biology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{79}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{102527}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.sbi.2023.102527}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Avoiding Failure Modes in Goal-Directed Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/avoiding-failure-modes-goal-directed-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/avoiding-failure-modes-goal-directed-generation/</guid><description>Langevin et al. show that apparent failure modes in goal-directed molecular generation stem from QSAR model disagreement, not algorithmic flaws.</description><content:encoded><![CDATA[<h2 id="reinterpreting-goal-directed-generation-failures-as-qsar-model-issues">Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues</h2>
<p>This is an <strong>Empirical</strong> study that challenges a widely cited finding about failure modes in goal-directed molecular generation. <a href="/notes/chemistry/molecular-design/generation/evaluation/failure-modes-molecule-generation/">Renz et al. (2019)</a> had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.</p>
<h2 id="why-qsar-model-agreement-matters-for-molecular-generation">Why QSAR Model Agreement Matters for Molecular Generation</h2>
<p>Goal-directed generation uses a scoring function (typically a <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.</p>
<p>Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.</p>
<p>The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.</p>
<h2 id="pre-existing-classifier-disagreement-explains-the-divergence">Pre-Existing Classifier Disagreement Explains the Divergence</h2>
<p>The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.</p>
<p>The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:</p>
<p>$$
\text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)|
$$</p>
<p>On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.</p>
<p>The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:</p>
<p>$$
\mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt}
$$</p>
<p>By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.</p>
<h2 id="experimental-setup-original-reproduction-and-corrected-experiments">Experimental Setup: Original Reproduction and Corrected Experiments</h2>
<h3 id="reproduction-of-renz-et-al">Reproduction of Renz et al.</h3>
<p>The original experimental framework uses three datasets from ChEMBL: <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> (842 molecules, 59 actives), <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> (842 molecules, 40 actives), and <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:</p>
<table>
  <thead>
      <tr>
          <th>Algorithm</th>
          <th>Type</th>
          <th>Mechanism</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph GA</td>
          <td>Genetic algorithm on molecular graphs</td>
          <td>Mutation and crossover of molecular graphs</td>
      </tr>
      <tr>
          <td>SMILES-LSTM</td>
          <td>Recurrent neural network</td>
          <td>Hill-climbing fine-tuning on best molecules</td>
      </tr>
      <tr>
          <td>MSO</td>
          <td>Particle swarm in CDDD latent space</td>
          <td>Multiple swarm optimization</td>
      </tr>
  </tbody>
</table>
<p>All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.</p>
<h3 id="tolerance-interval-analysis">Tolerance interval analysis</h3>
<p>The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.</p>
<h3 id="corrected-experiments-with-adequate-models">Corrected experiments with adequate models</h3>
<p>To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:</p>
<ol>
<li><strong>ALDH1 dataset</strong>: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.</li>
<li><strong>Modified JAK2</strong>: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.</li>
</ol>
<p>In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.</p>
<h2 id="findings-no-algorithmic-failure-when-models-agree">Findings: No Algorithmic Failure When Models Agree</h2>
<p>On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.</p>
<p>Key findings:</p>
<ol>
<li>
<p><strong>Pre-existing disagreement explains divergence</strong>: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.</p>
</li>
<li>
<p><strong>Split similarity bias is also pre-existing</strong>: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.</p>
</li>
<li>
<p><strong>Appropriate model design resolves the issue</strong>: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.</p>
</li>
<li>
<p><strong>Quality problems remain independent</strong>: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.</p>
</li>
</ol>
<h3 id="limitations-acknowledged-by-the-authors">Limitations acknowledged by the authors</h3>
<ul>
<li>The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.</li>
<li>The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.</li>
<li>The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.</li>
<li>The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Original tasks</td>
          <td>DRD2, EGFR, JAK2</td>
          <td>842, 842, 667 molecules</td>
          <td>Extracted from ChEMBL; small with few actives</td>
      </tr>
      <tr>
          <td>New task</td>
          <td>ALDH1</td>
          <td>464 molecules (173 with purine substructure)</td>
          <td>Extracted from LIT-PCBA; similarity-based split</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>Topliss tree analogs</td>
          <td>~10x augmentation of held-out set</td>
          <td>Structural analogs via phenyl ring enumeration</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Three goal-directed generation algorithms from the original Renz et al. study:</p>
<ul>
<li><strong>Graph GA</strong>: Genetic algorithm on molecular graphs (Jensen, 2019)</li>
<li><strong>SMILES-LSTM</strong>: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)</li>
<li><strong>MSO</strong>: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)</li>
</ul>
<p>All run for 151 epochs, 10 runs each.</p>
<h3 id="models">Models</h3>
<p>Random Forest classifiers (scikit-learn) with:</p>
<ul>
<li>ECFP fingerprints (radius 2, 1024 bits, RDKit)</li>
<li>Default parameters for original tasks</li>
<li>Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Purpose</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean Average Difference (MAD)</td>
          <td>Measures disagreement between optimization and control scores</td>
          <td>Computed as function of $S_{opt}$ on held-out set</td>
      </tr>
      <tr>
          <td>95% tolerance intervals</td>
          <td>Expected range of control scores given optimization scores</td>
          <td>Empirical, constructed from held-out set</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Split bias assessment</td>
          <td>Morgan fingerprints, radius 2, 1024 bits</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classifier predictive performance</td>
          <td>Used to verify models have comparable accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Sanofi-Public/IDD-papers-avoiding_failure_modes">Code and datasets</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Fork of Renz et al. codebase with modifications</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Langevin, M., Vuilleumier, R., &amp; Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. <em>Journal of Cheminformatics</em>, 14, 20. <a href="https://doi.org/10.1186/s13321-022-00601-y">https://doi.org/10.1186/s13321-022-00601-y</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{langevin2022explaining,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Explaining and avoiding failure modes in goal-directed generation of small molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00601-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPECTRA: Evaluating Generalizability of Molecular AI</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/spectra-evaluating-generalizability-molecular-ai/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/spectra-evaluating-generalizability-molecular-ai/</guid><description>SPECTRA evaluates ML model generalizability on molecular datasets by plotting performance across a spectrum of train-test overlap levels.</description><content:encoded><![CDATA[<h2 id="a-spectral-framework-for-evaluating-molecular-ml-generalizability">A Spectral Framework for Evaluating Molecular ML Generalizability</h2>
<p>This is a <strong>Method</strong> paper that introduces SPECTRA (SPECtral framework for model evaluaTion on moleculaR dAtasets), a systematic approach for evaluating how well machine learning models generalize on molecular sequencing data. The primary contribution is a framework that generates train-test splits with controlled, decreasing levels of overlap, producing a spectral performance curve (SPC) and a single summary metric, the area under the spectral performance curve (AUSPC), for comparing model generalizability across tasks and architectures.</p>
<h2 id="why-existing-molecular-benchmarks-overestimate-generalizability">Why Existing Molecular Benchmarks Overestimate Generalizability</h2>
<p>Deep learning has achieved high performance on molecular sequencing benchmarks, but a persistent gap exists between benchmark performance and real-world deployment. The authors identify the root cause: existing evaluation approaches use either metadata-based (MB) splits or similarity-based (SB) splits, both of which provide an incomplete picture of generalizability.</p>
<p>MB splits partition data by metadata properties (e.g., temporal splits, random splits) without controlling sequence similarity between train and test sets. This means high train-test similarity can inflate performance metrics. SB splits control similarity at a single threshold, but the model&rsquo;s behavior at other similarity levels remains unknown.</p>
<p>For example, the TAPE benchmark&rsquo;s remote homology family split has 97% cross-split overlap, while the superfamily split has 71%. Model accuracy drops by 50% between these two points, yet the full curve of performance degradation is never characterized. This gap between evaluated and real-world overlap levels leads to overoptimistic deployment expectations, as demonstrated by the case of <a href="https://en.wikipedia.org/wiki/Rifampicin">rifampicin</a> resistance prediction in <em>M. tuberculosis</em>, where commercial genotypic assays later proved unreliable in specific geographic regions.</p>
<h2 id="the-spectra-framework-spectral-properties-graphs-and-performance-curves">The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves</h2>
<p>SPECTRA takes three inputs: a molecular sequencing dataset, a machine learning model, and a spectral property definition. A spectral property (SP) is a molecular sequence property expected to influence model generalizability for a specific task. For sequence-to-sequence datasets, the spectral property is typically sequence identity (proportion of aligned positions &gt; 0.3). For mutational scan datasets, it is defined by sample barcodes (string representations of mutations present in each sample).</p>
<h3 id="spectral-property-graph-construction">Spectral Property Graph Construction</h3>
<p>SPECTRA constructs a spectral property graph (SPG) where nodes represent samples and edges connect samples that share the spectral property. The goal is to generate train-test splits with controlled levels of cross-split overlap by finding approximate <a href="https://en.wikipedia.org/wiki/Maximal_independent_set">maximal independent sets</a> of this graph.</p>
<p>Finding the exact maximal independent set is NP-Hard, so SPECTRA uses a greedy randomized algorithm parameterized by a spectral parameter $\mathbf{SP} \in [0, 1]$:</p>
<ol>
<li>Randomly order SPG vertices</li>
<li>Select the first vertex and delete each neighbor with probability equal to $\mathbf{SP}$</li>
<li>Continue until no vertices remain</li>
</ol>
<p>When $\mathbf{SP} = 0$, this produces a random split (maximum cross-split overlap). When $\mathbf{SP} = 1$, it approximates the maximal independent set (minimum cross-split overlap). For each spectral parameter value (incremented by 0.05 from 0 to 1), three splits with different random seeds are generated.</p>
<h3 id="the-spectral-performance-curve-and-auspc">The Spectral Performance Curve and AUSPC</h3>
<p>The model is trained and evaluated on each split. Plotting test performance against the spectral parameter produces the spectral performance curve (SPC). The area under this curve, the AUSPC, serves as a single summary metric for model generalizability that captures behavior across the full spectrum of train-test overlap.</p>
<h3 id="handling-mutational-scan-datasets">Handling Mutational Scan Datasets</h3>
<p>For mutational scan datasets where sample barcodes map to multiple samples, SPECTRA introduces two modifications: (1) weighting nodes in the SPG by the number of samples they represent, and (2) running a subset sum algorithm to ensure 80/20 train-test splits by sample count.</p>
<h2 id="evaluation-across-18-datasets-and-19-models">Evaluation Across 18 Datasets and 19 Models</h2>
<p>The authors apply SPECTRA to 18 molecular sequencing datasets spanning three benchmarks (TAPE, PEER, ProteinGym) plus PDBBind, evaluating 19 models including CNNs, LSTMs, GNNs (GearNet), LLMs (ESM2), diffusion models (DiffDock), variational autoencoders (EVE), and logistic regression.</p>
<h3 id="benchmark-datasets">Benchmark Datasets</h3>
<p>The core evaluation covers five primary tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Dataset</th>
          <th>Type</th>
          <th>Metric</th>
          <th>Samples</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Rifampicin resistance (RIF)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>17,474</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Isoniazid">Isoniazid</a> resistance (INH)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>26,574</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Pyrazinamide">Pyrazinamide</a> resistance (PZA)</td>
          <td>TB clinical isolates</td>
          <td>MSD</td>
          <td>AUROC</td>
          <td>12,146</td>
      </tr>
      <tr>
          <td>Fluorescence prediction</td>
          <td><a href="https://en.wikipedia.org/wiki/Green_fluorescent_protein">GFP</a> variants</td>
          <td>MSD</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>54,024</td>
      </tr>
      <tr>
          <td>Vaccine escape</td>
          <td>SARS-CoV-2 RBD</td>
          <td>MSD</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>438,046</td>
      </tr>
  </tbody>
</table>
<p>Additional benchmarks include remote homology detection, secondary structure prediction, subcellular localization, and protein-ligand binding (PDBBind, Astex diverse set, Posebusters).</p>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>Eight models were evaluated in depth across the five primary tasks: logistic regression, CNN, ESM2 (pretrained), ESM2-Finetuned, GearNet, GearNet-Finetuned, EVE, and SeqDesign. Additional models (LSTM, ResNet, DeepSF, Transformer, HHblits, Equibind, DiffDock, TankBind, Transception, MSA Transformer, ESM1v, Progen2) were evaluated on specific benchmark tasks.</p>
<h3 id="existing-splits-as-points-on-the-spc">Existing Splits as Points on the SPC</h3>
<p>SPECTRA reveals that existing benchmark splits correspond to specific points on the spectral performance curve. For instance:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Benchmark Split</th>
          <th>Cross-Split Overlap</th>
          <th>Spectral Parameter</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Remote homology</td>
          <td>TAPE family</td>
          <td>97%</td>
          <td>0.025</td>
      </tr>
      <tr>
          <td>Remote homology</td>
          <td>TAPE superfamily</td>
          <td>71%</td>
          <td>0.475</td>
      </tr>
      <tr>
          <td>Secondary structure</td>
          <td>CASP12</td>
          <td>48%</td>
          <td>0.5</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>Equibind temporal</td>
          <td>76%</td>
          <td>0.55</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>LPPDBind similarity</td>
          <td>91%</td>
          <td>0.275</td>
      </tr>
      <tr>
          <td>Protein-ligand binding</td>
          <td>Posebusters</td>
          <td>70%</td>
          <td>0.575</td>
      </tr>
  </tbody>
</table>
<h2 id="performance-degradation-and-foundation-model-insights">Performance Degradation and Foundation Model Insights</h2>
<h3 id="universal-performance-decline">Universal Performance Decline</h3>
<p>All evaluated models demonstrate decreased performance as cross-split overlap decreases. Logistic regression drops from AUROC &gt; 0.9 to 0.5 for rifampicin resistance. ESM2-Finetuned decreases from Spearman&rsquo;s $\rho &gt; 0.9$ to less than 0.4 for GFP fluorescence prediction.</p>
<p>No single model achieves the highest AUSPC across all tasks. CNN maintains AUSPC &gt; 0.6 across all tasks but is surpassed by ESM2-Finetuned and ESM2 on rifampicin resistance. Some models retain reasonable performance even at $\mathbf{SP} = 1$ (minimal overlap): ESM2, ESM2-Finetuned, and CNN maintain AUROC &gt; 0.7 for RIF and PZA at this extreme.</p>
<h3 id="uncovering-hidden-spectral-properties">Uncovering Hidden Spectral Properties</h3>
<p>SPECTRA can detect unconsidered spectral properties through high variance in model performance at fixed spectral parameters. For rifampicin resistance, the CNN shows high variance at $\mathbf{SP} = 0.9$, $0.95$, and $1.0$ (standard deviations of 0.09, 0.10, and 0.08 respectively).</p>
<p>The authors trace this to the rifampicin resistance determining region (RRDR), a 26-amino-acid region of the rpoB gene. They define diff-RRDR as:</p>
<p>$$
\text{diff-RRDR} = \left(\max\left(\text{position}_{\text{train}}\right) - \max\left(\text{position}_{\text{test}}\right)\right) + \left(\min\left(\text{position}_{\text{train}}\right) - \min\left(\text{position}_{\text{test}}\right)\right)
$$</p>
<p>diff-RRDR correlates with CNN performance variance (Spearman&rsquo;s $\rho = -0.51$, p-value $= 1.79 \times 10^{-5}$) but not with ESM2 performance. The authors attribute this to ESM2&rsquo;s larger context window (512 positions vs. CNN&rsquo;s 12), making it more invariant to positional shifts in resistance-determining mutations.</p>
<h3 id="foundation-model-generalizability">Foundation Model Generalizability</h3>
<p>For protein foundation models, SPECTRA reveals that AUSPC correlates with the similarity between task-specific datasets and the pretraining dataset. ESM2&rsquo;s AUSPC varies from 0.91 (RIF) to 0.26 (SARS-CoV-2). The correlation between UniRef50 overlap and AUSPC is strong (Spearman&rsquo;s $\rho = 0.9$, p-value $= 1.4 \times 10^{-27}$).</p>
<p>This finding holds across multiple foundation models (Transception, MSA Transformer, ESM1v, Progen2) evaluated on five ProteinGym datasets (Spearman&rsquo;s $\rho = 0.9$, p-value $= 0.04$). Fine-tuning improves AUSPC for tasks with low pretraining overlap (PZA, SARS-CoV-2, GFP).</p>
<h3 id="computational-cost">Computational Cost</h3>
<p>Generating SPECTRA splits ranges from 5 minutes (amyloid beta aggregation) to 9 hours (PDBBind). Generating spectral performance curves ranges from 1 hour (logistic regression) to 5 days (ESM2-Finetuned). The authors recommend releasing SPECTRA splits alongside new benchmarks to amortize this cost.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Spectral property selection is pivotal</strong>: The choice of spectral property must be biologically informed and task-specific. Standardized definitions across the community are needed.</li>
<li><strong>Computational cost</strong>: Running SPECTRA is expensive, especially for large models. The authors mitigate this with multi-core CPU parallelization and multi-GPU training.</li>
<li><strong>Not a model ranking tool</strong>: SPECTRA is designed for understanding generalizability patterns, not for ranking models. Proper ranking requires averaging AUSPCs across many tasks in a standardized benchmark.</li>
<li><strong>Spectral parameter vs. cross-split overlap</strong>: The minimal achievable cross-split overlap varies across tasks, so SPECTRA plots performance against the spectral parameter rather than overlap directly. This means the AUSPC reflects relative impact on performance per unit decrease in overlap.</li>
</ul>
<p>The authors envision SPECTRA as a foundation for next-generation molecular benchmarks that explicitly characterize generalizability across the full spectrum of distribution shift, applicable beyond molecular data to small molecule therapeutics, inverse protein folding, and patient-level clinical datasets.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All data used in this study is publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>TB RIF resistance</td>
          <td>17,474 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TB INH resistance</td>
          <td>26,574 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TB PZA resistance</td>
          <td>12,146 isolates</td>
          <td>From Green et al. (2022)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>GFP fluorescence</td>
          <td>54,024 samples</td>
          <td>From Sarkisyan et al. (2016)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>SARS-CoV-2 escape</td>
          <td>438,046 samples</td>
          <td>From Greaney et al. (2021)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>TAPE (remote homology, secondary structure)</td>
          <td>Various</td>
          <td>From Rao et al. (2019)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>PEER (subcellular localization)</td>
          <td>13,949 samples</td>
          <td>From Xu et al. (2022)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>ProteinGym (amyloid, RRM)</td>
          <td>Various</td>
          <td>From Notin et al. (2022)</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>PDBBind (protein-ligand binding)</td>
          <td>14,993-16,742 complexes</td>
          <td>From Wang et al. (2005)</td>
      </tr>
  </tbody>
</table>
<p>Data is also available on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN">Harvard Dataverse</a>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Spectral property comparison uses Biopython pairwise alignment (match=1, mismatch=-2, gap=-2.5) with a 0.3 similarity threshold for sequence-to-sequence datasets</li>
<li>Greedy randomized maximal independent set approximation for split generation</li>
<li>Spectral parameter incremented in 0.05 steps from 0 to 1</li>
<li>Three random seeds per spectral parameter value</li>
<li>80/20 train-test split ratio enforced via subset sum for mutational scan datasets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ESM2: 650M parameter version from Lin et al. (2023)</li>
<li>ESM2-Finetuned: First 30 layers frozen, masked language head replaced with linear prediction layer</li>
<li>GearNet and GearNet-Finetuned: Protein structures generated via ESMFold</li>
<li>CNN: Architecture from Green et al. (2022), one-hot encoded sequences</li>
<li>Logistic regression: One-hot encoded mutational barcodes</li>
<li>EVE and SeqDesign: MSAs constructed via Jackhmmer against UniRep100</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUROC</td>
          <td>TB resistance (RIF, INH, PZA)</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>GFP fluorescence, SARS-CoV-2 escape</td>
          <td>Regression tasks</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Remote homology, secondary structure, subcellular localization</td>
          <td>Per-label/class accuracy</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Protein-ligand binding</td>
          <td>Predicted vs. actual complex</td>
      </tr>
      <tr>
          <td>AUSPC</td>
          <td>All tasks</td>
          <td>Area under spectral performance curve</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Most models: 1x Tesla A10 GPU</li>
<li>ESM2-Finetuned: 4x Tesla A100 GPUs on Azure cluster</li>
<li>Hyperparameter optimization: Weights &amp; Biases random search over learning rate</li>
<li>All code in PyTorch</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mims-harvard/SPECTRA">SPECTRA Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Framework implementation and reproduction scripts</td>
      </tr>
      <tr>
          <td><a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN">Harvard Dataverse</a></td>
          <td>Dataset</td>
          <td>CC0 1.0</td>
          <td>All datasets and generated splits</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ektefaie, Y., Shen, A., Bykova, D., Marin, M. G., Zitnik, M., &amp; Farhat, M. (2024). Evaluating generalizability of artificial intelligence models for molecular datasets. <em>Nature Machine Intelligence</em>, 6(12), 1512-1524. <a href="https://doi.org/10.1038/s42256-024-00931-6">https://doi.org/10.1038/s42256-024-00931-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ektefaie2024evaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evaluating generalizability of artificial intelligence models for molecular datasets}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin, Maximillian G. and Zitnik, Marinka and Farhat, Maha}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1512--1524}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-024-00931-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PMO: Benchmarking Sample-Efficient Molecular Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/</guid><description>PMO benchmarks 25 molecular optimization algorithms across 23 tasks under a 10K oracle budget, finding older methods like REINVENT still lead.</description><content:encoded><![CDATA[<h2 id="a-standardized-benchmark-for-molecular-optimization">A Standardized Benchmark for Molecular Optimization</h2>
<p>This is a <strong>Resource</strong> paper that introduces PMO (Practical Molecular Optimization), an open-source benchmark for evaluating molecular optimization algorithms with a focus on sample efficiency. The primary contribution is not a new algorithm but a comprehensive evaluation framework that exposes blind spots in how the field measures progress. By benchmarking 25 methods across 23 oracle functions under a fixed budget of 10,000 oracle calls, the authors provide a standardized protocol for transparent and reproducible comparison of molecular design methods.</p>
<h2 id="the-missing-dimension-oracle-budget-in-molecular-design">The Missing Dimension: Oracle Budget in Molecular Design</h2>
<p>Molecular optimization is central to drug and materials discovery, and the field has seen rapid growth in computational methods. Despite this progress, the authors identify three persistent problems with how methods are evaluated:</p>
<ol>
<li>
<p><strong>Lack of oracle budget control</strong>: Most papers do not report how many candidate molecules were evaluated by the oracle to achieve their results, despite this number spanning orders of magnitude. In practice, the most valuable oracles (wet-lab experiments, high-accuracy simulations) are expensive, making sample efficiency critical.</p>
</li>
<li>
<p><strong>Trivial or self-designed oracles</strong>: Many papers only report on easy objectives like QED or penalized LogP, or introduce custom tasks that make cross-method comparison impossible.</p>
</li>
<li>
<p><strong>Insufficient handling of randomness</strong>: Many algorithms are stochastic, yet existing benchmarks examined no more than five methods and rarely reported variance across independent runs.</p>
</li>
</ol>
<p>Prior benchmarks such as <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, Therapeutics Data Commons (TDC), and Tripp et al.&rsquo;s analysis each suffer from at least one of these issues. PMO addresses all three simultaneously.</p>
<h2 id="the-pmo-benchmark-design">The PMO Benchmark Design</h2>
<p>The core innovation of PMO is its evaluation protocol rather than any single algorithmic contribution. The benchmark enforces three design principles:</p>
<p><strong>Oracle budget constraint</strong>: All methods are limited to 10,000 oracle calls. This is deliberately much smaller than the unconstrained budgets typical in the literature, reflecting the practical reality that experimental evaluations are costly.</p>
<p><strong>AUC-based metric</strong>: Instead of reporting only the final top-K score, PMO uses the area under the curve (AUC) of top-K average property value versus oracle calls:</p>
<p>$$
\text{AUC Top-}K = \int_{0}^{N} \bar{f}_{K}(n) , dn
$$</p>
<p>where $\bar{f}_{K}(n)$ is the average property value of the top $K$ molecules found after $n$ oracle calls, and $N = 10{,}000$. The paper uses $K = 10$. This metric rewards methods that reach high property values quickly, not just those that eventually converge given enough budget. All AUC values are min-max scaled to [0, 1].</p>
<p><strong>Standardized data</strong>: All methods use only the ZINC 250K dataset (approximately 250,000 molecules) whenever a database is required, ensuring a level playing field.</p>
<p>The benchmark includes 23 oracle functions: QED, <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a>, <a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>-beta, <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a>, and 19 oracles from <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> covering multi-property objectives (MPOs) based on similarity, molecular weight, CLogP, and other pharmaceutically relevant criteria. All oracle scores are normalized to [0, 1].</p>
<h2 id="25-methods-across-nine-algorithm-families">25 Methods Across Nine Algorithm Families</h2>
<p>The benchmark evaluates 25 molecular optimization methods organized along two dimensions: molecular assembly strategy (SMILES, SELFIES, atom-level graphs, fragment-level graphs, synthesis-based) and optimization algorithm (GA, MCTS, BO, VAE, GAN, score-based modeling, hill climbing, RL, gradient ascent). Each method was hyperparameter-tuned on two held-out tasks (zaleplon_mpo and perindopril_mpo) and then evaluated across all 23 oracles for 5 independent runs.</p>
<p>The following table summarizes the top 10 methods by sum of mean AUC Top-10 across all 23 tasks:</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Method</th>
          <th>Assembly</th>
          <th>Sum AUC Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>SMILES</td>
          <td>14.196</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Graph GA</td>
          <td>Fragments</td>
          <td>13.751</td>
      </tr>
      <tr>
          <td>3</td>
          <td>SELFIES-REINVENT</td>
          <td>SELFIES</td>
          <td>13.471</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GP BO</td>
          <td>Fragments</td>
          <td>13.156</td>
      </tr>
      <tr>
          <td>5</td>
          <td><a href="/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/">STONED</a></td>
          <td>SELFIES</td>
          <td>13.024</td>
      </tr>
      <tr>
          <td>6</td>
          <td>LSTM HC</td>
          <td>SMILES</td>
          <td>12.223</td>
      </tr>
      <tr>
          <td>7</td>
          <td>SMILES GA</td>
          <td>SMILES</td>
          <td>12.054</td>
      </tr>
      <tr>
          <td>8</td>
          <td>SynNet</td>
          <td>Synthesis</td>
          <td>11.498</td>
      </tr>
      <tr>
          <td>9</td>
          <td>DoG-Gen</td>
          <td>Synthesis</td>
          <td>11.456</td>
      </tr>
      <tr>
          <td>10</td>
          <td>DST</td>
          <td>Fragments</td>
          <td>10.989</td>
      </tr>
  </tbody>
</table>
<p>The bottom five methods by overall ranking were GFlowNet-AL, Pasithea, JT-VAE, Graph MCTS, and MolDQN.</p>
<p>REINVENT is ranked first across all six metrics considered (AUC Top-1, AUC Top-10, AUC Top-100, Top-1, Top-10, Top-100). Graph GA is consistently second. Both methods were released several years before many of the methods they outperform, yet they are rarely used as baselines in newer work.</p>
<h2 id="key-findings-older-methods-win-and-selfies-offers-limited-advantage">Key Findings: Older Methods Win and SELFIES Offers Limited Advantage</h2>
<p>The benchmark yields several findings with practical implications:</p>
<p><strong>No method solves optimization within realistic budgets.</strong> None of the 25 methods can optimize the included objectives within hundreds of oracle calls (the scale at which experimental evaluations would be feasible), except for trivially easy oracles like QED, DRD2, and osimertinib_mpo.</p>
<p><strong>Older algorithms remain competitive.</strong> REINVENT (2017) and Graph GA (2019) outperform all newer methods tested, including those published at top AI conferences. The absence of standardized benchmarking had obscured this fact.</p>
<p><strong>SMILES versus SELFIES.</strong> <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> was designed to guarantee syntactically valid molecular strings, but head-to-head comparisons show that SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> counterparts. Modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical issue. The one exception is genetic algorithms, where SELFIES-based GAs (<a href="/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/">STONED</a>) outperform SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations.</p>
<p><strong>Model-based methods need careful design.</strong> Model-based variants (GP BO relative to Graph GA, GFlowNet-AL relative to GFlowNet) do not consistently outperform their model-free counterparts. GP BO outperformed Graph GA in 12 of 23 tasks but underperformed on sum, and GFlowNet-AL underperformed GFlowNet in nearly every task. The bottleneck is the quality of the predictive surrogate model, and naive surrogate integration can actually hurt performance.</p>
<p><strong>Oracle landscape determines method suitability.</strong> Clustering analysis of relative AUC Top-10 scores reveals clear patterns. String-based GAs excel on isomer-type oracles (which are sums of atomic contributions), while RL-based and fragment-based methods perform better on similarity-based MPOs. This suggests there is no single best algorithm, and method selection should be informed by the optimization landscape.</p>
<p><strong>Hyperparameter tuning and multiple runs are essential.</strong> Optimal hyperparameters differed substantially from default values in original papers. For example, REINVENT&rsquo;s performance is highly sensitive to its sigma parameter, and the best value under the constrained-budget setting is much larger than originally suggested. Methods like Graph GA and GP BO also show high variance across runs, underscoring the importance of reporting distributional outcomes rather than single-run results.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations: they cannot exhaustively tune every hyperparameter or include every variant of each method; the conclusion may be biased toward similarity-based oracles (which dominate the 23 tasks); important quantities like synthesizability and diversity are not thoroughly evaluated; and oracle calls from pre-training data in model-based methods are counted against the budget, which may disadvantage methods that could leverage prior data collection. For a follow-up study that adds property filters and diversity requirements to the PMO evaluation, see <a href="/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/">Re-evaluating Sample Efficiency</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecule library</td>
          <td>ZINC 250K</td>
          <td>~250,000 molecules</td>
          <td>Used for screening, pre-training generative models, and fragment extraction</td>
      </tr>
      <tr>
          <td>Oracle functions</td>
          <td>TDC / GuacaMol</td>
          <td>23 tasks</td>
          <td>All scores normalized to [0, 1]</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>25 molecular optimization methods spanning 9 algorithm families and 5 molecular assembly strategies. Each method was hyperparameter-tuned on 2 held-out tasks (zaleplon_mpo, perindopril_mpo) using 3 independent runs, then evaluated on all 23 tasks with 5 independent runs each.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-K</td>
          <td>Area under curve of top-K average vs. oracle calls</td>
          <td>Primary metric; K=10; min-max scaled to [0, 1]</td>
      </tr>
      <tr>
          <td>Top-K</td>
          <td>Final top-K average property value at 10K calls</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>Sum rank</td>
          <td>Sum of AUC Top-10 across all 23 tasks</td>
          <td>Used for overall ranking</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper states hardware details are in Appendix C.2. The benchmark runs on standard compute infrastructure and does not require GPUs for most methods. Specific compute requirements vary by method.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">mol_opt</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full benchmark implementation with all 25 methods</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Results_for_practival_molecular_optimization_PMO_benchmark/20123453">Benchmark results</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>All experimental results from the paper</td>
      </tr>
      <tr>
          <td><a href="https://tdcommons.ai">TDC</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Oracle functions and evaluation infrastructure</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{gao2022sample,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Wenhao and Fu, Tianfan and Sun, Jimeng and Coley, Connor W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{35}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{21342--21357}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, W., Fu, T., Sun, J., &amp; Coley, C. W. (2022). Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization. <em>Advances in Neural Information Processing Systems</em>, 35, 21342-21357. <a href="https://arxiv.org/abs/2206.12411">https://arxiv.org/abs/2206.12411</a></p>
<p><strong>Publication</strong>: NeurIPS 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark Code (GitHub)</a></li>
<li><a href="https://figshare.com/articles/dataset/Results_for_practival_molecular_optimization_PMO_benchmark/20123453">Benchmark Results (Figshare)</a></li>
<li><a href="https://tdcommons.ai">Therapeutics Data Commons</a></li>
</ul>
]]></content:encoded></item><item><title>MolScore: Scoring and Benchmarking for Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molscore-scoring-benchmarking-framework/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molscore-scoring-benchmarking-framework/</guid><description>MolScore provides a unified, open-source Python framework for scoring, evaluating, and benchmarking generative models applied to de novo drug design.</description><content:encoded><![CDATA[<h2 id="a-unified-resource-for-generative-molecular-design">A Unified Resource for Generative Molecular Design</h2>
<p>MolScore is a <strong>Resource</strong> paper that introduces an open-source Python framework for scoring, evaluating, and benchmarking generative models in de novo drug design. The primary contribution is the software itself: a modular, configurable platform that consolidates functionality previously scattered across multiple tools (GuacaMol, MOSES, MolOpt, REINVENT, TDC) into a single package. MolScore provides scoring functions for molecular optimization, evaluation metrics for assessing the quality of generated molecules, and a benchmark mode for standardized comparison of generative models.</p>
<h2 id="the-fragmented-landscape-of-generative-model-evaluation">The Fragmented Landscape of Generative Model Evaluation</h2>
<p>Generative models for molecular design have proliferated rapidly, but evaluating and comparing them remains difficult. Existing benchmarks each address only part of the problem:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong> provides 20 fixed optimization objectives but cannot separate top-performing models on most tasks, and custom objectives require code modification.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></strong> focuses on distribution-learning metrics but does not support molecular optimization.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">MolOpt</a></strong> extends benchmark evaluation to 25 generative approaches but lacks evaluation of the quality of generated chemistry.</li>
<li><strong>Docking benchmarks</strong> (<a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">smina-docking-benchmark</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/dockstring-docking-benchmarks-ligand-design/">DOCKSTRING</a>, TDC) test structure-based scoring but often lack proper ligand preparation, leading generative models to exploit non-holistic objectives by generating large or greasy molecules.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></strong> provides configurable scoring functions but is tightly coupled to its own generative model architecture.</li>
</ul>
<p>No single tool offered configurable objectives, comprehensive evaluation metrics, generative-model-agnostic design, and graphical user interfaces together. This fragmentation forces practitioners to write custom glue code and makes reproducible comparison across methods difficult.</p>
<h2 id="modular-architecture-for-scoring-evaluation-and-benchmarking">Modular Architecture for Scoring, Evaluation, and Benchmarking</h2>
<p>MolScore is split into two sub-packages:</p>
<h3 id="molscore-molecule-scoring">molscore: Molecule Scoring</h3>
<p>The <code>molscore</code> sub-package handles iterative scoring of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> generated by any generative model. The workflow for each iteration:</p>
<ol>
<li>Parse and validate SMILES via RDKit, canonicalize, and check intra-batch uniqueness.</li>
<li>Cross-reference against previously generated molecules to reuse cached scores (saving compute for expensive scoring functions like docking).</li>
<li>Run user-specified scoring functions on valid, unique molecules (invalid molecules receive a score of 0).</li>
<li>Transform each score to a 0-1 range using configurable transformation functions (normalize, linear threshold, Gaussian threshold, step threshold).</li>
<li>Aggregate transformed scores into a single desirability score using configurable aggregation (weighted sum, product, geometric mean, arithmetic mean, <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto front</a>, or auto-weighted variants).</li>
<li>Optionally apply diversity filters to penalize non-diverse molecules, or use any scoring function as a multiplicative filter.</li>
</ol>
<p>The full objective is specified in a single JSON configuration file, with a Streamlit GUI provided for interactive configuration writing. The available scoring functions span:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Examples</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>RDKit descriptors, linker descriptors, penalized logP</td>
      </tr>
      <tr>
          <td>Similarity</td>
          <td>Fingerprint similarity, ROCS, Open3DAlign, substructure matching</td>
      </tr>
      <tr>
          <td>Predictive models</td>
          <td>Scikit-learn models, PIDGINv5 (2,337 ChEMBL31 targets), ChemProp, ADMET-AI</td>
      </tr>
      <tr>
          <td>Docking</td>
          <td>Glide, PLANTS, GOLD, OEDock, Smina, Gnina, Vina, rDock</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>SA score, RA Score, AiZynthFinder, reaction filters</td>
      </tr>
  </tbody>
</table>
<p>Most scoring functions support multiprocessing, and computationally expensive functions (docking, ligand preparation) can be distributed across compute clusters via Dask.</p>
<h3 id="moleval-molecule-evaluation">moleval: Molecule Evaluation</h3>
<p>The <code>moleval</code> sub-package computes performance metrics on generated molecules relative to reference datasets. It extends the MOSES metric suite with additional intrinsic metrics (sphere exclusion diversity, scaffold uniqueness, functional group and ring system diversity, ZINC20 purchasability via molbloom) and extrinsic metrics (analogue similarity/coverage, functional group and ring system similarity, outlier bits or &ldquo;Silliness&rdquo;).</p>
<h3 id="benchmark-mode">Benchmark Mode</h3>
<p>A <code>MolScoreBenchmark</code> class iterates over a list of JSON configuration files, providing standardized comparison. Pre-built presets reimplement GuacaMol and MolOpt benchmarks, and users can define custom benchmark suites without writing code.</p>
<h2 id="case-studies-5-ht2a-ligand-design-and-fine-tuning-evaluation">Case Studies: 5-HT2A Ligand Design and Fine-Tuning Evaluation</h2>
<p>The authors demonstrate MolScore with a SMILES-based RNN generative model using <a href="/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb</a> for optimization, designing serotonin <a href="https://en.wikipedia.org/wiki/5-HT2A_receptor">5-HT2A</a> receptor ligands across three objective sets of increasing complexity.</p>
<h3 id="first-objective-set-basic-drug-properties">First Objective Set: Basic Drug Properties</h3>
<p>Four objectives combine predicted 5-HT2A activity (via PIDGINv5 random forest models at 1 uM) with synthesizability (RAscore) and/or <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a> permeability property ranges (<a href="https://en.wikipedia.org/wiki/Polar_surface_area">TPSA</a> &lt; 70, HBD &lt; 2, logP 2-4, MW &lt; 400). All objectives were optimized successfully, with diversity filters preventing mode collapse. The most difficult single objective (5-HT2A activity alone) was hardest primarily because the diversity filter more heavily penalized similar molecules for this relatively easy task.</p>
<h3 id="second-objective-set-selectivity">Second Objective Set: Selectivity</h3>
<p>Six objectives incorporate selectivity proxies using PIDGINv5 models for off-target prediction against <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">Class A GPCR</a> membrane receptors (266 models), the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">D2 dopamine receptor</a>, dopamine receptor family, serotonin receptor subtypes, and combinations. These proved substantially harder: selectivity against dopamine and serotonin receptor families combined was barely improved during optimization. Even with imperfect predictive models, the PIDGINv5 ensemble correctly identified 95 of 126 known selective 5-HT2A ligands. Nearest-neighbor analysis of de novo molecules (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> 0.3-0.6) showed they tended to be structurally simpler versions of known selective ligands.</p>
<h3 id="third-objective-set-structure-based-docking">Third Objective Set: Structure-Based Docking</h3>
<p>Two objectives use molecular docking via GlideSP into 5-HT2A (PDB: 6A93) and D2 (PDB: 6CM4) crystal structures with full ligand preparation (LigPrep for stereoisomer/tautomer/protonation state enumeration). Multi-parameter optimization includes docking score, D155 polar interaction constraint, formal charge, and consecutive rotatable bond limits. Single-target docking scores reached the mean of known ligands within 200 steps, but optimizing for divergent 5-HT2A vs D2 docking scores was much harder due to binding pocket similarity. Protein-ligand interaction fingerprint analysis (ProLIF) revealed that molecules optimized for selectivity avoided specific binding pocket regions shared between the two receptors.</p>
<h3 id="evaluation-case-study-fine-tuning-epochs">Evaluation Case Study: Fine-Tuning Epochs</h3>
<p>The moleval sub-package was used to track metrics across fine-tuning epochs of a SMILES RNN on A2A receptor ligands, showing that just one or two epochs sufficed to increase similarity to the fine-tuning set, while further epochs reduced novelty and diversity.</p>
<h2 id="configurable-benchmarking-with-practical-drug-design-relevance">Configurable Benchmarking with Practical Drug Design Relevance</h2>
<p>MolScore provides a more comprehensive platform than any single existing tool. Compared to prior work:</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>GuacaMol</th>
          <th>MOSES</th>
          <th>MolOpt</th>
          <th>TDC</th>
          <th>REINVENT</th>
          <th>MolScore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Configurable objectives</td>
          <td>No</td>
          <td>N/A</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Optimization objectives</td>
          <td>Yes</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Evaluation metrics</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Model-agnostic</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>GUI</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p>The framework integrates into any Python-based generative model in three lines of code. Dependency conflicts between scoring function libraries are handled by running conflicting components as local servers from isolated conda environments.</p>
<p>Key limitations acknowledged by the authors include: the assumption of conda for environment management, the inherent difficulty of designing non-exploitable objectives, and the fact that ligand-based predictive models may have limited applicability domains for out-of-distribution de novo molecules.</p>
<p>Future directions include accepting 3D molecular conformations as inputs, structure interaction fingerprint rescoring, and dynamic configuration files for curriculum learning.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL compounds</td>
          <td>Not specified</td>
          <td>Standard ChEMBL training set for SMILES RNN</td>
      </tr>
      <tr>
          <td>Evaluation reference</td>
          <td>5-HT2A ligands from ChEMBL31</td>
          <td>3,771 compounds</td>
          <td>Extracted for score distribution comparison</td>
      </tr>
      <tr>
          <td>Activity models</td>
          <td>PIDGINv5 on ChEMBL31</td>
          <td>2,337 target models</td>
          <td>Random forest classifiers at various concentration thresholds</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>A2A receptor ligands</td>
          <td>Not specified</td>
          <td>Used for moleval case study</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The generative model used in case studies is a SMILES-based RNN with Augmented Hill-Climb reinforcement learning. Diversity filters penalize non-diverse molecules during optimization. Score transformation functions (normalize, linear threshold, Gaussian threshold, step threshold) map raw scores to 0-1 range. Aggregation functions (arithmetic mean, weighted sum, product, geometric mean, Pareto front) combine multi-parameter objectives.</p>
<h3 id="models">Models</h3>
<p>PIDGINv5 provides 2,337 pre-trained random forest classifiers on ChEMBL31 targets. RAscore provides pre-trained synthesizability prediction. ADMET-AI and ChemProp models are supported via isolated environments. Docking uses GlideSP with LigPrep for ligand preparation in the structure-based case study.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Intrinsic metrics: validity, uniqueness, scaffold uniqueness, internal diversity, sphere exclusion diversity, Solow-Polasky diversity, scaffold diversity, functional group diversity, ring system diversity, MCF and <a href="https://en.wikipedia.org/wiki/Pan-assay_interference_compounds">PAINS</a> filters, ZINC20 purchasability.</p>
<p>Extrinsic metrics: novelty, <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, analogue similarity/coverage, functional group similarity, ring system similarity, SNN similarity, fragment similarity, scaffold similarity, outlier bits, Wasserstein distance on LogP/SA/NP/QED/MW.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Docking-based objectives can be distributed across compute clusters via Dask.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Main framework, installable via pip</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore_examples">MolScore Examples</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration examples with SMILES-RNN, CReM, GraphGA</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2024). MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design. <em>Journal of Cheminformatics</em>, 16(1), 64. <a href="https://doi.org/10.1186/s13321-024-00861-w">https://doi.org/10.1186/s13321-024-00861-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2024molscore,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00861-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGenBench: Benchmarking Molecular Generative Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgenbench-molecular-generative-models/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgenbench-molecular-generative-models/</guid><description>MolGenBench benchmarks 17 molecular generative models across 120 protein targets using novel metrics for target awareness, hit rates, and lead optimization.</description><content:encoded><![CDATA[<h2 id="a-comprehensive-benchmark-for-structure-based-molecular-generation">A Comprehensive Benchmark for Structure-Based Molecular Generation</h2>
<p>MolGenBench is a <strong>Resource</strong> paper that provides a large-scale, application-oriented benchmark for evaluating molecular generative models in the context of structure-based drug design (SBDD). The primary contribution is a dataset of 220,005 experimentally validated active molecules across 120 protein targets, organized into 5,433 chemical series, along with a suite of novel evaluation metrics. The benchmark addresses both <a href="https://en.wikipedia.org/wiki/De_novo_drug_design">de novo molecular design</a> and hit-to-lead (H2L) optimization, a critical drug discovery stage that existing benchmarks largely ignore.</p>
<h2 id="gaps-in-existing-molecular-generation-benchmarks">Gaps in Existing Molecular Generation Benchmarks</h2>
<p>Despite rapid progress in deep generative models for drug discovery, the evaluation landscape has not kept pace. The authors identify four categories of limitations in existing benchmarks:</p>
<ol>
<li>
<p><strong>Dataset construction</strong>: Existing benchmarks use overly stringent activity cutoffs and too few protein targets. The widely used CrossDocked2020 dataset contains very few reference ligands per target, making it difficult to evaluate whether a model can rediscover the full distribution of active compounds.</p>
</li>
<li>
<p><strong>Model selection</strong>: Prior benchmark studies evaluate a narrow range of architectures and do not systematically examine the effects of training data composition, prior knowledge integration, or architectural paradigm.</p>
</li>
<li>
<p><strong>Evaluation scenarios</strong>: Existing benchmarks focus exclusively on de novo generation. Hit-to-lead optimization, where a hit compound is refined through R-group modifications, remains unstandardized.</p>
</li>
<li>
<p><strong>Evaluation metrics</strong>: Standard metrics (QED, Vina score, SA score) correlate strongly with atom count and fail to assess target-specific generation capacity. The AddCarbon model illustrates this: simply adding random carbon atoms to training molecules achieves near-perfect scores on standard metrics while producing nonsensical chemistry.</p>
</li>
</ol>
<h2 id="novel-metrics-for-evaluating-molecular-generation">Novel Metrics for Evaluating Molecular Generation</h2>
<p>MolGenBench introduces three key metrics designed to capture aspects of model performance that existing metrics miss.</p>
<h3 id="target-aware-score-tascore">Target-Aware Score (TAScore)</h3>
<p>The TAScore measures whether a model generates target-specific molecules rather than generic structures. It compares the ratio of active molecule or scaffold recovery on a specific target to the background recovery across all targets:</p>
<p>$$
\text{TAScore}_{\text{label}, i} = \frac{S_{i} / S_{\text{all}}}{R_{i} / R_{\text{all}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\}
$$</p>
<p>For target $i$: $R_{\text{all}}$ is the total number of distinct molecules generated across all 120 targets; $R_{i}$ is the subset matching known actives for target $i$ (without conditioning on target $i$); $S_{\text{all}}$ is the total generated when conditioned on target $i$; and $S_{i}$ is the subset matching known actives for target $i$. A TAScore above 1 indicates the model uses target-specific information effectively.</p>
<h3 id="hit-rate">Hit Rate</h3>
<p>The hit rate quantifies the efficiency of active compound discovery:</p>
<p>$$
\text{HitRate}_{\text{label}} = \frac{\mathcal{M}_{\text{active}}}{\mathcal{M}_{\text{sampled}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\}
$$</p>
<p>where $\mathcal{M}_{\text{active}}$ is the number of unique active molecules or scaffolds found, and $\mathcal{M}_{\text{sampled}}$ is the total number of generated molecules.</p>
<h3 id="mean-normalized-affinity-mna-score">Mean Normalized Affinity (MNA) Score</h3>
<p>For H2L optimization, the MNA Score measures whether models generate compounds with improved potency relative to the known activity range within each chemical series:</p>
<p>$$
\text{NA}_{g} = \frac{\text{Affinity}_{g}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}{\text{Affinity}_{\max}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}
$$</p>
<p>$$
\text{MNAScore} = \frac{1}{G} \sum_{g}^{G} \text{NA}_{g}
$$</p>
<p>This normalizes affinities to [0, 1] within each series, enabling cross-series comparison.</p>
<h2 id="systematic-evaluation-of-17-generative-models-across-two-drug-discovery-scenarios">Systematic Evaluation of 17 Generative Models Across Two Drug Discovery Scenarios</h2>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>The MolGenBench dataset was built from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL v33</a>. Ligands failing RDKit validation were discarded, along with entries where binding affinity exceeded 10 uM. The 120 protein targets were selected based on minimum thresholds: at least 50 active molecules, at least 50 unique Bemis-Murcko scaffolds, and at least 20 distinct chemical series per target. For H2L optimization, maximum common substructures (MCS) were identified per series, with dual thresholds requiring the MCS to appear in over 80% of molecules and cover more than one-third of each molecule&rsquo;s atoms. The top 5 series per target (ranked by dockable ligands) formed the H2L test set: 600 compound series across 120 targets.</p>
<h3 id="evaluated-models">Evaluated Models</h3>
<p><strong>De novo models (10)</strong>: Pocket2Mol, TargetDiff, FLAG, DecompDiff, SurfGen, PocketFlow, MolCraft, <a href="/notes/chemistry/molecular-design/generation/target-aware/tamgen-target-aware-molecule-generation/">TamGen</a>, DiffSBDD-M (trained on BindingMOAD), DiffSBDD-C (trained on CrossDock). These span autoregressive, diffusion, and Bayesian flow network architectures.</p>
<p><strong>H2L models (7)</strong>: Fragment-based (DiffSBDD-M/C inpainting, Delete, DiffDec) and ligand-based (ShEPhERD, ShapeMol, PGMG). These use pharmacophore, surface, or shape priors.</p>
<p>Models were further stratified by whether test proteins appeared in their CrossDock training set (&ldquo;Proteins in CrossDock&rdquo; vs. &ldquo;Proteins Not in CrossDock&rdquo;), enabling direct measurement of generalization.</p>
<h3 id="evaluation-dimensions">Evaluation Dimensions</h3>
<p>The benchmark evaluates six dimensions:</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Key Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Basic molecular properties</td>
          <td>Validity, QED, SA score, uniqueness, diversity, JSD alignment</td>
      </tr>
      <tr>
          <td>Chemical safety</td>
          <td>Industry-standard filter pass rates (Eli Lilly, Novartis, ChEMBL rules)</td>
      </tr>
      <tr>
          <td>Conformational quality</td>
          <td>PoseBusters pass rate, strain energy, steric clash frequency</td>
      </tr>
      <tr>
          <td>Active compound recovery</td>
          <td>Hit rate, hit fraction, active molecule and scaffold recovery counts</td>
      </tr>
      <tr>
          <td>Target awareness</td>
          <td>TAScore at molecule and scaffold levels</td>
      </tr>
      <tr>
          <td>Lead optimization</td>
          <td>MNA Score, number of series with hits</td>
      </tr>
  </tbody>
</table>
<h3 id="key-results-basic-properties-and-chemical-safety">Key Results: Basic Properties and Chemical Safety</h3>
<p>Most models generate drug-like molecules with reasonable QED (0.4-0.6) and SA scores (0.5-0.8). However, two models (FLAG, SurfGen) showed validity below 0.4. TamGen exhibited low uniqueness (~27%), suggesting overreliance on pretrained patterns.</p>
<p>Chemical filter pass rates revealed a more concerning picture: only TamGen and PGMG exceeded 50% of molecules passing all industry-standard filters. Most models fell below 40%, and some (FLAG, SurfGen) below 5%. Nearly 70% of reference active molecules passed the same filters, indicating models frequently generate high-risk compounds.</p>
<h3 id="key-results-conformational-quality">Key Results: Conformational Quality</h3>
<p>MolCraft achieved the highest PoseBusters validity (0.783 PB-valid score among valid molecules). PocketFlow, despite perfect SMILES validity, had fewer than half of its valid molecules pass conformational checks. Most models produced conformations with higher <a href="https://en.wikipedia.org/wiki/Strain_(chemistry)">strain energy</a> than those from <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>. Some models (MolCraft for de novo, DiffDec for H2L) surpassed Vina in minimizing steric clashes, suggesting advanced architectures can exceed the patterns in their training data.</p>
<h3 id="key-results-active-compound-recovery-and-hit-rates">Key Results: Active Compound Recovery and Hit Rates</h3>
<p>De novo models exhibited very low hit rates. The highest molecular hit rate among de novo models was 0.124% on proteins in CrossDock, dropping to 0.024% on unseen proteins. Scaffold-level hit rates were 10-fold higher, showing that generating pharmacologically plausible scaffolds is considerably easier than generating fully active molecules.</p>
<p>After removing molecules overlapping with the CrossDock training set, TamGen&rsquo;s recovery dropped substantially (from 30.3 to 18.7 targets), confirming significant memorization effects. On proteins not in CrossDock, half of the de novo models failed to recover any active molecules at all.</p>
<p>Fragment-based H2L models substantially outperformed both de novo models and ligand-based H2L approaches. Delete recovered active molecules in 44.3 series (out of 600), and DiffDec in 34.7 series.</p>
<h3 id="key-results-target-awareness">Key Results: Target Awareness</h3>
<p>Most de novo models failed the TAScore evaluation. PocketFlow showed the strongest target awareness at the scaffold level, with only 27% of targets showing TAScore &lt; 1 (indicating no target specificity). At the molecular level, results were even weaker: TamGen achieved TAScore &gt; 1 for only 30.6% of CrossDock-seen targets and just 4 out of 35 unseen targets. Most models generated structurally similar molecules regardless of which target they were conditioned on.</p>
<h3 id="key-results-h2l-optimization-mna-score">Key Results: H2L Optimization (MNA Score)</h3>
<p>DiffDec achieved the highest total active hits (121.7) and the best MNA Score (0.523), followed by Delete (104.7 hits, MNA Score 0.482). Ligand-based models (ShEPhERD, PGMG) recovered fewer hits but showed higher MNA Scores per hit, suggesting pharmacophore-based priors help prioritize more potent molecules when actives are found. The most successful model (Delete) achieved a hit in only 9.6% of series (57/600), indicating substantial room for improvement.</p>
<h2 id="critical-findings-and-limitations-of-current-molecular-generative-models">Critical Findings and Limitations of Current Molecular Generative Models</h2>
<p>The benchmark reveals several consistent limitations:</p>
<ol>
<li>
<p><strong>Low screening efficiency</strong>: De novo models achieve molecular hit rates below 0.13%, far from practical utility. Scaffold recovery is more feasible but still limited.</p>
</li>
<li>
<p><strong>Weak target awareness</strong>: Most SBDD models fail to use protein structural information effectively, generating similar molecules across different targets. This raises concerns about off-target effects.</p>
</li>
<li>
<p><strong>Conformational prediction remains difficult</strong>: Most models produce conformations with higher strain energy than classical docking, and only a small fraction (typically below 23%) of generated poses match redocked conformations within 2 Angstrom RMSD.</p>
</li>
<li>
<p><strong>Generalization gap</strong>: Performance consistently drops on proteins not in the training set, and prior benchmarks that do not stratify by training data exposure overestimate real-world utility.</p>
</li>
<li>
<p><strong>Inference-time scaling does not solve the problem</strong>: Sampling up to 100,000 molecules increased the absolute number of active discoveries but with diminishing efficiency. Without better scoring functions, scaling sampling offers limited practical value.</p>
</li>
<li>
<p><strong>Chemical safety</strong>: Most models produce a majority of molecules that fail industry-standard reactivity and promiscuity filters.</p>
</li>
</ol>
<p>The authors acknowledge that the benchmark&rsquo;s 220,005 active molecules represent a biased subset of bioactive chemical space. A model&rsquo;s failure to rediscover known actives for a given target may reflect sampling limitations rather than generating inactive compounds.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Active compounds</td>
          <td>ChEMBL v33</td>
          <td>220,005 molecules, 120 targets</td>
          <td>Filtered at 10 uM affinity threshold</td>
      </tr>
      <tr>
          <td>H2L series</td>
          <td>ChEMBL v33 + PDB</td>
          <td>5,433 series (600 used for H2L test)</td>
          <td>MCS-based series construction</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td><a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB</a></td>
          <td>120 targets</td>
          <td>One PDB entry per target</td>
      </tr>
      <tr>
          <td>Training (most models)</td>
          <td>CrossDocked2020</td>
          <td>Varies</td>
          <td>Standard SBDD training set</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>De novo models sampled 1,000 molecules per target; H2L models sampled 200 per series</li>
<li>All experiments repeated three times with different random seeds</li>
<li>Docking performed with AutoDock Vina using standard parameters</li>
<li>Chemical filters applied via the medchem library</li>
<li>Conformational quality assessed with PoseBusters and PoseCheck</li>
<li>Interaction scores computed via ProLIF with frequency-weighted normalization</li>
</ul>
<h3 id="models">Models</h3>
<p>All 17 models were obtained from their official GitHub repositories and run with default configurations. The benchmark does not introduce new model architectures.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Summary of key metrics across the best-performing models in each category:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best De Novo</th>
          <th>Value</th>
          <th>Best H2L</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PB-valid score</td>
          <td>MolCraft</td>
          <td>0.783</td>
          <td>DiffSBDD-M</td>
          <td>0.597</td>
      </tr>
      <tr>
          <td>Molecular hit rate (in CrossDock)</td>
          <td>TamGen</td>
          <td>0.124%</td>
          <td>DiffDec</td>
          <td>Higher than de novo</td>
      </tr>
      <tr>
          <td>Scaffold hit rate (in CrossDock)</td>
          <td>PocketFlow</td>
          <td>&gt;10%</td>
          <td>Delete</td>
          <td>Lower than PocketFlow</td>
      </tr>
      <tr>
          <td>TAScore scaffold (% targets &gt;1)</td>
          <td>PocketFlow</td>
          <td>73%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>MNA Score</td>
          <td>N/A</td>
          <td>N/A</td>
          <td>DiffDec</td>
          <td>0.523</td>
      </tr>
      <tr>
          <td>Filter pass rate</td>
          <td>TamGen</td>
          <td>&gt;50%</td>
          <td>PGMG</td>
          <td>&gt;50%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Specific hardware requirements are not detailed in the paper. Models were run using their default configurations from official repositories.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CAODH/MolGenBench">MolGenBench</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmark evaluation framework</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/17572553">Zenodo dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-ND 4.0</td>
          <td>Processed data and source data for all results</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cao, D., Fan, Z., Yu, J., Chen, M., Jiang, X., Sheng, X., Wang, X., Zeng, C., Luo, X., Teng, D., &amp; Zheng, M. (2025). Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench. <em>bioRxiv</em>. <a href="https://doi.org/10.1101/2025.11.03.686215">https://doi.org/10.1101/2025.11.03.686215</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cao2025molgenbench,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cao, Duanhua and Fan, Zhehuan and Yu, Jie and Chen, Mingan and Jiang, Xinyu and Sheng, Xia and Wang, Xingyou and Zeng, Chuanlong and Luo, Xiaomin and Teng, Dan and Zheng, Mingyue}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{bioRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1101/2025.11.03.686215}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GuacaMol: Benchmarking Models for De Novo Molecular Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/</guid><description>GuacaMol introduces a standardized benchmark suite for evaluating de novo molecular design models across distribution learning and goal-directed optimization.</description><content:encoded><![CDATA[<h2 id="a-standardized-benchmark-for-molecular-design">A Standardized Benchmark for Molecular Design</h2>
<p>GuacaMol is a <strong>Resource</strong> paper. Its primary contribution is a standardized, open-source benchmarking framework for evaluating models for de novo molecular design. The framework defines 5 distribution-learning benchmarks and 20 goal-directed optimization benchmarks, implemented as a Python package. The authors also provide baseline results for several classical and neural generative models, establishing reference performance levels for future comparisons.</p>
<h2 id="the-need-for-consistent-evaluation-in-generative-chemistry">The Need for Consistent Evaluation in Generative Chemistry</h2>
<p>By 2018, deep generative models for molecular design (<a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAEs</a>, RNNs, <a href="/posts/what-is-a-gan/">GANs</a>) had shown promising results, but the field lacked consistent evaluation standards. Different papers used different tasks, different datasets, and different metrics, making it difficult to compare models or assess real progress. Comparative studies between neural approaches and well-established algorithms like genetic algorithms were rare.</p>
<p>In other areas of machine learning, standardized benchmarks (ImageNet for vision, GLUE for NLP) had driven rapid progress by enabling fair comparisons. The de novo design community lacked an equivalent. Additionally, many existing evaluations focused on easily optimizable properties (logP, QED) that could not differentiate between models, since even simple baselines achieved near-perfect scores on those tasks.</p>
<h2 id="benchmark-design-distribution-learning-and-goal-directed-optimization">Benchmark Design: Distribution Learning and Goal-Directed Optimization</h2>
<p>GuacaMol separates evaluation into two independent dimensions, reflecting the two main use cases of generative models.</p>
<h3 id="distribution-learning-benchmarks">Distribution-Learning Benchmarks</h3>
<p>These five benchmarks assess how well a model learns to generate molecules similar to a training set (a standardized subset of ChEMBL 24):</p>
<ol>
<li><strong>Validity</strong>: Fraction of generated molecules that are chemically valid (parseable by RDKit), measured over 10,000 generated samples.</li>
<li><strong>Uniqueness</strong>: Fraction of unique canonical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> among 10,000 valid generated molecules.</li>
<li><strong>Novelty</strong>: Fraction of generated molecules not present in the training set, measured over 10,000 unique samples.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Measures distributional similarity between generated and reference molecules using hidden representations from ChemNet (trained on biological activity prediction). The FCD score is transformed as:</li>
</ol>
<p>$$S = \exp(-0.2 \cdot \text{FCD})$$</p>
<ol start="5">
<li><strong>KL Divergence</strong>: Compares distributions of nine physicochemical descriptors (BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, NumAromaticRings) plus maximum nearest-neighbor ECFP4 similarity. The final score aggregates per-descriptor KL divergences:</li>
</ol>
<p>$$S = \frac{1}{k} \sum_{i}^{k} \exp(-D_{\text{KL}, i})$$</p>
<p>where $k = 9$ is the number of descriptors.</p>
<h3 id="goal-directed-benchmarks">Goal-Directed Benchmarks</h3>
<p>The 20 goal-directed benchmarks evaluate a model&rsquo;s ability to generate molecules that maximize a given scoring function. These span several categories:</p>
<ul>
<li><strong>Rediscovery</strong> (3 tasks): Regenerate a specific target molecule (Celecoxib, Troglitazone, Thiothixene) using Tanimoto similarity on ECFP4 fingerprints.</li>
<li><strong>Similarity</strong> (3 tasks): Generate many molecules similar to a target (Aripiprazole, Albuterol, Mestranol) above a threshold of 0.75.</li>
<li><strong>Isomers</strong> (2 tasks): Generate molecules matching a target molecular formula ($\text{C}_{11}\text{H}_{24}$ and $\text{C}_9\text{H}_{10}\text{N}_2\text{O}_2\text{PF}_2\text{Cl}$).</li>
<li><strong>Median molecules</strong> (2 tasks): Maximize similarity to two reference molecules simultaneously (camphor/menthol and tadalafil/sildenafil).</li>
<li><strong>Multi-property optimization</strong> (7 tasks): Optimize combinations of similarity, physicochemical properties, and structural features for drug-relevant molecules (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon).</li>
<li><strong>SMARTS-based</strong> (1 task): Target molecules containing specific substructure patterns with constrained physicochemical properties (Valsartan SMARTS).</li>
<li><strong>Scaffold/decorator hop</strong> (2 tasks): Modify molecular scaffolds while preserving substituent patterns, or vice versa.</li>
</ul>
<p>The benchmark score for most goal-directed tasks combines top-1, top-10, and top-100 molecule scores:</p>
<p>$$S = \frac{1}{3}\left(s_1 + \frac{1}{10}\sum_{i=1}^{10} s_i + \frac{1}{100}\sum_{i=1}^{100} s_i\right)$$</p>
<p>where $s_i$ are molecule scores sorted in decreasing order.</p>
<h3 id="score-modifiers">Score Modifiers</h3>
<p>Raw molecular properties are transformed via modifier functions to restrict scores to [0, 1]:</p>
<ul>
<li><strong>Gaussian($\mu$, $\sigma$)</strong>: Targets a specific property value</li>
<li><strong>MinGaussian($\mu$, $\sigma$)</strong>: Full score below $\mu$, decreasing above</li>
<li><strong>MaxGaussian($\mu$, $\sigma$)</strong>: Full score above $\mu$, decreasing below</li>
<li><strong>Thresholded($t$)</strong>: Full score above threshold $t$, linear decrease below</li>
</ul>
<p>Multi-property objectives use either arithmetic or geometric means to combine individual scores.</p>
<h2 id="baseline-models-and-experimental-setup">Baseline Models and Experimental Setup</h2>
<p>The authors evaluate six baseline models spanning different paradigms:</p>
<p><strong>Distribution-learning baselines:</strong></p>
<ul>
<li><strong>Random sampler</strong>: Samples molecules directly from the dataset (provides upper/lower bounds).</li>
<li><strong>SMILES LSTM</strong>: 3-layer LSTM (hidden size 1024) trained to predict next SMILES characters.</li>
<li><strong>Graph MCTS</strong>: Monte Carlo Tree Search building molecules atom-by-atom.</li>
<li><strong>VAE</strong>: Variational autoencoder on SMILES representations.</li>
<li><strong>AAE</strong>: Adversarial autoencoder.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></strong>: Objective-reinforced generative adversarial network.</li>
</ul>
<p><strong>Goal-directed baselines:</strong></p>
<ul>
<li><strong>Best of dataset</strong>: Scores all training molecules and returns the best (virtual screening baseline).</li>
<li><strong>SMILES LSTM</strong>: Same model with 20 iterations of hill-climbing (8192 samples per iteration, top 1024 for fine-tuning).</li>
<li><strong>SMILES GA</strong>: Genetic algorithm operating on SMILES strings with grammar-based mutations.</li>
<li><strong>Graph GA</strong>: Genetic algorithm operating on molecular graphs with crossover and mutation.</li>
<li><strong>Graph MCTS</strong>: Monte Carlo Tree Search with 40 simulations per molecule.</li>
</ul>
<p>The training dataset is ChEMBL 24 after filtering: salt removal, charge neutralization, SMILES length cap of 100, element restrictions, and removal of molecules similar (ECFP4 &gt; 0.323) to 10 held-out drug molecules used in benchmarks.</p>
<h3 id="distribution-learning-results">Distribution-Learning Results</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Random</th>
          <th>SMILES LSTM</th>
          <th>Graph MCTS</th>
          <th>AAE</th>
          <th>ORGAN</th>
          <th>VAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>1.000</td>
          <td>0.959</td>
          <td>1.000</td>
          <td>0.822</td>
          <td>0.379</td>
          <td>0.870</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.997</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.841</td>
          <td>0.999</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.000</td>
          <td>0.912</td>
          <td>0.994</td>
          <td>0.998</td>
          <td>0.687</td>
          <td>0.974</td>
      </tr>
      <tr>
          <td>KL divergence</td>
          <td>0.998</td>
          <td>0.991</td>
          <td>0.522</td>
          <td>0.886</td>
          <td>0.267</td>
          <td>0.982</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>0.929</td>
          <td>0.913</td>
          <td>0.015</td>
          <td>0.529</td>
          <td>0.000</td>
          <td>0.863</td>
      </tr>
  </tbody>
</table>
<h3 id="goal-directed-results-selected">Goal-Directed Results (Selected)</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Best of Dataset</th>
          <th>SMILES LSTM</th>
          <th>SMILES GA</th>
          <th>Graph GA</th>
          <th>Graph MCTS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Celecoxib rediscovery</td>
          <td>0.505</td>
          <td>1.000</td>
          <td>0.732</td>
          <td>1.000</td>
          <td>0.355</td>
      </tr>
      <tr>
          <td>Osimertinib MPO</td>
          <td>0.839</td>
          <td>0.907</td>
          <td>0.886</td>
          <td>0.953</td>
          <td>0.784</td>
      </tr>
      <tr>
          <td>Sitagliptin MPO</td>
          <td>0.509</td>
          <td>0.545</td>
          <td>0.689</td>
          <td>0.891</td>
          <td>0.458</td>
      </tr>
      <tr>
          <td>Scaffold Hop</td>
          <td>0.738</td>
          <td>0.998</td>
          <td>0.885</td>
          <td>1.000</td>
          <td>0.478</td>
      </tr>
      <tr>
          <td><strong>Total (20 tasks)</strong></td>
          <td><strong>12.144</strong></td>
          <td><strong>17.340</strong></td>
          <td><strong>14.396</strong></td>
          <td><strong>17.983</strong></td>
          <td><strong>9.009</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-findings">Main Findings</h3>
<p>The Graph GA achieves the highest total score across goal-directed benchmarks (17.983), followed closely by the SMILES LSTM (17.340). This result is notable because genetic algorithms are well-established methods, and the LSTM-based neural approach nearly matches their optimization performance.</p>
<p>However, compound quality tells a different story. When examining the top 100 molecules per task through chemical quality filters (SureChEMBL, Glaxo, PAINS rules), 77% of LSTM-generated molecules pass, matching the Best of ChEMBL baseline. In contrast, Graph GA produces only 40% passing molecules, and Graph MCTS only 22%. This suggests that neural models benefit from pre-training on real molecular distributions, which encodes implicit knowledge about what constitutes a &ldquo;reasonable&rdquo; molecule.</p>
<p><a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a> performs poorly across all distribution-learning tasks, with more than half its generated molecules being invalid. This is consistent with mode collapse, a known problem in GAN training.</p>
<p>Simpler generative models (LSTM, VAE) outperform more complex architectures (ORGAN, AAE) on distribution learning. Graph MCTS struggles with both distribution learning and goal-directed optimization, suggesting that single-molecule search trees are less effective than population-based approaches.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors explicitly identify several issues:</p>
<ul>
<li><strong>Compound quality is hard to quantify</strong>: The rule-based filters used are acknowledged as &ldquo;high precision, low recall&rdquo; surrogates. They catch some problematic molecules but cannot encode the full breadth of medicinal chemistry expertise.</li>
<li><strong>Some benchmarks are too easy</strong>: The trivially optimizable tasks (logP, QED, CNS MPO) cannot differentiate between models. All baselines achieve near-perfect scores on these.</li>
<li><strong>Sample efficiency and runtime are not benchmarked</strong>: The framework does not penalize models for requiring excessive scoring function calls.</li>
<li><strong>Synthesis accessibility is not addressed</strong>: Generated molecules may be valid but impractical to synthesize.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors call for harder benchmark tasks, better compound quality metrics, attention to sample efficiency and runtime constraints, and further development of graph-based neural generative models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL 24 (post-processed)</td>
          <td>~1.6M molecules</td>
          <td>Salt removal, neutralization, SMILES length cap, element restrictions</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>10 held-out drug molecules</td>
          <td>10</td>
          <td>Removed from training set via ECFP4 similarity threshold</td>
      </tr>
      <tr>
          <td>Quality filters</td>
          <td>SureChEMBL, Glaxo, PAINS, in-house rules</td>
          <td>N/A</td>
          <td>Applied via rd_filters</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES LSTM</strong>: 3-layer LSTM, hidden size 1024; hill-climbing with 20 iterations, 8192 samples per iteration, top 1024 for fine-tuning</li>
<li><strong>Graph GA</strong>: Population of 100, mating pool of 200, crossover + mutation (probability 0.5), 1000 epochs max</li>
<li><strong>SMILES GA</strong>: Population of 300, offspring of 600, SMILES grammar-based mutations, 1000 epochs max</li>
<li><strong>Graph MCTS</strong>: 40 simulations per molecule, 25 children per step, rollout to 60 atoms, starting from CC</li>
</ul>
<h3 id="models">Models</h3>
<p>All baseline implementations are released as open-source code. VAE, AAE, and ORGAN implementations are from the <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> repository.</p>
<h3 id="evaluation">Evaluation</h3>
<p>All distribution-learning benchmarks sample 10,000 molecules. Goal-directed benchmarks use combinations of top-1, top-10, and top-100 scores. Compound quality is assessed via the percentage of top-100 molecules passing chemical filters.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BenevolentAI/guacamol">GuacaMol</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmarking framework and scoring functions</td>
      </tr>
      <tr>
          <td><a href="https://github.com/BenevolentAI/guacamol_baselines">GuacaMol Baselines</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Baseline model implementations</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/projects/GuacaMol/56639">ChEMBL dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Post-processed ChEMBL 24 for benchmarks</td>
      </tr>
      <tr>
          <td><a href="https://github.com/bioinf-jku/FCD">FCD package</a></td>
          <td>Code</td>
          <td>LGPL-3.0</td>
          <td>Fréchet ChemNet Distance implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Brown, N., Fiscato, M., Segler, M. H. S., &amp; Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. <em>Journal of Chemical Information and Modeling</em>, 59(3), 1096-1108. <a href="https://doi.org/10.1021/acs.jcim.8b00839">https://doi.org/10.1021/acs.jcim.8b00839</a></p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/BenevolentAI/guacamol">GuacaMol Python package</a></li>
<li><a href="https://github.com/BenevolentAI/guacamol_baselines">GuacaMol baselines</a></li>
<li><a href="https://figshare.com/projects/GuacaMol/56639">Post-processed ChEMBL datasets</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{brown2019guacamol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GuacaMol: Benchmarking Models for de Novo Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Brown, Nathan and Fiscato, Marco and Segler, Marwin H. S. and Vaucher, Alain C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{59}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1096--1108}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.8b00839}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Frechet ChemNet Distance for Molecular Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/</guid><description>FCD uses ChemNet activations and the Wasserstein-2 distance to evaluate molecular generative models for chemical validity, biological relevance, and diversity.</description><content:encoded><![CDATA[<h2 id="a-unified-evaluation-metric-for-molecular-generation">A Unified Evaluation Metric for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces the Frechet ChemNet Distance (FCD), a single scalar metric for evaluating generative models that produce molecules for drug discovery. FCD adapts the Frechet Inception Distance (FID) from image generation to the molecular domain. By comparing distributions of learned representations from a drug-activity prediction network (ChemNet), FCD simultaneously captures whether generated molecules are chemically valid, biologically relevant, and structurally diverse.</p>
<h2 id="inconsistent-evaluation-of-molecular-generative-models">Inconsistent Evaluation of Molecular Generative Models</h2>
<p>At the time of this work (2018), deep generative models for molecules were proliferating: RNNs combined with <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoders</a>, reinforcement learning, and <a href="/posts/what-is-a-gan/">GANs</a> all produced <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings representing novel molecules. The evaluation landscape was fragmented. Different papers reported different metrics: percentage of valid SMILES, mean logP, druglikeness, synthetic accessibility (SA) scores, or internal diversity via Tanimoto distance.</p>
<p>This inconsistency created several problems. First, method comparison across publications was difficult because no common metric existed. Second, simple metrics like &ldquo;fraction of valid SMILES&rdquo; could be trivially maximized by generating short, simple molecules (e.g., &ldquo;CC&rdquo; or &ldquo;CCC&rdquo;). Third, individual property metrics (logP, druglikeness) each captured only one dimension of quality. A model could score well on logP but produce molecules that were not diverse or not biologically meaningful.</p>
<p>The authors argued that a good metric should capture three properties simultaneously: (1) chemical validity and similarity to real drug-like molecules, (2) biological relevance, and (3) diversity within the generated set.</p>
<h2 id="core-innovation-frechet-distance-over-chemnet-activations">Core Innovation: Frechet Distance over ChemNet Activations</h2>
<p>The key insight is to use a neural network trained on biological activity prediction as a feature extractor for molecules, then compare distributions of these features using the Frechet (Wasserstein-2) distance.</p>
<h3 id="chemnet-architecture">ChemNet Architecture</h3>
<p>ChemNet is a multi-task neural network trained to predict bioactivities across approximately 6,000 assays from three major drug discovery databases (ChEMBL, ZINC, PubChem). The architecture processes one-hot encoded SMILES strings through:</p>
<ol>
<li>Two 1D convolutional layers with SELU activations</li>
<li>A max-pooling layer</li>
<li>Two stacked LSTM layers</li>
<li>A fully connected output layer</li>
</ol>
<p>The penultimate layer (the second LSTM&rsquo;s hidden state after processing the full input sequence) serves as the molecular representation. Because ChemNet was trained to predict drug activities, its internal representations encode both chemical structure (from the input side) and biological function (from the output side).</p>
<h3 id="the-fcd-formula">The FCD Formula</h3>
<p>Given a set of real molecules and a set of generated molecules, FCD is computed as follows:</p>
<ol>
<li>Pass each molecule (as a SMILES string) through ChemNet and extract penultimate-layer activations.</li>
<li>Fit a multivariate Gaussian to each set by computing the mean $\mathbf{m}$ and covariance $\mathbf{C}$ for the generated set, and mean $\mathbf{m}_w$ and covariance $\mathbf{C}_w$ for the real set.</li>
<li>Compute the squared Frechet distance:</li>
</ol>
<p>$$
d^{2}\left((\mathbf{m}, \mathbf{C}), (\mathbf{m}_w, \mathbf{C}_w)\right) = |\mathbf{m} - \mathbf{m}_w|_2^{2} + \mathrm{Tr}\left(\mathbf{C} + \mathbf{C}_w - 2(\mathbf{C}\mathbf{C}_w)^{1/2}\right)
$$</p>
<p>The Gaussian assumption is justified by the maximum entropy principle: the Gaussian is the maximum-entropy distribution for given mean and covariance. A lower FCD indicates that the generated distribution is closer to the real distribution.</p>
<h3 id="why-not-just-fingerprints">Why Not Just Fingerprints?</h3>
<p>The authors also define a Frechet Fingerprint Distance (FFD) that replaces ChemNet activations with 2048-bit ECFP_4 fingerprints. FFD captures chemical structure but not biological function. The experimental comparison shows that FCD produces more distinct separations between biased and unbiased molecule sets, particularly for biologically meaningful biases.</p>
<h2 id="detecting-flaws-in-generative-models">Detecting Flaws in Generative Models</h2>
<p>The experiments evaluate whether FCD can detect specific failure modes in generative models. The authors simulate five types of biased generators by selecting molecules from real databases that exhibit particular properties, then compare FCD against individual metrics (logP, druglikeness, SA score, internal diversity) and FFD.</p>
<h3 id="simulated-bias-experiments">Simulated Bias Experiments</h3>
<p>All experiments use 5,000 molecules drawn 5 times each. The reference distribution is 200,000 randomly drawn real molecules not used for ChemNet training.</p>
<table>
  <thead>
      <tr>
          <th>Bias Type</th>
          <th>logP</th>
          <th>Druglikeness</th>
          <th>SA Score</th>
          <th>Int. Diversity</th>
          <th>FFD</th>
          <th>FCD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Low druglikeness (&lt;5th pct)</td>
          <td>-</td>
          <td>Detects</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>High logP (&gt;95th pct)</td>
          <td>Detects</td>
          <td>Detects</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Low SA score (&lt;5th pct)</td>
          <td>-</td>
          <td>Partial</td>
          <td>-</td>
          <td>Partial</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Mode collapse (cluster)</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
      <tr>
          <td>Kinase inhibitors (PLK1)</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>Detects</td>
          <td>Detects</td>
      </tr>
  </tbody>
</table>
<p>FCD is the only metric that detects all five bias types. The biological bias test (kinase inhibitors for PLK1-PBD from PubChem AID 720504) is particularly notable: only FFD and FCD detect this bias, and FCD provides a more distinct separation. This validates the hypothesis that incorporating biological information through ChemNet activations improves evaluation beyond purely chemical descriptors.</p>
<h3 id="sample-size-requirements">Sample Size Requirements</h3>
<p>The authors tested FCD convergence with varying sample sizes (5 to 300,000 molecules). Mean FCD values for samples drawn from the real distribution:</p>
<table>
  <thead>
      <tr>
          <th>Sample Size</th>
          <th>Mean FCD</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>5</td>
          <td>76.46</td>
          <td>5.03</td>
      </tr>
      <tr>
          <td>50</td>
          <td>31.86</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>500</td>
          <td>4.41</td>
          <td>0.03</td>
      </tr>
      <tr>
          <td>5,000</td>
          <td>0.42</td>
          <td>0.01</td>
      </tr>
      <tr>
          <td>50,000</td>
          <td>0.05</td>
          <td>0.00</td>
      </tr>
      <tr>
          <td>300,000</td>
          <td>0.02</td>
          <td>0.00</td>
      </tr>
  </tbody>
</table>
<p>A sample size of 5,000 molecules is sufficient for reliable estimation, with the mean FCD approaching zero and negligible variance.</p>
<h3 id="benchmarking-published-generative-models">Benchmarking Published Generative Models</h3>
<p>The authors computed FCD for several published generative methods:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>FCD</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Random real molecules</td>
          <td>0.22</td>
          <td>Baseline (near zero as expected)</td>
      </tr>
      <tr>
          <td>Segler et al. (LSTM)</td>
          <td>1.62</td>
          <td>Trained to approximate full ChEMBL distribution</td>
      </tr>
      <tr>
          <td>DRD2-targeted methods</td>
          <td>24.14 to 47.85</td>
          <td>Olivecrona, RL, and ORGAN agents</td>
      </tr>
      <tr>
          <td>Rule-based baseline</td>
          <td>58.76</td>
          <td>Random concatenation of C, N, O atoms</td>
      </tr>
  </tbody>
</table>
<p>The ranking matches expectations. The Segler model, trained to approximate the overall molecule distribution, achieves the lowest FCD (1.62). Models optimized for a specific target (DRD2), including the Olivecrona RL agents, the RL method by Benhenda, and ORGAN, produce higher FCD values (24.14 to 47.85) against the general distribution. More training iterations push these models further from the general distribution, as they become increasingly DRD2-specific. The canonical and reduced Olivecrona agents learn similar chemical spaces, consistent with the original authors&rsquo; conclusions. The rule-based system scores worst (58.76), confirming FCD as a meaningful quality metric.</p>
<h2 id="conclusions-and-impact">Conclusions and Impact</h2>
<p>FCD provides a single metric that unifies the evaluation of chemical validity, biological relevance, and diversity for molecular generative models. Its main advantages are:</p>
<ol>
<li>It captures multiple quality dimensions in one score, simplifying method comparison.</li>
<li>It detects biases that no single existing metric can catch alone.</li>
<li>It requires only SMILES strings as input, making it applicable to any generative method (including graph-based approaches via SMILES conversion).</li>
<li>It incorporates biological information through ChemNet, distinguishing it from purely chemical metrics like FFD.</li>
</ol>
<p><strong>Limitations</strong>: The metric depends on the ChemNet model, which was trained on a specific set of bioactivity assays. Molecules outside the training distribution of ChemNet may not be well-represented. The Gaussian assumption for the activation distributions may not hold perfectly. FCD measures distance to a reference set, so it evaluates how well a generator approximates a given distribution rather than the absolute quality of individual molecules. When using FCD for targeted generation (e.g., molecules active against a specific protein), the reference set should be chosen accordingly, not the general drug-like molecule distribution.</p>
<p>FCD has since become a standard evaluation metric in the molecular generation community, adopted by benchmarking platforms like <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemNet training</td>
          <td>ChEMBL, ZINC, PubChem</td>
          <td>~6,000 assays</td>
          <td>Two-thirds for training, one-third for testing</td>
      </tr>
      <tr>
          <td>Reference distribution</td>
          <td>Combined databases</td>
          <td>200,000 molecules</td>
          <td>Excluded from ChemNet training</td>
      </tr>
      <tr>
          <td>Bias simulations</td>
          <td>Subsets of combined databases</td>
          <td>5,000 per experiment</td>
          <td>5 repetitions each</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>ChemNet: 2x 1D-conv (SELU), max-pool, 2x stacked LSTM, FC output</li>
<li>FCD: Squared Frechet distance between Gaussian-fitted ChemNet penultimate-layer activations</li>
<li>FFD: Same as FCD but using 2048-bit ECFP_4 fingerprints instead of ChemNet activations</li>
<li>Molecular property calculations: RDKit (logP, druglikeness, SA score, Morgan fingerprints with radius 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FCD</td>
          <td>Frechet distance over ChemNet activations (lower = closer to reference)</td>
      </tr>
      <tr>
          <td>FFD</td>
          <td>Frechet distance over ECFP_4 fingerprints</td>
      </tr>
      <tr>
          <td>logP</td>
          <td>Mean partition coefficient</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Geometric mean of desired molecular properties (QED)</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility score</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Tanimoto distance within generated set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not provided in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bioinf-jku/FCD">FCD Implementation</a></td>
          <td>Code</td>
          <td>LGPL-3.0</td>
          <td>Official Python implementation; requires only SMILES input</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., &amp; Klambauer, G. (2018). Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. <em>Journal of Chemical Information and Modeling</em>, 58(9), 1736-1741.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{preuer2018frechet,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fr{\&#39;e}chet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Preuer, Kristina and Renz, Philipp and Unterthiner, Thomas and Hochreiter, Sepp and Klambauer, G{\&#34;u}nter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{58}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1736--1741}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.8b00234}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Failure Modes in Molecule Generation &amp; Optimization</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/failure-modes-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/failure-modes-molecule-generation/</guid><description>Renz et al. show trivial models fool distribution-learning metrics and ML scoring functions introduce exploitable biases in goal-directed molecule generation.</description><content:encoded><![CDATA[<h2 id="an-empirical-critique-of-molecular-generation-evaluation">An Empirical Critique of Molecular Generation Evaluation</h2>
<p>This is an <strong>Empirical</strong> paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple &ldquo;AddCarbon&rdquo; model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.</p>
<h2 id="evaluation-gaps-in-de-novo-molecular-design">Evaluation Gaps in De Novo Molecular Design</h2>
<p>The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Frechet ChemNet Distance (FCD)</a>. For goal-directed generation, penalized logP became a common optimization target.</p>
<p>However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.</p>
<p>At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.</p>
<h2 id="the-copy-problem-and-control-score-framework">The Copy Problem and Control Score Framework</h2>
<p>The paper introduces two key conceptual contributions.</p>
<h3 id="the-addcarbon-model-for-distribution-learning">The AddCarbon Model for Distribution-Learning</h3>
<p>The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> distribution-learning benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>RS</th>
          <th>LSTM</th>
          <th>GraphMCTS</th>
          <th>AAE</th>
          <th>ORGAN</th>
          <th>VAE</th>
          <th>AddCarbon</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>1.000</td>
          <td>0.959</td>
          <td>1.000</td>
          <td>0.822</td>
          <td>0.379</td>
          <td>0.870</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.997</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.841</td>
          <td>0.999</td>
          <td>0.999</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.000</td>
          <td>0.912</td>
          <td>0.994</td>
          <td>0.998</td>
          <td>0.687</td>
          <td>0.974</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>KL divergence</td>
          <td>0.998</td>
          <td>0.991</td>
          <td>0.522</td>
          <td>0.886</td>
          <td>0.267</td>
          <td>0.982</td>
          <td>0.982</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>0.929</td>
          <td>0.913</td>
          <td>0.015</td>
          <td>0.529</td>
          <td>0.000</td>
          <td>0.863</td>
          <td>0.871</td>
      </tr>
  </tbody>
</table>
<p>The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the &ldquo;copy problem&rdquo;: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.</p>
<h3 id="control-scores-for-goal-directed-generation">Control Scores for Goal-Directed Generation</h3>
<p>For goal-directed generation, the authors introduce a three-score experimental design:</p>
<ul>
<li><strong>Optimization Score (OS)</strong>: Output of a classifier trained on data split 1, used to guide the molecular optimizer.</li>
<li><strong>Model Control Score (MCS)</strong>: Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.</li>
<li><strong>Data Control Score (DCS)</strong>: Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.</li>
</ul>
<p>This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.</p>
<h2 id="experimental-setup-three-targets-three-generators">Experimental Setup: Three Targets, Three Generators</h2>
<h3 id="targets-and-data">Targets and Data</h3>
<p>The authors selected three biological targets from ChEMBL: <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">Janus kinase 2</a> (JAK2), <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">epidermal growth factor receptor</a> (EGFR), and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor D2</a> (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.</p>
<h3 id="generators">Generators</h3>
<p>Three molecular generators were evaluated:</p>
<ol>
<li><strong>Graph-based Genetic Algorithm (GA)</strong>: Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.</li>
<li><strong>SMILES-LSTM</strong>: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">Particle Swarm Optimization</a> (PS)</strong>: Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.</li>
</ol>
<p>Each optimizer was run 10 times per target dataset.</p>
<h2 id="score-divergence-and-exploitable-biases">Score Divergence and Exploitable Biases</h2>
<h3 id="optimization-vs-control-score-divergence">Optimization vs. Control Score Divergence</h3>
<p>Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.</p>
<p>The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.</p>
<h3 id="chemical-space-migration">Chemical Space Migration</h3>
<p>Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.</p>
<h3 id="quality-of-generated-molecules">Quality of Generated Molecules</h3>
<p>High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.</p>
<h3 id="key-takeaways">Key Takeaways</h3>
<p>The authors emphasize several practical implications:</p>
<ol>
<li><strong>Early stopping</strong>: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.</li>
<li><strong>Scoring function iteration</strong>: In practice, generative models are &ldquo;highly adept at exploiting&rdquo; incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.</li>
<li><strong>Synthetic accessibility</strong>: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.</li>
<li><strong>Likelihood-based evaluation</strong>: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bioactivity data</td>
          <td>ChEMBL (JAK2, EGFR, DRD2)</td>
          <td>See Table S1</td>
          <td>Binary classification tasks, split 50/50</td>
      </tr>
      <tr>
          <td>Distribution-learning</td>
          <td>GuacaMol training set</td>
          <td>Subset of ChEMBL</td>
          <td>Used as starting population for GA and PS</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Scoring function</strong>: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)</li>
<li><strong>GA</strong>: Graph-based genetic algorithm from Jensen (2019)</li>
<li><strong>LSTM</strong>: SMILES-LSTM with hill climbing, pretrained model from GuacaMol</li>
<li><strong>PS</strong>: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)</li>
<li>Each optimizer run 10 times per target</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization Score (OS)</td>
          <td>RF classifier on split 1</td>
          <td>Guides optimization</td>
      </tr>
      <tr>
          <td>Model Control Score (MCS)</td>
          <td>RF on split 1, different seed</td>
          <td>Detects model-specific bias</td>
      </tr>
      <tr>
          <td>Data Control Score (DCS)</td>
          <td>RF on split 2</td>
          <td>Detects data-specific bias</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> metrics</td>
          <td>Validity, uniqueness, novelty, KL div, FCD</td>
          <td>For distribution-learning</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ml-jku/mgenerators-failure-modes">ml-jku/mgenerators-failure-modes</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Data, code, and results</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{renz2019failure,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{On failure modes in molecule generation and optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Renz, Philipp and Van Rompaey, Dries and Wegner, J{\&#34;o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\&#34;u}nter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Drug Discovery Today: Technologies}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{32-33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{55--63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ddtec.2020.09.003}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., &amp; Klambauer, G. (2019). On failure modes in molecule generation and optimization. <em>Drug Discovery Today: Technologies</em>, 32-33, 55-63. <a href="https://doi.org/10.1016/j.ddtec.2020.09.003">https://doi.org/10.1016/j.ddtec.2020.09.003</a></p>
<p><strong>Publication</strong>: Drug Discovery Today: Technologies, Volume 32-33, 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ml-jku/mgenerators-failure-modes">Code and data (GitHub)</a></li>
</ul>
]]></content:encoded></item><item><title>DOCKSTRING: Docking-Based Benchmarks for Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/dockstring-docking-benchmarks-ligand-design/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/dockstring-docking-benchmarks-ligand-design/</guid><description>DOCKSTRING provides an open-source Python docking package, 15M+ score dataset across 58 targets, and benchmark tasks for ML-driven drug design.</description><content:encoded><![CDATA[<h2 id="a-three-part-resource-for-docking-based-ml-benchmarks">A Three-Part Resource for Docking-Based ML Benchmarks</h2>
<p>DOCKSTRING is a <strong>Resource</strong> paper that delivers three integrated components for benchmarking machine learning models in drug discovery using molecular docking. The primary contributions are: (1) an open-source Python package wrapping <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a> for deterministic docking from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, (2) a dataset of over 15 million docking scores and poses covering 260,000+ molecules docked against 58 medically relevant protein targets, and (3) a suite of benchmark tasks spanning regression, <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a>, and de novo molecular design. The paper additionally provides baseline results across classical and deep learning methods.</p>
<h2 id="why-existing-molecular-benchmarks-fall-short">Why Existing Molecular Benchmarks Fall Short</h2>
<p>ML methods for drug discovery are frequently evaluated using simple physicochemical properties such as penalized logP or QED (quantitative estimate of druglikeness). These properties are computationally cheap and easy to optimize, but they do not depend on the interaction between a candidate compound and a protein target. As a result, strong performance on logP or QED benchmarks does not necessarily translate to strong performance on real drug design tasks.</p>
<p><a href="https://en.wikipedia.org/wiki/Docking_(molecular)">Molecular docking</a> offers a more realistic evaluation objective because docking scores depend on the 3D structure of the ligand-target complex. Docking is routinely used by medicinal chemists to estimate binding affinities during hit discovery and lead optimization. Several prior efforts attempted to bring docking into ML benchmarking, but each had limitations:</p>
<ul>
<li><strong>VirtualFlow and DockStream</strong> require manually prepared target files and domain expertise.</li>
<li><strong>TDC and Cieplinski et al.</strong> provide SMILES-to-score wrappers but lack proper ligand protonation and randomness control, and cover very few targets (one and four, respectively).</li>
<li><strong>DUD-E</strong> is easily overfit by ML models that memorize actives vs. decoys.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></strong> rely on physicochemical properties or similarity functions that miss 3D structural subtleties.</li>
<li><strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> compiles experimental datasets but does not support on-the-fly label computation needed for transfer learning or de novo design.</li>
</ul>
<p>DOCKSTRING addresses all of these gaps: it standardizes the docking procedure, automates ligand and target preparation, controls randomness for reproducibility, and provides a large, diverse target set.</p>
<h2 id="core-innovation-standardized-end-to-end-docking-pipeline">Core Innovation: Standardized End-to-End Docking Pipeline</h2>
<p>The key innovation is a fully automated, deterministic docking pipeline that produces reproducible scores from a SMILES string in four lines of Python code. The pipeline consists of three stages:</p>
<p><strong>Target Preparation.</strong> 57 of the 58 protein targets originate from the Directory of Useful Decoys Enhanced (DUD-E). PDB files are standardized with <a href="https://en.wikipedia.org/wiki/Open_Babel">Open Babel</a>, polar hydrogens are added, and conversion to PDBQT format is performed with AutoDock Tools. Search boxes are derived from crystallographic ligands with 12.5 A padding and a minimum side length of 30 A. The 58th target (DRD2, <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor D2</a>) was prepared separately following the same protocol.</p>
<p><strong>Ligand Preparation.</strong> Ligands are protonated at pH 7.4 with Open Babel, embedded into 3D conformations using the ETKDG algorithm in RDKit, refined with the <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94 force field</a>, and assigned Gasteiger partial charges. Stereochemistry of determined stereocenters is maintained, while undetermined stereocenters are assigned randomly but consistently across runs.</p>
<p><strong>Docking.</strong> AutoDock Vina runs with default exhaustiveness (8), up to 9 binding modes, and an energy range of 3 kcal/mol. The authors verified that fixing the random seed yields docking score variance of less than 0.1 kcal/mol across runs, making the pipeline fully deterministic.</p>
<p>The three de novo design objective functions incorporate a QED penalty to enforce druglikeness:</p>
<p>$$
f_{\text{F2}}(l) = s(l, \text{F2}) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{PPAR}}(l) = \max_{t \in \text{PPAR}} s(l, t) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{JAK2}}(l) = s(l, \text{JAK2}) - \min(s(l, \text{LCK}), -8.1) + 10(1 - \text{QED}(l))
$$</p>
<p>The F2 task optimizes binding to a single protease. The Promiscuous <a href="https://en.wikipedia.org/wiki/Peroxisome_proliferator-activated_receptor">PPAR</a> task requires strong binding to three nuclear receptors simultaneously. The Selective <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> task is adversarial, requiring strong JAK2 binding while avoiding <a href="https://en.wikipedia.org/wiki/Tyrosin-protein_kinase_Lck">LCK</a> binding (two kinases with a score correlation of 0.80).</p>
<h2 id="experimental-setup-regression-virtual-screening-and-de-novo-design">Experimental Setup: Regression, Virtual Screening, and De Novo Design</h2>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>The dataset combines molecules from ExCAPE-DB (which curates PubChem and ChEMBL bioactivity assays). The authors selected all molecules with active labels against targets having at least 1,000 experimental actives, plus 150,000 inactive-only molecules. After discarding 1.8% of molecules that failed ligand preparation, the final dataset contains 260,155 compounds docked against 58 targets, producing over 15 million docking scores and poses. The dataset required over 500,000 CPU hours to generate.</p>
<p>Cluster analysis using <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard distance</a> threshold of 0.25 on RDKit fingerprints) found 52,000 clusters, and Bemis-Murcko scaffold decomposition identified 102,000 scaffolds, confirming high molecular diversity. Train/test splitting follows cluster labels to prevent data leakage.</p>
<h3 id="regression-baselines">Regression Baselines</h3>
<p>Five targets of varying difficulty were selected: <a href="https://en.wikipedia.org/wiki/Poly_(ADP-ribose)_polymerase">PARP1</a> (easy), F2 (easy-medium), KIT (medium), ESR2 (hard), and PGR (hard). Baselines include Ridge, Lasso, XGBoost, exact GP, sparse GP, MPNN, and Attentive FP.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Ridge</th>
          <th>Lasso</th>
          <th>XGBoost</th>
          <th>GP (exact)</th>
          <th>GP (sparse)</th>
          <th>MPNN</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>0.640</td>
          <td>0.640</td>
          <td>0.734</td>
          <td>0.707</td>
          <td>0.716</td>
          <td>0.953</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.519</td>
          <td>0.483</td>
          <td>0.660</td>
          <td>0.640</td>
          <td>0.598</td>
          <td>0.901</td>
          <td>0.981</td>
      </tr>
      <tr>
          <td>ESR2</td>
          <td>0.421</td>
          <td>0.416</td>
          <td>0.497</td>
          <td>0.441</td>
          <td>0.508</td>
          <td>0.506</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>F2</td>
          <td>0.672</td>
          <td>0.663</td>
          <td>0.688</td>
          <td>0.705</td>
          <td>0.744</td>
          <td>0.798</td>
          <td>0.880</td>
      </tr>
      <tr>
          <td>KIT</td>
          <td>0.604</td>
          <td>0.594</td>
          <td>0.674</td>
          <td>0.637</td>
          <td>0.684</td>
          <td>0.755</td>
          <td>0.806</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>0.706</td>
          <td>0.700</td>
          <td>0.723</td>
          <td>0.743</td>
          <td>0.772</td>
          <td>0.815</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>0.242</td>
          <td>0.245</td>
          <td>0.345</td>
          <td>0.291</td>
          <td>0.387</td>
          <td>0.324</td>
          <td>0.678</td>
      </tr>
  </tbody>
</table>
<p>Values are mean $R^2$ over three runs. Attentive FP achieves the best performance on every target but remains well below perfect prediction on the harder targets, confirming that docking score regression is a meaningful benchmark.</p>
<h3 id="virtual-screening-baselines">Virtual Screening Baselines</h3>
<p>Models trained on PARP1, KIT, and PGR docking scores rank all molecules in <a href="/notes/chemistry/datasets/zinc-22/">ZINC20</a> (~1 billion compounds). The top 5,000 predictions are docked, and the enrichment factor (EF) is computed relative to a 0.1 percentile activity threshold.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Threshold</th>
          <th>FSS</th>
          <th>Ridge</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>KIT</td>
          <td>-10.7</td>
          <td>239.2</td>
          <td>451.6</td>
          <td>766.5</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>-12.1</td>
          <td>313.1</td>
          <td>325.9</td>
          <td>472.2</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>-10.1</td>
          <td>161.4</td>
          <td>120.5</td>
          <td>461.3</td>
      </tr>
  </tbody>
</table>
<p>The maximum possible EF is 1,000. Attentive FP substantially outperforms fingerprint similarity search (FSS) and Ridge regression across all targets.</p>
<h3 id="de-novo-design-baselines">De Novo Design Baselines</h3>
<p>Four optimization methods were tested: <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> GA, <a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, GP-BO with UCB acquisition ($\beta = 10$), and GP-BO with expected improvement (EI), each with a budget of 5,000 objective function evaluations. Without QED penalties, all methods easily surpass the best training set molecules but produce large, lipophilic, undrug-like compounds. With QED penalties, the tasks become substantially harder: GP-BO with EI is the only method that finds 25 molecules better than the training set across all three tasks.</p>
<p>The Selective JAK2 task proved hardest due to the high correlation between JAK2 and LCK scores. Pose analysis of the top de novo molecule revealed a dual binding mode: type V inhibitor behavior in JAK2 (binding distant N- and C-terminal lobe regions) and type I behavior in LCK (hinge-binding), suggesting a plausible selectivity mechanism.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ol>
<li>Docking scores are substantially harder to predict than logP or QED, making them more suitable for benchmarking high-performing ML models. Graph neural networks (Attentive FP) achieve near-perfect $R^2$ on logP but only 0.63-0.91 on docking targets.</li>
<li>In-distribution regression difficulty does not necessarily predict out-of-distribution virtual screening difficulty. PARP1 is easiest for regression, but KIT is easiest for virtual screening.</li>
<li>Adding a QED penalty to de novo design objectives transforms trivially solvable tasks into meaningful benchmarks. The adversarial Selective JAK2 objective, which exploits correlated docking scores, may be an effective way to avoid docking score biases toward large and lipophilic molecules.</li>
<li>Docking scores from related protein targets are highly correlated, supporting the biological meaningfulness of the dataset and enabling multiobjective and transfer learning tasks.</li>
</ol>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>Docking scores are approximate heuristics. They use static binding sites and force fields with limited calibration for certain metal ions. DOCKSTRING benchmarks should not substitute for rational drug design and experimental validation.</li>
<li>The pipeline relies on AutoDock Vina specifically; other docking programs may produce different rankings.</li>
<li>Top de novo molecules for F2 and Promiscuous PPAR contain conjugated ring structures uncommon in successful drugs.</li>
<li>Platform support is primarily Linux, with noted scoring inconsistencies on macOS.</li>
</ul>
<p><strong>Future directions</strong> mentioned include multiobjective tasks (transfer learning, few-shot learning), improved objective functions for better pharmacokinetic properties and synthetic feasibility, and multifidelity optimization tasks combining docking with more expensive computational methods.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ligand source</td>
          <td>ExCAPE-DB (PubChem + ChEMBL)</td>
          <td>260,155 molecules</td>
          <td>Actives against 58 targets + 150K inactive-only</td>
      </tr>
      <tr>
          <td>Docking scores</td>
          <td>DOCKSTRING dataset</td>
          <td>15M+ scores and poses</td>
          <td>Full matrix across all molecule-target pairs</td>
      </tr>
      <tr>
          <td>Virtual screening library</td>
          <td>ZINC20</td>
          <td>~1 billion molecules</td>
          <td>Used for out-of-distribution evaluation</td>
      </tr>
      <tr>
          <td>Target structures</td>
          <td>DUD-E + PDB 6CM4 (DRD2)</td>
          <td>58 targets</td>
          <td>Kinases (22), enzymes (12), nuclear receptors (9), proteases (7), GPCRs (5), cytochromes (2), chaperone (1)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Docking engine</strong>: AutoDock Vina with default exhaustiveness (8), up to 9 binding modes, energy range of 3 kcal/mol</li>
<li><strong>Ligand preparation</strong>: Open Babel (protonation at pH 7.4), RDKit ETKDG (3D embedding), MMFF94 (force field refinement), Gasteiger charges</li>
<li><strong>Regression models</strong>: Ridge, Lasso, XGBoost (hyperparameters via 20-configuration random search with 5-fold CV), exact GP and sparse GP (Tanimoto kernel on fingerprints), MPNN, Attentive FP (DeepChem defaults, 10 epochs)</li>
<li><strong>Optimization</strong>: Graph GA (population 250, offspring 25, mutation rate 0.01), SELFIES GA (same population/offspring settings), GP-BO with UCB ($\beta = 10$) or EI (batch size 5, 1000 offspring, 25 generations per iteration)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Setting</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$R^2$ (coefficient of determination)</td>
          <td>Regression</td>
          <td>Cluster-split train/test</td>
      </tr>
      <tr>
          <td>EF (enrichment factor)</td>
          <td>Virtual screening</td>
          <td>Top 5,000 from ZINC20, 0.1 percentile threshold</td>
      </tr>
      <tr>
          <td>Objective value trajectory</td>
          <td>De novo design</td>
          <td>5,000 function evaluation budget</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The dataset required over 500,000 CPU hours to compute, using the University of Cambridge Research Computing Service (EPSRC and DiRAC funded). Per-target docking takes approximately 15 seconds on 8 CPUs.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">DOCKSTRING Python package</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Wraps AutoDock Vina; available via conda-forge and PyPI</td>
      </tr>
      <tr>
          <td><a href="https://dockstring.github.io">DOCKSTRING dataset</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>15M+ docking scores and poses for 260K molecules x 58 targets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">Benchmark baselines</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Regression, virtual screening, and de novo design baseline implementations</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., &amp; Bacallado, S. (2022). DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. <em>Journal of Chemical Information and Modeling</em>, 62(15), 3486-3502. <a href="https://doi.org/10.1021/acs.jcim.1c01334">https://doi.org/10.1021/acs.jcim.1c01334</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://dockstring.github.io">DOCKSTRING Project Page</a></li>
<li><a href="https://github.com/dockstring/dockstring">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{garciaortegon2022dockstring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{DOCKSTRING}: Easy Molecular Docking Yields Better Benchmarks for Ligand Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Garc{\&#39;\i}a-Orteg{\&#39;o}n, Miguel and Simm, Gregor N. C. and Tripp, Austin J. and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and Bender, Andreas and Bacallado, Sergio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3486--3502}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.1c01334}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Tartarus: Realistic Inverse Molecular Design Benchmarks</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/tartarus-inverse-molecular-design/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/tartarus-inverse-molecular-design/</guid><description>Tartarus provides physics-based benchmark tasks for inverse molecular design spanning materials, drugs, and reactions with algorithm-domain dependencies.</description><content:encoded><![CDATA[<h2 id="a-resource-for-realistic-molecular-design-evaluation">A Resource for Realistic Molecular Design Evaluation</h2>
<p>This is a <strong>Resource</strong> paper. Its primary contribution is Tartarus, a modular benchmarking platform for inverse molecular design that provides physically grounded evaluation tasks across four application domains: organic photovoltaics, organic emitters, protein ligands, and chemical reaction substrates. Each task pairs a curated reference dataset with a computational simulation workflow that evaluates proposed molecular structures using established methods from computational chemistry (<a href="https://en.wikipedia.org/wiki/Force_field_(chemistry)">force fields</a>, semi-empirical quantum chemistry, <a href="https://en.wikipedia.org/wiki/Density_functional_theory">density functional theory</a>, and <a href="https://en.wikipedia.org/wiki/Docking_(molecular)">molecular docking</a>).</p>
<h2 id="the-problem-with-existing-molecular-design-benchmarks">The Problem with Existing Molecular Design Benchmarks</h2>
<p>Inverse molecular design, the challenge of crafting molecules with specific optimal properties, is central to drug, catalyst, and materials discovery. Many algorithms have been proposed for this task, but the benchmarks used to evaluate them have significant limitations:</p>
<ul>
<li><strong>Penalized logP</strong>, one of the most common benchmarks, depends heavily on molecule size and chain composition, limiting its informativeness.</li>
<li><strong>QED maximization</strong> has reached saturation, with numerous models achieving near-perfect scores.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong> often yields near-perfect scores across models, obscuring meaningful performance differences. <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">Gao et al. (2022)</a> traced this to unlimited property evaluations, with imposed limits revealing much larger disparities.</li>
<li><strong>MOSES</strong> evaluates distribution-matching ability, but the emergence of <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> and simple algorithms has made these tasks relatively straightforward.</li>
<li><strong>Molecular docking</strong> benchmarks are gaining popularity, but tend to favor reactive or unstable molecules and typically cover only drug design.</li>
</ul>
<p>These benchmarks share a common weakness: they rely on cheap, approximate property estimators (often QSAR models or simple heuristics) rather than physics-based simulations. This makes them poor proxies for real molecular design campaigns, where properties must be validated through computational or experimental workflows. Tartarus addresses this by providing benchmark tasks grounded in established simulation methods.</p>
<h2 id="physics-based-simulation-workflows-as-benchmark-oracles">Physics-Based Simulation Workflows as Benchmark Oracles</h2>
<p>The core innovation in Tartarus is the use of computational chemistry simulation pipelines as objective functions for benchmarking. Rather than relying on learned property predictors, each benchmark task runs a full simulation workflow to evaluate proposed molecules:</p>
<ol>
<li><strong>Organic Photovoltaics (OPV)</strong>: Starting from a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string, the workflow generates 3D coordinates with Open Babel, performs conformer search with CREST at the GFN-FF level, optimizes geometry at GFN2-xTB, and computes <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO/LUMO</a> energies. Power conversion efficiency (PCE) is estimated via the Scharber model for single-junction <a href="https://en.wikipedia.org/wiki/Organic_solar_cell">organic solar cells</a>. HOMO and LUMO energies are calibrated against DFT results from the Harvard Clean Energy Project Database using <a href="https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator">Theil-Sen regression</a>:</li>
</ol>
<p>$$
E_{\text{HOMO, calibrated}} = E_{\text{HOMO, GFN2-xTB}} \cdot 0.8051 + 2.5377 \text{ eV}
$$</p>
<p>$$
E_{\text{LUMO, calibrated}} = E_{\text{LUMO, GFN2-xTB}} \cdot 0.8788 + 3.7913 \text{ eV}
$$</p>
<ol start="2">
<li>
<p><strong>Organic Emitters (OLED)</strong>: The workflow uses conformer search via CREST, geometry optimization at GFN0-xTB, and TD-DFT single-point calculations at the B3LYP/6-31G* level with PySCF to extract singlet-triplet gaps, <a href="https://en.wikipedia.org/wiki/Oscillator_strength">oscillator strengths</a>, and vertical excitation energies.</p>
</li>
<li>
<p><strong>Protein Ligands</strong>: The workflow generates 3D coordinates, applies structural filters (<a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a>, reactive moiety checks), and performs molecular docking using QuickVina2 with re-scoring via smina against three protein targets: 1SYH (ionotropic glutamate receptor), 6Y2F (<a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 main protease</a>), and 4LDE (beta-2 adrenoceptor).</p>
</li>
<li>
<p><strong>Chemical Reaction Substrates</strong>: The workflow models the intramolecular double hydrogen transfer in syn-sesquinorbornenes using the SEAM force field approach at the GFN-FF/GFN2-xTB level to compute activation and reaction energies.</p>
</li>
</ol>
<p>Each benchmark also includes a curated reference dataset for training generative models and a standardized evaluation protocol: train on 80% of the dataset, use 20% for hyperparameter optimization, then optimize structures starting from the best reference molecule with a constrained budget of 5,000 proposed compounds, a 24-hour runtime cap, and five independent repetitions.</p>
<h2 id="benchmark-tasks-datasets-and-model-comparisons">Benchmark Tasks, Datasets, and Model Comparisons</h2>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>Eight generative models spanning major algorithm families were tested:</p>
<ul>
<li><strong>VAEs</strong>: SMILES-VAE and SELFIES-VAE</li>
<li><strong>Flow models</strong>: MoFlow</li>
<li><strong>Reinforcement learning</strong>: <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></li>
<li><strong>LSTM-based hill climbing</strong>: SMILES-LSTM-HC and SELFIES-LSTM-HC</li>
<li><strong>Genetic algorithms</strong>: <a href="/notes/chemistry/molecular-design/generation/search-based/graph-based-genetic-algorithm-chemical-space/">GB-GA</a> and JANUS</li>
</ul>
<h3 id="organic-photovoltaics-results">Organic Photovoltaics Results</h3>
<p>The reference dataset (CEP_SUB) contains approximately 25,000 molecules from the Harvard Clean Energy Project Database. Two objectives combine PCE with synthetic accessibility (SAscore):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>PCE_PCBM - SAscore</th>
          <th>PCE_PCDTBT - SAscore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>7.57</td>
          <td>31.71</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>7.44 +/- 0.28</td>
          <td>10.23 +/- 11.14</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>7.05 +/- 0.66</td>
          <td>29.24 +/- 0.65</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>7.08 +/- 0.31</td>
          <td>29.81 +/- 0.37</td>
      </tr>
      <tr>
          <td>SMILES-LSTM-HC</td>
          <td>6.69 +/- 0.40</td>
          <td>31.79 +/- 0.15</td>
      </tr>
      <tr>
          <td>SELFIES-LSTM-HC</td>
          <td>7.40 +/- 0.41</td>
          <td>30.71 +/- 1.20</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>7.48 +/- 0.11</td>
          <td>30.47 +/- 0.44</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>7.78 +/- 0.02</td>
          <td>30.24 +/- 0.80</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>7.59 +/- 0.14</td>
          <td>31.34 +/- 0.74</td>
      </tr>
  </tbody>
</table>
<p>GB-GA achieves the best score on the first task (7.78), while SMILES-LSTM-HC leads on the second (31.79). Most models can marginally improve PCE but struggle to simultaneously improve PCE and reduce SAscore.</p>
<h3 id="organic-emitters-results">Organic Emitters Results</h3>
<p>The reference dataset (GDB-13_SUB) contains approximately 380,000 molecules filtered for conjugated pi-systems from <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>. Three objectives target singlet-triplet gap minimization, oscillator strength maximization, and a combined multi-objective:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Delta E(S1-T1)</th>
          <th>f12</th>
          <th>Multi-objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>0.020</td>
          <td>2.97</td>
          <td>-0.04</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>0.071 +/- 0.003</td>
          <td>0.50 +/- 0.27</td>
          <td>-0.57 +/- 0.33</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>0.016 +/- 0.001</td>
          <td>0.36 +/- 0.31</td>
          <td>0.17 +/- 0.10</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>0.013 +/- 0.001</td>
          <td>0.81 +/- 0.11</td>
          <td>-0.04 +/- 0.06</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>0.012 +/- 0.002</td>
          <td>2.14 +/- 0.45</td>
          <td>0.07 +/- 0.03</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>0.008 +/- 0.001</td>
          <td>2.07 +/- 0.16</td>
          <td>0.02 +/- 0.05</td>
      </tr>
  </tbody>
</table>
<p>Only JANUS, GB-GA, and SELFIES-VAE generate compounds comparable to or improving upon the best training molecules. JANUS achieves the lowest singlet-triplet gap (0.008 eV), while SELFIES-VAE achieves the highest multi-objective fitness (0.17). Some proposed structures contain reactive moieties, likely because stability is not explicitly penalized in the objective functions.</p>
<h3 id="protein-ligand-results">Protein Ligand Results</h3>
<p>The reference dataset contains approximately 152,000 molecules from the DTP Open Compound Collection, filtered for drug-likeness. Docking is performed against three protein targets using both QuickVina2 and smina re-scoring:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1SYH (smina)</th>
          <th>6Y2F (smina)</th>
          <th>4LDE (smina)</th>
          <th>SR (1SYH)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>-10.2</td>
          <td>-8.2</td>
          <td>-13.1</td>
          <td>100.0%</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>-10.4 +/- 0.6</td>
          <td>-8.9 +/- 0.8</td>
          <td>-11.1 +/- 0.4</td>
          <td>12.3%</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>-10.9 +/- 0.3</td>
          <td>-10.1 +/- 0.4</td>
          <td>-11.9 +/- 0.2</td>
          <td>34.8%</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>-12.1 +/- 0.2</td>
          <td>-11.4 +/- 0.3</td>
          <td>-13.7 +/- 0.5</td>
          <td>77.8%</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>-12.0 +/- 0.2</td>
          <td>-11.0 +/- 0.2</td>
          <td>-13.8 +/- 0.4</td>
          <td>72.6%</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>-11.9 +/- 0.2</td>
          <td>-11.9 +/- 0.4</td>
          <td>-13.6 +/- 0.5</td>
          <td>68.4%</td>
      </tr>
  </tbody>
</table>
<p>No single model consistently achieves the best docking score across all three targets. REINVENT leads on 1SYH, JANUS on 6Y2F, and GB-GA on 4LDE. Both VAE models show low success rates for structural filter compliance (12-39%), while REINVENT, GAs, and LSTMs achieve 68-78%.</p>
<h3 id="chemical-reaction-substrates-results">Chemical Reaction Substrates Results</h3>
<p>The reference dataset (SNB-60K) contains approximately 60,000 syn-sesquinorbornene derivatives generated via <a href="/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/">STONED-SELFIES</a> mutations. Four objectives target activation energy, reaction energy, and two combined metrics:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Delta E(activation)</th>
          <th>Delta E(reaction)</th>
          <th>Delta E(act) + Delta E(rxn)</th>
          <th>-Delta E(act) + Delta E(rxn)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>64.94</td>
          <td>-34.39</td>
          <td>56.48</td>
          <td>-95.25</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>76.81 +/- 0.25</td>
          <td>-10.96 +/- 0.71</td>
          <td>71.01 +/- 0.62</td>
          <td>-90.94 +/- 1.04</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>70.12 +/- 2.13</td>
          <td>-20.21 +/- 4.13</td>
          <td>63.21 +/- 0.69</td>
          <td>-92.82 +/- 3.06</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>56.04 +/- 3.07</td>
          <td>-41.39 +/- 5.76</td>
          <td>45.20 +/- 6.78</td>
          <td>-100.07 +/- 1.35</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>47.56 +/- 2.19</td>
          <td>-45.37 +/- 7.90</td>
          <td>39.22 +/- 3.99</td>
          <td>-97.14 +/- 1.13</td>
      </tr>
  </tbody>
</table>
<p>Only JANUS and GB-GA consistently outperform the best reference compounds. Both VAE models fail to surpass the dataset baseline on any objective. JANUS achieves the best single-objective scores for activation energy (47.56) and reaction energy (-45.37), and the best combined score (39.22).</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="central-finding-algorithm-performance-is-domain-dependent">Central Finding: Algorithm Performance is Domain-Dependent</h3>
<p>The most important result from Tartarus is that no single generative model consistently outperforms the others across all benchmark domains. This has several implications:</p>
<ul>
<li><strong>Genetic algorithms (GB-GA and JANUS) show the most consistently strong performance</strong> across benchmarks, despite being among the simplest approaches and requiring minimal pre-conditioning time (seconds vs. hours for deep models).</li>
<li><strong>VAE-based models (SMILES-VAE and SELFIES-VAE) show the weakest overall performance</strong>, often failing to surpass the best molecules in the reference datasets. Their reliance on the available training data appears to limit their effectiveness.</li>
<li><strong>REINVENT performs competitively on protein ligand tasks</strong> but shows weaker performance on other benchmarks.</li>
<li><strong>Representation matters</strong>: SELFIES-based models generally outperform their SMILES-based counterparts (e.g., SELFIES-VAE vs. SMILES-VAE), consistent with SELFIES providing 100% validity guarantees.</li>
</ul>
<h3 id="timing-analysis">Timing Analysis</h3>
<p>Training time varies dramatically across models. Both VAEs require over 9 hours of GPU training, with estimated CPU-only training times of approximately 25 days. REINVENT and MoFlow train in under 1 hour. Both GAs complete pre-conditioning in seconds and require no GPU.</p>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ul>
<li>Benchmark domains covered are not comprehensive and need expansion.</li>
<li>3D generative models are not well supported, as proposed conformers are ignored in favor of simulation-derived geometries.</li>
<li>The chemical reaction substrate benchmark requires specialized geometries (reactant, product, transition state) that most 3D generative models cannot produce.</li>
<li>Results depend heavily on both model hyperparameters and benchmark settings (compute budget, number of evaluations).</li>
<li>Objective functions may need revision when undesired structures are promoted.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OPV Training</td>
          <td>CEP_SUB (Harvard Clean Energy Project subset)</td>
          <td>~25,000 molecules</td>
          <td>From HIPS/neural-fingerprint repository</td>
      </tr>
      <tr>
          <td>Emitter Training</td>
          <td>GDB-13_SUB (filtered GDB-13)</td>
          <td>~380,000 molecules</td>
          <td>Conjugated pi-system filter applied</td>
      </tr>
      <tr>
          <td>Ligand Training</td>
          <td>DTP Open Compound Collection (filtered)</td>
          <td>~152,000 molecules</td>
          <td>Drug-likeness and structural filters applied</td>
      </tr>
      <tr>
          <td>Reaction Training</td>
          <td>SNB-60K (STONED-SELFIES mutations)</td>
          <td>~60,000 molecules</td>
          <td>Generated from syn-sesquinorbornene core</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>All eight algorithms are implemented in the Tartarus repository with configuration files and installation instructions. The evaluation protocol specifies: 80/20 train/validation split, population size of 5,000, 24-hour runtime cap, five independent runs per model.</p>
<h3 id="models">Models</h3>
<p>Pre-trained model checkpoints are not provided. Training must be performed from scratch using the provided reference datasets and hyperparameter configurations documented in the Supporting Information.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Properties are evaluated through physics-based simulation workflows (not learned surrogates). Each workflow accepts a SMILES string and returns computed properties. Key software dependencies include: Open Babel, CREST, xTB, PySCF, QuickVina2, smina, and RDKit.</p>
<h3 id="hardware">Hardware</h3>
<p>Training and sampling benchmarks were conducted using 24 CPU cores (AMD Rome 7532 @ 2.40 GHz) and a single Tesla A100 GPU. Simulations were run on the Beluga, Narval, Niagara, Cedar, and Sherlock supercomputing clusters.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/Tartarus">Tartarus GitHub</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Benchmark tasks, simulation workflows, model configs</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/badge/latestdoi/444879123">Zenodo Archive</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Reference datasets for all four benchmark domains</td>
      </tr>
      <tr>
          <td><a href="https://discord.gg/KypwPXTY2s">Discord Community</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Discussion and collaboration channel</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L. A., Kundaje, A., &amp; Aspuru-Guzik, A. (2023). Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design. <em>Advances in Neural Information Processing Systems 36</em>, 3263-3306.</p>
<p><strong>Publication</strong>: NeurIPS 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/Tartarus">Tartarus GitHub Repository</a></li>
<li><a href="https://zenodo.org/badge/latestdoi/444879123">Zenodo Dataset Archive</a></li>
<li><a href="https://discord.gg/KypwPXTY2s">Discord Community</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{nigam2023tartarus,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Nigam, AkshatKumar and Pollice, Robert and Tom, Gary and Jorner, Kjell and Willes, John and Thiede, Luca A. and Kundaje, Anshul and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3263--3306}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMINA Docking Benchmark for De Novo Drug Design Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/</guid><description>A docking-based benchmark for evaluating de novo drug design generative models, using SMINA scoring across eight protein targets from ChEMBL.</description><content:encoded><![CDATA[<h2 id="a-docking-based-benchmark-for-de-novo-drug-design">A Docking-Based Benchmark for De Novo Drug Design</h2>
<p>This is a <strong>Resource</strong> paper. Its primary contribution is a standardized benchmark for evaluating generative models in de novo drug design. Rather than introducing a new generative method, the paper provides a reusable evaluation framework built around molecular docking, a widely used computational proxy for predicting protein-ligand binding. The benchmark uses SMINA (a fork of <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>) to score generated molecules against eight protein targets, offering a more realistic evaluation than commonly used proxy metrics like logP or QED.</p>
<h2 id="why-existing-benchmarks-fall-short">Why Existing Benchmarks Fall Short</h2>
<p>De novo drug design methods are typically evaluated using simple proxy tasks that do not reflect the complexity of real drug discovery. The octanol-water partition coefficient (logP) can be trivially optimized by producing unrealistic molecules. The QED drug-likeness score suffers from the same issue. Neural network-based bioactivity predictors are similarly exploitable.</p>
<p>As Coley et al. (2020) note: &ldquo;The current evaluations for generative models do not reflect the complexity of real discovery problems.&rdquo;</p>
<p>More realistic evaluation approaches exist in adjacent domains (photovoltaics, excitation energies), where physical calculations are used to both train and evaluate models. Yet de novo drug design has largely relied on the same simplistic proxies. This gap between proxy task performance and real-world utility motivates the development of a docking-based benchmark that, while still a proxy, captures more of the structural complexity involved in protein-ligand interactions.</p>
<h2 id="benchmark-design-smina-docking-with-the-vinardo-scoring-function">Benchmark Design: SMINA Docking with the Vinardo Scoring Function</h2>
<p>The benchmark is defined by three components: (1) docking software that computes a ligand&rsquo;s pose in the binding site, (2) a scoring function that evaluates the pose, and (3) a training set of compounds with precomputed docking scores.</p>
<p>The concrete instantiation uses SMINA v. 2017.11.9 with the Vinardo scoring function:</p>
<p>$$S = -0.045 \cdot G + 0.8 \cdot R - 0.035 \cdot H - 0.6 \cdot B$$</p>
<p>where $S$ is the docking score, $G$ is the gauss term, $R$ is repulsion, $H$ is the hydrophobic term, and $B$ is the non-directional hydrogen bond term. The gauss and repulsion terms measure steric interactions between the ligand and the protein, while the hydrophobic and hydrogen bond terms capture favorable non-covalent contacts.</p>
<p>The benchmark includes three task variants:</p>
<ol>
<li><strong>Docking Score Function</strong>: Optimize the full Vinardo docking score (lower is better).</li>
<li><strong>Repulsion</strong>: Minimize only the repulsion component, defined as:</li>
</ol>
<p>$$
R(a_1, a_2) = \begin{cases}
d(a_1, a_2)^2 &amp; d(a_1, a_2) &lt; 0 \\
0 &amp; \text{otherwise}
\end{cases}
$$</p>
<p>where $d(a_1, a_2)$ is the inter-atomic distance minus the sum of <a href="https://en.wikipedia.org/wiki/Van_der_Waals_radius">van der Waals radii</a>.</p>
<ol start="3">
<li><strong>Hydrogen Bonding</strong>: Maximize the hydrogen bond term:</li>
</ol>
<p>$$
B(a_1, a_2) = \begin{cases}
0 &amp; (a_1, a_2) \text{ do not form H-bond} \\
1 &amp; d(a_1, a_2) &lt; -0.6 \\
0 &amp; d(a_1, a_2) \geq 0 \\
\frac{d(a_1, a_2)}{-0.6} &amp; \text{otherwise}
\end{cases}
$$</p>
<p>Scores are averaged over the top 5 binding poses for stability. Generated compounds are filtered by <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and a minimum molecular weight of 100. Each model must generate 250 unique molecules per target.</p>
<p>Training data comes from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, covering eight drug targets: 5-HT1B, 5-HT2B, ACM2, CYP2D6, ADRB1, MOR, A2A, and D2. Dataset sizes range from 1,082 (ADRB1) to 10,225 (MOR) molecules.</p>
<h2 id="experimental-evaluation-of-three-generative-models">Experimental Evaluation of Three Generative Models</h2>
<h3 id="models-tested">Models Tested</h3>
<p>Three popular generative models were evaluated:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">CVAE</a></strong> (Chemical Variational Autoencoder): A VAE operating on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GVAE</a></strong> (Grammar Variational Autoencoder): Extends CVAE by enforcing grammatical correctness of generated SMILES.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></strong>: A recurrent neural network trained first on ChEMBL in a supervised manner, then fine-tuned with reinforcement learning using docking scores as rewards.</li>
</ul>
<p>For CVAE and GVAE, molecules are generated by sampling from the latent space and taking 50 gradient steps to optimize an MLP that predicts the docking score. For REINVENT, a random forest model predicts docking scores from ECFP fingerprints, and the reward combines this prediction with the QED score.</p>
<h3 id="baselines">Baselines</h3>
<p>Two baselines provide context:</p>
<ul>
<li><strong>Training set</strong>: The top 50%, 10%, and 1% of docking scores from the ChEMBL training set.</li>
<li><strong><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a> subset</strong>: A random sample of ~9.2 million drug-like molecules from ZINC, with the same percentile breakdowns.</li>
</ul>
<p>Diversity is measured as the mean <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto distance</a> (using 1024-bit ECFP with radius 2) between all pairs of generated molecules.</p>
<h3 id="key-results">Key Results</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Model</th>
          <th>5-HT1B Score</th>
          <th>5-HT1B Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking Score</td>
          <td>CVAE</td>
          <td>-4.647</td>
          <td>0.907</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>GVAE</td>
          <td>-4.955</td>
          <td>0.901</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>REINVENT</td>
          <td>-9.774</td>
          <td>0.506</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>ZINC (10%)</td>
          <td>-9.894</td>
          <td>0.862</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>ZINC (1%)</td>
          <td>-10.496</td>
          <td>0.861</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>Train (10%)</td>
          <td>-10.837</td>
          <td>0.749</td>
      </tr>
  </tbody>
</table>
<p>On the full docking score task, CVAE and GVAE fail to match even the mean ZINC docking score. REINVENT performs substantially better (e.g., -9.774 on 5-HT1B) but still falls short of the top 10% ZINC scores (-9.894) in most cases. The exception is ACM2, where REINVENT&rsquo;s score (-9.775) exceeds the ZINC 10% threshold (-8.282).</p>
<p>On the repulsion task, all three models fail to outperform the top 10% ZINC scores. On the hydrogen bonding task (the easiest), GVAE and REINVENT nearly match the top 1% ZINC scores, suggesting that optimizing individual scoring components is more tractable than the full docking score.</p>
<p>A consistent finding across all experiments is that REINVENT generates substantially less diverse molecules than the training set (e.g., 0.506 vs. 0.787 mean Tanimoto distance on 5-HT1B). The t-SNE visualizations show generated molecules clustering in a single dense region, separate from the training data, regardless of optimization target.</p>
<p>The paper also notes a moderately strong correlation between docking scores and molecular weight or the number of rotatable bonds. Generated compounds achieve better docking scores at the same molecular weight after optimization, suggesting the models learn some structural preferences rather than simply exploiting molecular size.</p>
<h2 id="limitations-of-current-generative-models-for-drug-design">Limitations of Current Generative Models for Drug Design</h2>
<p>The main finding is negative: popular generative models for de novo drug design struggle to generate molecules that dock well when trained on realistically sized datasets (1,000 to 10,000 compounds). Even the best-performing model (REINVENT) generally cannot outperform the top 10% of a random ZINC subset on the full docking score task.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Docking is itself a proxy</strong>: The SMINA docking score is only an approximation of true binding affinity. The fact that even this simpler proxy is challenging should raise concerns about these models&rsquo; readiness for real drug discovery pipelines.</li>
<li><strong>Limited model selection</strong>: Only three models were tested (CVAE, GVAE, REINVENT). The authors note that CVAE and GVAE were not designed for small training sets, and REINVENT may not represent the state of the art in all respects.</li>
<li><strong>ML-based scoring surrogate</strong>: All models use an ML model (MLP or random forest) to predict docking scores during generation, rather than running SMINA directly. This introduces an additional approximation layer.</li>
<li><strong>No similarity constraints</strong>: The benchmark does not impose constraints on the distance between generated and training molecules. A trivial baseline is to simply return the training set.</li>
</ul>
<p>On a more positive note, the tested models perform well on the simplest subtask (hydrogen bonding), suggesting that optimizing docking scores from limited data is attainable but challenging. The benchmark has already been adopted by other groups, notably Nigam et al. (2021) for evaluating their JANUS genetic algorithm.</p>
<p>Future directions include adding similarity constraints, extending to additional protein targets, and using the benchmark to evaluate newer structure-based generative models that employ equivariant neural networks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ChEMBL (8 targets)</td>
          <td>1,082-10,225 molecules per target</td>
          <td>90/10 train/test split</td>
      </tr>
      <tr>
          <td>Baseline</td>
          <td>ZINC 15 subset</td>
          <td>~9.2M drug-like molecules</td>
          <td>In-stock, standard reactivity, drug-like</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td><a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">Protein Data Bank</a></td>
          <td>8 structures</td>
          <td>Cleaned with Schrodinger modeling package</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>CVAE/GVAE: Fine-tuned 5 epochs on target data, then 50 gradient steps in latent space to optimize MLP-predicted score</li>
<li>REINVENT: Pretrained on ChEMBL, fine-tuned with RL; reward = random forest prediction * QED score</li>
<li>All docking performed with SMINA v. 2017.11.9 using Vinardo scoring function in score_only mode</li>
<li>Scores averaged over top 5 binding poses</li>
<li>Filtering: Lipinski Rule of Five, minimum molecular weight 100</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean docking score</td>
          <td>Average over 250 generated molecules</td>
          <td>Lower is better for docking score and repulsion</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Mean Tanimoto distance (ECFP, r=2)</td>
          <td>Higher is more diverse</td>
      </tr>
      <tr>
          <td>ZINC percentile baselines</td>
          <td>Top 50%, 10%, 1% from random ZINC subset</td>
          <td>Task considered &ldquo;solved&rdquo; if generated score exceeds ZINC 1%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/cieplinski-tobiasz/smina-docking-benchmark">smina-docking-benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmark code, data, evaluation notebooks</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cieplinski, T., Danel, T., Podlewska, S., &amp; Jastrzebski, S. (2023). Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark. <em>Journal of Chemical Information and Modeling</em>, 63(11), 3238-3247. <a href="https://doi.org/10.1021/acs.jcim.2c01355">https://doi.org/10.1021/acs.jcim.2c01355</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/cieplinski-tobiasz/smina-docking-benchmark">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cieplinski2023generative,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cieplinski, Tobiasz and Danel, Tomasz and Podlewska, Sabina and Jastrzebski, Stanislaw}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3238--3247}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01355}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGenSurvey: Systematic Survey of ML for Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgensurvey-molecule-design/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgensurvey-molecule-design/</guid><description>Survey of ML molecule design methods across 1D string, 2D graph, and 3D geometry representations with deep generative and optimization approaches.</description><content:encoded><![CDATA[<h2 id="a-taxonomy-for-ml-driven-molecule-design">A Taxonomy for ML-Driven Molecule Design</h2>
<p>This is a <strong>Systematization</strong> paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including <a href="/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/">Sánchez-Lengeling &amp; Aspuru-Guzik, 2018</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/">Elton et al., 2019</a>, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.</p>
<p>The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).</p>
<h2 id="molecular-representations">Molecular Representations</h2>
<p>The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.</p>
<h3 id="1d-string-descriptions">1D String Descriptions</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.</p>
<p>Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.</p>
<h3 id="2d-molecular-graphs">2D Molecular Graphs</h3>
<p>Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node&rsquo;s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).</p>
<h3 id="3d-molecular-geometry">3D Molecular Geometry</h3>
<p>Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.</p>
<p>Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.</p>
<h2 id="deep-generative-models">Deep Generative Models</h2>
<p>The survey covers six families of deep generative models applied to molecule design.</p>
<h3 id="autoregressive-models-ars">Autoregressive Models (ARs)</h3>
<p>ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:</p>
<p>$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$</p>
<p>For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.</p>
<h3 id="variational-autoencoders-vaes">Variational Autoencoders (VAEs)</h3>
<p>VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):</p>
<p>$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$</p>
<p>The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">ChemVAE</a> (SMILES-based), JT-VAE (junction tree graphs), and <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GrammarVAE</a> (grammar-constrained SMILES).</p>
<h3 id="normalizing-flows-nfs">Normalizing Flows (NFs)</h3>
<p>NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).</p>
<h3 id="generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p>GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a> (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).</p>
<h3 id="diffusion-models">Diffusion Models</h3>
<p>Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:</p>
<p>$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$</p>
<p>Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).</p>
<h3 id="energy-based-models-ebms">Energy-Based Models (EBMs)</h3>
<p>EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.</p>
<h2 id="combinatorial-optimization-methods">Combinatorial Optimization Methods</h2>
<p>Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.</p>
<h3 id="reinforcement-learning-rl">Reinforcement Learning (RL)</h3>
<p>RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).</p>
<h3 id="genetic-algorithms-ga">Genetic Algorithms (GA)</h3>
<p>GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.</p>
<h3 id="bayesian-optimization-bo">Bayesian Optimization (BO)</h3>
<p>BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.</p>
<h3 id="monte-carlo-tree-search-mcts">Monte Carlo Tree Search (MCTS)</h3>
<p>MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.</p>
<h3 id="mcmc-sampling">MCMC Sampling</h3>
<p>MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.</p>
<h3 id="other-approaches">Other Approaches</h3>
<p>The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. <strong>Optimal Transport (OT)</strong> is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). <strong>Differentiable Learning</strong> formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).</p>
<h2 id="task-taxonomy-eight-molecule-generation-tasks">Task Taxonomy: Eight Molecule Generation Tasks</h2>
<p>The survey&rsquo;s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is <em>de novo</em> (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is <em>generation</em> (distribution learning, producing valid and diverse molecules) or <em>optimization</em> (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper&rsquo;s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.</p>
<h3 id="1d2d-tasks">1D/2D Tasks</h3>
<ul>
<li><strong>De novo 1D/2D molecule generation</strong>: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>), ARs (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/">MolecularRNN</a>), and EBMs (GraphEBM).</li>
<li><strong>De novo 1D/2D molecule optimization</strong>: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).</li>
<li><strong>1D/2D molecule optimization</strong>: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>), and differentiable approaches (DST).</li>
</ul>
<h3 id="3d-tasks">3D Tasks</h3>
<ul>
<li><strong>De novo 3D molecule generation</strong>: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).</li>
<li><strong>De novo 3D conformation generation</strong>: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).</li>
<li><strong>De novo binding-based 3D molecule generation</strong>: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).</li>
<li><strong>De novo binding-pose conformation generation</strong>: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).</li>
<li><strong>3D molecule optimization</strong>: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).</li>
</ul>
<h2 id="evaluation-metrics">Evaluation Metrics</h2>
<p>The survey organizes evaluation metrics into four categories.</p>
<h3 id="generation-evaluation">Generation Evaluation</h3>
<p>Basic metrics assess the quality of generated molecules:</p>
<ul>
<li><strong>Validity</strong>: fraction of chemically valid molecules among all generated molecules</li>
<li><strong>Novelty</strong>: fraction of generated molecules absent from the training set</li>
<li><strong>Uniqueness</strong>: fraction of distinct molecules among generated samples</li>
<li><strong>Quality</strong>: fraction passing a predefined chemical rule filter</li>
<li><strong>Diversity</strong> (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets</li>
</ul>
<h3 id="distribution-evaluation">Distribution Evaluation</h3>
<p>Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD), and Mean Maximum Discrepancy (MMD).</p>
<h3 id="optimization-evaluation">Optimization Evaluation</h3>
<p>Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.</p>
<h3 id="3d-evaluation">3D Evaluation</h3>
<p>3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.</p>
<h2 id="datasets">Datasets</h2>
<p>The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Scale</th>
          <th>Dimensionality</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ZINC</td>
          <td>250K</td>
          <td>1D/2D</td>
          <td>Virtual screening compounds</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>2.1M</td>
          <td>1D/2D</td>
          <td>Bioactive molecules</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></td>
          <td>1.9M</td>
          <td>1D/2D</td>
          <td>Benchmarking generation</td>
      </tr>
      <tr>
          <td>CEPDB</td>
          <td>4.3M</td>
          <td>1D/2D</td>
          <td>Organic photovoltaics</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></td>
          <td>970M</td>
          <td>1D/2D</td>
          <td>Enumerated small molecules</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>134K</td>
          <td>1D/2D/3D</td>
          <td>Quantum chemistry properties</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a></td>
          <td>450K/37M</td>
          <td>1D/2D/3D</td>
          <td>Conformer ensembles</td>
      </tr>
      <tr>
          <td>ISO17</td>
          <td>200/431K</td>
          <td>1D/2D/3D</td>
          <td>Molecule-conformation pairs</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>3.9M</td>
          <td>1D/2D/3D</td>
          <td>DFT ground-state geometries</td>
      </tr>
      <tr>
          <td>CrossDock2020</td>
          <td>22.5M</td>
          <td>1D/2D/3D</td>
          <td>Docked ligand poses</td>
      </tr>
      <tr>
          <td>scPDB</td>
          <td>16K</td>
          <td>1D/2D/3D</td>
          <td>Binding sites</td>
      </tr>
      <tr>
          <td>DUD-E</td>
          <td>23K</td>
          <td>1D/2D/3D</td>
          <td>Active compounds with decoys</td>
      </tr>
  </tbody>
</table>
<h2 id="challenges-and-opportunities">Challenges and Opportunities</h2>
<h3 id="challenges">Challenges</h3>
<ol>
<li><strong>Out-of-distribution generation</strong>: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.</li>
<li><strong>Unrealistic problem formulation</strong>: Many task setups do not respect real-world chemistry constraints.</li>
<li><strong>Expensive oracle calls</strong>: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.</li>
<li><strong>Lack of interpretability</strong>: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.</li>
<li><strong>No unified evaluation protocols</strong>: The field lacks consensus on what defines a &ldquo;good&rdquo; drug candidate and how to fairly compare methods.</li>
<li><strong>Insufficient benchmarking</strong>: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.</li>
<li><strong>Low-data regime</strong>: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.</li>
</ol>
<h3 id="opportunities">Opportunities</h3>
<ol>
<li><strong>Extension to complex structured data</strong>: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.</li>
<li><strong>Connection to later drug development phases</strong>: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.</li>
<li><strong>Knowledge discovery</strong>: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.</li>
</ol>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.</li>
<li>Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.</li>
<li>The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers&rsquo; reported results.</li>
<li>1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field&rsquo;s shift toward structured representations at the time of writing.</li>
<li>As a survey, this paper produces no code, models, or datasets. The surveyed methods&rsquo; individual repositories are referenced in their original publications but are not aggregated here.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Du, Y., Fu, T., Sun, J., &amp; Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. <em>arXiv preprint arXiv:2203.14500</em>.</p>
<p><strong>Publication</strong>: arXiv preprint, March 2022. <strong>Note</strong>: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2203.14500">arXiv: 2203.14500</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{du2022molgensurvey,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2203.14500}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>UnCorrupt SMILES: Post Hoc Correction for De Novo Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</guid><description>A transformer-based SMILES corrector that fixes invalid outputs from molecular generators, recovering 60-95% of erroneous SMILES strings.</description><content:encoded><![CDATA[<h2 id="a-transformer-based-smiles-error-corrector">A Transformer-Based SMILES Error Corrector</h2>
<p>This is a <strong>Method</strong> paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (<a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:</p>
<ul>
<li><strong>RNN models</strong> (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)</li>
<li><strong>GANs</strong> (ORGANIC): 9.5% invalid</li>
<li><strong>VAEs</strong> (GENTRL): 88.9% invalid</li>
</ul>
<p>These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (<a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.</p>
<h2 id="error-taxonomy-across-generator-types">Error Taxonomy Across Generator Types</h2>
<p>The paper classifies invalid SMILES errors into six categories based on RDKit error messages:</p>
<ol>
<li><strong>Syntax errors</strong>: malformed SMILES grammar</li>
<li><strong>Unclosed rings</strong>: unmatched ring closure digits</li>
<li><strong>Parentheses errors</strong>: unbalanced open/close parentheses</li>
<li><strong>Bond already exists</strong>: duplicate bonds between the same atoms</li>
<li><strong>Aromaticity errors</strong>: atoms incorrectly marked as aromatic or kekulization failures</li>
<li><strong>Valence errors</strong>: atoms exceeding their maximum bond count</li>
</ol>
<p>The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.</p>
<h2 id="architecture-and-training">Architecture and Training</h2>
<p>The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:</p>
<ul>
<li>Embedding dimension: 256</li>
<li>Encoder/decoder layers: 3 each</li>
<li>Attention heads: 8</li>
<li>Feed-forward dimension: 512</li>
<li>Dropout: 0.1</li>
<li>Optimizer: Adam (learning rate 0.0005)</li>
<li>Training: 20 epochs, batch size 16</li>
</ul>
<p>Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the <a href="/notes/chemistry/datasets/gdb-11/">GDB</a>-8 database to atoms with full valence, and other structural modifications.</p>
<h2 id="training-with-multiple-errors-improves-correction">Training with Multiple Errors Improves Correction</h2>
<p>A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>1 error/input</th>
          <th>12 errors/input</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN (DrugEx)</td>
          <td>~60% fixed</td>
          <td>62% fixed</td>
      </tr>
      <tr>
          <td>Target-directed RNN</td>
          <td>~60% fixed</td>
          <td>68% fixed</td>
      </tr>
      <tr>
          <td>GAN (ORGANIC)</td>
          <td>~80% fixed</td>
          <td>95% fixed</td>
      </tr>
      <tr>
          <td>VAE (GENTRL)</td>
          <td>~35% fixed</td>
          <td>80% fixed</td>
      </tr>
  </tbody>
</table>
<p>Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to &ldquo;distrust&rdquo; inputs more when exposed to higher error rates.</p>
<p>The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.</p>
<h2 id="fixed-molecules-are-comparable-to-generator-outputs">Fixed Molecules Are Comparable to Generator Outputs</h2>
<p>The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:</p>
<ul>
<li><strong>Uniqueness</strong>: 97% of corrected molecules are unique</li>
<li><strong>Novelty vs. generated</strong>: 97% of corrected molecules are novel compared to the valid generator outputs</li>
<li><strong>Similarity to nearest neighbor (SNN)</strong>: 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space</li>
<li><strong>Property distributions</strong>: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set</li>
</ul>
<p>This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator&rsquo;s valid outputs while exploring complementary regions of chemical space.</p>
<h2 id="local-chemical-space-exploration-via-error-introduction">Local Chemical Space Exploration via Error Introduction</h2>
<p>Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:</p>
<ol>
<li>Take a known active molecule</li>
<li>Introduce random errors into its SMILES (repeated 1000 times)</li>
<li>Correct the errors using the trained corrector</li>
</ol>
<p>This &ldquo;local sequence exploration&rdquo; generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).</p>
<p>The authors demonstrate this on selective <a href="https://en.wikipedia.org/wiki/Aurora_kinase_B">Aurora kinase B</a> (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).</p>
<h2 id="limitations">Limitations</h2>
<p>The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">LindeSchoenmaker/SMILES-corrector</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training code, synthetic error generation, and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.</p>
<p><strong>Code</strong>: Transformer implemented in PyTorch, adapted from Ben Trevett&rsquo;s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.</p>
<p><strong>Evaluation</strong>: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark protocols.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., &amp; van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. <em>Journal of Cheminformatics</em>, 15, 22.</p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">GitHub: LindeSchoenmaker/SMILES-corrector</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schoenmaker2023uncorrupt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{UnCorrupt SMILES: a novel approach to de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Schoenmaker, Linde and B{\&#39;e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00696-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Sets (MOSES): A Generative Modeling Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/</link><pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/</guid><description>MOSES provides a standardized benchmarking platform for molecular generative models, featuring datasets, metrics, and baselines.</description><content:encoded><![CDATA[<h2 id="the-role-of-moses-a-benchmarking-resource">The Role of MOSES: A Benchmarking Resource</h2>
<p>This is a <strong>Resource and Benchmarking</strong> paper. It introduces Molecular Sets (MOSES), a platform designed to standardize the training, comparison, and evaluation of molecular generative models. It provides a standardized dataset, a suite of evaluation metrics, and a collection of baseline models to serve as reference points for the field.</p>
<h2 id="motivation-the-reproducibility-crisis-in-generative-chemistry">Motivation: The Reproducibility Crisis in Generative Chemistry</h2>
<p>Generative models are increasingly popular for drug discovery and material design, capable of exploring the vast chemical space ($10^{23}$ to $10^{80}$ compounds) more efficiently than traditional methods. However, the field faces a significant reproducibility crisis:</p>
<ol>
<li><strong>Lack of Standardization</strong>: There is no consensus on how to properly compare and rank the efficacy of different generative models.</li>
<li><strong>Inconsistent Metrics</strong>: Different papers use different metrics or distinct implementations of the same metrics.</li>
<li><strong>Data Variance</strong>: Models are often trained on different subsets of chemical databases (like ZINC), making direct comparison impossible.</li>
</ol>
<p>MOSES aims to solve these issues by providing a unified &ldquo;measuring stick&rdquo; for distribution learning models in chemistry.</p>
<h2 id="core-innovation-standardizing-chemical-distribution-learning">Core Innovation: Standardizing Chemical Distribution Learning</h2>
<p>The core contribution is the <strong>standardization of the distribution learning definition</strong> for molecular generation. Why focus on distribution learning? Rule-based filters enforce strict boundaries like molecular weight limits. Distribution learning complements this by allowing chemists to apply <strong>implicit or soft restrictions</strong>. This ensures that generated molecules satisfy hard constraints and reflect complex chemical realities defined by the training distribution. These realities include the prevalence of certain substructures and the avoidance of unstable motifs.</p>
<p>MOSES specifically targets distribution learning by providing:</p>
<ol>
<li><strong>A Clean, Standardized Dataset</strong>: A specific subset of the ZINC Clean Leads collection with rigorous filtering.</li>
<li><strong>Diverse Metrics</strong>: A comprehensive suite of metrics that measure validity alongside novelty, diversity (internal and external), chemical properties (properties distribution), and substructure similarity.</li>
<li><strong>Open Source Platform</strong>: A Python library <code>molsets</code> that decouples the data and evaluation logic from the model implementation, ensuring everyone measures performance exactly the same way.</li>
</ol>
<h2 id="experimental-setup-and-baseline-generative-models">Experimental Setup and Baseline Generative Models</h2>
<p>The authors benchmarked a wide variety of generative models against the MOSES dataset to establish baselines:</p>
<ul>
<li><strong>Baselines</strong>: Character-level RNN (CharRNN), <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoder</a> (VAE), Adversarial Autoencoder (AAE), Junction Tree VAE (JTN-VAE), and <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>.</li>
<li><strong>Non-Neural Baselines</strong>: HMM, n-gram models, and a combinatorial generator (randomly connecting fragments).</li>
<li><strong>Evaluation</strong>: Models were trained on the standard set and evaluated on:
<ul>
<li><strong>Validity/Uniqueness</strong>: Can the model generate valid, non-duplicate SMILES? Uniqueness is measured at $k = 1{,}000$ and $k = 10{,}000$ samples.</li>
<li><strong>Filters</strong>: What fraction of generated molecules pass the same medicinal chemistry and PAINS filters used for dataset construction?</li>
<li><strong>Feature Distribution</strong>: Do generated molecules match the physicochemical properties of the training set? Evaluated using the <strong>Wasserstein-1 distance</strong> on 1D distributions of:
<ul>
<li><strong>LogP</strong>: Octanol-water partition coefficient (lipophilicity).</li>
<li><strong>SA</strong>: Synthetic Accessibility score (ease of synthesis).</li>
<li><strong>QED</strong>: Quantitative Estimation of Drug-likeness.</li>
<li><strong>MW</strong>: Molecular Weight.</li>
</ul>
</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Measures similarity in biological/chemical space using the penultimate-layer (second-to-last layer) activations of a pre-trained network (ChemNet).</li>
<li><strong>Similarity to Nearest Neighbor (SNN)</strong>: Measures the precision of generation by checking the closest match in the training set (Tanimoto similarity).</li>
</ul>
</li>
</ul>
<h2 id="key-findings-and-metric-trade-offs">Key Findings and Metric Trade-offs</h2>
<ul>
<li><strong>CharRNN Performance</strong>: The simple character-level RNN (CharRNN) outperformed more complex models (like VAEs and <a href="/posts/what-is-a-gan/">GANs</a>) on many metrics, achieving the best FCD scores ($0.073$).</li>
<li><strong>Metric Trade-offs</strong>: No single metric captures &ldquo;quality.&rdquo;
<ul>
<li>The <strong>Combinatorial Generator</strong> achieved 100% validity and high diversity. It struggled with distribution learning metrics (FCD), indicating it explores chemical space broadly without capturing natural distributions.</li>
<li><strong>VAEs</strong> often achieve high <strong>Similarity to Nearest Neighbor (SNN)</strong> while exhibiting low novelty. The authors suggest this pattern may indicate overfitting to training set prototypes, though they treat this as a hypothesis rather than a proven mechanism.</li>
</ul>
</li>
<li><strong>Implicit Constraints</strong>: A major finding was that neural models successfully learned implicit chemical rules (like avoiding <a href="https://en.wikipedia.org/wiki/Pan-assay_interference_compounds">PAINS</a> structures) purely from the data distribution.</li>
<li><strong>Recommendation</strong>: The authors suggest using FCD/Test for general model ranking, while emphasizing the importance of checking specific metrics (validity, diversity) to diagnose model failure modes.</li>
<li><strong>Limitations of the Benchmark</strong>: MOSES focuses on distribution learning and uses FCD as a primary ranking metric. As the authors note, FCD captures multiple aspects of other metrics in a single number but does not give insights into specific issues, so more interpretable metrics are necessary for thorough investigation. The benchmark evaluates only 1D (SMILES) and 2D molecular features, without assessing 3D conformational properties.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The benchmark uses a curated subset of the <strong>ZINC Clean Leads</strong> collection.</p>
<ul>
<li><strong>Source Size</strong>: ~4.6M molecules (4,591,276 after initial extraction).</li>
<li><strong>Final Size</strong>: 1,936,962 molecules.</li>
<li><strong>Splits</strong>: Train (1,584,664), Test (176,075), Scaffold Test (176,226).
<ul>
<li><strong>Scaffold Test Split</strong>: This split is crucial for distinct generalization testing. It contains molecules whose <a href="https://pubs.acs.org/doi/10.1021/jm9602928">Bemis-Murcko scaffolds</a> are <em>completely absent</em> from the training and test sets. Evaluating on this split strictly tests a model&rsquo;s ability to generate novel chemical structures (generalization).</li>
</ul>
</li>
<li><strong>Filters Applied</strong>:
<ul>
<li>Molecular weight: 250 to 350 Da</li>
<li>Rotatable bonds: $\leq 7$</li>
<li>XlogP: $\leq 3.5$</li>
<li>Atom types: C, N, S, O, F, Cl, Br, H</li>
<li>No charged atoms or cycles &gt; 8 atoms</li>
<li>Medicinal Chemistry Filters (MCF) and PAINS filters applied.</li>
</ul>
</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>MOSES introduces a standard suite of metrics. Key definitions:</p>
<ul>
<li><strong>Validity</strong>: Fraction of valid <a href="/posts/visualizing-smiles-and-selfies-strings/">SMILES</a> strings (via <a href="https://www.rdkit.org/">RDKit</a>).</li>
<li><strong>Unique@k</strong>: Fraction of unique molecules in the first $k$ valid samples ($k = 1{,}000$ and $k = 10{,}000$).</li>
<li><strong>Filters</strong>: Fraction of generated molecules passing the MCF and PAINS filters used during dataset construction. High scores here indicate the model learned implicit chemical validity constraints from the data distribution.</li>
<li><strong>Novelty</strong>: Fraction of generated molecules not present in the training set.</li>
<li><strong>Internal Diversity (IntDiv)</strong>: Average Tanimoto distance between generated molecules ($G$), useful for detecting mode collapse:
$$ \text{IntDiv}_p(G) = 1 - \sqrt[p]{\frac{1}{|G|^2} \sum_{m_1, m_2 \in G} T(m_1, m_2)^p} $$</li>
<li><strong>Fragment Similarity (Frag)</strong>: Cosine similarity of fragment frequency vectors (BRICS decomposition) between generated and test sets.</li>
<li><strong>Scaffold Similarity (Scaff)</strong>: Cosine similarity of Bemis-Murcko scaffold frequency vectors between sets. Measures how well the model captures higher-level structural motifs.</li>
<li><strong>Similarity to Nearest Neighbor (SNN)</strong>: The average Tanimoto similarity between a generated molecule&rsquo;s fingerprint and its nearest neighbor in the reference set. This serves as a measure of precision; high SNN suggests the model produces molecules very similar to the training distribution, potentially indicating memorization if novelty is low.
$$ \text{SNN}(G, R) = \frac{1}{|G|} \sum_{m_G \in G} \max_{m_R \in R} T(m_G, m_R) $$</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Fréchet distance between the Gaussian approximations (mean and covariance) of penultimate-layer activations from ChemNet. This measures how close the distribution of generated molecules is to the real distribution in chemical/biological space. The authors note that FCD correlates with other metrics. For example, if the generated structures are not diverse enough or the model produces too many duplicates, FCD will decrease because the variance is smaller. The authors suggest using FCD for hyperparameter tuning and final model selection.
$$ \text{FCD}(G, R) = |\mu_G - \mu_R|^2 + \text{Tr}(\Sigma_G + \Sigma_R - 2(\Sigma_G \Sigma_R)^{1/2}) $$</li>
<li><strong>Properties Distribution (Wasserstein-1)</strong>: The 1D <a href="/posts/what-is-a-gan/#wasserstein-gan-wgan-a-mathematical-revolution">Wasserstein-1 distance</a> between the distributions of molecular properties (MW, LogP, SA, <a href="https://www.nature.com/articles/nchem.1243">QED</a>) in the generated and test sets.</li>
</ul>
<h3 id="models--baselines">Models &amp; Baselines</h3>
<p>The paper selects baselines to represent different theoretical approaches to distribution learning:</p>
<ol>
<li><strong>Explicit Density Models</strong>: Models where the probability mass function $P(x)$ can be computed analytically.
<ul>
<li><strong>N-gram</strong>: Simple statistical models. They failed to generate valid molecules reliably due to limited long-range dependency modeling.</li>
</ul>
</li>
<li><strong>Implicit Density Models</strong>: Models that sample from the distribution without explicitly computing $P(x)$.
<ul>
<li><strong>VAE/AAE</strong>: Optimizes a lower bound on the log-likelihood (ELBO) or uses adversarial training.</li>
<li><strong>GANs (<a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>)</strong>: Directly minimizes the distance between real and generated distributions via a discriminator.</li>
</ul>
</li>
</ol>
<p>Models are also distinguished by their data representation:</p>
<ul>
<li><strong>String-based (SMILES)</strong>: Models like <strong>CharRNN</strong>, <strong>VAE</strong>, and <strong>AAE</strong> treat molecules as SMILES strings. SMILES encodes a molecular graph by traversing a spanning tree in depth-first order, storing atom and edge tokens.</li>
<li><strong>Graph-based</strong>: <strong>JTN-VAE</strong> operates directly on molecular subgraphs (junction tree), ensuring chemical validity by construction but often requiring more complex training.</li>
</ul>
<p>Key baselines implemented in PyTorch (hyperparameters are detailed in Supplementary Information 3 of the original paper):</p>
<ul>
<li><strong>CharRNN</strong>: LSTM-based sequence model (3 layers, 768 hidden units). Trained with Adam ($lr = 10^{-3}$, batch size 64, 80 epochs, learning rate halved every 10 epochs).</li>
<li><strong>VAE</strong>: Encoder-decoder architectures (bidirectional GRU encoder, 3-layer GRU decoder with 512 hidden units) with KL regularization.</li>
<li><strong>AAE</strong>: Encoder (single layer bidirectional LSTM with 512 units) and decoder (2-layer LSTM with 512 units) initialized with adversarial formulation.</li>
<li><strong>LatentGAN</strong>: GAN (5-layer fully connected generator) trained on the latent space of a pre-trained heteroencoder.</li>
<li><strong>JTN-VAE</strong>: Tree-structured graph generation.</li>
</ul>
<h3 id="code--hardware-requirements">Code &amp; Hardware Requirements</h3>
<ul>
<li><strong>Code Repository</strong>: Available at <a href="https://github.com/molecularsets/moses">github.com/molecularsets/moses</a> as well as the PyPI library <code>molsets</code>. The platform provides standard scripts (<code>scripts/run.py</code> to evaluate models end-to-end, and <code>scripts/run_all_models.sh</code> for multi-seed evaluations).</li>
<li><strong>Hardware</strong>: The repository supports GPU acceleration via <code>nvidia-docker</code> (defaulting to 10GB shared memory). However, specific training times and exact GPU models used by the authors for the baselines are not formally documented in the source text.</li>
<li><strong>Model Weights</strong>: Pre-trained model checkpoints are not natively pre-packaged as standalone downloads; practitioners are expected to re-train the default baselines using the provided scripts.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molecularsets/moses">molecularsets/moses</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official benchmark platform with baseline models and evaluation metrics</td>
      </tr>
      <tr>
          <td><a href="https://pypi.org/project/molsets/">molsets (PyPI)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>pip-installable package for dataset access and metric computation</td>
      </tr>
      <tr>
          <td>ZINC Clean Leads subset</td>
          <td>Dataset</td>
          <td>See ZINC terms</td>
          <td>Curated dataset of 1,936,962 molecules distributed via the repository</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. <em>Frontiers in Pharmacology</em>, 11, 565644. <a href="https://doi.org/10.3389/fphar.2020.565644">https://doi.org/10.3389/fphar.2020.565644</a></p>
<p><strong>Publication</strong>: Frontiers in Pharmacology, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{polykovskiy2020moses,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular Sets (MOSES): A benchmarking platform for molecular generation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Al{\&#39;a}n and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Frontiers in Pharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{565644}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Frontiers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.3389/fphar.2020.565644}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>