<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Systematization Papers: Surveys, Reviews, and Taxonomies on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/paper-types/systematization/</link><description>Recent content in Systematization Papers: Surveys, Reviews, and Taxonomies on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 11 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/paper-types/systematization/index.xml" rel="self" type="application/rss+xml"/><item><title>T5: Exploring Transfer Learning Limits</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</guid><description>Raffel et al. systematically study transfer learning for NLP with a text-to-text framework, ablating architectures, objectives, data, and multi-task mixing.</description><content:encoded><![CDATA[<h2 id="a-systematic-study-of-nlp-transfer-learning">A systematic study of NLP transfer learning</h2>
<p>This is a <strong>systematization paper</strong> that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.</p>
<h2 id="unifying-nlp-tasks-as-text-to-text">Unifying NLP tasks as text-to-text</h2>
<p>The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.</p>
<p>The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).</p>
<h2 id="multi-task-mixing-strategies-and-findings">Multi-task mixing: strategies and findings</h2>
<p>The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.</p>
<h3 id="three-mixing-strategies">Three mixing strategies</h3>
<p><strong>Examples-proportional mixing.</strong> Sample in proportion to each dataset&rsquo;s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:</p>
<p>$$
r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)}
$$</p>
<p>where $e_{m}$ is the number of examples in task $m$&rsquo;s dataset.</p>
<p><strong>Temperature-scaled mixing.</strong> Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.</p>
<p><strong>Equal mixing.</strong> Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.</p>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Mixing strategy</th>
          <th>GLUE</th>
          <th>CNN/DM</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
          <th>EnDe</th>
          <th>EnFr</th>
          <th>EnRo</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Baseline (pre-train/fine-tune)</td>
          <td>83.28</td>
          <td>19.24</td>
          <td>80.88</td>
          <td>71.36</td>
          <td>26.98</td>
          <td>39.82</td>
          <td>27.65</td>
      </tr>
      <tr>
          <td>Equal</td>
          <td>76.13</td>
          <td>19.02</td>
          <td>76.51</td>
          <td>63.37</td>
          <td>23.89</td>
          <td>34.31</td>
          <td>26.78</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{18}$</td>
          <td>81.67</td>
          <td>19.07</td>
          <td>78.17</td>
          <td>67.94</td>
          <td>24.57</td>
          <td>35.19</td>
          <td>27.39</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{19}$</td>
          <td>81.42</td>
          <td>19.24</td>
          <td>79.78</td>
          <td>67.30</td>
          <td>25.21</td>
          <td>36.30</td>
          <td>27.76</td>
      </tr>
      <tr>
          <td>Temperature-scaled, $T=2$</td>
          <td>81.90</td>
          <td>19.28</td>
          <td>79.42</td>
          <td>69.92</td>
          <td>25.42</td>
          <td>36.72</td>
          <td>27.20</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings on mixing:</strong></p>
<ol>
<li>
<p><strong>Multi-task training underperforms pre-train-then-fine-tune on most tasks.</strong> No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.</p>
</li>
<li>
<p><strong>Equal mixing is worst.</strong> It dramatically degrades performance, confirming that proportions matter.</p>
</li>
<li>
<p><strong>There exists a task-specific sweet spot for the cap $K$.</strong> Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.</p>
</li>
<li>
<p><strong>Temperature scaling at $T=2$ provides the best single compromise.</strong> It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.</p>
</li>
<li>
<p><strong>Multi-task pre-training followed by fine-tuning closes the gap.</strong> When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.</p>
</li>
<li>
<p><strong>&ldquo;Leave-one-out&rdquo; training works.</strong> Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.</p>
</li>
</ol>
<h2 id="data-repetition-degrades-performance">Data repetition degrades performance</h2>
<p>The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:</p>
<table>
  <thead>
      <tr>
          <th>Unique tokens</th>
          <th>Repeats</th>
          <th>GLUE</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full dataset</td>
          <td>0</td>
          <td>83.28</td>
          <td>80.88</td>
          <td>71.36</td>
      </tr>
      <tr>
          <td>$2^{29}$</td>
          <td>64</td>
          <td>82.87</td>
          <td>80.97</td>
          <td>72.03</td>
      </tr>
      <tr>
          <td>$2^{27}$</td>
          <td>256</td>
          <td>82.62</td>
          <td>79.78</td>
          <td>69.97</td>
      </tr>
      <tr>
          <td>$2^{25}$</td>
          <td>1,024</td>
          <td>79.55</td>
          <td>76.27</td>
          <td>64.76</td>
      </tr>
      <tr>
          <td>$2^{23}$</td>
          <td>4,096</td>
          <td>76.34</td>
          <td>70.92</td>
          <td>59.29</td>
      </tr>
  </tbody>
</table>
<p>Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.</p>
<h2 id="scaling-and-final-configuration">Scaling and final configuration</h2>
<p>The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.</p>
<p>The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The T5 paper&rsquo;s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.</li>
<li>The multi-task mixing experiments treat each task as a separate &ldquo;domain.&rdquo; Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.</li>
<li>The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.</li>
<li>C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> Code, pre-trained models, and the C4 dataset are all publicly released.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>C4 (Colossal Clean Crawled Corpus)</td>
          <td>~750 GB</td>
          <td>Heuristically cleaned Common Crawl</td>
      </tr>
      <tr>
          <td>Downstream</td>
          <td>GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)</td>
          <td>Standard splits</td>
          <td>Text-to-text format</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.</p>
<h3 id="hardware">Hardware</h3>
<p>All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer">T5 Code</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official TensorFlow implementation (JAX successor: T5X)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints">T5 Models</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained checkpoints (Small through 11B)</td>
      </tr>
      <tr>
          <td><a href="https://www.tensorflow.org/datasets/catalog/c4">C4 Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>~750 GB cleaned Common Crawl, via TensorFlow Datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{raffel2020exploring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{21}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{140}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Materials Representations for ML Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/materials-representations-ml-review/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/materials-representations-ml-review/</guid><description>Review of representation strategies for encoding solid-state materials as ML inputs, covering structural descriptors, crystal graphs, and generative models.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-material-representations">A Systematization of Material Representations</h2>
<p>This paper is a <strong>Systematization</strong> that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.</p>
<h2 id="why-material-representations-matter">Why Material Representations Matter</h2>
<p>Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:</p>
<ol>
<li><strong>Similarity preservation</strong>: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.</li>
<li><strong>Domain coverage</strong>: The representation should be constructable for every material in the target domain.</li>
<li><strong>Cost efficiency</strong>: Computing the representation should be cheaper than computing the target property directly (e.g., via <a href="https://en.wikipedia.org/wiki/Density_functional_theory">DFT</a>).</li>
</ol>
<p>In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.</p>
<h2 id="structural-descriptors-local-global-and-topological">Structural Descriptors: Local, Global, and Topological</h2>
<p>The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.</p>
<h3 id="local-descriptors">Local Descriptors</h3>
<p>Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:</p>
<p>$$
G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij})
$$</p>
<p>$$
G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk})
$$</p>
<p>The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and <a href="https://en.wikipedia.org/wiki/Spherical_harmonics">spherical harmonics</a>:</p>
<p>$$
\rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}})
$$</p>
<p>The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n&rsquo;lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.</p>
<p><a href="https://en.wikipedia.org/wiki/Voronoi_diagram">Voronoi tessellation</a> provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.</p>
<h3 id="global-descriptors">Global Descriptors</h3>
<p>Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:</p>
<p>$$
M_{i,j} = \begin{cases} Z_{i}^{2.4} &amp; \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} &amp; \text{for } i \neq j \end{cases}
$$</p>
<p>Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.</p>
<h3 id="topological-descriptors">Topological Descriptors</h3>
<p><a href="https://en.wikipedia.org/wiki/Persistent_homology">Persistent homology</a> from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in <a href="https://en.wikipedia.org/wiki/Zeolite">zeolites</a>. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.</p>
<h2 id="crystal-graph-neural-networks">Crystal Graph Neural Networks</h2>
<p>Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.</p>
<p>Key architectures discussed include:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Key Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>Crystal graph convolutions for broad property prediction</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>Materials graph networks with global state attributes</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>Line graph neural networks incorporating three-body angular features</td>
      </tr>
      <tr>
          <td>Equivariant GNNs</td>
          <td>E(3)-equivariant message passing for tensorial properties</td>
      </tr>
  </tbody>
</table>
<p>The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.</p>
<p>A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.</p>
<p>Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.</p>
<h2 id="compositional-descriptors-without-structure">Compositional Descriptors Without Structure</h2>
<p>When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.</p>
<p>Key methods include:</p>
<ul>
<li><strong>MagPie</strong>: 145 input features derived from elemental properties</li>
<li><strong>SISSO</strong>: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)</li>
<li><strong>ElemNet</strong>: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with &gt;3,000 training points</li>
<li><strong>ROOST</strong>: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples</li>
<li><strong>CrabNet</strong>: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs</li>
</ul>
<p>Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.</p>
<h2 id="defects-surfaces-and-grain-boundaries">Defects, Surfaces, and Grain Boundaries</h2>
<p>The review extends beyond idealized unit cells to practical materials challenges:</p>
<p><strong>Point defects</strong>: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.</p>
<p><strong>Surfaces and catalysis</strong>: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the <a href="https://en.wikipedia.org/wiki/Sabatier_principle">Sabatier principle</a> that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (&gt;1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.</p>
<p><strong>Grain boundaries</strong>: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.</p>
<h2 id="transfer-learning-across-representations">Transfer Learning Across Representations</h2>
<p>When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.</p>
<p>Key findings from the review:</p>
<ul>
<li>Transfer learning is most effective when the source dataset is orders of magnitude larger than the target</li>
<li>Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)</li>
<li>Earlier neural network layers learn more general representations and transfer better across properties</li>
<li>Multi-depth feature extraction, combining activations from multiple layers, can improve transfer</li>
<li>Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude</li>
</ul>
<h2 id="generative-models-for-crystal-inverse-design">Generative Models for Crystal Inverse Design</h2>
<p>Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (&gt;100 atoms for zeolites and MOFs).</p>
<p>The review traces the progression of approaches:</p>
<ol>
<li><strong>Voxel representations</strong>: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.</li>
<li><strong>Continuous coordinate models</strong>: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.</li>
<li><strong>Symmetry-aware models</strong>: Crystal Diffusion <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAE</a> (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.</li>
<li><strong>Constrained models for porous materials</strong>: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.</li>
</ol>
<h2 id="open-problems-and-future-directions">Open Problems and Future Directions</h2>
<p>The review highlights four high-impact open questions:</p>
<ol>
<li><strong>Local vs. global descriptor trade-offs</strong>: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.</li>
<li><strong>Prediction from unrelaxed prototypes</strong>: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.</li>
<li><strong>Applicability of compositional descriptors</strong>: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.</li>
<li><strong>Extensions of generative models</strong>: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://arxiv.org/abs/2301.08813">arXiv preprint (2301.08813)</a></td>
          <td>Other</td>
          <td>arXiv (open access)</td>
          <td>Free preprint version</td>
      </tr>
      <tr>
          <td><a href="https://materialsproject.org">Materials Project</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>DFT energies, band gaps, structures for &gt;100,000 compounds</td>
      </tr>
      <tr>
          <td><a href="https://oqmd.org">OQMD</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Open Quantum Materials Database, &gt;600,000 DFT entries</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Open-Catalyst-Project/ocp">Open Catalyst 2020 (OC20)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>&gt;1,000,000 DFT surface adsorption energies</td>
      </tr>
      <tr>
          <td><a href="https://aflowlib.org">AFLOW</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>High-throughput ab initio library, &gt;3,000,000 entries</td>
      </tr>
      <tr>
          <td><a href="https://github.com/hackingmaterials/matminer">Matminer</a></td>
          <td>Code</td>
          <td>BSD</td>
          <td>Open-source toolkit for materials data mining and featurization</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.</p>
<h3 id="hardware">Hardware</h3>
<p>No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).</p>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Partially Reproducible</strong>: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., &amp; Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. <em>Annual Review of Materials Research</em>, 53. <a href="https://doi.org/10.1146/annurev-matsci-080921-085947">https://doi.org/10.1146/annurev-matsci-080921-085947</a></p>
<p><strong>Publication</strong>: Annual Review of Materials Research, 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{damewood2023representations,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Representations of Materials for Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\&#39;o}mez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Annual Review of Materials Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{53}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1146/annurev-matsci-080921-085947}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformers and LLMs for Chemistry Drug Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/transformers-llms-chemistry-drug-discovery/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/transformers-llms-chemistry-drug-discovery/</guid><description>Bran and Schwaller review transformer architectures for chemistry, from task-specific SMILES models to multimodal LLMs and chemistry agents.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformers-in-chemistry">A Systematization of Transformers in Chemistry</h2>
<p>This book chapter by Bran and Schwaller is a <strong>Systematization</strong> paper that organizes the growing body of work applying transformer architectures to chemistry and drug discovery. Rather than proposing a new method, the authors trace a three-stage evolution: (1) task-specific single-modality models operating on SMILES and reaction strings, (2) multimodal models bridging molecular representations with spectra, synthesis actions, and natural language, and (3) large language models and LLM-powered agents capable of general chemical reasoning.</p>
<h2 id="why-transformers-for-chemistry">Why Transformers for Chemistry?</h2>
<p>The authors motivate the review by drawing analogies between natural language and chemical language. Just as text can be decomposed into subwords and tokens, molecules can be linearized into <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings, and chemical reactions can be encoded as reaction SMILES. This structural parallel enabled direct transfer of transformer architectures, originally designed for machine translation, to chemical prediction tasks.</p>
<p>Several factors accelerated this adoption:</p>
<ul>
<li>The publication of open chemical databases and benchmarks (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>, Open Reaction Database, Therapeutics Data Commons)</li>
<li>Improvements in compute infrastructure and training algorithms</li>
<li>The success of attention mechanisms at capturing context-dependent relationships, which proved effective for learning chemical grammar and atom-level correspondences</li>
</ul>
<p>The review positions the transformer revolution in chemistry as a natural extension of NLP advances, noting that the gap between chemical and natural language is progressively closing.</p>
<h2 id="molecular-representations-as-language">Molecular Representations as Language</h2>
<p>A key section of the review covers text-based molecular representations that make transformer applications possible:</p>
<ul>
<li><strong>SMILES</strong> (Simplified Molecular Input Line Entry System): The dominant linearization scheme since the 1980s, encoding molecular graphs as character sequences with special symbols for bonds, branches, and rings.</li>
<li><strong>SELFIES</strong> (Self-Referencing Embedded Strings): A newer representation that guarantees every string maps to a valid molecule, addressing the robustness issues of SMILES in generative settings.</li>
<li><strong>Reaction SMILES</strong>: Extends molecular representations to encode full chemical reactions in the format &ldquo;A.B &gt; catalyst.reagent &gt; C.D&rdquo;, enabling reaction prediction as a sequence-to-sequence task.</li>
</ul>
<p>The authors note that while IUPAC names, InChI, and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> exist as alternatives, SMILES and SELFIES dominate practical applications.</p>
<h2 id="stage-1-task-specific-transformer-models">Stage 1: Task-Specific Transformer Models</h2>
<p>The first stage of transformer adoption focused on clearly defined chemical tasks, with models trained on a single data modality (molecular strings).</p>
<h3 id="chemical-translation-tasks">Chemical Translation Tasks</h3>
<p>The encoder-decoder architecture was directly applied to tasks framed as translation:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a></strong> (Schwaller et al.): Treated reaction prediction as translation from reactant SMILES to product SMILES, becoming a leading method for forward synthesis prediction.</li>
<li><strong>Retrosynthetic planning</strong>: The reverse task, predicting reactants from products, with iterative application to construct full retrosynthetic trees mapping to commercially available building blocks.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a></strong> (Irwin et al.): A pre-trained model across multiple chemical tasks, offering transferability to new applications with improved performance.</li>
<li><strong>Graph-to-sequence models</strong> (Tu and Coley): Used a custom graph encoder with a transformer decoder, achieving improvements through permutation-invariant molecular graph encoding.</li>
</ul>
<h3 id="representation-learning-and-feature-extraction">Representation Learning and Feature Extraction</h3>
<p>Encoder-only transformers proved valuable for generating molecular and reaction embeddings:</p>
<ul>
<li><strong>Reaction representations</strong> (Wang et al., SMILES-BERT): Trained models to generate reaction vectors that outperformed hand-engineered features on downstream regression tasks.</li>
<li><strong>Reaction classification</strong> (Schwaller et al.): Replaced the decoder with a classification layer to map chemical reactions by class, revealing clustering patterns by reaction type, data source, and molecular properties.</li>
<li><strong>Yield prediction</strong>: Regression heads attached to encoders achieved strong results on high-throughput experimentation datasets.</li>
<li><strong>Protein language models</strong> (Rives et al., ESM): Trained on 250 million protein sequences using unsupervised learning, achieving strong performance on protein property prediction and structure forecasting.</li>
<li><strong>RXNMapper</strong> (Schwaller et al.): A notable application where attention weight analysis revealed that transformers internally learn atom-to-atom mappings in chemical reactions, leading to an open-source atom mapping algorithm that outperformed existing approaches.</li>
</ul>
<h2 id="stage-2-multimodal-chemical-models">Stage 2: Multimodal Chemical Models</h2>
<p>The second stage extended transformers beyond molecular strings to incorporate additional data types:</p>
<ul>
<li><strong>Molecular captioning</strong>: Describing molecules in natural language, covering scaffolds, sources, drug interactions, and other features (Edwards et al.).</li>
<li><strong>Bidirectional molecule-text conversion</strong>: Models capable of generating molecules from text queries and performing molecule-to-molecule tasks (Christofidellis et al.).</li>
<li><strong>Experimental procedure prediction</strong>: Generating actionable synthesis steps from reaction SMILES (Vaucher et al.), bridging the gap between retrosynthetic planning and laboratory execution.</li>
<li><strong>Structural elucidation from IR spectra</strong>: Encoding IR spectra as text sequences alongside chemical formulas, then predicting SMILES from these inputs (Alberts et al.), achieving 45% accuracy in structure prediction and surpassing prior approaches for functional group identification.</li>
</ul>
<h2 id="stage-3-large-language-models-and-chemistry-agents">Stage 3: Large Language Models and Chemistry Agents</h2>
<p>The most recent stage builds on foundation models pre-trained on vast text corpora, adapted for chemistry through fine-tuning and in-context learning.</p>
<h3 id="scaling-laws-and-emergent-capabilities">Scaling Laws and Emergent Capabilities</h3>
<p>The authors discuss how model scaling leads to emergent capabilities relevant to chemistry:</p>
<ul>
<li>Below certain compute thresholds, model performance on chemistry tasks appears random.</li>
<li>Above critical sizes, sudden improvements emerge, along with capabilities like chain-of-thought (CoT) reasoning and instruction following.</li>
<li>These emergent abilities enable chemistry tasks that require multi-step reasoning without explicit training on chemical data.</li>
</ul>
<h3 id="llms-as-chemistry-tools">LLMs as Chemistry Tools</h3>
<p>Key applications of LLMs in chemistry include:</p>
<ul>
<li><strong><a href="/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/">Fine-tuning for low-data chemistry</a></strong> (Jablonka et al.): GPT-3 fine-tuned on limited chemistry datasets performed comparably to, and sometimes exceeded, specialized models with engineered features for tasks like predicting transition wavelengths and phase classification.</li>
<li><strong>In-context learning</strong>: Providing LLMs with a few examples enables prediction on chemistry tasks without any parameter updates, particularly valuable when data is scarce.</li>
<li><strong>Bayesian optimization with LLMs</strong> (Ramos et al.): Using GPT models for uncertainty-calibrated regression, enabling catalyst and molecular optimization directly from synthesis procedures without feature engineering.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">3D structure generation</a></strong> (Flam-Shepherd and Aspuru-Guzik): Using language models to generate molecular structures with three-dimensional atomic positions in XYZ, CIF, and PDB formats, matching graph-based algorithms while overcoming representation limitations.</li>
</ul>
<h3 id="llm-powered-chemistry-agents">LLM-Powered Chemistry Agents</h3>
<p>The review highlights the agent paradigm as the most impactful recent development:</p>
<ul>
<li><strong>14 LLM use-cases</strong> (Jablonka et al.): A large-scale collaborative effort demonstrating applications from computational tool wrappers to reaction optimization assistants and scientific question answering.</li>
<li><strong><a href="/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/">ChemCrow</a></strong> (Bran, Cox et al.): An LLM-powered agent equipped with curated computational chemistry tools, capable of planning and executing tasks across drug design, materials design, and synthesis. ChemCrow demonstrated that tool integration overcomes LLM hallucination issues by grounding responses in reliable data sources.</li>
<li><strong>Autonomous scientific research</strong> (Boiko et al.): Systems with focus on cloud laboratory operability.</li>
</ul>
<p>The agent paradigm offers tool composability through natural language interfaces, allowing users to chain multiple computational tools into custom pipelines.</p>
<h2 id="outlook-and-limitations">Outlook and Limitations</h2>
<p>The authors identify several themes for the future:</p>
<ul>
<li>The three stages represent increasing generality, from task-specific single-modality models to open-ended agents.</li>
<li>Natural language interfaces are progressively closing the gap between chemical and human language.</li>
<li>Tool integration through agents provides grounding that mitigates hallucination, a known limitation of direct LLM application to chemistry.</li>
<li>The review acknowledges that LLMs have a &ldquo;high propensity to generate false and inaccurate content&rdquo; on chemical tasks, making tool-augmented approaches preferable to direct application.</li>
</ul>
<p>The chapter does not provide quantitative benchmarks or systematic comparisons across the methods discussed, as its goal is to organize the landscape rather than evaluate individual methods.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review/survey chapter and does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the referenced works rather than the review itself.</p>
<h3 id="key-referenced-resources">Key Referenced Resources</h3>
<p>Several open-source tools and datasets discussed in the review are publicly available:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/rxn4chemistry/rxnmapper">RXNMapper</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Attention-based atom mapping</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-public">ChemCrow</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>LLM-powered chemistry agent</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>Various</td>
          <td>Molecular ML benchmarks</td>
      </tr>
      <tr>
          <td><a href="https://open-reaction-database.org/">Open Reaction Database</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA-4.0</td>
          <td>Curated reaction data</td>
      </tr>
      <tr>
          <td><a href="https://tdcommons.ai/">Therapeutics Data Commons</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Drug discovery ML datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-classification">Reproducibility Classification</h3>
<p><strong>Not applicable</strong> (review paper). Individual referenced works range from Highly Reproducible (open-source models like RXNMapper, ChemCrow) to Partially Reproducible (some models without released code) to Closed (proprietary LLMs like GPT-3/GPT-4 used in fine-tuning studies).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bran, A. M., &amp; Schwaller, P. (2024). Transformers and Large Language Models for Chemistry and Drug Discovery. In <em>Drug Development Supported by Informatics</em> (pp. 143-163). Springer Nature Singapore. <a href="https://doi.org/10.1007/978-981-97-4828-0_8">https://doi.org/10.1007/978-981-97-4828-0_8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@incollection</span>{bran2024transformers,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformers and Large Language Models for Chemistry and Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bran, Andres M. and Schwaller, Philippe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Drug Development Supported by Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{143--163}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature Singapore}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1007/978-981-97-4828-0_8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformers for Molecular Property Prediction Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</guid><description>A systematic review of 16 transformer models for molecular property prediction, analyzing architecture, data, tokenization, and benchmarking gaps.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformers-for-molecular-property-prediction">A Systematization of Transformers for Molecular Property Prediction</h2>
<p>This is a <strong>Systematization</strong> paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper&rsquo;s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.</p>
<h2 id="the-problem-inconsistent-evaluation-hinders-progress">The Problem: Inconsistent Evaluation Hinders Progress</h2>
<p>Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. However, the field faces several challenges:</p>
<ol>
<li><strong>Small labeled datasets</strong>: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.</li>
<li><strong>No standardized evaluation protocol</strong>: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.</li>
<li><strong>Unclear design choices</strong>: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.</li>
</ol>
<p>The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.</p>
<h2 id="seven-design-questions-for-molecular-transformers">Seven Design Questions for Molecular Transformers</h2>
<p>The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.</p>
<h3 id="reviewed-models">Reviewed Models</h3>
<p>The paper catalogs 16 models organized by architecture:</p>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Base Model</th>
          <th>Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Encoder-Decoder</td>
          <td>Transformer, BART</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">ST</a>, Transformer-CNN, <a href="/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/">X-Mol</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>BERT</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, MAT, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, Mol-BERT, Chen et al., K-BERT, FP-BERT, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>RoBERTa</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
      </tr>
      <tr>
          <td>Decoder-Only</td>
          <td>XLNet</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a> (RT)</td>
      </tr>
  </tbody>
</table>
<p>The core attention mechanism shared by all these models is the scaled dot-product attention:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
$$</p>
<p>where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.</p>
<h3 id="question-1-which-database-and-how-many-molecules">Question 1: Which Database and How Many Molecules?</h3>
<p>Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Database</th>
          <th>Size</th>
          <th>Language</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>ChEMBL</td>
          <td>900K</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>ChEMBL (<a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>)</td>
          <td>1.6M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>PubChem</td>
          <td>100K-10M</td>
          <td>SMILES, SELFIES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>PubChem</td>
          <td>5M-77M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>ZINC</td>
          <td>2M</td>
          <td>List of atoms</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>ZINC + PubChem</td>
          <td>1.1B</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>Chen et al.</td>
          <td>C, CP, CPZ</td>
          <td>2M-775M</td>
          <td>SMILES</td>
      </tr>
  </tbody>
</table>
<p>A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.</p>
<h3 id="question-2-which-chemical-language">Question 2: Which Chemical Language?</h3>
<p>Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.</p>
<p>Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.</p>
<h3 id="question-3-how-to-tokenize">Question 3: How to Tokenize?</h3>
<p>Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.</p>
<h3 id="question-4-how-to-add-positional-embeddings">Question 4: How to Add Positional Embeddings?</h3>
<p>Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.</p>
<p>MolFormer&rsquo;s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.</p>
<p>The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.</p>
<h3 id="question-5-how-many-parameters">Question 5: How Many Parameters?</h3>
<p>Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Dimensions</th>
          <th>Heads</th>
          <th>Layers</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>256</td>
          <td>4</td>
          <td>4</td>
          <td>7M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>768</td>
          <td>12</td>
          <td>12</td>
          <td>85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>768</td>
          <td>12</td>
          <td>6, 12</td>
          <td>43M, 85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
          <td>768</td>
          <td>12, 4</td>
          <td>8, 12</td>
          <td>57M, 85M</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>1024</td>
          <td>16</td>
          <td>8</td>
          <td>101M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>768</td>
          <td>12</td>
          <td>6</td>
          <td>43M</td>
      </tr>
  </tbody>
</table>
<p>SELFormer and MolFormer both tested different model sizes. SELFormer&rsquo;s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer&rsquo;s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.</p>
<h3 id="question-6-which-pre-training-objectives">Question 6: Which Pre-training Objectives?</h3>
<p>Pre-training objectives fall into domain-agnostic and domain-specific categories:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Pre-training Objective</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>MLM</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a></td>
          <td>MLM</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>MLM, PhysChemPred, SMILES-EQ</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td>K-BERT</td>
          <td>Atom feature, MACCS prediction, CL</td>
          <td>Update last layer</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>MLM, MTR</td>
          <td>Update</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>MLM, 2D Adjacency, 3D Distance</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
          <td>Denoising Span MLM, Augmentation</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">RT</a></td>
          <td>PLM (Permutation Language Modeling)</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT&rsquo;s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT&rsquo;s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).</p>
<p>ChemBERTa-2&rsquo;s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.</p>
<h3 id="question-7-how-to-fine-tune">Question 7: How to Fine-tune?</h3>
<p>Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.</p>
<h2 id="benchmarking-challenges-and-performance-comparison">Benchmarking Challenges and Performance Comparison</h2>
<h3 id="downstream-datasets">Downstream Datasets</h3>
<p>The review focuses on nine benchmark datasets across three categories from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Molecules</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Application</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>LogD at pH 7.4</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,050</td>
          <td>1 classification</td>
          <td>Physiology</td>
          <td>Blood-brain barrier</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>1,484</td>
          <td>2 classification</td>
          <td>Physiology</td>
          <td>Clinical trial toxicity</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>1,427</td>
          <td>27 classification</td>
          <td>Physiology</td>
          <td>Drug side effects</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>7,831</td>
          <td>12 classification</td>
          <td>Physiology</td>
          <td>Nuclear receptor/stress pathways</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>1,513</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Beta-secretase 1 binding</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Anti-HIV activity</td>
      </tr>
  </tbody>
</table>
<h3 id="inconsistencies-in-evaluation">Inconsistencies in Evaluation</h3>
<p>The authors document substantial inconsistencies that prevent fair model comparison:</p>
<ol>
<li><strong>Data splitting</strong>: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.</li>
<li><strong>Different test sets</strong>: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.</li>
<li><strong>Varying repetitions</strong>: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.</li>
<li><strong>Metric inconsistency</strong>: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.</li>
</ol>
<h3 id="performance-findings">Performance Findings</h3>
<p>When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.</p>
<p>For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.</p>
<h2 id="key-takeaways-and-future-directions">Key Takeaways and Future Directions</h2>
<p>The review concludes with six main takeaways:</p>
<ol>
<li><strong>Performance</strong>: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.</li>
<li><strong>Scaling</strong>: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.</li>
<li><strong>Pre-training data</strong>: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.</li>
<li><strong>Chemical language</strong>: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.</li>
<li><strong>Domain knowledge</strong>: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.</li>
<li><strong>Benchmarking</strong>: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.</li>
</ol>
<p>The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.</p>
<h3 id="models">Models</h3>
<p>Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/Transformers4MPP_review">Transformers4MPP_review</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Figure generation code and compiled data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sultan, A., Sieg, J., Mathea, M., &amp; Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. <em>Journal of Chemical Information and Modeling</em>, 64(16), 6259-6280. <a href="https://doi.org/10.1021/acs.jcim.4c00747">https://doi.org/10.1021/acs.jcim.4c00747</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sultan2024transformers,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6259--6280}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00747}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer CLMs for SMILES: Literature Review 2024</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformer-clms-smiles-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformer-clms-smiles-review/</guid><description>Review of transformer-based chemical language models for SMILES, covering encoder, decoder, and encoder-decoder architectures for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformer-based-chemical-language-models">A Systematization of Transformer-Based Chemical Language Models</h2>
<p>This paper is a <strong>Systematization</strong> (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.</p>
<h2 id="why-review-transformer-clms-for-smiles">Why Review Transformer CLMs for SMILES?</h2>
<p>The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.</p>
<p>Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings as a &ldquo;chemical language,&rdquo; these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.</p>
<p>The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.</p>
<h2 id="architectural-taxonomy-encoder-decoder-and-encoder-decoder-models">Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models</h2>
<p>The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.</p>
<h3 id="encoder-only-models-bert-family">Encoder-Only Models (BERT Family)</h3>
<p>These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:</p>
<ul>
<li><strong>BERT</strong> (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MOLBERT</a></strong> (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a></strong> (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> / <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></strong> (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training</li>
<li><strong>GPT-MolBERTa</strong> (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a></strong> (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></strong> (Yuksel et al., 2023): Operates on <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> representations rather than SMILES</li>
<li><strong>Mol-BERT / MolRoPE-BERT</strong> (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences</li>
<li><strong>BET</strong> (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules</li>
</ul>
<h3 id="decoder-only-models-gpt-family">Decoder-Only Models (GPT Family)</h3>
<p>These models excel at generative tasks, including de novo molecular design:</p>
<ul>
<li><strong>GPT-2-based model</strong> (Adilov, 2021): Generative pre-training from molecules</li>
<li><strong>MolXPT</strong> (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language</li>
<li><strong>BioGPT</strong> (Luo et al., 2022): Focuses on biomedical text generation and mining</li>
<li><strong>MolGPT</strong> (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design</li>
<li><strong>Mol-Instructions</strong> (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs</li>
</ul>
<h3 id="encoder-decoder-models">Encoder-Decoder Models</h3>
<p>These combine encoding and generation capabilities for sequence-to-sequence tasks:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a></strong> (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction</li>
<li><strong>MolT5</strong> (adapted T5): Unified text-to-text framework for molecular tasks</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">SMILES Transformer</a></strong> (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/">X-MOL</a></strong> (Xue et al., 2020): Large-scale pre-training for molecular understanding</li>
<li><strong><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a></strong> (Born and Manica, 2023): Operates on <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, enabling concurrent regression and generation</li>
<li><strong>TransAntivirus</strong> (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature</li>
</ul>
<h2 id="tokenization-embedding-and-pre-training-strategies">Tokenization, Embedding, and Pre-Training Strategies</h2>
<h3 id="smiles-tokenization">SMILES Tokenization</h3>
<p>The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Source</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES (AIS)</a></td>
          <td>Ucak et al. (2023)</td>
          <td>Atom-level tokens preserving chemical identity</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a></td>
          <td>Li and Fourches (2021)</td>
          <td>BPE-inspired substructure tokenization</td>
      </tr>
      <tr>
          <td>Byte-Pair Encoding (BPE)</td>
          <td>Chithrananda et al. (2020); Lee and Nam (2022)</td>
          <td>Standard subword tokenization adapted for SMILES</td>
      </tr>
      <tr>
          <td>SMILESTokenizer</td>
          <td>Chithrananda et al. (2020)</td>
          <td>Character-level tokenization with chemical adjustments</td>
      </tr>
  </tbody>
</table>
<h3 id="positional-embeddings">Positional Embeddings</h3>
<p>The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.</p>
<h3 id="pre-training-and-fine-tuning-pipeline">Pre-Training and Fine-Tuning Pipeline</h3>
<p>The standard workflow follows two phases:</p>
<ol>
<li><strong>Pre-training</strong>: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings</li>
<li><strong>Fine-tuning</strong>: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)</li>
</ol>
<p>The self-attention mechanism, central to all transformer CLMs, is formulated as:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.</p>
<h2 id="benchmark-datasets-and-evaluation-landscape">Benchmark Datasets and Evaluation Landscape</h2>
<p>The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Datasets</th>
          <th>Task Type</th>
          <th>Example Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Regression</td>
          <td>642 to 4,200</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA, MUV, HIV, PDBbind, BACE</td>
          <td>Classification/Regression</td>
          <td>11,908 to 437,929</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP, Tox21, ToxCast, SIDER, ClinTox</td>
          <td>Classification</td>
          <td>1,427 to 8,575</td>
      </tr>
  </tbody>
</table>
<p>The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.</p>
<h2 id="challenges-limitations-and-future-directions">Challenges, Limitations, and Future Directions</h2>
<h3 id="current-challenges">Current Challenges</h3>
<p>The review identifies several persistent limitations:</p>
<ol>
<li><strong>Data efficiency</strong>: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce</li>
<li><strong>Interpretability</strong>: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions</li>
<li><strong>Computational cost</strong>: Training large-scale models demands significant GPU resources, limiting accessibility</li>
<li><strong>Handling rare molecules</strong>: Models struggle with molecular structures that deviate significantly from training data distributions</li>
<li><strong>SMILES limitations</strong>: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture</li>
</ol>
<h3 id="smiles-representation-issues">SMILES Representation Issues</h3>
<p>The authors highlight five specific problems with SMILES as an input representation:</p>
<ul>
<li>Non-canonical representations reduce string uniqueness for the same molecule</li>
<li>Many symbol combinations produce chemically invalid outputs</li>
<li>Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)</li>
<li>Spatial information is inadequately captured</li>
<li>Syntactic and semantic robustness is limited</li>
</ul>
<h3 id="future-research-directions">Future Research Directions</h3>
<p>The review proposes several directions:</p>
<ul>
<li><strong>Alternative molecular representations</strong>: Exploring <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, IUPAC, and InChI beyond SMILES</li>
<li><strong>Role of SMILES token types</strong>: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical</li>
<li><strong>Few-shot learning</strong>: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios</li>
<li><strong>Drug repurposing</strong>: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains</li>
<li><strong>Improved benchmarks</strong>: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation</li>
<li><strong>Ethical considerations</strong>: Addressing dual-use risks, data biases, and responsible open-source release of CLMs</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20</td>
          <td>5.5B+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>100M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL</td>
          <td>2M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>MoleculeNet (8 datasets)</td>
          <td>642 to 437,929</td>
          <td>Standard benchmark suite</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>COVID-19 drug compounds</td>
          <td>740</td>
          <td>From Harigua-Souiai et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cocrystal formation</td>
          <td>3,282</td>
          <td>From Mswahili et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Antimalarial drugs</td>
          <td>4,794</td>
          <td>From Mswahili et al. (2024)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cancer gene/drug response</td>
          <td>201 drugs, 734 cell lines</td>
          <td>From Kim et al. (2021)</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://dai.chungbuk.ac.kr/">DAI Lab website</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Authors&rsquo; research lab</td>
      </tr>
  </tbody>
</table>
<p>No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (literature review).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Mswahili, M. E., &amp; Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. <em>Heliyon</em>, 10(20), e39038. <a href="https://doi.org/10.1016/j.heliyon.2024.e39038">https://doi.org/10.1016/j.heliyon.2024.e39038</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mswahili2024transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-based models for chemical {SMILES} representation: A comprehensive literature review}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Mswahili, Medard Edmund and Jeong, Young-Seob}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Heliyon}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e39038}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.heliyon.2024.e39038}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Systematic Review of Deep Learning CLMs (2020-2024)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/systematic-review-deep-learning-clms/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/systematic-review-deep-learning-clms/</guid><description>Systematic review of 72 deep learning molecular generation studies using MOSES and GuacaMol benchmarks across RNNs, transformers, VAEs, and GANs.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-chemical-language-models-for-molecular-generation">A Systematization of Chemical Language Models for Molecular Generation</h2>
<p>This paper is a <strong>Systematization</strong> that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.</p>
<h2 id="motivation-evaluating-four-years-of-generative-clm-progress">Motivation: Evaluating Four Years of Generative CLM Progress</h2>
<p>Deep learning molecular generation has expanded rapidly since 2018, when <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> and <a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">Segler et al.</a> demonstrated that deep generative models could learn to produce novel molecules from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> had been introduced to enable standardized evaluation.</p>
<p>Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.</p>
<h2 id="prisma-based-systematic-review-methodology">PRISMA-Based Systematic Review Methodology</h2>
<p>The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like &ldquo;Molecule Generation,&rdquo; &ldquo;Chemical Language Models,&rdquo; &ldquo;Deep Learning,&rdquo; and specific architecture names. The search window covered January 2020 to June 2024.</p>
<h3 id="eligibility-criteria">Eligibility Criteria</h3>
<p>Papers were included if they:</p>
<ol>
<li>Were written in English</li>
<li>Explicitly presented at least two metrics of uniqueness, validity, or novelty</li>
<li>Defined these metrics consistent with MOSES or GuacaMol concepts</li>
<li>Used deep learning generative models for de novo molecule design</li>
<li>Used conventional (non-quantum) deep learning methods</li>
<li>Were published between January 2020 and June 2024</li>
</ol>
<p>This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.</p>
<h3 id="data-collection">Data Collection</h3>
<p>For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, InChI, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).</p>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>The review focuses on three core MOSES metrics:</p>
<p>$$
\text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}}
$$</p>
<p>$$
\text{Uniqueness} = \frac{\text{set}(V_m)}{V_m}
$$</p>
<p>$$
\text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m}
$$</p>
<p>where $V_m$ denotes valid molecules and $T_d$ the training dataset.</p>
<h2 id="architecture-distribution-and-performance-comparison">Architecture Distribution and Performance Comparison</h2>
<h3 id="architecture-trends-2020-2024">Architecture Trends (2020-2024)</h3>
<p>The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.</p>
<p>The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.</p>
<h3 id="molecular-representations-and-databases">Molecular Representations and Databases</h3>
<p>SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Molecules (millions)</th>
          <th>Representation</th>
          <th>Articles</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>2.4</td>
          <td>SMILES, InChI</td>
          <td>27</td>
      </tr>
      <tr>
          <td>ZINC</td>
          <td>750</td>
          <td>SMILES</td>
          <td>27</td>
      </tr>
      <tr>
          <td>PubChem</td>
          <td>115.3</td>
          <td>SMILES, InChI</td>
          <td>4</td>
      </tr>
      <tr>
          <td>COCONUT</td>
          <td>0.695</td>
          <td>SMILES, InChI</td>
          <td>1</td>
      </tr>
      <tr>
          <td>DNA-Encoded Library</td>
          <td>1,040</td>
          <td>SMILES</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<h3 id="unbiased-model-performance">Unbiased Model Performance</h3>
<p><strong>Validity</strong>: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.</p>
<p><strong>Uniqueness</strong>: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.</p>
<p><strong>Validity-Novelty Trade-off</strong>: The authors propose a &ldquo;Valid/Sample&rdquo; metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.</p>
<h3 id="biased-model-performance">Biased Model Performance</h3>
<p>The review examines three biased generation strategies:</p>
<p><strong>Transfer Learning (TL)</strong>: The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>TL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training size</td>
          <td>1,128,920</td>
          <td>2,507</td>
          <td>&lt;0.0001</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>98.05%</td>
          <td>95.5%</td>
          <td>0.1602</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>97.9%</td>
          <td>90.2%</td>
          <td>0.0144</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>91.6%</td>
          <td>96.0%</td>
          <td>0.8438</td>
      </tr>
  </tbody>
</table>
<p><strong>Reinforcement Learning (RL)</strong>: Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>RL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>91.1%</td>
          <td>96.5%</td>
          <td>0.1289</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99.9%</td>
          <td>89.7%</td>
          <td>0.0935</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>91.5%</td>
          <td>93.5%</td>
          <td>0.2500</td>
      </tr>
  </tbody>
</table>
<p><strong>Conditional Learning (CL)</strong>: Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>CL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>98.5%</td>
          <td>96.8%</td>
          <td>0.4648</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99.9%</td>
          <td>97.5%</td>
          <td>0.0753</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>89.3%</td>
          <td>99.6%</td>
          <td>0.2945</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-directions-for-chemical-language-models">Key Findings and Directions for Chemical Language Models</h2>
<h3 id="main-conclusions">Main Conclusions</h3>
<ol>
<li>
<p><strong>Transformers are overtaking RNNs</strong> as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.</p>
</li>
<li>
<p><strong>SMILES remains dominant</strong> (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.</p>
</li>
<li>
<p><strong>No architecture achieves both high validity and high novelty easily.</strong> Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.</p>
</li>
<li>
<p><strong>Transfer learning requires only ~2,500 molecules</strong> to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.</p>
</li>
<li>
<p><strong>Combining biased methods</strong> (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.</p>
</li>
<li>
<p><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/s4-chemical-language-modeling/">S4 models</a></strong> were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (systematic review, no model training performed).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flores-Hernandez, H., &amp; Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. <em>Journal of Cheminformatics</em>, 16(1), 129. <a href="https://doi.org/10.1186/s13321-024-00916-y">https://doi.org/10.1186/s13321-024-00916-y</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{floreshernandez2024systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of deep learning chemical language models in recent era}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flores-Hernandez, Hector and Mart{\&#39;i}nez-Ledesma, Emmanuel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{129}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00916-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Survey of Transformer Architectures in Molecular Science</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformers-molecular-science-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformers-molecular-science-review/</guid><description>A comprehensive review of 12 transformer architectures applied to molecular science, covering GPT, BERT, BART, graph transformers, and more.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformer-architectures-for-molecular-science">A Systematization of Transformer Architectures for Molecular Science</h2>
<p>This paper is a <strong>Systematization</strong> review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.</p>
<h2 id="bridging-the-gap-between-transformer-variants-and-molecular-applications">Bridging the Gap Between Transformer Variants and Molecular Applications</h2>
<p>Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism&rsquo;s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.</p>
<h2 id="twelve-transformer-families-and-their-molecular-mechanisms">Twelve Transformer Families and Their Molecular Mechanisms</h2>
<p>The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:</p>
<p>$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$</p>
<p>The 12 architecture families covered are:</p>
<ol>
<li>
<p><strong>GPT (Generative Pre-trained Transformer)</strong>: Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.</p>
</li>
<li>
<p><strong>BERT (Bidirectional Encoder Representations from Transformers)</strong>: Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.</p>
</li>
<li>
<p><strong>BART (Bidirectional and Auto-Regressive Transformers)</strong>: Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.</p>
</li>
<li>
<p><strong>Graph Transformer</strong>: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.</p>
</li>
<li>
<p><strong>Transformer-XL</strong>: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.</p>
</li>
<li>
<p><strong><a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5 (Text-to-Text Transfer Transformer)</a></strong>: Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.</p>
</li>
<li>
<p><strong>Vision Transformer (ViT)</strong>: Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).</p>
</li>
<li>
<p><strong>DETR (Detection Transformer)</strong>: End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).</p>
</li>
<li>
<p><strong>Conformer</strong>: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).</p>
</li>
<li>
<p><strong>CLIP (Contrastive Language-Image Pre-training)</strong>: Multimodal learning linking text and images. Applied to peptide design (Cut&amp;CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).</p>
</li>
<li>
<p><strong>Sparse Transformers</strong>: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.</p>
</li>
<li>
<p><strong>Mobile and Efficient Transformers</strong>: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.</p>
</li>
</ol>
<h2 id="survey-organization-and-coverage-of-molecular-domains">Survey Organization and Coverage of Molecular Domains</h2>
<p>As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:</p>
<p><strong>Drug Discovery and Design</strong>: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.</p>
<p><strong>Protein Science</strong>: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&amp;CLIP).</p>
<p><strong>Molecular Property Prediction</strong>: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.</p>
<p><strong>Structural Biology</strong>: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.</p>
<p><strong>Genomics</strong>: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.</p>
<h2 id="future-directions-and-limitations-of-the-survey">Future Directions and Limitations of the Survey</h2>
<p>The review concludes with four future directions:</p>
<ol>
<li>
<p><strong>ChatGPT integration into molecular science</strong>: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.</p>
</li>
<li>
<p><strong>Multifunction transformers</strong>: Models that extract features across diverse molecular structures and sequences simultaneously.</p>
</li>
<li>
<p><strong>Molecular-aware transformers</strong>: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.</p>
</li>
<li>
<p><strong>Self-assessment transformers and superintelligence</strong>: Speculative discussion of models that learn from seemingly unrelated data sources.</p>
</li>
</ol>
<p>The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).</p>
<h3 id="algorithms">Algorithms</h3>
<p>No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.</p>
<h3 id="models">Models</h3>
<p>No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.</p>
<h3 id="evaluation">Evaluation</h3>
<p>No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware requirements are specified, as this is a survey paper.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/wcms.1725">Paper (open access)</a></td>
          <td>Paper</td>
          <td>CC-BY-NC-ND</td>
          <td>Open access via Wiley</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., &amp; Wei, G.-W. (2024). Transformer technology in molecular science. <em>WIREs Computational Molecular Science</em>, 14(4), e1725. <a href="https://doi.org/10.1002/wcms.1725">https://doi.org/10.1002/wcms.1725</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jiang2024transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer technology in molecular science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{WIREs Computational Molecular Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e1725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Wiley}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1002/wcms.1725}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Survey of Scientific LLMs in Bio and Chem Domains</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/scientific-llm-survey-bio-chem/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/scientific-llm-survey-bio-chem/</guid><description>Survey of scientific LLMs covering textual, molecular, protein, genomic, and multimodal models for biological and chemical research.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-scientific-language-models">A Systematization of Scientific Language Models</h2>
<p>This paper is a <strong>Systematization</strong> (survey) that provides a comprehensive review of scientific large language models (Sci-LLMs) designed for biological and chemical domains. The survey covers five main branches of scientific language modeling: textual, molecular, protein, genomic, and multimodal LLMs. For each branch, the authors analyze model architectures, capabilities, training datasets, evaluation benchmarks, and assessment criteria, then identify open challenges and future research directions.</p>
<h2 id="motivation-bridging-scientific-languages-and-llms">Motivation: Bridging Scientific Languages and LLMs</h2>
<p>Large language models have demonstrated strong capabilities in natural language understanding, but scientific research involves specialized &ldquo;languages&rdquo; that differ fundamentally from natural text. Chemical molecules are expressed as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings, proteins as amino acid sequences, and genomes as nucleotide sequences. Each of these language systems has its own vocabulary and grammar. General-purpose LLMs like ChatGPT and GPT-4 often fail to properly handle these scientific data types because the semantics and grammar of scientific languages diverge substantially from natural language.</p>
<p>Prior surveys have focused on individual modalities (molecules, proteins, or genomes) in isolation. No comprehensive review had unified these language modeling advances into a single framework. This survey fills that gap by systematically covering all five modalities and, notably, the emerging area of multimodal Sci-LLMs that integrate multiple scientific languages.</p>
<h2 id="taxonomy-of-scientific-language-models">Taxonomy of Scientific Language Models</h2>
<p>The survey organizes Sci-LLMs into a clear taxonomic framework built on two axes: the scientific language modality and the model architecture type.</p>
<h3 id="scientific-language-modalities">Scientific Language Modalities</h3>
<p>The authors define five categories of Sci-LLMs:</p>
<ol>
<li>
<p><strong>Text-Sci-LLMs</strong>: LLMs trained on scientific textual corpora (medical, biological, chemical, and comprehensive domains). Examples include BioBERT, BioGPT, ChemBERT, SciBERT, and <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a>.</p>
</li>
<li>
<p><strong>Mol-LLMs</strong>: Models that process molecular languages (SMILES, SELFIES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>). These include encoder-only models like <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a> for property prediction, decoder-only models like MolGPT for molecular generation, and encoder-decoder models like Molecular Transformer and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> for reaction prediction.</p>
</li>
<li>
<p><strong>Prot-LLMs</strong>: Models operating on protein amino acid sequences. The ESM series (ESM-1b, ESM-2) and ProtTrans serve as encoders for function and structure prediction, while ProGen and ProtGPT2 generate novel protein sequences.</p>
</li>
<li>
<p><strong>Gene-LLMs</strong>: Models for DNA and RNA sequences, including DNABERT, Nucleotide Transformer, HyenaDNA, and Evo, covering tasks from variant effect prediction to genome-scale sequence modeling.</p>
</li>
<li>
<p><strong>MM-Sci-LLMs</strong>: Multimodal models integrating multiple scientific data types (molecule-text, protein-text, gene-cell-text, molecule-protein), such as MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/">BioT5</a>, Mol-Instructions, and BioMedGPT.</p>
</li>
</ol>
<h3 id="architecture-classification">Architecture Classification</h3>
<p>For each modality, models are categorized into three architecture types:</p>
<ul>
<li><strong>Encoder-only</strong>: Based on BERT/RoBERTa, these models learn fixed-size representations via masked language modeling. They excel at discriminative tasks like property prediction and classification.</li>
<li><strong>Decoder-only</strong>: Based on GPT, these models perform autoregressive generation. They are used for de novo molecule design, protein sequence generation, and DNA sequence generation.</li>
<li><strong>Encoder-decoder</strong>: Based on architectures like <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a> or BART, these handle sequence-to-sequence tasks such as reaction prediction, molecule captioning, and protein sequence-structure translation.</li>
</ul>
<h2 id="comprehensive-catalog-of-models-datasets-and-benchmarks">Comprehensive Catalog of Models, Datasets, and Benchmarks</h2>
<p>A central contribution of the survey is its exhaustive cataloging of resources across all five modalities. The authors compile detailed summary tables covering over 100 Sci-LLMs, their parameter counts, base architectures, training data, and capabilities.</p>
<h3 id="molecular-llms">Molecular LLMs</h3>
<p>The survey documents a rich landscape of Mol-LLMs:</p>
<p><strong>Encoder-only models</strong> for property prediction include <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, ChemBERTa, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, MolFormer, MG-BERT, GROVER, MAT, Uni-Mol, and others. These models are pre-trained on ZINC, PubChem, or ChEMBL datasets and fine-tuned for molecular property prediction tasks on benchmarks like <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</p>
<p><strong>Decoder-only models</strong> for molecular generation include MolGPT, SMILES GPT, iupacGPT, cMolGPT, and Taiga. These generate SMILES strings autoregressively, often combining GPT with reinforcement learning for property optimization.</p>
<p><strong>Encoder-decoder models</strong> for reaction prediction include Molecular Transformer, Retrosynthesis Transformer, Chemformer, <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a>, Graph2SMILES, and MOLGEN. These handle forward reaction prediction and retrosynthesis.</p>
<h3 id="key-datasets-surveyed">Key Datasets Surveyed</h3>
<p>The survey catalogs pre-training datasets and benchmarks for each modality:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Pre-training Sources</th>
          <th>Key Benchmarks</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Text</td>
          <td>PubMed, PMC, arXiv, Semantic Scholar</td>
          <td>MMLU, MedQA, PubMedQA, SciEval</td>
      </tr>
      <tr>
          <td>Molecule</td>
          <td>ZINC, PubChem, ChEMBL, USPTO, <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td>MoleculeNet, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>, SPECTRA</td>
      </tr>
      <tr>
          <td>Protein</td>
          <td>UniRef50/90/100, BFD, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB</a>, <a href="https://en.wikipedia.org/wiki/AlphaFold">AlphaFoldDB</a></td>
          <td><a href="https://en.wikipedia.org/wiki/CASP">CASP</a>, TAPE, ProteinGym, FLIP, PEER</td>
      </tr>
      <tr>
          <td>Genome</td>
          <td>GRCh38, 1000 Genomes, <a href="https://en.wikipedia.org/wiki/ENCODE">ENCODE</a></td>
          <td>NT-Bench, GenBench, BEACON</td>
      </tr>
      <tr>
          <td>Multimodal</td>
          <td>ChEBI-20, PubChemSTM, Mol-Instructions</td>
          <td>Various cross-modal retrieval and generation tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>For molecular generation, the survey details standard metrics:</p>
<ul>
<li><strong>Validity</strong>: percentage of chemically viable molecules</li>
<li><strong>Uniqueness</strong>: fraction of distinct generated structures</li>
<li><strong>Novelty</strong>: fraction not present in the training set</li>
<li><strong>Internal diversity</strong>: measured as</li>
</ul>
<p>$$
\text{IntDiv}_{p}(G) = 1 - \sqrt[p]{\frac{1}{|G|^{2}} \sum_{m_{1}, m_{2} \in G} T(m_{1}, m_{2})^{p}}
$$</p>
<p>where $T(m_{1}, m_{2})$ is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between molecules $m_{1}$ and $m_{2}$.</p>
<ul>
<li><strong>Frechet ChemNet Distance (FCD)</strong>: comparing distributions of generated and reference molecules</li>
</ul>
<p>$$
\text{FCD}(G, R) = | \mu_{G} - \mu_{R} |^{2} + \text{Tr}\left[\Sigma_{G} + \Sigma_{R} - 2(\Sigma_{G}\Sigma_{R})^{1/2}\right]
$$</p>
<p>For protein generation, analogous metrics include perplexity, Frechet Protein Distance (FPD), foldability (pLDDT), sequence recovery, and novelty (sequence identity).</p>
<h2 id="critical-challenges-and-future-directions">Critical Challenges and Future Directions</h2>
<p>The survey identifies four major challenges and seven future research directions for Sci-LLMs.</p>
<h3 id="challenges">Challenges</h3>
<ol>
<li>
<p><strong>Training data limitations</strong>: Sci-LLM training datasets are orders of magnitude smaller than those for general LLMs. ProGen was trained on 280M protein sequences (tens of billions of tokens), while ChatGPT used approximately 570 billion tokens. Scaling laws suggest larger datasets would improve performance, and advances in sequencing technologies may help close this gap.</p>
</li>
<li>
<p><strong>Architecture mismatch</strong>: Standard Transformer architectures face difficulties with scientific languages. Scientific sequences (proteins with hundreds or thousands of amino acids, DNA with millions of base pairs) are far longer than typical natural language sentences. Additionally, 3D structural information is critical for function prediction but does not naturally map to sequence tokens. Autoregressive generation is also a poor fit since biological sequences function as a whole rather than being read left-to-right.</p>
</li>
<li>
<p><strong>Evaluation gaps</strong>: Computational metrics for generated molecules and proteins provide only indirect quality measures. Wet-lab validation remains the gold standard but is beyond the scope of most AI research teams. Better computational evaluation methods that correlate with experimental outcomes are needed.</p>
</li>
<li>
<p><strong>Ethics</strong>: Sensitive biological data raises privacy concerns. The potential for misuse (e.g., generating harmful substances) requires careful safeguards. Algorithmic bias and equitable access to Sci-LLM benefits also demand attention.</p>
</li>
</ol>
<h3 id="future-directions">Future Directions</h3>
<ol>
<li><strong>Larger-scale, cross-modal training datasets</strong> with strong semantic alignment across modalities</li>
<li><strong>Incorporating 3D structural and temporal information</strong> into language-based modeling, including structural motifs as tokens</li>
<li><strong>Integration with external knowledge sources</strong> such as <a href="https://en.wikipedia.org/wiki/Gene_Ontology">Gene Ontology</a> and chemical knowledge graphs to reduce hallucination</li>
<li><strong>Coupling with physical simulation</strong> (e.g., <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a>) to ground language models in physical reality</li>
<li><strong>Augmenting Sci-LLMs with specialized tools and agents</strong>, following the success of tool-augmented general LLMs like <a href="/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/">ChemCrow</a></li>
<li><strong>Development of computational evaluation metrics</strong> that are both fast and accurate, enabling rapid research iteration</li>
<li><strong>Super-alignment with human ethics</strong>, ensuring ethical reasoning is deeply integrated into Sci-LLM behavior</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a survey paper that does not present new experimental results. The authors catalog extensive datasets across five modalities (see tables in the paper for comprehensive listings). The survey itself is maintained as an open resource.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HICAI-ZJU/Scientific-LLM-Survey">Scientific-LLM-Survey GitHub</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Curated list of papers, models, and resources</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (survey paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Q., Ding, K., Lyv, T., Wang, X., Yin, Q., Zhang, Y., Yu, J., Wang, Y., Li, X., Xiang, Z., Feng, K., Zhuang, X., Wang, Z., Qin, M., Zhang, M., Zhang, J., Cui, J., Huang, T., Yan, P., Xu, R., Chen, H., Li, X., Fan, X., Xing, H., &amp; Chen, H. (2025). Scientific Large Language Models: A Survey on Biological &amp; Chemical Domains. <em>ACM Computing Surveys</em>, 57(6), 1–38. <a href="https://doi.org/10.1145/3715318">https://doi.org/10.1145/3715318</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2025scientific,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scientific Large Language Models: A Survey on Biological \&amp; Chemical Domains}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Qiang and Ding, Keyan and Lyv, Tianwen and Wang, Xinda and Yin, Qingyu and Zhang, Yiwen and Yu, Jing and Wang, Yuhao and Li, Xiaotong and Xiang, Zhuoyi and Feng, Kehua and Zhuang, Xiang and Wang, Zeyuan and Qin, Ming and Zhang, Mengyao and Zhang, Jinlu and Cui, Jiyu and Huang, Tao and Yan, Pengju and Xu, Renjun and Chen, Hongyang and Li, Xiaolin and Fan, Xiaohui and Xing, Huabin and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACM Computing Surveys}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{57}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3715318}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review: Deep Learning for Molecular Design (2019)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/</guid><description>A 2019 review surveying deep generative models for molecular design, covering RNNs, VAEs, GANs, and RL approaches with SMILES and graph representations.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-deep-generative-models-for-molecular-design">A Systematization of Deep Generative Models for Molecular Design</h2>
<p>This is a <strong>Systematization</strong> paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.</p>
<h2 id="the-challenge-of-navigating-vast-chemical-space">The Challenge of Navigating Vast Chemical Space</h2>
<p>The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.</p>
<p>By 2016, <a href="/notes/machine-learning/generative-models/">deep generative models</a> had shown strong results in producing original images, music, and text. The &ldquo;molecular autoencoder&rdquo; of <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016/2018)</a> first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.</p>
<h2 id="molecular-representations-and-architecture-taxonomy">Molecular Representations and Architecture Taxonomy</h2>
<p>The review&rsquo;s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.</p>
<h3 id="molecular-representations">Molecular Representations</h3>
<p>The review categorizes representations into 3D and 2D graph-based schemes:</p>
<p><strong>3D representations</strong> include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.</p>
<p><strong>2D graph representations</strong> include:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings</strong>: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.</li>
<li><strong>Canonical SMILES</strong>: Unique but potentially encode grammar rules rather than chemical structure.</li>
<li><strong>Context-free grammars (CFGs)</strong>: Decompose SMILES into grammar rules to improve validity rates, though not to 100%.</li>
<li><strong>Tensor representations</strong>: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.</li>
<li><strong>Graph operations</strong>: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.</li>
</ul>
<h3 id="deep-learning-architectures">Deep Learning Architectures</h3>
<p><strong>Recurrent Neural Networks (RNNs)</strong> generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:</p>
<p>$$
L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1})
$$</p>
<p>Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> learn a continuous latent space by maximizing the evidence lower bound (ELBO):</p>
<p>$$
\mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)]
$$</p>
<p>The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">grammar VAEs</a> (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> train a generator against a discriminator using the minimax objective:</p>
<p>$$
\min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))]
$$</p>
<p>The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more &ldquo;balanced&rdquo; training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover&rsquo;s distance for more stable training:</p>
<p>$$
W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y|
$$</p>
<p><strong>Reinforcement Learning</strong> recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:</p>
<p>$$
\nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right]
$$</p>
<p>To prevent RL fine-tuning from causing the generator to &ldquo;drift&rdquo; away from viable chemical structures, an augmented reward function incorporates the prior likelihood:</p>
<p>$$
R&rsquo;(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2}
$$</p>
<h2 id="cataloging-45-models-and-their-design-choices">Cataloging 45 Models and Their Design Choices</h2>
<p>Rather than running new experiments, the review&rsquo;s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model&rsquo;s architecture, representation, training dataset, and dataset size. Key patterns include:</p>
<ul>
<li><strong>RNN-based models</strong> (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.</li>
<li><strong>VAE variants</strong> (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.</li>
<li><strong>GAN models</strong> (7 entries): Include <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.</li>
<li><strong>Other approaches</strong> (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.</li>
</ul>
<p>The review also catalogs 13 publicly available datasets (Table 3), ranging from <a href="/notes/chemistry/datasets/qm9/">QM9</a> (133K molecules with quantum chemical properties) to <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).</p>
<h3 id="metrics-and-reward-function-design">Metrics and Reward Function Design</h3>
<p>A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:</p>
<p><strong>Diversity</strong> using Tanimoto similarity over fingerprints:</p>
<p>$$
r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2})
$$</p>
<p><strong>Novelty</strong> measured as the fraction of generated molecules not appearing in a hold-out test set:</p>
<p>$$
r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|}
$$</p>
<p><strong>Synthesizability</strong> primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.</p>
<p>The review also discusses the <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, and DiversityNet.</p>
<h2 id="key-findings-and-future-directions">Key Findings and Future Directions</h2>
<p>The review identifies several major trends and conclusions:</p>
<p><strong>Shift from SMILES to graph-based representations.</strong> SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.</p>
<p><strong>Advantages of adversarial and RL training over MLE.</strong> The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.</p>
<p><strong>Genetic algorithms remain competitive.</strong> The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.</p>
<p><strong>Reward function design is underappreciated.</strong> Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.</p>
<p><strong>Need for standardized benchmarks.</strong> The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.</p>
<h3 id="limitations">Limitations</h3>
<p>As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></td>
          <td>977M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC15</td>
          <td>750M+</td>
          <td>Commercially available compounds</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td>50M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ChEMBL</td>
          <td>2M</td>
          <td>Curated bioactive molecules</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>QM9</td>
          <td>133,885</td>
          <td>Small organic molecules with DFT properties</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>PubChemQC</td>
          <td>3.98M</td>
          <td>PubChem compounds with DFT data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).</p>
<h3 id="evaluation">Evaluation</h3>
<p>Key evaluation frameworks discussed:</p>
<ul>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (molecular analog of FID)</li>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> benchmarking platform</li>
<li><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmarking suite</li>
<li>Validity rate, uniqueness, novelty, and internal diversity metrics</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Elton, D. C., Boukouvalas, Z., Fuge, M. D., &amp; Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. <em>Molecular Systems Design &amp; Engineering</em>, 4(4), 828-849. <a href="https://doi.org/10.1039/C9ME00039A">https://doi.org/10.1039/C9ME00039A</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{elton2019deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep Learning for Molecular Design -- A Review of the State of the Art}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Molecular Systems Design \&amp; Engineering}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{828--849}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C9ME00039A}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Inverse Molecular Design with ML Generative Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/</guid><description>Review of inverse molecular design approaches including VAEs, GANs, and RL for navigating chemical space and generating novel molecules with desired properties.</description><content:encoded><![CDATA[<h2 id="a-foundational-systematization-of-inverse-molecular-design">A Foundational Systematization of Inverse Molecular Design</h2>
<p>This paper is a <strong>Systematization</strong> of the nascent field of inverse molecular design using machine learning generative models. Published in <em>Science</em> in 2018, it organizes and contextualizes the rapidly emerging body of work on using deep generative models (variational autoencoders, generative adversarial networks, and reinforcement learning) to navigate chemical space and propose novel molecules with targeted properties. Rather than introducing a new method, the paper synthesizes the conceptual framework connecting molecular representations, generative architectures, and inverse design objectives, establishing a reference point for the field at a critical early stage.</p>
<h2 id="the-challenge-of-navigating-chemical-space">The Challenge of Navigating Chemical Space</h2>
<p>The core problem is the sheer scale of chemical space. For pharmacologically relevant small molecules alone, the number of possible structures is estimated at $10^{60}$. Traditional approaches to materials discovery rely on trial and error or high-throughput virtual screening (HTVS), both of which are fundamentally limited by the need to enumerate and evaluate candidates from a predefined library.</p>
<p>The conventional materials discovery pipeline, from concept to commercial product, historically takes 15 to 20 years, involving iterative cycles of simulation, synthesis, device integration, and characterization. Inverse design offers a conceptual alternative: start from a desired functionality and search for molecular structures that satisfy it. This inverts the standard paradigm where a molecule is proposed first and its properties are computed or measured afterward.</p>
<p>The key distinction the authors draw is between discriminative and generative models. A discriminative model learns $p(y|x)$, the conditional probability of properties $y$ given a molecule $x$. A <a href="/notes/machine-learning/generative-models/">generative model</a> instead learns the joint distribution $p(x,y)$, which can be conditioned to yield either the direct design problem $p(y|x)$ or the inverse design problem $p(x|y)$.</p>
<h2 id="three-pillars-vaes-gans-and-reinforcement-learning">Three Pillars: VAEs, GANs, and Reinforcement Learning</h2>
<p>The review organizes inverse molecular design approaches around three generative paradigms and the molecular representations they operate on.</p>
<h3 id="molecular-representations">Molecular Representations</h3>
<p>The paper surveys representations across three broad categories:</p>
<ul>
<li><strong>Discrete (text-based)</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings encode molecular structure as 1D text following a grammar syntax. Their adoption has been driven by the availability of NLP deep learning tools.</li>
<li><strong>Continuous (vectors/tensors)</strong>: <a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrices</a>, bag of bonds, fingerprints, symmetry functions, and electronic density representations. These expose different physical symmetries (permutational, rotational, reflectional, translational invariance).</li>
<li><strong>Weighted graphs</strong>: Molecules as undirected graphs where atoms are nodes and bonds are edges, with vectorized features on edges and nodes (bonding type, aromaticity, charge, distance).</li>
</ul>
<p>An ideal representation for inverse design should be invertible, meaning it supports mapping back to a synthesizable molecular structure. SMILES strings and molecular graphs are invertible, while many continuous representations require lookup tables or auxiliary methods.</p>
<h3 id="variational-autoencoders-vaes">Variational Autoencoders (VAEs)</h3>
<p><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAEs</a> encode molecules into a continuous latent space and decode latent vectors back to molecular representations. The key insight is that by constraining the encoder to produce latent vectors following a Gaussian distribution, the model gains the ability to <a href="/posts/modern-variational-autoencoder-in-pytorch/">interpolate between molecules and sample novel structures</a>. The latent space encodes a geometry: nearby points decode to similar molecules, and gradient-based optimization over this continuous space enables direct property optimization.</p>
<p>The VAE loss function combines a reconstruction term with a KL divergence regularizer:</p>
<p>$$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) | p(z))$$</p>
<p>where $q(z|x)$ is the encoder (approximate posterior), $p(x|z)$ is the decoder, and $p(z)$ is the prior (typically Gaussian).</p>
<p>Semi-supervised variants jointly train on molecules and properties, reorganizing latent space so molecules with similar properties cluster together. <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> demonstrated local and global optimization across generated distributions using Bayesian optimization over latent space.</p>
<p>The review traces the evolution from character-level SMILES VAEs to <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">grammar-aware and syntax-directed variants</a> that improve the generation of syntactically valid structures.</p>
<h3 id="generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p><a href="/posts/what-is-a-gan/">GANs</a> pit a generator against a discriminator in an adversarial training framework. The generator learns to produce synthetic molecules from noise, while the discriminator learns to distinguish synthetic from real molecules. Training convergence for GANs is challenging, suffering from mode collapse and generator-discriminator imbalance.</p>
<p>For molecular applications, dealing with discrete SMILES data introduces nondifferentiability, addressed through workarounds like SeqGAN&rsquo;s policy gradient approach and boundary-seeking GANs.</p>
<h3 id="reinforcement-learning-rl">Reinforcement Learning (RL)</h3>
<p>RL treats molecule generation as a sequential decision process where an agent (the generator) takes actions (adding characters to a SMILES string) to maximize a reward (desired molecular properties). Since rewards can only be assigned after sequence completion, Monte Carlo Tree Search (MCTS) is used to simulate possible completions and weight paths based on their success.</p>
<p>Applications include generation of drug-like molecules and <a href="https://en.wikipedia.org/wiki/Retrosynthesis">retrosynthesis</a> planning. Notable examples cited include RL for optimizing putative <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> inhibitors and molecules active against <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2</a>.</p>
<h3 id="hybrid-approaches">Hybrid Approaches</h3>
<p>The review highlights that these paradigms are not exclusive. Examples include druGAN (adversarial autoencoder) and <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a> (combined GAN and RL), which leverage strengths of multiple frameworks.</p>
<h2 id="survey-of-applications-and-design-paradigms">Survey of Applications and Design Paradigms</h2>
<p>Being a review paper, this work does not present new experiments but surveys existing applications across domains:</p>
<p><strong>Drug Discovery</strong>: Most generative model applications at the time of writing targeted pharmaceutical properties, including solubility, melting temperature, synthesizability, and target activity. Popova et al. optimized for JAK2 inhibitors, and Olivecrona et al. targeted dopamine receptor type 2.</p>
<p><strong>Materials Science</strong>: HTVS had been applied to organic photovoltaics (screening by frontier orbital energies and conversion efficiency), organic redox flow batteries (redox potential and solubility), organic LEDs (singlet-triplet gap), and inorganic materials via the Materials Project.</p>
<p><strong>Chemical Space Exploration</strong>: Evolution strategies had been applied to map chemical space, with structured search procedures incorporating genotype representations and mutation operations. Bayesian sampling with sequential Monte Carlo and gradient-based optimization of properties with respect to molecular systems represented alternative inverse design strategies.</p>
<p><strong>Graph-Based Generation</strong>: The paper notes the emerging extension of VAEs to molecular graphs (junction tree VAE) and message passing networks for incremental graph construction, though the graph isomorphism approximation problem remained a practical challenge.</p>
<h2 id="future-directions-and-open-challenges">Future Directions and Open Challenges</h2>
<p>The authors identify several open directions for the field:</p>
<p><strong>Closed-Loop Discovery</strong>: The ultimate goal is to concurrently propose, create, and characterize new materials with simultaneous data flow between components. At the time of writing, very few examples of successful closed-loop approaches existed.</p>
<p><strong>Active Learning</strong>: Combining inverse design with Bayesian optimization enables models that adapt as they explore chemical space, expanding in regions of high uncertainty and discovering molecular regions with desirable properties as a function of composition.</p>
<p><strong>Representation Learning</strong>: No single molecular representation works optimally for all properties. Graph and hierarchical representations were identified as areas needing further study. Representations that encode relevant physics tend to generalize better.</p>
<p><strong>Improved Architectures</strong>: Memory-augmented sequence generation models, Riemannian optimization methods exploiting latent space geometry, multi-level VAEs for structured latent spaces, and inverse RL for learning reward functions were highlighted as promising research directions.</p>
<p><strong>Integration into Education</strong>: The authors advocate for integrating ML into curricula across chemical, biochemical, medicinal, and materials sciences.</p>
<h3 id="limitations">Limitations</h3>
<p>As a review paper from 2018, this work captures the field at an early stage. Several limitations are worth noting:</p>
<ul>
<li>The survey is dominated by SMILES-based approaches, reflecting the state of the field at the time. Graph-based and 3D-aware generative models were just emerging.</li>
<li>Quantitative benchmarking of generative models was not yet standardized. The review does not provide systematic comparisons across methods.</li>
<li>The synthesis feasibility of generated molecules receives limited attention. The gap between computationally generated candidates and experimentally realizable molecules was (and remains) a significant challenge.</li>
<li>Transformer-based architectures, which would come to dominate chemical language modeling, are not discussed, as the Transformer had only been published the year prior.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>As a review/perspective paper, this work does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the cited primary works rather than the review itself.</p>
<h3 id="key-cited-methods-and-their-resources">Key Cited Methods and Their Resources</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Authors</th>
          <th>Type</th>
          <th>Availability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Automatic Chemical Design (VAE)</a></td>
          <td>Gomez-Bombarelli et al.</td>
          <td>Code + Data</td>
          <td>Published in ACS Central Science</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a></td>
          <td>Kusner et al.</td>
          <td>Code</td>
          <td>arXiv:1703.01925</td>
      </tr>
      <tr>
          <td>Junction Tree VAE</td>
          <td>Jin et al.</td>
          <td>Code</td>
          <td>arXiv:1802.04364</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a></td>
          <td>Sanchez-Lengeling et al.</td>
          <td>Code</td>
          <td>ChemRxiv preprint</td>
      </tr>
      <tr>
          <td>SeqGAN</td>
          <td>Yu et al.</td>
          <td>Code</td>
          <td>AAAI 2017</td>
      </tr>
      <tr>
          <td>Neural Message Passing</td>
          <td>Gilmer et al.</td>
          <td>Code</td>
          <td>arXiv:1704.01212</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sánchez-Lengeling, B., &amp; Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative models for matter engineering. <em>Science</em>, 361(6400), 360-365. <a href="https://doi.org/10.1126/science.aat2663">https://doi.org/10.1126/science.aat2663</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sanchez-lengeling2018inverse,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Inverse molecular design using machine learning: Generative models for matter engineering}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{S{\&#39;a}nchez-Lengeling, Benjamin and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{361}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6400}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{360--365}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1126/science.aat2663}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Generative AI Survey for De Novo Molecule and Protein Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/generative-ai-drug-design-survey/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/generative-ai-drug-design-survey/</guid><description>Comprehensive survey of generative AI for de novo drug design covering molecule and protein generation with VAEs, GANs, diffusion, and flow models.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-generative-ai-for-drug-design">A Systematization of Generative AI for Drug Design</h2>
<p>This is a <strong>Systematization</strong> paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.</p>
<p>The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.</p>
<h2 id="the-challenge-of-navigating-de-novo-drug-design">The Challenge of Navigating De Novo Drug Design</h2>
<p>The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.</p>
<p>AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.</p>
<p>The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.</p>
<h2 id="unified-taxonomy-two-themes-seven-subtasks">Unified Taxonomy: Two Themes, Seven Subtasks</h2>
<p>The survey&rsquo;s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.</p>
<h3 id="generative-model-architectures">Generative Model Architectures</h3>
<p>The survey covers four main generative model families used across both molecule and protein generation:</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$</p>
<p>where the KL loss is:</p>
<p>$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:</p>
<p>$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$</p>
<p><strong>Flow-Based Models</strong> generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:</p>
<p>$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$</p>
<p><strong>Diffusion Models</strong> gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:</p>
<p>$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$</p>
<p>The training loss minimizes the difference between the true noise and the predicted noise:</p>
<p>$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$</p>
<p>Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.</p>
<h2 id="small-molecule-generation-tasks-datasets-and-models">Small Molecule Generation: Tasks, Datasets, and Models</h2>
<h3 id="target-agnostic-molecule-design">Target-Agnostic Molecule Design</h3>
<p>The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).</p>
<p><strong>Datasets</strong>: <a href="/notes/chemistry/datasets/qm9/">QM9</a> (small stable molecules from <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>) and <a href="/notes/chemistry/datasets/geom/">GEOM</a>-Drug (more complex, drug-like molecules).</p>
<p>The field has shifted from SMILES-based VAEs (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GVAE</a>, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>EGNN, Diffusion</td>
          <td>99.8</td>
          <td>97.5</td>
          <td>97.9</td>
          <td>97.6</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>EGNN, VAE, Diffusion</td>
          <td>99.2</td>
          <td>89.6</td>
          <td>98.6</td>
          <td>94.6</td>
      </tr>
      <tr>
          <td>JODO</td>
          <td>EGNN, Diffusion</td>
          <td>99.2</td>
          <td>93.4</td>
          <td>99.0</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>VAE, Diffusion</td>
          <td>98.9</td>
          <td>89.4</td>
          <td>93.8</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>EGNN, Diffusion</td>
          <td>98.7</td>
          <td>82.0</td>
          <td>91.9</td>
          <td>90.7</td>
      </tr>
  </tbody>
</table>
<p>EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a &ldquo;relaxed&rdquo; EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.</p>
<p>On the larger GEOM-Drugs dataset, performance drops for most models:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>99.8</td>
          <td>91.6</td>
          <td>77.8</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>&ndash;</td>
          <td>62.2</td>
          <td>99.5</td>
          <td>99.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>84.4</td>
          <td>&ndash;</td>
          <td>99.3</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>81.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<p>MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.</p>
<h3 id="target-aware-molecule-design">Target-Aware Molecule Design</h3>
<p>Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.</p>
<p><strong>Datasets</strong>: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.</p>
<p><strong>Metrics</strong>: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Vina</th>
          <th>Affinity (%)</th>
          <th>QED</th>
          <th>SA</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffSBDD</td>
          <td>EGNN, Diffusion</td>
          <td>-7.333</td>
          <td>&ndash;</td>
          <td>0.467</td>
          <td>0.554</td>
          <td>0.758</td>
      </tr>
      <tr>
          <td>Luo et al.</td>
          <td>SchNet</td>
          <td>-6.344</td>
          <td>29.09</td>
          <td>0.525</td>
          <td>0.657</td>
          <td>0.720</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>EGNN, Diffusion</td>
          <td>-6.3</td>
          <td>58.1</td>
          <td>0.48</td>
          <td>0.58</td>
          <td>0.72</td>
      </tr>
      <tr>
          <td>LiGAN</td>
          <td>CNN, VAE</td>
          <td>-6.144</td>
          <td>21.1</td>
          <td>0.39</td>
          <td>0.59</td>
          <td>0.66</td>
      </tr>
      <tr>
          <td>Pocket2Mol</td>
          <td>EGNN, MLP</td>
          <td>-5.14</td>
          <td>48.4</td>
          <td>0.56</td>
          <td>0.74</td>
          <td>0.69</td>
      </tr>
  </tbody>
</table>
<p>DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).</p>
<h3 id="molecular-conformation-generation">Molecular Conformation Generation</h3>
<p>Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations &ldquo;covered&rdquo; within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).</p>
<p><strong>Datasets</strong>: GEOM-QM9, GEOM-Drugs, ISO17.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>GEOM-QM9 COV (%)</th>
          <th>GEOM-QM9 MAT</th>
          <th>GEOM-Drugs COV (%)</th>
          <th>GEOM-Drugs MAT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Torsional Diff.</td>
          <td>Diffusion</td>
          <td>92.8</td>
          <td>0.178</td>
          <td>72.7*</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>DGSM</td>
          <td>MPNN, Diffusion</td>
          <td>91.49</td>
          <td>0.2139</td>
          <td>78.73</td>
          <td>1.0154</td>
      </tr>
      <tr>
          <td>GeoDiff</td>
          <td>GFN, Diffusion</td>
          <td>90.07</td>
          <td>0.209</td>
          <td>89.13</td>
          <td>0.8629</td>
      </tr>
      <tr>
          <td>ConfGF</td>
          <td>GIN, Diffusion</td>
          <td>88.49</td>
          <td>0.2673</td>
          <td>62.15</td>
          <td>1.1629</td>
      </tr>
      <tr>
          <td>GeoMol</td>
          <td>MPNN</td>
          <td>71.26</td>
          <td>0.3731</td>
          <td>67.16</td>
          <td>1.0875</td>
      </tr>
  </tbody>
</table>
<p>*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.</p>
<p>Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.</p>
<h2 id="protein-generation-from-sequence-to-structure">Protein Generation: From Sequence to Structure</h2>
<h3 id="protein-representation-learning">Protein Representation Learning</h3>
<p>Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman&rsquo;s $\rho$).</p>
<p>Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.</p>
<h3 id="protein-structure-prediction">Protein Structure Prediction</h3>
<p>Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.</p>
<p>AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>CAMEO RMSD</th>
          <th>CAMEO TMScore</th>
          <th>CAMEO GDT-TS</th>
          <th>CAMEO lDDT</th>
          <th>CASP14 TMScore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AlphaFold2</td>
          <td>Transformer</td>
          <td>3.30</td>
          <td>0.87</td>
          <td>0.86</td>
          <td>0.90</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>ESMFold</td>
          <td>Transformer</td>
          <td>3.99</td>
          <td>0.85</td>
          <td>0.83</td>
          <td>0.87</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>RoseTTAFold</td>
          <td>Transformer</td>
          <td>5.72</td>
          <td>0.77</td>
          <td>0.71</td>
          <td>0.79</td>
          <td>0.37</td>
      </tr>
      <tr>
          <td>EigenFold</td>
          <td>Diffusion</td>
          <td>7.37</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.78</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<h3 id="sequence-generation-inverse-folding">Sequence Generation (Inverse Folding)</h3>
<p>Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.</p>
<p>Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):</p>
<p>$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$</p>
<p>ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>AAR (%)</th>
          <th>Div.</th>
          <th>RMSD</th>
          <th>Non.</th>
          <th>Time (s)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ProteinMPNN</td>
          <td>MPNN</td>
          <td>48.7</td>
          <td>0.168</td>
          <td>1.019</td>
          <td>1.061</td>
          <td>112</td>
      </tr>
      <tr>
          <td>ESM-IF1</td>
          <td>Transformer</td>
          <td>47.7</td>
          <td>0.184</td>
          <td>1.265</td>
          <td>1.201</td>
          <td>1980</td>
      </tr>
      <tr>
          <td>GPD</td>
          <td>Transformer</td>
          <td>46.2</td>
          <td>0.219</td>
          <td>1.758</td>
          <td>1.333</td>
          <td>35</td>
      </tr>
      <tr>
          <td>ABACUS-R</td>
          <td>Transformer</td>
          <td>45.7</td>
          <td>0.124</td>
          <td>1.482</td>
          <td>0.968</td>
          <td>233280</td>
      </tr>
      <tr>
          <td>3D CNN</td>
          <td>CNN</td>
          <td>44.5</td>
          <td>0.272</td>
          <td>1.62</td>
          <td>1.027</td>
          <td>536544</td>
      </tr>
      <tr>
          <td>PiFold</td>
          <td>GNN</td>
          <td>42.8</td>
          <td>0.141</td>
          <td>1.592</td>
          <td>1.464</td>
          <td>221</td>
      </tr>
      <tr>
          <td>ProteinSolver</td>
          <td>GNN</td>
          <td>24.6</td>
          <td>0.186</td>
          <td>5.354</td>
          <td>1.389</td>
          <td>180</td>
      </tr>
  </tbody>
</table>
<p>Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.</p>
<h3 id="backbone-design">Backbone Design</h3>
<p>Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.</p>
<p>Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).</p>
<p>ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.</p>
<p>Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using &ldquo;self-conditioning&rdquo; on predicted structures. Protpardelle co-designs sequence and structure by creating a &ldquo;superposition&rdquo; over possible sidechain states and collapsing them during each iterative diffusion step.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>scTM (%)</th>
          <th>Design. (%)</th>
          <th>PPL</th>
          <th>AAR (%)</th>
          <th>RMSD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RFDiffusion</td>
          <td>Diffusion</td>
          <td>&ndash;</td>
          <td>95.1</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Protpardelle</td>
          <td>Diffusion</td>
          <td>85</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FrameDiff</td>
          <td>Diffusion</td>
          <td>84</td>
          <td>48.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Genie</td>
          <td>Diffusion</td>
          <td>81.5</td>
          <td>79.0</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>LatentDiff</td>
          <td>EGNN, Diffusion</td>
          <td>31.6</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FoldingDiff</td>
          <td>Diffusion</td>
          <td>14.2</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>ProtDiff</td>
          <td>EGNN, Diffusion</td>
          <td>11.8</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>12.47*</td>
          <td>8.01*</td>
      </tr>
  </tbody>
</table>
<p>*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.</p>
<h3 id="antibody-design">Antibody Design</h3>
<p>The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.</p>
<p>For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.</p>
<h3 id="peptide-design">Peptide Design</h3>
<p>The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).</p>
<h2 id="current-trends-challenges-and-future-directions">Current Trends, Challenges, and Future Directions</h2>
<h3 id="current-trends">Current Trends</h3>
<p>The survey identifies several parallel trends across molecule and protein generation:</p>
<ol>
<li>
<p><strong>Shift from sequence to structure</strong>: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.</p>
</li>
<li>
<p><strong>Dominance of E(3) equivariant architectures</strong>: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.</p>
</li>
<li>
<p><strong>Structure-based over ligand-based approaches</strong>: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.</p>
</li>
</ol>
<h3 id="challenges">Challenges</h3>
<p><strong>For small molecule generation:</strong></p>
<ul>
<li><strong>Complexity</strong>: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.</li>
<li><strong>Applicability</strong>: Generating molecules with high binding affinity to targets remains difficult.</li>
<li><strong>Explainability</strong>: Methods are black-box, offering no insight into why generated molecules have desired properties.</li>
</ul>
<p><strong>For protein generation:</strong></p>
<ul>
<li><strong>Benchmarking</strong>: Protein generative tasks lack a standard evaluative procedure, with variance between each model&rsquo;s metrics and testing conditions.</li>
<li><strong>Performance</strong>: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.</li>
</ul>
<p>The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.</p>
<h3 id="data">Data</h3>
<p>The survey catalogs the following key datasets across subtasks:</p>
<table>
  <thead>
      <tr>
          <th>Subtask</th>
          <th>Datasets</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target-agnostic molecule</td>
          <td>QM9, <a href="/notes/chemistry/datasets/geom/">GEOM</a>-Drug</td>
          <td>QM9 from <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>; GEOM-Drug for complex molecules</td>
      </tr>
      <tr>
          <td>Target-aware molecule</td>
          <td>CrossDocked2020, ZINC20, Binding MOAD</td>
          <td>CrossDocked2020 most used (22.5M pairs)</td>
      </tr>
      <tr>
          <td>Conformation generation</td>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a>-QM9, GEOM-Drugs, ISO17</td>
          <td>Conformer sets for molecules</td>
      </tr>
      <tr>
          <td>Protein structure prediction</td>
          <td>PDB, CASP14, CAMEO</td>
          <td>CASP biennial blind evaluation</td>
      </tr>
      <tr>
          <td>Protein sequence generation</td>
          <td>PDB, UniRef, UniParc, CATH, TS500</td>
          <td>CATH for domain classification</td>
      </tr>
      <tr>
          <td>Backbone design</td>
          <td>PDB, AlphaFoldDB, SCOP, CATH</td>
          <td>AlphaFoldDB for expanded structural coverage</td>
      </tr>
      <tr>
          <td>Antibody structure</td>
          <td>SAbDab, RAB</td>
          <td>SAbDab: all antibody structures from PDB</td>
      </tr>
      <tr>
          <td>Antibody CDR generation</td>
          <td>SAbDab, RAB, SKEMPI</td>
          <td>SKEMPI for affinity optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gersteinlab/GenAI4Drug">GenAI4Drug</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Organized repository of all covered sources</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., &amp; Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. <em>Briefings in Bioinformatics</em>, 25(4), bbae338. <a href="https://doi.org/10.1093/bib/bbae338">https://doi.org/10.1093/bib/bbae338</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2402.08703">arXiv: 2402.08703</a></li>
<li><a href="https://github.com/gersteinlab/GenAI4Drug">GitHub: GenAI4Drug</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247410/">PMC: PMC11247410</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{tang2024survey,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbae338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbae338}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Foundation Models in Chemistry: A 2025 Perspective</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/foundation-models-chemistry-perspective/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/foundation-models-chemistry-perspective/</guid><description>Perspective reviewing foundation models for chemistry across property prediction, MLIPs, inverse design, and multi-domain applications.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-foundation-models-for-chemistry">A Systematization of Foundation Models for Chemistry</h2>
<p>This is a <strong>Systematization</strong> paper. It organizes the rapidly growing landscape of foundation models in chemistry into a coherent taxonomy. The paper distinguishes between &ldquo;small&rdquo; foundation models (pretrained for a single application domain) and &ldquo;big&rdquo; foundation models (adaptable across multiple domains such as property prediction and inverse design). It covers models based on graph neural networks (GNNs) and language models, reviews pretraining strategies (self-supervised, multimodal, supervised), and maps approximately 40 models across four application domains.</p>
<h2 id="why-a-foundation-model-perspective-for-chemistry">Why a Foundation Model Perspective for Chemistry?</h2>
<p>Foundation models have transformed NLP and computer vision through large-scale pretraining and transfer learning. In chemistry, however, several persistent challenges motivate the adoption of this paradigm:</p>
<ol>
<li><strong>Data scarcity</strong>: Chemical datasets are often small and expensive to generate (requiring experiments or quantum mechanical calculations), unlike the large annotated datasets available in NLP/CV.</li>
<li><strong>Poor generalization</strong>: ML models in chemistry frequently need to extrapolate to out-of-domain compounds (e.g., novel drug candidates, unseen crystal structures), where conventional models struggle.</li>
<li><strong>Limited transferability</strong>: Traditional ML interatomic potentials (MLIPs) are trained on system-specific datasets and cannot be easily transferred across different chemical systems.</li>
</ol>
<p>Foundation models address these by learning general representations from large unlabeled datasets, which can then be adapted to specific downstream tasks via finetuning. The paper argues that summarizing this fast-moving field is timely, given the diversity of approaches emerging across molecular property prediction, MLIPs, inverse design, and multi-domain applications.</p>
<h2 id="small-vs-big-foundation-models-a-two-tier-taxonomy">Small vs. Big Foundation Models: A Two-Tier Taxonomy</h2>
<p>The paper&rsquo;s central organizing framework distinguishes two scopes of foundation model:</p>
<p><strong>Small foundation models</strong> are pretrained models adapted to various tasks within a single application domain. Examples include:</p>
<ul>
<li>A model pretrained on large molecular databases that predicts multiple molecular properties (band gap, formation energy, etc.)</li>
<li>A universal MLIP that can simulate diverse chemical systems</li>
<li>A pretrained generative model adapted for inverse design of different target properties</li>
</ul>
<p><strong>Big foundation models</strong> span multiple application domains, handling both property prediction and inverse design within a single framework. These typically use multimodal learning (combining SMILES/graphs with text) or build on large language models.</p>
<h3 id="architectures">Architectures</h3>
<p>The paper reviews two primary architecture families:</p>
<p><strong>Graph Neural Networks (GNNs)</strong> represent molecules and crystals as graphs $G = (V, E)$ with nodes (atoms) and edges (bonds). Node features are updated through message passing:</p>
<p>$$
m_{i}^{t+1} = \sum_{j \in N(i)} M_{t}(v_{i}^{t}, v_{j}^{t}, e_{ij}^{t})
$$</p>
<p>$$
v_{i}^{t+1} = U_{t}(v_{i}^{t}, m_{i}^{t+1})
$$</p>
<p>After $T$ message-passing steps, a readout function produces a graph-level feature:</p>
<p>$$
g = R({v_{i}^{T} \mid i \in G})
$$</p>
<p>Recent equivariant GNNs (e.g., NequIP, MACE, EquformerV2) use vectorial features that respect geometric symmetries, improving expressivity for tasks sensitive to 3D structure.</p>
<p><strong>Language Models</strong> operate on string representations of molecules (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or crystal structures. Autoregressive models like GPT maximize:</p>
<p>$$
\prod_{t=1}^{T} P(y_{t} \mid x_{1}, x_{2}, \ldots, x_{t-1})
$$</p>
<p>Transformers use self-attention:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
$$</p>
<h3 id="pretraining-strategies">Pretraining Strategies</h3>
<p>The paper categorizes pretraining methods into three self-supervised learning (SSL) approaches plus supervised and multimodal strategies:</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Mechanism</th>
          <th>Example Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Contrastive learning</td>
          <td>Maximize similarity between positive pairs, minimize for negatives</td>
          <td>GraphCL, MolCLR, GraphMVP, CrysGNN</td>
      </tr>
      <tr>
          <td>Predictive learning</td>
          <td>Predict self-generated labels (node context, functional groups, space group)</td>
          <td>GROVER, Hu et al., CrysGNN</td>
      </tr>
      <tr>
          <td>Generative learning</td>
          <td>Reconstruct masked nodes/edges or entire molecules/SMILES</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a></td>
      </tr>
      <tr>
          <td>Supervised pretraining</td>
          <td>Train on energy, forces, stress from DFT databases</td>
          <td>M3GNet, CHGNet, MACE-MP-0, MatterSim</td>
      </tr>
      <tr>
          <td>Multimodal learning</td>
          <td>Learn joint representations across SMILES/graph + text modalities</td>
          <td>KV-PLM, <a href="/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/">MoMu</a>, MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/">SPMM</a></td>
      </tr>
  </tbody>
</table>
<p>A common finding across studies is that combining local and global information (e.g., via contrastive learning between node-level and graph-level views, or supervised learning on both forces and total energy) produces more transferable representations.</p>
<h2 id="survey-of-models-across-four-domains">Survey of Models Across Four Domains</h2>
<h3 id="property-prediction">Property Prediction</h3>
<p>The paper reviews 13 models for molecular and materials property prediction. Key findings:</p>
<ul>
<li><strong>Contrastive learning approaches</strong> (GraphCL, MolCLR, GraphMVP) achieve strong results by defining positive pairs through augmentation, 2D/3D structure views, or crystal system membership.</li>
<li><strong>Language model approaches</strong> (<a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>) show that transformers trained on SMILES via masked language modeling can compete with GNN-based approaches.</li>
<li><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>, pretrained on 1.1 billion SMILES from PubChem and ZINC, outperformed many baselines including GNNs on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and <a href="/notes/chemistry/datasets/qm9/">QM9</a> benchmarks. Its attention maps captured molecular structural features directly from SMILES strings.</li>
<li>For crystalline materials, CrysGNN combined contrastive, predictive, and generative learning, demonstrating improvements even on small experimental datasets.</li>
</ul>
<h3 id="machine-learning-interatomic-potentials-mlips">Machine Learning Interatomic Potentials (MLIPs)</h3>
<p>The paper surveys 10 universal MLIPs, all using supervised learning on DFT-calculated energies, forces, and stresses:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Training Data Size</th>
          <th>Key Capability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>M3GNet</td>
          <td>GNN</td>
          <td>187K (MP)</td>
          <td>First universal MLIP</td>
      </tr>
      <tr>
          <td>CHGNet</td>
          <td>GNN</td>
          <td>1.58M (MPtrj)</td>
          <td>Predicts magnetic moments</td>
      </tr>
      <tr>
          <td>MACE-MP-0</td>
          <td>MACE</td>
          <td>1.58M (MPtrj)</td>
          <td>35 diverse applications</td>
      </tr>
      <tr>
          <td>GNoME potential</td>
          <td>NequIP</td>
          <td>89M</td>
          <td>Zero-shot comparable to trained MLIPs</td>
      </tr>
      <tr>
          <td>MatterSim</td>
          <td>M3GNet/Graphormer</td>
          <td>17M</td>
          <td>SOTA on Matbench Discovery</td>
      </tr>
      <tr>
          <td>eqV2</td>
          <td>EquformerV2</td>
          <td>118M (OMat24)</td>
          <td>Structural relaxation</td>
      </tr>
  </tbody>
</table>
<p>The GNoME potential, trained on approximately 89 million data points, achieved zero-shot performance comparable to state-of-the-art MLIPs trained from scratch. MatterSim, trained on over 17 million entries across wide temperature (0-5000K) and pressure (0-1000 GPa) ranges, achieved state-of-the-art on Matbench Discovery and accurately computed thermodynamic and lattice dynamic properties.</p>
<h3 id="inverse-design">Inverse Design</h3>
<p>Few pretrained generative models for inverse design exist. The paper highlights three:</p>
<ul>
<li><strong>MatterGen</strong> (Microsoft): Diffusion model pretrained on Alexandria/MP databases (607K structures), finetuned for conditional generation on band gap, elastic modulus, spacegroup, and composition. Generated S.U.N. (stable, unique, novel) materials at rates more than 2x the previous state of the art.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/">GP-MoLFormer</a></strong> (IBM): MoLFormer pretrained on 1.1B SMILES, finetuned via pair-tuning for property-guided molecular optimization.</li>
<li><strong>CrystalLLM</strong>: Finetuned LLaMA-2 70B for crystal generation with target spacegroup and composition using string representations and prompting.</li>
</ul>
<h3 id="multi-domain-models">Multi-Domain Models</h3>
<p>The paper covers two multi-domain categories:</p>
<p><strong>Property prediction + MLIP</strong>: Denoising pretraining learns virtual forces that guide noisy configurations back to equilibrium, connecting to force prediction. Joint multi-domain pretraining (JMP) from Meta FAIR achieved state-of-the-art on 34 of 40 tasks spanning molecules, crystals, and MOFs by training simultaneously on diverse energy/force databases.</p>
<p><strong>Property prediction + inverse design</strong>: Multimodal models (KV-PLM, <a href="/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/">MoMu</a>, MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/molfm-multimodal-molecular-foundation/">MolFM</a>, <a href="/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/">SPMM</a>) learn joint representations from molecular structures and text, enabling text-based inverse design and property prediction in a single framework. LLM-based models (<a href="/notes/chemistry/llm-applications/chemdfm-x/">ChemDFM</a>, <a href="/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/">nach0</a>, <a href="/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/">finetuned GPT-3</a>) can interact with humans and handle diverse chemistry tasks through instruction tuning.</p>
<h2 id="trends-and-future-directions">Trends and Future Directions</h2>
<h3 id="scope-expansion">Scope Expansion</h3>
<p>The authors identify three axes for expanding foundation model scope:</p>
<ol>
<li><strong>Material types</strong>: Most models target molecules or a single material class. Foundation models that span molecules, crystals, surfaces, and MOFs could exploit shared chemistry across materials.</li>
<li><strong>Modalities</strong>: Beyond SMILES, graphs, and text, additional modalities (images, spectral data like XRD patterns) remain underexplored.</li>
<li><strong>Downstream tasks</strong>: Extending to new chemistry and tasks through emergent capabilities, analogous to the capabilities observed in LLMs at scale.</li>
</ol>
<h3 id="performance-and-scaling">Performance and Scaling</h3>
<p>Key scaling challenges include:</p>
<ul>
<li><strong>Data quality vs. quantity</strong>: Noisy DFT labels (e.g., HOMO-LUMO gaps with high uncertainty from different functionals/basis sets) can limit scalability and out-of-distribution performance.</li>
<li><strong>GNN scalability</strong>: While transformers scale to hundreds of billions of parameters, GNNs have rarely been explored above one million parameters due to oversmoothing and the curse of dimensionality. Recent work by Sypetkowski et al. demonstrated scaling GNNs to 3 billion parameters with consistent improvements.</li>
<li><strong>Database integration</strong>: Combining datasets from different DFT codes requires proper alignment (e.g., total energy alignment methods).</li>
</ul>
<h3 id="efficiency">Efficiency</h3>
<p>For MLIPs, efficiency is critical since MD simulations require millions of inference steps. Approaches include:</p>
<ul>
<li>Knowledge distillation from expensive teacher models to lighter student models</li>
<li>Model compression techniques (quantization, pruning) adapted for GNNs</li>
<li>Investigating whether strict equivariance is always necessary</li>
</ul>
<h3 id="interpretability">Interpretability</h3>
<p>Foundation models can generate hallucinations or mode-collapsed outputs. The authors highlight recent interpretability advances (feature extraction from Claude 3, knowledge localization and editing in transformers) as promising directions for more reliable chemical applications.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Combining local and global information in pretraining consistently improves downstream performance across all domains reviewed.</li>
<li>Self-supervised pretraining enables effective transfer learning even in low-data regimes, a critical advantage for chemistry.</li>
<li>Universal MLIPs have reached the point where zero-shot performance can be comparable to system-specific trained models.</li>
<li>Multimodal learning is the most promising approach for big foundation models capable of spanning property prediction and inverse design.</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The precise definition of &ldquo;foundation model&rdquo; in chemistry is not established and varies by scope.</li>
<li>Most surveyed models focus on molecules, with crystalline materials less explored.</li>
<li>Benchmarks for low-data regimes and out-of-distribution performance are insufficient.</li>
<li>The paper focuses on three domains (property prediction, MLIPs, inverse design) and does not cover retrosynthesis, reaction prediction, or other chemical tasks in depth.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a perspective/review paper. No new data or models are introduced. The paper surveys existing models and their training datasets, summarized in Table 1 of the paper.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Not applicable (review paper). The paper describes pretraining strategies (contrastive, predictive, generative, supervised, multimodal) at a conceptual level with references to the original works.</p>
<h3 id="models">Models</h3>
<p>Not applicable (review paper). The paper catalogs approximately 40 foundation models across four domains. See Table 1 in the paper for the complete listing.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Not applicable (review paper). The paper references benchmark results from the original studies (MoleculeNet, QM9, Matbench, Matbench Discovery, JARVIS-DFT) but does not perform independent evaluation.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Choi, J., Nam, G., Choi, J., &amp; Jung, Y. (2025). A Perspective on Foundation Models in Chemistry. <em>JACS Au</em>, 5(4), 1499-1518. <a href="https://doi.org/10.1021/jacsau.4c01160">https://doi.org/10.1021/jacsau.4c01160</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{choi2025perspective,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A Perspective on Foundation Models in Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Choi, Junyoung and Nam, Gunwook and Choi, Jaesik and Jung, Yousung}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{JACS Au}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1499--1518}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/jacsau.4c01160}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Language Models for De Novo Drug Design Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/</guid><description>Review of chemical language models for de novo drug design covering string representations, architectures, training strategies, and experimental validation.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-chemical-language-models-for-drug-design">A Systematization of Chemical Language Models for Drug Design</h2>
<p>This paper is a <strong>Systematization</strong> (minireview) that surveys the landscape of chemical language models (CLMs) for de novo drug design. It organizes the field along three axes: molecular string representations, deep learning architectures, and generation strategies (distribution learning, goal-directed, and conditional). The review also highlights experimental validations, current gaps, and future opportunities.</p>
<h2 id="why-chemical-language-models-matter-for-drug-design">Why Chemical Language Models Matter for Drug Design</h2>
<p>De novo drug design faces an enormous combinatorial challenge: the &ldquo;chemical universe&rdquo; is estimated to contain up to $10^{60}$ drug-like small molecules. Exhaustive enumeration is infeasible, and traditional design algorithms rely on hand-crafted assembly rules. Chemical language models address this by borrowing natural language processing techniques to learn the &ldquo;chemical language,&rdquo; generating molecules as string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, DeepSMILES) that satisfy both syntactic validity (chemically valid structures) and semantic correctness (desired pharmacological properties).</p>
<p>CLMs have gained traction because string representations are readily available for most molecular databases, generation is computationally cheap (one molecule per forward pass through a sequence model), and the same architecture can be applied to diverse tasks (property prediction, de novo generation, reaction prediction). At the time of this review, CLMs had produced experimentally validated bioactive molecules in several prospective studies, establishing them as practical tools for drug discovery.</p>
<h2 id="molecular-string-representations-smiles-deepsmiles-and-selfies">Molecular String Representations: SMILES, DeepSMILES, and SELFIES</h2>
<p>The review covers three main string representations used as input/output for CLMs:</p>
<p><strong>SMILES</strong> (Simplified Molecular Input Line Entry Systems) converts hydrogen-depleted molecular graphs into strings where atoms are denoted by atomic symbols, bonds and branching by punctuation, and ring openings/closures by numbers. SMILES are non-univocal (multiple valid strings per molecule), and canonicalization algorithms are needed for unique representations. Multiple studies show that using randomized (non-canonical) SMILES for data augmentation improves CLM performance, with diminishing returns beyond 10- to 20-fold augmentation.</p>
<p><strong><a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a></strong> modifies SMILES to improve machine-readability by replacing the paired ring-opening/closure digits with a count-based system and using closing parentheses only (no opening ones). This reduces the frequency of syntactically invalid strings but does not eliminate them entirely.</p>
<p><strong><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a></strong> (Self-Referencing Embedded Strings) use a formal grammar that guarantees 100% syntactic validity of decoded molecules. Every SELFIES string maps to a valid molecular graph. However, SELFIES can produce chemically unrealistic molecules (e.g., highly strained ring systems), and the mapping between string edits and molecular changes is less intuitive than for SMILES.</p>
<p>The review notes a key tradeoff: SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.</p>
<h2 id="clm-architectures-and-training-strategies">CLM Architectures and Training Strategies</h2>
<h3 id="architectures">Architectures</h3>
<p>The review describes the main architectures used in CLMs:</p>
<p><strong>Recurrent Neural Networks (RNNs)</strong>, particularly LSTMs and GRUs, dominated early CLM work. These models process SMILES character-by-character and generate new strings autoregressively via next-token prediction. RNNs are computationally efficient and well-suited to the sequential nature of molecular strings.</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> encode molecules into a continuous latent space and decode them back into strings. This enables smooth interpolation between molecules and latent-space optimization, but generated strings may be syntactically invalid.</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> have been adapted for molecular string generation (e.g., <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>), though they face training instability and mode collapse challenges that limit their adoption.</p>
<p><strong>Transformers</strong> have emerged as an increasingly popular alternative, offering parallelized training and the ability to capture long-range dependencies in molecular strings. The review notes the growing relevance of Transformer-based CLMs, particularly for large-scale pretraining.</p>
<h3 id="generation-strategies">Generation Strategies</h3>
<p>The review organizes CLM generation into three categories:</p>
<ol>
<li>
<p><strong>Distribution learning</strong>: The model learns to reproduce the statistical distribution of a training set of molecules. No explicit scoring function is used during generation. The generated molecules are evaluated post-hoc by comparing their property distributions to the training set. This approach is end-to-end but provides no direct indication of individual molecule quality.</p>
</li>
<li>
<p><strong>Goal-directed generation</strong>: A pretrained CLM is steered toward molecules optimizing a specified scoring function (e.g., predicted bioactivity, physicochemical properties). Common approaches include reinforcement learning (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and variants), hill-climbing, and Bayesian optimization. Scoring functions provide direct quality signals but can introduce biases, shortcuts, and limited structural diversity.</p>
</li>
<li>
<p><strong>Conditional generation</strong>: An intermediate approach that learns a joint semantic space between molecular structures and desired properties. The desired property profile serves as an input &ldquo;prompt&rdquo; for generation (e.g., a protein target, gene expression signature, or 3D shape). This bypasses the need for external scoring functions but has seen limited experimental application.</p>
</li>
</ol>
<h3 id="transfer-learning-and-chemical-space-exploration">Transfer Learning and Chemical Space Exploration</h3>
<p>Transfer learning is the dominant paradigm for CLM-driven chemical space exploration. A large-scale pretraining step (on $10^5$ to $10^6$ molecules via next-character prediction) is followed by fine-tuning on a smaller set of molecules with desired properties (often 10 to $10^2$ molecules). Key findings from the literature:</p>
<ul>
<li>The minimum training set size depends on target molecule complexity and heterogeneity.</li>
<li>SMILES augmentation is most beneficial with small training sets (fewer than 10,000 molecules) and plateaus for large, structurally complex datasets.</li>
<li>Fine-tuning with as few as 10 to 100 molecules has produced experimentally validated bioactive designs.</li>
<li>Hyperparameter tuning has relatively little effect on overall CLM performance.</li>
</ul>
<h2 id="evaluating-clm-designs-and-experimental-validation">Evaluating CLM Designs and Experimental Validation</h2>
<p>The review identifies evaluation as a critical gap. CLMs are often benchmarked on &ldquo;toy&rdquo; properties such as calculated logP, molecular weight, or QED (quantitative estimate of drug-likeness). These metrics capture the ability to satisfy predefined criteria but fail to reflect real-world drug discovery complexity and may lead to trivial solutions.</p>
<p>Existing benchmarks (<a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>) enable comparability across independently developed approaches but do not fully address the quality of generated compounds. The review emphasizes that experimental validation is the ultimate test. At the time of writing, only a few prospective applications had been published:</p>
<ul>
<li>Dual modulator of <a href="https://en.wikipedia.org/wiki/Retinoid_X_receptor">retinoid X</a> and <a href="https://en.wikipedia.org/wiki/Peroxisome_proliferator-activated_receptor">PPAR</a> receptors (EC50 ranging from 0.06 to 2.3 uM)</li>
<li>Inhibitor of <a href="https://en.wikipedia.org/wiki/Pim_kinase">Pim1 kinase</a> and <a href="https://en.wikipedia.org/wiki/Cyclin-dependent_kinase_4">CDK4</a> (manually modified from generated design)</li>
<li>Natural-product-inspired <a href="https://en.wikipedia.org/wiki/RAR-related_orphan_receptor_gamma">RORgamma</a> agonist (EC50 = 0.68 uM)</li>
<li>Molecules designed via combined generative AI and on-chip synthesis</li>
</ul>
<p>The scarcity of experimental validations reflects the interdisciplinary expertise required and the time/cost of chemical synthesis.</p>
<h2 id="gaps-limitations-and-future-directions">Gaps, Limitations, and Future Directions</h2>
<p>The review identifies several key gaps and opportunities:</p>
<p><strong>Scoring function limitations</strong>: Current scoring functions struggle with activity cliffs and non-additive structure-activity relationships. Conditional generation methods may help overcome these limitations by learning direct structure-property mappings.</p>
<p><strong>Structure-based design</strong>: Generating molecules that match electrostatic and shape features of protein binding pockets holds promise for addressing unexplored targets. However, prospective applications have been limited, potentially due to bias in existing protein-ligand affinity datasets.</p>
<p><strong>Synthesizability</strong>: Improving the ability of CLMs to propose synthesizable molecules is expected to increase practical relevance. Automated synthesis platforms may help but could also limit accessible chemical space.</p>
<p><strong>Few-shot learning</strong>: Large-scale pretrained CLMs combined with few-shot learning approaches are expected to boost prospective applications.</p>
<p><strong>Extensions beyond small molecules</strong>: Extending chemical languages to more complex molecular entities (proteins with non-natural amino acids, crystals, supramolecular chemistry) is an open frontier.</p>
<p><strong>Failure modes</strong>: Several studies have documented failure modes in goal-directed generation, including model shortcuts (exploiting scoring function artifacts), limited structural diversity, and generation of chemically unrealistic molecules.</p>
<p><strong>Interdisciplinary collaboration</strong>: The review emphasizes that bridging deep learning, cheminformatics, and medicinal chemistry expertise is essential for translating CLM designs into real-world drug candidates.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper and does not present novel experimental data. The paper surveys results from the literature.</p>
<h3 id="algorithms">Algorithms</h3>
<p>No novel algorithms are introduced. The review categorizes existing approaches (RNNs, VAEs, GANs, Transformers) and generation strategies (distribution learning, goal-directed, conditional).</p>
<h3 id="models">Models</h3>
<p>No new models are presented. The paper references existing implementations including REINVENT, ORGAN, and various RNN-based and Transformer-based CLMs.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The review discusses existing benchmarks:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong>: Benchmarking suite for de novo molecular design</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></strong>: Benchmarking platform for molecular generation models</li>
<li><strong>QED</strong>: Quantitative estimate of drug-likeness</li>
<li>Various physicochemical property metrics (logP, molecular weight)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. <em>Current Opinion in Structural Biology</em>, 79, 102527. <a href="https://doi.org/10.1016/j.sbi.2023.102527">https://doi.org/10.1016/j.sbi.2023.102527</a></p>
<p><strong>Publication</strong>: Current Opinion in Structural Biology, Volume 79, April 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grisoni2023chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language models for de novo drug design: Challenges and opportunities}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Current Opinion in Structural Biology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{79}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{102527}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.sbi.2023.102527}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of Molecular Representation Learning Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</guid><description>A systematic review of molecular representation learning foundation models for drug discovery, covering five modalities and four pretraining strategies.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-molecular-representation-foundation-models">A Systematization of Molecular Representation Foundation Models</h2>
<p>This paper is a <strong>Systematization</strong> that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.</p>
<h2 id="why-a-systematic-review-of-mrl-foundation-models-is-needed">Why a Systematic Review of MRL Foundation Models Is Needed</h2>
<p>Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.</p>
<p>Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.</p>
<h2 id="taxonomy-of-molecular-descriptors-and-model-architectures">Taxonomy of Molecular Descriptors and Model Architectures</h2>
<p>The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.</p>
<h3 id="molecular-descriptors">Molecular Descriptors</h3>
<p>The review identifies five primary descriptor types:</p>
<ol>
<li><strong>Molecular fingerprints</strong>: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.</li>
<li><strong>1D sequences</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.</li>
<li><strong>2D topological graphs</strong>: Atoms as nodes, bonds as edges. Can be derived from SMILES via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, making graph datasets effectively interchangeable with SMILES datasets.</li>
<li><strong>3D geometry</strong>: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.</li>
<li><strong>Multimodal</strong>: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.</li>
</ol>
<p>The paper also discusses mathematically abstract molecular representations. For example, the <a href="https://en.wikipedia.org/wiki/Wiener_index">Wiener index</a> quantifies structural complexity:</p>
<p>$$
W = \frac{1}{2} \sum_{i &lt; j} d_{ij}
$$</p>
<p>where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.</p>
<p>Degree centrality captures local connectivity:</p>
<p>$$
C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij}
$$</p>
<p>where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.</p>
<h3 id="model-architectures">Model Architectures</h3>
<p>Models are classified into two primary categories:</p>
<p><strong>Unimodal-based models:</strong></p>
<ul>
<li><strong>Sequence-based</strong>: Transformer models operating on SMILES/SELFIES (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>, MolGEN, <a href="/notes/chemistry/llm-applications/llamsmol-instruction-tuning-chemistry/">LlaSMol</a>). These capture syntactic patterns but miss spatial and topological features.</li>
<li><strong>Topological graph-based</strong>: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.</li>
<li><strong>3D geometry-based</strong>: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.</li>
<li><strong>Image-based</strong>: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.</li>
</ul>
<p><strong>Multimodal-based models:</strong></p>
<ul>
<li><strong>Sequence + Graph</strong>: <a href="/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/">DVMP</a>, PanGu Drug Model. Combines the strengths of string and topological representations.</li>
<li><strong>Graph + 3D Geometry</strong>: GraphMVP, Transformer-M. Enriches topological features with spatial information.</li>
<li><strong>Text + Molecular Structure</strong>: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.</li>
</ul>
<h2 id="four-pretraining-paradigms-for-mrl">Four Pretraining Paradigms for MRL</h2>
<p>The review systematically categorizes pretraining strategies into four paradigms:</p>
<h3 id="masked-language-modeling-mlm">Masked Language Modeling (MLM)</h3>
<p>The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.</p>
<h3 id="contrastive-learning-cl">Contrastive Learning (CL)</h3>
<p>The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.</p>
<h3 id="reconstruction-based-pretraining-rbp">Reconstruction-Based Pretraining (RBP)</h3>
<p>Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.</p>
<h3 id="multimodal-alignment-pretraining-map">Multimodal Alignment Pretraining (MAP)</h3>
<p>Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.</p>
<h2 id="downstream-applications-and-performance-benchmarks">Downstream Applications and Performance Benchmarks</h2>
<p>The review evaluates MRL foundation models across five application domains.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification datasets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>BBBP</th>
          <th>BACE</th>
          <th>ClinTox</th>
          <th>Tox21</th>
          <th>SIDER</th>
          <th>HIV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MGMAE</td>
          <td>Graph</td>
          <td>94.2</td>
          <td>92.7</td>
          <td>96.7</td>
          <td>86.0</td>
          <td>66.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MPG</td>
          <td>Graph</td>
          <td>92.2</td>
          <td>92.0</td>
          <td>96.3</td>
          <td>83.7</td>
          <td>66.1</td>
          <td>-</td>
      </tr>
      <tr>
          <td>GROVER</td>
          <td>Graph+Trans.</td>
          <td>94.0</td>
          <td>89.4</td>
          <td>94.4</td>
          <td>83.1</td>
          <td>65.8</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MoLFormer</td>
          <td>Sequence</td>
          <td>93.7</td>
          <td>88.2</td>
          <td>94.8</td>
          <td>84.7</td>
          <td>69.0</td>
          <td>82.2</td>
      </tr>
      <tr>
          <td>MM-Deacon</td>
          <td>Seq.+IUPAC</td>
          <td>78.5</td>
          <td>-</td>
          <td>99.5</td>
          <td>-</td>
          <td>69.3</td>
          <td>80.1</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>3D</td>
          <td>72.9</td>
          <td>85.7</td>
          <td>91.9</td>
          <td>79.6</td>
          <td>65.9</td>
          <td>80.8</td>
      </tr>
      <tr>
          <td>DVMP</td>
          <td>Seq.+Graph</td>
          <td>77.8</td>
          <td>89.4</td>
          <td>95.6</td>
          <td>79.1</td>
          <td>69.8</td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>TxD-T-LLM</td>
          <td>Seq.+Text</td>
          <td>-</td>
          <td>-</td>
          <td>86.3</td>
          <td>88.2</td>
          <td>-</td>
          <td>73.2</td>
      </tr>
  </tbody>
</table>
<p>The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.</p>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on <a href="/notes/chemistry/datasets/qm9/">QM9</a>.</p>
<h3 id="drug-drug-interaction-prediction"><a href="https://en.wikipedia.org/wiki/Drug_interaction">Drug-Drug Interaction</a> Prediction</h3>
<p>MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.</p>
<h3 id="retrosynthesis-prediction"><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a> Prediction</h3>
<p>DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).</p>
<h3 id="drug-synergy-prediction">Drug Synergy Prediction</h3>
<p>SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.</p>
<h2 id="guidelines-limitations-and-future-directions">Guidelines, Limitations, and Future Directions</h2>
<h3 id="model-selection-guidelines">Model Selection Guidelines</h3>
<p>The authors provide structured guidelines for choosing MRL foundation models based on:</p>
<ol>
<li><strong>Task objective</strong>: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.</li>
<li><strong>Data characteristics</strong>: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.</li>
<li><strong>Interpretability needs</strong>: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.</li>
<li><strong>Computational budget</strong>: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.</li>
</ol>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The review identifies five key challenges:</p>
<ol>
<li><strong>Multimodal data integration</strong>: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> trajectories as a dynamic modality and using cross-modal data augmentation.</li>
<li><strong>Data scarcity</strong>: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.</li>
<li><strong>Interpretability</strong>: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.</li>
<li><strong>Training efficiency</strong>: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.</li>
<li><strong>Robustness and generalization</strong>: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.</p>
<h3 id="data">Data</h3>
<p>The review catalogs 28 representative molecular datasets used by the surveyed foundation models:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Descriptor</th>
          <th>Primary Use</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubChem</td>
          <td>~118M</td>
          <td>SMILES, 3D, Image, IUPAC</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ZINC15</td>
          <td>~980M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>~2.4M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>133,884</td>
          <td>SMILES</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a></td>
          <td>450,000</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>950,000</td>
          <td>SMILES</td>
          <td>Reaction prediction</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>4M</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Z-dot-max/MRL_Foundation_Review">Review Materials (GitHub)</a></td>
          <td>Code/Data</td>
          <td>Not specified</td>
          <td>Code and data tables for figures</td>
      </tr>
      <tr>
          <td><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12784970/">Paper (PMC)</a></td>
          <td>Paper</td>
          <td>CC-BY</td>
          <td>Open access via PubMed Central</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model&rsquo;s original setup. The review covers:</p>
<ul>
<li>ROC-AUC for classification tasks (property prediction, DDI, synergy)</li>
<li>RMSE/MAE for regression tasks</li>
<li>Validity and novelty for molecular generation</li>
<li>Top-k accuracy for retrosynthesis</li>
<li>COV and MAT for conformation generation</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., &amp; Liu, Y. (2025). A systematic review of molecular representation learning foundation models. <em>Briefings in Bioinformatics</em>, 27(1), bbaf703. <a href="https://doi.org/10.1093/bib/bbaf703">https://doi.org/10.1093/bib/bbaf703</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{song2025systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of molecular representation learning foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbaf703}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbaf703}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ROGI-XD: Roughness of Pretrained Molecular Representations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/rogi-xd-roughness-pretrained-representations/</link><pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/rogi-xd-roughness-pretrained-representations/</guid><description>ROGI-XD enables cross-representation roughness comparison, showing pretrained chemical models produce no smoother QSPR surfaces than fingerprints.</description><content:encoded><![CDATA[<h2 id="evaluating-chemical-foundation-models-through-surface-roughness">Evaluating Chemical Foundation Models Through Surface Roughness</h2>
<p>This is a <strong>Systematization</strong> paper that introduces a metric reformulation (ROGI-XD) and uses it to evaluate whether pretrained chemical models (PCMs) learn representations that produce smoother <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">quantitative structure-property relationship</a> (QSPR) surfaces than simple baselines. The key finding is negative: pretrained representations are no smoother than molecular fingerprints or descriptors, offering a principled explanation for their inconsistent performance on property prediction benchmarks.</p>
<h2 id="the-smoothness-gap-in-chemical-foundation-models">The Smoothness Gap in Chemical Foundation Models</h2>
<p>Chemical foundation models like ChemBERTa, ChemGPT, and graph-based pretrained networks promise to learn meaningful molecular representations from large unlabeled datasets via self-supervised learning. However, empirical benchmarks consistently show mixed results: these learned representations sometimes match and sometimes underperform simple baselines like Morgan fingerprints or RDKit descriptors.</p>
<p>Prior work by Deng et al. demonstrated that a random forest trained on 2048-bit Morgan fingerprints was competitive with, or superior to, pretrained models like <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a> and GROVER on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and opioid bioactivity tasks. The authors sought to explain this pattern through the lens of QSPR surface roughness: if pretrained representations do not produce smoother mappings from molecular structure to property, they cannot consistently outperform baselines.</p>
<h2 id="rogi-xd-a-dimensionality-independent-roughness-metric">ROGI-XD: A Dimensionality-Independent Roughness Metric</h2>
<p>The original ROuGhness Index (ROGI) captures global surface roughness by measuring the loss in property dispersion as a dataset is progressively coarse-grained through <a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">hierarchical clustering</a>. However, ROGI values are not comparable across representations of different dimensionalities because distances between randomly sampled points increase with dimension, artificially deflating ROGI for high-dimensional representations.</p>
<p>ROGI-XD addresses this by changing the integration variable. Instead of integrating over normalized distance threshold $t$, ROGI-XD integrates over $1 - \log N_{\text{clusters}} / \log N$, where $N_{\text{clusters}}$ is the number of clusters at a given dendrogram step and $N$ is the dataset size. This variable captures the degree of coarse-graining independent of representation dimensionality, producing comparable roughness values across representations ranging from 14 dimensions (descriptors) to 2048 dimensions (ChemGPT).</p>
<p>The procedure follows five steps: (1) cluster molecules using <a href="https://en.wikipedia.org/wiki/Complete-linkage_clustering">complete linkage</a> at distance threshold $t$, (2) coarse-grain by replacing each property label $y_i$ with its cluster mean $\bar{y}_j$, (3) compute the standard deviation $\sigma_t$ of the coarse-grained dataset, (4) repeat for all dendrogram steps, and (5) compute the area under the curve of $2(\sigma_0 - \sigma_t)$ versus the new integration variable.</p>
<h2 id="representations-and-tasks-evaluated">Representations and Tasks Evaluated</h2>
<p>The study compares seven molecular representations:</p>
<table>
  <thead>
      <tr>
          <th>Representation</th>
          <th>Type</th>
          <th>Dimensionality</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>Fixed</td>
          <td>14</td>
          <td>RDKit (14 properties)</td>
      </tr>
      <tr>
          <td>Morgan FP</td>
          <td>Fixed</td>
          <td>512</td>
          <td>Radius 2, 512-bit</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>Pretrained</td>
          <td>128</td>
          <td>Character-based <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> VAE, <a href="/notes/chemistry/datasets/zinc-22/">ZINC 250k</a></td>
      </tr>
      <tr>
          <td>GIN</td>
          <td>Pretrained</td>
          <td>300</td>
          <td>Node attribute masking, ZINC 250k</td>
      </tr>
      <tr>
          <td>ChemBERTa</td>
          <td>Pretrained</td>
          <td>384</td>
          <td>77M molecules, masked LM</td>
      </tr>
      <tr>
          <td>ChemGPT</td>
          <td>Pretrained</td>
          <td>2048</td>
          <td>PubChem 10M, causal LM</td>
      </tr>
      <tr>
          <td>Random</td>
          <td>Baseline</td>
          <td>128</td>
          <td>Uniform $[0,1]^{128}$</td>
      </tr>
  </tbody>
</table>
<p>These are evaluated on 17 regression tasks drawn from two sources: ADMET datasets from the Therapeutics Data Commons (TDC) and toy datasets generated using <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> oracle functions. Five ML models are used for cross-validation: KNN, MLP, <a href="https://en.wikipedia.org/wiki/Partial_least_squares_regression">PLS</a>, random forest, and SVR.</p>
<h2 id="pretrained-representations-are-not-smoother">Pretrained Representations Are Not Smoother</h2>
<p>ROGI-XD correlates strongly with cross-validated RMSE across representations (median Pearson $r = 0.72$-$0.88$ depending on model), compared to the original ROGI which produces weak cross-representation correlations (median $r \in [-0.32, 0.28]$). When correlating over both representations and tasks simultaneously, ROGI-XD achieves $r = 0.91$-$0.99$ versus $r = 0.68$-$0.84$ for the original ROGI.</p>
<p>Using this validated metric, the authors find that pretrained representations do not produce smoother QSPR surfaces than fingerprints or descriptors. In more than 50% of tasks, both descriptors and fingerprints generate smoother surfaces. The median relative ROGI-XD increase for pretrained representations is 9.1-21.3% compared to descriptors and 2.3-10.1% compared to fingerprints, indicating rougher surfaces.</p>
<p>As a practical tool, ROGI-XD can guide representation selection without exhaustive benchmarking. Selecting the representation with the lowest ROGI-XD for each task and then optimizing over model architecture results in only a 6.8% average relative increase in best-case model error across the 17 tasks. In 8 of 17 tasks, the lowest ROGI-XD correctly identifies the optimal representation.</p>
<p>Fine-tuning can improve smoothness. On the Lipophilicity task ($N_{\text{tot}} = 4200$), fine-tuning the VAE with a contrastive loss reduces ROGI-XD from 0.254 to 0.107 ($\pm 0.02$), well below the descriptor baseline of 0.227. On the smaller CACO2 task ($N_{\text{tot}} = 910$), fine-tuning yields ROGI-XD of 0.143 ($\pm 0.05$), comparable to descriptors at 0.132. The impact of fine-tuning is sensitive to both the task and the amount of labeled data.</p>
<h2 id="implications-for-chemical-foundation-model-development">Implications for Chemical Foundation Model Development</h2>
<p>The lack of smoothness in pretrained QSPR surfaces explains the inconsistent empirical performance of chemical foundation models. The authors note that ROGI-XD is thematically similar to a contrastive loss, as both scale proportionally with the frequency and severity of activity cliffs. This connection suggests that imposing stronger smoothness assumptions during pretraining, for example through weak supervision on calculable molecular properties, could help produce representations that generalize better to downstream property prediction. ROGI-XD provides a practical tool for evaluating new pretraining strategies without exhaustive benchmark testing: a representation with lower ROGI-XD on a given task is likely to yield lower model error.</p>
<p>A limitation is that the study treats pretrained representations as static (frozen features). Fine-tuning introduces many additional design choices and can substantially improve representation quality, but this evaluation is left for future work. Additionally, the survey of pretrained models is not exhaustive and focuses on four representative architectures.</p>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/coleygroup/rogi-xd">coleygroup/rogi-xd</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained models and notebooks; results reproducible via <code>make all</code></td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining (VAE, GIN)</td>
          <td>ZINC 250k</td>
          <td>250,000</td>
          <td>80/20 train/val split</td>
      </tr>
      <tr>
          <td>Pretraining (ChemBERTa)</td>
          <td>PubChem</td>
          <td>77M</td>
          <td>Masked language modeling</td>
      </tr>
      <tr>
          <td>Pretraining (ChemGPT)</td>
          <td>PubChem 10M</td>
          <td>10M</td>
          <td>Causal language modeling</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TDC ADMET</td>
          <td>~900-10,000 per task</td>
          <td>12 regression tasks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>GuacaMol oracles</td>
          <td>10,000 per task</td>
          <td>5 synthetic tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>ROGI-XD</strong>: Hierarchical clustering (complete linkage) with integration over $1 - \log N_{\text{clusters}} / \log N$</li>
<li><strong>Cross-validation</strong>: 5-fold CV with KNN, MLP, PLS, RF (n_estimators=50), SVR from scikit-learn</li>
<li><strong>Fine-tuning loss</strong>: $\mathscr{L} = \mathscr{L}_{\text{CE}} + \beta \cdot \mathscr{L}_{\text{KL}} + \gamma \cdot \mathscr{L}_{\text{cont}}$ with $\beta = 0.1$, $\gamma = 50$; contrastive term uses cosine distance in latent space and absolute value in target space</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Two AMD Ryzen Threadripper PRO 3995WX CPUs, four NVIDIA A5000 GPUs, 512 GB RAM, Ubuntu 20.04 LTS.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Graff, D. E., Pyzer-Knapp, E. O., Jordan, K. E., Shakhnovich, E. I., &amp; Coley, C. W. (2023). Evaluating the roughness of structure-property relationships using pretrained molecular representations. <em>Digital Discovery</em>, 2(5), 1452-1460. <a href="https://doi.org/10.1039/d3dd00088e">https://doi.org/10.1039/d3dd00088e</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/coleygroup/rogi-xd">ROGI-XD Code Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{graff2023roughness,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evaluating the roughness of structure--property relationships using pretrained molecular representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Graff, David E. and Pyzer-Knapp, Edward O. and Jordan, Kirk E. and Shakhnovich, Eugene I. and Coley, Connor W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1452--1460}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d3dd00088e}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGenSurvey: Systematic Survey of ML for Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgensurvey-molecule-design/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molgensurvey-molecule-design/</guid><description>Survey of ML molecule design methods across 1D string, 2D graph, and 3D geometry representations with deep generative and optimization approaches.</description><content:encoded><![CDATA[<h2 id="a-taxonomy-for-ml-driven-molecule-design">A Taxonomy for ML-Driven Molecule Design</h2>
<p>This is a <strong>Systematization</strong> paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including <a href="/notes/chemistry/molecular-design/generation/evaluation/inverse-molecular-design-ml-review/">Sánchez-Lengeling &amp; Aspuru-Guzik, 2018</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/deep-learning-molecular-design-review/">Elton et al., 2019</a>, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.</p>
<p>The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).</p>
<h2 id="molecular-representations">Molecular Representations</h2>
<p>The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.</p>
<h3 id="1d-string-descriptions">1D String Descriptions</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.</p>
<p>Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.</p>
<h3 id="2d-molecular-graphs">2D Molecular Graphs</h3>
<p>Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node&rsquo;s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).</p>
<h3 id="3d-molecular-geometry">3D Molecular Geometry</h3>
<p>Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.</p>
<p>Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.</p>
<h2 id="deep-generative-models">Deep Generative Models</h2>
<p>The survey covers six families of deep generative models applied to molecule design.</p>
<h3 id="autoregressive-models-ars">Autoregressive Models (ARs)</h3>
<p>ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:</p>
<p>$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$</p>
<p>For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.</p>
<h3 id="variational-autoencoders-vaes">Variational Autoencoders (VAEs)</h3>
<p>VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):</p>
<p>$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$</p>
<p>The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">ChemVAE</a> (SMILES-based), JT-VAE (junction tree graphs), and <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">GrammarVAE</a> (grammar-constrained SMILES).</p>
<h3 id="normalizing-flows-nfs">Normalizing Flows (NFs)</h3>
<p>NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).</p>
<h3 id="generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p>GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a> (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).</p>
<h3 id="diffusion-models">Diffusion Models</h3>
<p>Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:</p>
<p>$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$</p>
<p>Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).</p>
<h3 id="energy-based-models-ebms">Energy-Based Models (EBMs)</h3>
<p>EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.</p>
<h2 id="combinatorial-optimization-methods">Combinatorial Optimization Methods</h2>
<p>Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.</p>
<h3 id="reinforcement-learning-rl">Reinforcement Learning (RL)</h3>
<p>RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).</p>
<h3 id="genetic-algorithms-ga">Genetic Algorithms (GA)</h3>
<p>GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.</p>
<h3 id="bayesian-optimization-bo">Bayesian Optimization (BO)</h3>
<p>BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.</p>
<h3 id="monte-carlo-tree-search-mcts">Monte Carlo Tree Search (MCTS)</h3>
<p>MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.</p>
<h3 id="mcmc-sampling">MCMC Sampling</h3>
<p>MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.</p>
<h3 id="other-approaches">Other Approaches</h3>
<p>The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. <strong>Optimal Transport (OT)</strong> is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). <strong>Differentiable Learning</strong> formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).</p>
<h2 id="task-taxonomy-eight-molecule-generation-tasks">Task Taxonomy: Eight Molecule Generation Tasks</h2>
<p>The survey&rsquo;s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is <em>de novo</em> (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is <em>generation</em> (distribution learning, producing valid and diverse molecules) or <em>optimization</em> (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper&rsquo;s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.</p>
<h3 id="1d2d-tasks">1D/2D Tasks</h3>
<ul>
<li><strong>De novo 1D/2D molecule generation</strong>: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>), ARs (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/">MolecularRNN</a>), and EBMs (GraphEBM).</li>
<li><strong>De novo 1D/2D molecule optimization</strong>: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).</li>
<li><strong>1D/2D molecule optimization</strong>: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>), and differentiable approaches (DST).</li>
</ul>
<h3 id="3d-tasks">3D Tasks</h3>
<ul>
<li><strong>De novo 3D molecule generation</strong>: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).</li>
<li><strong>De novo 3D conformation generation</strong>: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).</li>
<li><strong>De novo binding-based 3D molecule generation</strong>: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).</li>
<li><strong>De novo binding-pose conformation generation</strong>: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).</li>
<li><strong>3D molecule optimization</strong>: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).</li>
</ul>
<h2 id="evaluation-metrics">Evaluation Metrics</h2>
<p>The survey organizes evaluation metrics into four categories.</p>
<h3 id="generation-evaluation">Generation Evaluation</h3>
<p>Basic metrics assess the quality of generated molecules:</p>
<ul>
<li><strong>Validity</strong>: fraction of chemically valid molecules among all generated molecules</li>
<li><strong>Novelty</strong>: fraction of generated molecules absent from the training set</li>
<li><strong>Uniqueness</strong>: fraction of distinct molecules among generated samples</li>
<li><strong>Quality</strong>: fraction passing a predefined chemical rule filter</li>
<li><strong>Diversity</strong> (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets</li>
</ul>
<h3 id="distribution-evaluation">Distribution Evaluation</h3>
<p>Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD), and Mean Maximum Discrepancy (MMD).</p>
<h3 id="optimization-evaluation">Optimization Evaluation</h3>
<p>Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.</p>
<h3 id="3d-evaluation">3D Evaluation</h3>
<p>3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.</p>
<h2 id="datasets">Datasets</h2>
<p>The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Scale</th>
          <th>Dimensionality</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ZINC</td>
          <td>250K</td>
          <td>1D/2D</td>
          <td>Virtual screening compounds</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>2.1M</td>
          <td>1D/2D</td>
          <td>Bioactive molecules</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></td>
          <td>1.9M</td>
          <td>1D/2D</td>
          <td>Benchmarking generation</td>
      </tr>
      <tr>
          <td>CEPDB</td>
          <td>4.3M</td>
          <td>1D/2D</td>
          <td>Organic photovoltaics</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></td>
          <td>970M</td>
          <td>1D/2D</td>
          <td>Enumerated small molecules</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>134K</td>
          <td>1D/2D/3D</td>
          <td>Quantum chemistry properties</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a></td>
          <td>450K/37M</td>
          <td>1D/2D/3D</td>
          <td>Conformer ensembles</td>
      </tr>
      <tr>
          <td>ISO17</td>
          <td>200/431K</td>
          <td>1D/2D/3D</td>
          <td>Molecule-conformation pairs</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>3.9M</td>
          <td>1D/2D/3D</td>
          <td>DFT ground-state geometries</td>
      </tr>
      <tr>
          <td>CrossDock2020</td>
          <td>22.5M</td>
          <td>1D/2D/3D</td>
          <td>Docked ligand poses</td>
      </tr>
      <tr>
          <td>scPDB</td>
          <td>16K</td>
          <td>1D/2D/3D</td>
          <td>Binding sites</td>
      </tr>
      <tr>
          <td>DUD-E</td>
          <td>23K</td>
          <td>1D/2D/3D</td>
          <td>Active compounds with decoys</td>
      </tr>
  </tbody>
</table>
<h2 id="challenges-and-opportunities">Challenges and Opportunities</h2>
<h3 id="challenges">Challenges</h3>
<ol>
<li><strong>Out-of-distribution generation</strong>: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.</li>
<li><strong>Unrealistic problem formulation</strong>: Many task setups do not respect real-world chemistry constraints.</li>
<li><strong>Expensive oracle calls</strong>: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.</li>
<li><strong>Lack of interpretability</strong>: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.</li>
<li><strong>No unified evaluation protocols</strong>: The field lacks consensus on what defines a &ldquo;good&rdquo; drug candidate and how to fairly compare methods.</li>
<li><strong>Insufficient benchmarking</strong>: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.</li>
<li><strong>Low-data regime</strong>: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.</li>
</ol>
<h3 id="opportunities">Opportunities</h3>
<ol>
<li><strong>Extension to complex structured data</strong>: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.</li>
<li><strong>Connection to later drug development phases</strong>: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.</li>
<li><strong>Knowledge discovery</strong>: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.</li>
</ol>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.</li>
<li>Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.</li>
<li>The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers&rsquo; reported results.</li>
<li>1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field&rsquo;s shift toward structured representations at the time of writing.</li>
<li>As a survey, this paper produces no code, models, or datasets. The surveyed methods&rsquo; individual repositories are referenced in their original publications but are not aggregated here.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Du, Y., Fu, T., Sun, J., &amp; Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. <em>arXiv preprint arXiv:2203.14500</em>.</p>
<p><strong>Publication</strong>: arXiv preprint, March 2022. <strong>Note</strong>: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2203.14500">arXiv: 2203.14500</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{du2022molgensurvey,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2203.14500}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Exposing Limitations of Molecular ML with Activity Cliffs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/</guid><description>A benchmark of 24 ML methods on activity cliff compounds across 30 drug targets, showing descriptor-based models outperform deep learning.</description><content:encoded><![CDATA[<h2 id="a-benchmark-for-activity-cliff-prediction">A Benchmark for Activity Cliff Prediction</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for <a href="/notes/chemistry/molecular-design/property-prediction/">molecular property prediction</a> in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.</p>
<h2 id="activity-cliffs-as-a-blind-spot-in-molecular-ml">Activity Cliffs as a Blind Spot in Molecular ML</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Chemical_similarity">similarity principle</a> underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).</p>
<p>Despite their importance for <a href="https://en.wikipedia.org/wiki/Hit_to_lead">hit-to-lead optimization</a>, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.</p>
<p>The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.</p>
<h2 id="defining-and-detecting-activity-cliffs">Defining and Detecting Activity Cliffs</h2>
<p>The authors use three complementary similarity metrics to identify activity cliffs:</p>
<ol>
<li><strong>Substructure similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto coefficient</a> on extended connectivity fingerprints (ECFPs), capturing shared radial substructures</li>
<li><strong>Scaffold similarity</strong>: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences</li>
<li><strong>SMILES similarity</strong>: <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> on canonical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, capturing character-level insertions, deletions, and translocations</li>
</ol>
<p>Pairs with $\geq 90%$ similarity on <strong>any one</strong> of the three metrics and $&gt; 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.</p>
<h2 id="24-methods-across-30-drug-targets">24 Methods Across 30 Drug Targets</h2>
<p>The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> v29 datasets (48,707 total molecules).</p>
<p><strong>Traditional ML algorithms</strong>: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.</p>
<p><strong>Deep learning methods</strong>: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> (SMILES-based), and an MLP on ECFPs.</p>
<p>Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:</p>
<p>$$
\text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}}
$$</p>
<p>Key results:</p>
<ul>
<li><strong>Molecular descriptors matter more than algorithms</strong>: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p &lt; 0.05$, <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Wilcoxon rank-sum test</a> with <a href="https://en.wikipedia.org/wiki/False_discovery_rate">Benjamini-Hochberg correction</a>).</li>
<li><strong>SVM + ECFPs wins on average</strong>: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.</li>
<li><strong>Deep learning underperforms</strong>: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.</li>
<li><strong>Large case-by-case variation</strong>: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.</li>
</ul>
<h2 id="simple-descriptors-beat-complex-architectures-on-cliffs">Simple Descriptors Beat Complex Architectures on Cliffs</h2>
<p>The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.</p>
<p>Key observations:</p>
<ul>
<li><strong>RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average)</strong>, so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.</li>
<li><strong>Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation</strong>: Datasets with $&gt; 1000$ training molecules show $r &gt; 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.</li>
<li><strong>No relationship between % cliff compounds and model performance</strong>, and no target-family-specific effects were found.</li>
<li><strong>Transfer learning helped SMILES models (LSTM) but not graph models</strong>: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.</li>
</ul>
<p>The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Source</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Benchmarking</td>
          <td>ChEMBL v29</td>
          <td>48,707 molecules (35,632 unique) across 30 targets</td>
          <td>Curated for duplicates, salts, outliers</td>
      </tr>
      <tr>
          <td>Smallest dataset</td>
          <td>JAK1</td>
          <td>615 molecules</td>
          <td>7% activity cliffs</td>
      </tr>
      <tr>
          <td>Largest dataset</td>
          <td>DRD3</td>
          <td>3,657 molecules</td>
          <td>39% activity cliffs</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Activity cliff detection</strong>: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $&gt; 10\times$ potency difference</li>
<li><strong>Splitting</strong>: <a href="https://en.wikipedia.org/wiki/Spectral_clustering">Spectral clustering</a> on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion</li>
<li><strong>Hyperparameter optimization</strong>: <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian optimization</a> with Gaussian process, max 50 combinations, 5-fold cross-validation</li>
<li><strong>SMILES augmentation</strong>: 10-fold for all SMILES-based methods</li>
<li><strong>Transfer learning</strong>: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> compounds</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Traditional ML</strong>: KNN, RF, GBM, SVM (scikit-learn v1.0.2)</li>
<li><strong>Descriptors</strong>: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)</li>
<li><strong>GNNs</strong>: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling</li>
<li><strong>SMILES models</strong>: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer</li>
<li><strong>Total models trained</strong>: 720 (24 methods $\times$ 30 targets)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE</td>
          <td>All test molecules</td>
          <td>Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$</td>
      </tr>
      <tr>
          <td>$\text{RMSE}_{\text{cliff}}$</td>
          <td>Activity cliff compounds only</td>
          <td>RMSE restricted to cliff molecules in test set</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE">MoleculeACE</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Benchmark platform with all 30 curated datasets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE/tree/main/MoleculeACE/Data/benchmark_data">Curated datasets</a></td>
          <td>Data</td>
          <td>MIT</td>
          <td>Processed ChEMBL bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: van Tilborg, D., Alenicheva, A., &amp; Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. <em>Journal of Chemical Information and Modeling</em>, 62(23), 5938-5951. <a href="https://doi.org/10.1021/acs.jcim.2c01073">https://doi.org/10.1021/acs.jcim.2c01073</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molML/MoleculeACE">MoleculeACE GitHub Repository</a></li>
<li><a href="https://chemrxiv.org/engage/chemrxiv/article-details/630cc44058843b8403a19810">ChemRxiv Preprint</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{vantilborg2022activity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exposing the Limitations of Molecular Machine Learning with Activity Cliffs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5938--5951}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01073}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Scaling Laws vs Model Architectures: Inductive Bias</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/</guid><description>Tay et al.'s 2022 study comparing scaling behavior across ten model architectures, showing that inductive bias affects scaling properties in distinct ways.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>systematization paper</strong> that conducts a large-scale empirical comparison of how ten different model architectures scale. Rather than proposing a new architecture, it characterizes the relationship between inductive bias and scaling behavior across both upstream (pretraining) and downstream (transfer) performance.</p>
<h2 id="why-architecture-aware-scaling-matters">Why architecture-aware scaling matters</h2>
<p>Prior scaling laws work (Kaplan et al., 2020) focused almost exclusively on vanilla Transformers, finding that loss scales as a power law with model size, dataset size, and compute. A common assumption in the field is that improvements observed at one scale transfer to other scales, and new architectures are often evaluated at a single compute point (e.g., base size). This paper challenges that assumption by asking whether different inductive biases scale differently.</p>
<h2 id="ten-architectures-one-controlled-setup">Ten architectures, one controlled setup</h2>
<p>All models are implemented in Mesh TensorFlow under a shared encoder-decoder (<a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style) framework, pretrained on C4 for $2^{19}$ steps with Adafactor optimizer and inverse square root learning rate schedule, and finetuned for 100K steps on GLUE + SuperGLUE + SQuAD. Models range from 15M to 40B parameters, trained on 16 TPU-v3 chips. The ten architectures span four categories:</p>
<p><strong>Transformer variants</strong>: vanilla Transformer, Evolved Transformer (AutoML-derived), Universal Transformer (parameter sharing + recurrence), Switch Transformer (sparse MoE)</p>
<p><strong>Efficient variants</strong>: Performer (linear attention), Funnel Transformer (sequence downsampling), ALBERT (cross-layer parameter sharing + embedding factorization)</p>
<p><strong>General improvements</strong>: Mixture of Softmaxes (MoS), Gated Linear Units (GLU)</p>
<p><strong>Non-Transformers</strong>: Lightweight Convolutions, Dynamic Convolutions, MLP-Mixer</p>
<h2 id="key-findings-on-scaling-behavior">Key findings on scaling behavior</h2>
<h3 id="architecture-changes-the-scaling-slope">Architecture changes the scaling slope</h3>
<p>The paper fits linear scaling laws in log-log space (i.e., power law fits of the form $L \propto C^{-\alpha}$) for each model across multiple axes (FLOPs vs. upstream, FLOPs vs. downstream, etc.). The vanilla Transformer has the highest scaling coefficient on most reported axes ($\alpha_{F,U} = 0.54$, $\alpha_{F,D} = 0.28$). Models that make minimal changes to the Transformer (GLU, MoS) retain similar scaling behavior. Models with more radical inductive biases show worse scaling:</p>
<ul>
<li><strong>Performer</strong> (linear attention): $\alpha_{F,U} = 0.25$, upstream perplexity decreases only 2.7% from base to large vs. 8.4% for vanilla Transformer</li>
<li><strong>ALBERT</strong>: scales negatively on downstream ($\alpha_{F,D} = -0.12$), getting worse as compute increases. ALBERT was designed for parameter efficiency (cross-layer weight sharing, embedding factorization), not compute efficiency, so this result is expected: additional FLOPs reuse the same parameters without adding capacity</li>
<li><strong>MLP-Mixer</strong>: near-zero downstream scaling ($\alpha_{F,D} = -0.03$)</li>
</ul>
<h3 id="the-best-architecture-changes-with-scale">The best architecture changes with scale</h3>
<p>Models that perform well at small compute budgets are not necessarily the best at larger budgets. For example, the Evolved Transformer outperforms vanilla Transformers at tiny-to-small scale on downstream tasks but falls behind when scaled up. MoS-Transformer outperforms vanilla Transformers at some compute regions but not others.</p>
<h3 id="upstream-and-downstream-scaling-diverge">Upstream and downstream scaling diverge</h3>
<p>Good upstream perplexity scaling does not guarantee good downstream transfer scaling. Funnel Transformers and Lightweight Convolutions hold up reasonably well on upstream perplexity but suffer substantially on downstream tasks. Switch Transformers show the best upstream-to-downstream transfer ratio ($\alpha_{U,D} = 0.58$).</p>
<h3 id="depth-and-width-affect-architectures-differently">Depth and width affect architectures differently</h3>
<p>Depth scaling has a more substantial impact on downstream performance than width scaling across most architectures. Evolved Transformers are a partial exception, scaling slightly better under width scaling compared to other architectures on downstream tasks.</p>
<h2 id="practical-implications">Practical implications</h2>
<p>The authors offer concrete guidance: practitioners should be cautious about staking expensive large-scale runs on architectures that drastically modify the attention mechanism. Performers and MLP-Mixers are characterized as &ldquo;high risk&rdquo; options. This helps explain why most large language models at the time (PaLM, Gopher, UL2) use relatively vanilla Transformer architectures.</p>
<p>The paper also notes that not every use case requires billion-parameter models. Inductive biases tailored to small or low-compute regimes remain valuable when scaling is not the priority.</p>
<h2 id="reproducibility">Reproducibility</h2>
<p>No code or trained model weights were publicly released with this paper. The experiments rely on Google&rsquo;s internal Mesh TensorFlow infrastructure with 16 TPU-v3 chips, and pretraining uses the publicly available C4 corpus. Finetuning benchmarks (GLUE, SuperGLUE, SQuAD) are all publicly available. However, reproducing the full study would require substantial compute resources and re-implementation of all ten architectures within a shared framework.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://arxiv.org/abs/2207.10551">arXiv paper</a></td>
          <td>Paper</td>
          <td>Open access</td>
          <td>Full paper with appendices</td>
      </tr>
      <tr>
          <td><a href="https://www.tensorflow.org/datasets/catalog/c4">C4 corpus</a></td>
          <td>Dataset</td>
          <td>ODC-BY</td>
          <td>Pretraining data</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components</strong>: No released code, model checkpoints, or training scripts. Internal Mesh TensorFlow codebase is not publicly available.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., &amp; Metzler, D. (2022). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? <em>EMNLP 2022</em>.</p>
<p><strong>Publication</strong>: EMNLP 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2207.10551">arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{tay2022scaling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tay, Yi and Dehghani, Mostafa and Abnar, Samira and Chung, Hyung Won and Fedus, William and Rao, Jinfeng and Narang, Sharan and Tran, Vinh Q. and Yogatama, Dani and Metzler, Donald}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Relational Inductive Biases in Deep Learning (2018)</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/relational-inductive-biases-deep-learning-graph-networks/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/relational-inductive-biases-deep-learning-graph-networks/</guid><description>Battaglia et al.'s 2018 paper unifying graph neural network variants under a general graph network framework and analyzing relational inductive biases.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>systematization paper</strong> that is part position paper, part review, and part unification. It argues that combinatorial generalization, the ability to construct new inferences from known building blocks, is a top priority for AI. It frames relational inductive biases as the key design principle connecting standard deep learning architectures, presents the graph network (GN) as a general framework subsuming prior graph neural network variants, and advocates for combining structured approaches with deep learning rather than choosing between them.</p>
<h2 id="the-case-for-relational-inductive-biases">The case for relational inductive biases</h2>
<p>Human intelligence relies on representing the world as compositions of entities, relations, and rules. We understand complex systems by decomposing them into parts and their interactions. Modern deep learning&rsquo;s &ldquo;end-to-end&rdquo; philosophy minimizes structural assumptions, relying on data and compute to learn representations from scratch. The paper argues this approach struggles with combinatorial generalization: generalizing beyond one&rsquo;s experiences by composing known elements in new ways.</p>
<p>The authors reject the false dichotomy between &ldquo;hand-engineering&rdquo; and &ldquo;end-to-end&rdquo; learning. Just as biology uses both nature and nurture, they advocate for architectures that bake in useful structural assumptions (inductive biases) while still learning flexibly from data.</p>
<h2 id="inductive-biases-across-standard-architectures">Inductive biases across standard architectures</h2>
<p>The paper provides a systematic analysis of how existing architectures encode relational structure:</p>
<p><strong>Fully connected networks (MLPs)</strong>: The weakest relational inductive bias. All input units can interact with all others, with no reuse of parameters. No assumptions about the structure of the input.</p>
<p><strong>Convolutional networks (CNNs)</strong>: Encode locality (nearby elements interact) and translation invariance (the same local function is applied everywhere). The entities are individual units or grid elements (e.g., pixels), the relations are defined by the grid neighborhood, and the rule (convolution kernel) is shared across all positions.</p>
<p><strong>Recurrent networks (RNNs)</strong>: Encode sequential structure and temporal invariance. The entities are time steps, each step relates to the previous one through a shared transition function. This imposes a Markovian bias (the future depends on the present state, not the full history directly).</p>
<p><strong>Sets and self-attention</strong>: Permutation invariant architectures impose no ordering on entities. Self-attention (as in Transformers) allows all pairwise interactions but with no structural prior on which interactions matter.</p>
<p>Each architecture can be understood as making specific commitments about what the entities are, what the relations between them are, and what rules govern their interactions.</p>
<h2 id="the-graph-network-framework">The graph network framework</h2>
<p>The paper defines a general &ldquo;graph network&rdquo; (GN) block that operates on graphs with attributes on nodes, edges, and the global graph level. A GN block performs three update steps and three aggregation steps:</p>
<ol>
<li><strong>Edge update</strong>: For each edge, compute updated edge attributes using the current edge attributes, the sender node attributes, the receiver node attributes, and the global attributes</li>
<li><strong>Node update</strong>: For each node, aggregate incoming updated edge attributes, then compute updated node attributes using the aggregated edges, current node attributes, and global attributes</li>
<li><strong>Global update</strong>: Aggregate all updated edge and node attributes, then compute updated global attributes</li>
</ol>
<p>Each update function is learned (typically a small neural network), and each aggregation function must be permutation invariant (typically sum, mean, or max).</p>
<p>This framework generalizes prior work:</p>
<ul>
<li><strong>Message Passing Neural Networks</strong> (Gilmer et al., 2017): edge and node updates with a readout function but no explicit global attribute in message passing</li>
<li><strong>Non-local Neural Networks</strong> (Wang et al., 2018): attention-weighted edge interactions</li>
<li><strong>Interaction Networks</strong> (Battaglia et al., 2016): physics-inspired message passing</li>
<li><strong>Relation Networks</strong> (Santoro et al., 2017): a simple neural network module for relational reasoning</li>
<li><strong>Discovering objects and their relations</strong> (Raposo et al., 2017): discovering objects and their relations from entangled scene representations</li>
<li><strong>Deep Sets</strong> (Zaheer et al., 2017): node-only aggregation without edge structure</li>
<li><strong>CommNet, Structure2Vec, GGNNs</strong>, and others</li>
</ul>
<p>The paper shows how each prior approach corresponds to a specific configuration of which GN components are used and how they are connected.</p>
<h2 id="design-principles-for-graph-networks">Design principles for graph networks</h2>
<p>The paper identifies several key design choices:</p>
<p><strong>Flexible representations</strong>: GN blocks can output graphs with different structure than their input (e.g., predicting edge existence), enabling tasks like link prediction, clustering, or property regression.</p>
<p><strong>Configurable within-block structure</strong>: The internal update and aggregation functions can be swapped freely. The framework separates what is computed (the relational structure) from how it is computed (the function approximators).</p>
<p><strong>Composable multi-block architectures</strong>: GN blocks can be stacked, sharing or not sharing weights across layers. They can be composed with other architectures (e.g., an encoder-GN-decoder pattern) or arranged in recurrent configurations.</p>
<p><strong>Combinatorial generalization</strong>: Because GN blocks share functions across edges and nodes, they can generalize to graphs of different sizes and topologies than those seen during training. A GN trained on small graphs can, in principle, be applied to larger ones.</p>
<h2 id="connections-to-broader-ai-themes">Connections to broader AI themes</h2>
<p>The paper frames graph networks as supporting:</p>
<ul>
<li><strong>Relational reasoning</strong>: Learning about entities and their interactions</li>
<li><strong>Combinatorial generalization</strong>: Applying learned rules to novel combinations</li>
<li><strong>Structured prediction</strong>: Producing complex, structured outputs including graphs and sequences</li>
<li><strong>Interpretable representations</strong>: Graph structure provides a natural vocabulary for understanding what the model has learned</li>
</ul>
<p>The authors also discuss connections to classical AI (logic, planning, causal reasoning) and argue that graph networks provide a bridge between the flexibility of deep learning and the compositionality of symbolic approaches.</p>
<h2 id="limitations-and-open-questions">Limitations and open questions</h2>
<p>The paper identifies several limitations of graph networks:</p>
<ul>
<li><strong>Graph isomorphism</strong>: Learned message-passing cannot be guaranteed to discriminate between certain non-isomorphic graphs. Kondor et al. (2018) suggested that covariance, rather than invariance to permutations, may be preferable.</li>
<li><strong>Expressivity limits of graphs</strong>: Notions like recursion, control flow, and conditional iteration are not straightforward to represent with graphs. Programs and more &ldquo;computer-like&rdquo; processing may offer greater representational and computational expressivity for these concepts.</li>
<li><strong>Where do graphs come from?</strong>: Converting raw sensory data (images, text) into graph-structured representations remains an open problem. Fully connected graphs between spatial or linguistic entities are a common workaround but may not reflect the true underlying structure.</li>
<li><strong>Adaptive graph structure</strong>: How to modify graph topology during computation (e.g., splitting a node when an object fractures, or adding/removing edges based on contact) is an active research direction.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>The authors released an open-source software library for building graph networks in TensorFlow/Sonnet, including demos for shortest-path finding, sorting, and physical prediction tasks.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepmind/graph_nets">Graph Nets library</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official TensorFlow/Sonnet implementation with demos</td>
      </tr>
  </tbody>
</table>
<p>This is a position/systematization paper rather than an empirical one, so reproducibility pertains to the accompanying library rather than experimental results. The library and demos are publicly available, making the framework highly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., &hellip; &amp; Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks. <em>arXiv preprint arXiv:1806.01261</em>.</p>
<p><strong>Publication</strong>: arXiv 2018</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/1806.01261">arXiv</a></li>
<li><a href="https://github.com/deepmind/graph_nets">Graph Nets library (GitHub)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{battaglia2018relational,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Relational inductive biases, deep learning, and graph networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Battaglia, Peter W. and Hamrick, Jessica B. and Bapst, Victor and Sanchez-Gonzalez, Alvaro and Zambaldi, Vinicius and Malinowski, Mateusz and Tacchetti, Andrea and Raposo, David and Santoro, Adam and Faulkner, Ryan and others}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1806.01261}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of OCSR Techniques and Models (Musazade 2022)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</guid><description>Systematization of OCSR evolution from rule-based systems to deep learning, highlighting the paradigm shift to image captioning approaches.</description><content:encoded><![CDATA[<h2 id="systematization-of-ocsr-evolution">Systematization of OCSR Evolution</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: <strong>Rule-based systems</strong> (1990s-2010s) and <strong>Machine Learning-based systems</strong> (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to &ldquo;image captioning&rdquo; (sequence generation).</p>
<p><strong>Justification</strong>: The paper focuses on &ldquo;organizing and synthesizing existing literature&rdquo; and answers the core question: &ldquo;What do we know?&rdquo; The dominant contribution is systematization based on several key indicators:</p>
<ol>
<li>
<p><strong>Survey Structure</strong>: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: &ldquo;Rule-based systems&rdquo; and &ldquo;ML-based systems&rdquo;. It traces the &ldquo;evolution of approaches from rule-based structure analyses to complex statistical models&rdquo;, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.</p>
</li>
<li>
<p><strong>Synthesis of Knowledge</strong>: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).</p>
</li>
<li>
<p><strong>Identification of Gaps</strong>: The authors dedicate specific sections to &ldquo;Gaps of rule-based systems&rdquo; and &ldquo;Gaps of ML-based systems&rdquo;. It concludes with recommendations for future development, such as the need for &ldquo;standardized datasets&rdquo; and specific improvements in image augmentation and evaluation metrics.</p>
</li>
</ol>
<h2 id="motivation-for-digitization-in-cheminformatics">Motivation for Digitization in Cheminformatics</h2>
<p>The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:</p>
<ol>
<li><strong>Representational Variety</strong>: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).</li>
<li><strong>Legacy Data</strong>: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.</li>
<li><strong>Lack of Standardization</strong>: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.</li>
</ol>
<h2 id="key-insights-and-the-paradigm-shift">Key Insights and the Paradigm Shift</h2>
<p>The paper provides a structured comparison of the &ldquo;evolution&rdquo; of OCSR, specifically identifying the pivot point where the field moved from object detection to <strong>NLP-inspired sequence generation</strong>.</p>
<p>Key insights include:</p>
<ul>
<li><strong>The Paradigm Shift</strong>: Identifying that OCSR has effectively become an &ldquo;image captioning&rdquo; problem where the &ldquo;caption&rdquo; is a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> string.</li>
<li><strong>Metric Critique</strong>: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking &ldquo;F&rdquo; for &ldquo;S&rdquo; is worse than a wrong digit).</li>
<li><strong>Hybrid Potential</strong>: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).</li>
</ul>
<h2 id="comparative-analysis-of-rule-based-vs-ml-systems">Comparative Analysis of Rule-Based vs. ML Systems</h2>
<p>As a review paper, it aggregates experimental results from primary sources. It compares:</p>
<ul>
<li><strong>Rule-based systems</strong>: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.</li>
<li><strong>ML-based systems</strong>: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.</li>
</ul>
<p>It contrasts these systems using:</p>
<ul>
<li><strong>Datasets</strong>: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).</li>
<li><strong>Metrics</strong>: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).</li>
</ul>
<h2 id="outcomes-critical-gaps-and-recommendations">Outcomes, Critical Gaps, and Recommendations</h2>
<ol>
<li><strong>Transformers are SOTA</strong>: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.</li>
<li><strong>Data Hungry</strong>: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.</li>
<li><strong>Critical Gaps</strong>:
<ul>
<li><strong>Super-atoms</strong>: Current models struggle with abbreviated super-atoms (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;).</li>
<li><strong>Stereochemistry</strong>: 3D information (wedges/dashes) is often lost or misinterpreted.</li>
<li><strong>Resolution</strong>: Models are brittle to resolution changes; some require high-res, others fail if images aren&rsquo;t downscaled.</li>
</ul>
</li>
<li><strong>Recommendation</strong>: Future systems should integrate &ldquo;smart&rdquo; pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.</li>
</ol>
<h2 id="reproducibility">Reproducibility</h2>
<p>As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.</p>
<h3 id="data">Data</h3>
<p>The review identifies the following key datasets used for training OCSR models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BMS (Bristol-Myers Squibb)</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~4M images</td>
          <td style="text-align: left">2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt &amp; pepper, blur) and rotations absent from training images.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>PubChem</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~39M</td>
          <td style="text-align: left">Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>U.S. Patents (USPTO)</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">Variable</td>
          <td style="text-align: left">Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ChemInfty</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">869 images</td>
          <td style="text-align: left">Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review highlights the progression of algorithms:</p>
<ul>
<li><strong>Rule-Based</strong>: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.</li>
<li><strong>Sequence Modeling</strong>:
<ul>
<li><strong>Image Captioning</strong>: Encoder (CNN/ViT) → Decoder (RNN/Transformer).</li>
<li><strong>Tokenization</strong>: Parsing InChI/SMILES into discrete tokens (e.g., splitting <code>C13</code> into <code>C</code>, <code>13</code>).</li>
<li><strong>Beam Search</strong>: Used in inference (typical $k=15-20$) to find the most likely chemical string.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>Key architectures reviewed:</p>
<ul>
<li><strong>DECIMER 1.0</strong>: Uses <strong>EfficientNet-B3</strong> (Encoder) and <strong>Transformer</strong> (Decoder). Predicts <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings (more robust than <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>).</li>
<li><strong>Swin Transformer</strong>: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.</li>
<li><strong>Grid LSTM</strong>: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics standard in the field:</p>
<ul>
<li><strong>Levenshtein Distance (LD)</strong>: Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.</li>
<li><strong>Tanimoto Similarity</strong>: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as:
$$
\begin{aligned}
T(A, B) = \frac{N_c}{N_a + N_b - N_c}
\end{aligned}
$$
where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.</li>
<li><strong>1-1 Match Rate</strong>: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Cost</strong>: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.</li>
<li><strong>Inference</strong>: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Musazade, F., Jamalova, N., &amp; Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. <em>Journal of Cheminformatics</em>, 14(1), 61. <a href="https://doi.org/10.1186/s13321-022-00642-3">https://doi.org/10.1186/s13321-022-00642-3</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{musazadeReviewTechniquesModels2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-022-00642-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>A Review of Optical Chemical Structure Recognition Tools</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</guid><description>Comprehensive review and benchmarking of 30 years of Optical Chemical Structure Recognition (OCSR) methods and tools.</description><content:encoded><![CDATA[<h2 id="systematization-and-benchmarking-of-ocsr">Systematization and Benchmarking of OCSR</h2>
<p>This is primarily a <strong>Systematization</strong> paper ($0.7 \Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($0.3 \Psi_{\text{Resource}}$).</p>
<p>It serves as a <strong>Systematization</strong> because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.</p>
<p>It acts as a <strong>Resource</strong> by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.</p>
<h2 id="motivation-digitizing-legacy-chemical-literature">Motivation: Digitizing Legacy Chemical Literature</h2>
<p>A vast amount of chemical knowledge remains &ldquo;hidden&rdquo; in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a &ldquo;backlog of decades of chemical literature&rdquo; that cannot be easily indexed or searched in open-access databases.</p>
<p>While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.</p>
<h2 id="core-innovations-historical-taxonomy-and-open-standards">Core Innovations: Historical Taxonomy and Open Standards</h2>
<p>The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.</p>
<p>Specific contributions include:</p>
<ul>
<li><strong>Historical Taxonomy</strong>: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.</li>
<li><strong>Open Source Benchmark</strong>: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.</li>
<li><strong>Algorithmic Breakdown</strong>: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.</li>
</ul>
<h2 id="benchmarking-methodology-and-open-source-evaluation">Benchmarking Methodology and Open-Source Evaluation</h2>
<p>The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: <strong>MolVec (0.9.7)</strong>, <strong>Imago (2.0)</strong>, and <strong>OSRA (2.1.0)</strong>.</p>
<p>They tested these tools on four datasets of varying quality and origin:</p>
<ol>
<li><strong>USPTO</strong>: 5,719 images from US patents (high quality).</li>
<li><strong>UOB</strong>: 5,740 images from the University of Birmingham, published alongside MolRec.</li>
<li><strong>CLEF 2012</strong>: 961 images from the CLEF-IP evaluation (well-segmented, clean).</li>
<li><strong>JPO</strong>: 450 images from Japanese patents (low quality, noise, Japanese characters).</li>
</ol>
<p>Evaluation metrics were:</p>
<ul>
<li><strong>Accuracy</strong>: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> strings and matching against reference InChIs).</li>
<li><strong>Speed</strong>: Total processing time for the dataset.</li>
</ul>
<h2 id="results-and-general-conclusions">Results and General Conclusions</h2>
<p><strong>Benchmark Results (Table 2)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>MolVec 0.9.7</th>
          <th>Imago 2.0</th>
          <th>OSRA 2.1.0</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO (5,719 images)</td>
          <td>Time (min)</td>
          <td>28.65</td>
          <td>72.83</td>
          <td>145.04</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.41%</td>
          <td>87.20%</td>
          <td>87.69%</td>
      </tr>
      <tr>
          <td>UOB (5,740 images)</td>
          <td>Time (min)</td>
          <td>28.42</td>
          <td>152.52</td>
          <td>125.78</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.39%</td>
          <td>63.54%</td>
          <td>86.50%</td>
      </tr>
      <tr>
          <td>CLEF 2012 (961 images)</td>
          <td>Time (min)</td>
          <td>4.41</td>
          <td>16.03</td>
          <td>21.33</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>80.96%</td>
          <td>65.45%</td>
          <td>94.90%</td>
      </tr>
      <tr>
          <td>JPO (450 images)</td>
          <td>Time (min)</td>
          <td>7.50</td>
          <td>22.55</td>
          <td>16.68</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>66.67%</td>
          <td>40.00%</td>
          <td>57.78%</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Observations</strong>:</p>
<ul>
<li><strong>MolVec</strong> was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).</li>
<li><strong>OSRA</strong> performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.</li>
<li><strong>Imago</strong> generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).</li>
<li><strong>JPO Difficulty</strong>: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.</li>
</ul>
<p><strong>General Conclusions</strong>:</p>
<ul>
<li>No &ldquo;gold standard&rdquo; tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).</li>
<li>Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.</li>
<li>There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The authors provided sufficient detail to replicate the benchmarking study.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/OCSR_Review">OCSR_Review (GitHub)</a></td>
          <td>Code / Data</td>
          <td>MIT</td>
          <td>Benchmark images (PNG, 72 dpi) and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/p/osra/wiki/Download/">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.1.0 tested; precompiled binaries are commercial</td>
      </tr>
      <tr>
          <td><a href="https://lifescience.opensource.epam.com/download/imago.html">Imago</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.0 tested; no longer actively developed</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ncats/molvec">MolVec</a></td>
          <td>Code</td>
          <td>LGPL-2.1</td>
          <td>Version 0.9.7 tested; Java-based standalone tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
          <th>Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>5,719</td>
          <td>OSRA Validation Set</td>
          <td>US Patent images, generally clean.</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>5,740</td>
          <td>Univ. of Birmingham</td>
          <td>Published alongside MolRec.</td>
      </tr>
      <tr>
          <td><strong>CLEF 2012</strong></td>
          <td>961</td>
          <td>CLEF-IP 2012</td>
          <td>Well-segmented, high quality.</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>450</td>
          <td>Japanese Patent Office</td>
          <td>Low quality, noisy, contains Japanese text.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:</p>
<ul>
<li><strong>Imago</strong>: Executed via command line without installation.
<code>./imago_console -dir /image/directory/path</code></li>
<li><strong>MolVec</strong>: Executed as a JAR file.
<code>java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]</code></li>
<li><strong>OSRA</strong>: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling.
<code>osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]</code></li>
</ul>
<h3 id="models">Models</h3>
<p>The specific versions of the open-source software tested were:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Version</th>
          <th>Technology</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>0.9.7</td>
          <td>Java-based, rule-based</td>
          <td>LGPL-2.1</td>
      </tr>
      <tr>
          <td><strong>Imago</strong></td>
          <td>2.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>2.1.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.</li>
<li><strong>Environment</strong>: Linux workstation (Ubuntu 20.04 LTS).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The benchmark was performed on a high-end workstation to measure processing time.</p>
<ul>
<li><strong>CPUs</strong>: 2x Intel Xeon Silver 4114 (40 threads total).</li>
<li><strong>RAM</strong>: 64 GB.</li>
<li><strong>Parallelization</strong>: MolVec had pre-implemented parallelization features that contributed to its speed.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Zielesny, A., &amp; Steinbeck, C. (2020). A review of optical chemical structure recognition tools. <em>Journal of Cheminformatics</em>, 12(1), 60. <a href="https://doi.org/10.1186/s13321-020-00465-0">https://doi.org/10.1186/s13321-020-00465-0</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanReviewOpticalChemical2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A Review of Optical Chemical Structure Recognition Tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00465-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Three Domains of Life: Woese's Phylogenetic Revolution</title><link>https://hunterheidenreich.com/notes/biology/evolutionary-biology/woese-three-domain-1990/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/biology/evolutionary-biology/woese-three-domain-1990/</guid><description>Woese, Kandler, and Wheelis proposed the three-domain system in 1990, replacing the prokaryote-eukaryote dichotomy with Bacteria, Archaea, and Eucarya.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Systematization</strong> paper with a strong <strong>Position</strong> component.</p>
<p><strong>Systematization</strong>: It synthesizes decades of molecular sequence data (specifically rRNA) to propose a &ldquo;formal system of organisms&rdquo; that replaces previous taxonomies.</p>
<p><strong>Position</strong>: It argues that the prevailing &ldquo;Prokaryote-Eukaryote&rdquo; and &ldquo;Five Kingdom&rdquo; models are &ldquo;outmoded,&rdquo; &ldquo;misleading,&rdquo; and based on &ldquo;flawed premises&rdquo; regarding the organization of life.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The authors aim to align formal taxonomy with the &ldquo;natural system emerging from molecular data&rdquo;.</p>
<p><strong>The Problem</strong>: Existing systems (Whittaker&rsquo;s 5-Kingdoms) were based on morphology and nutrition, which are insufficient for microbial classification.</p>
<p><strong>The Gap</strong>: The &ldquo;Prokaryote&rdquo; definition was negative (defined by what they <em>lack</em>, a nucleus), obscuring the fact that &ldquo;Archaebacteria&rdquo; are as distinct from &ldquo;Eubacteria&rdquo; as they are from Eukaryotes.</p>
<p><strong>The Goal</strong>: To establish a taxonomic rank above Kingdom that recognizes the three primary evolutionary lineages.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core contribution is the formal proposal of the <strong>Domain</strong> as the highest taxonomic rank. Specific novel definitions include:</p>
<ol>
<li>
<p><strong>Three Domains</strong>:</p>
<ul>
<li><strong>Bacteria</strong> (formerly Eubacteria): Membrane lipids are diacyl glycerol diesters; eubacterial rRNA.</li>
<li><strong>Archaea</strong> (formerly Archaebacteria): Membrane lipids are isoprenoid glycerol diethers/tetraethers; archaeal rRNA. The term &ldquo;archaebacteria&rdquo; is abandoned to emphasize their independence.</li>
<li><strong>Eucarya</strong> (Eukaryotes): Cells with nuclei; glycerol fatty acyl diester lipids; eukaryotic rRNA.</li>
</ul>
</li>
<li>
<p><strong>Subdivision of Archaea</strong>: The domain is formally split into two kingdoms:</p>
<ul>
<li><strong>Euryarchaeota</strong> (methanogens, halophiles, thermoplasms, sulfate-reducing <em>Archaeoglobus</em>, and thermophiles <em>Thermococcus</em> and <em>Pyrococcus</em>).</li>
<li><strong>Crenarchaeota</strong> (sulfur-dependent extreme thermophiles).</li>
</ul>
</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This paper is a synthesis of phylogenetic analysis. It relies on:</p>
<ul>
<li><strong>rRNA Sequencing</strong>: Comparison of 16S (small subunit) ribosomal RNA sequences. The paper cites over 400 known eubacterial cases of a characteristic structural feature (the 6-nucleotide side bulge at positions 500-545).</li>
<li><strong>Phylogenetic Tree Reconstruction</strong>: Analysis of branching orders and lengths based on rRNA sequence comparisons (citing Woese, 1987).</li>
<li><strong>Paralogous Gene Rooting</strong>: Determining the root of the universal tree by comparing duplicated genes (e.g., elongation factors) that diverged before the three lineages separated.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Tripartite Division</strong>: Life divides into three monophyletic groups. The evolutionary differences among the three domains are more profound than those separating traditional kingdoms such as animals and plants.</li>
<li><strong>Archaea-Eucarya Sisterhood</strong>: The root of the tree separates Bacteria from the other two, making Archaea and Eucarya sister groups.</li>
<li><strong>Molecular Definition</strong>: Phenotypes are replaced by molecular signatures. For example, Bacteria are defined by a 6-nucleotide bulge in the 16S rRNA (positions 500-545), whereas Archaea and Eucarya share a 7-nucleotide bulge.</li>
<li><strong>&ldquo;Prokaryote&rdquo; as Invalid Taxon</strong>: The paper explicitly argues that &ldquo;Prokaryote&rdquo; is not a valid natural taxon. Because it is defined by what the organisms <em>lack</em> (a nucleus), it groups together two deeply divergent domains (Bacteria and Archaea) by a plesiomorphic character. The term should be abandoned in natural classification.</li>
<li><strong>Domain Replaces Kingdom</strong>: Introducing the Domain rank above Kingdom resolves the issue. A bacterium is no more related to an archaeon than either is to a eukaryote, so all three deserve equivalent top-level status.</li>
<li><strong>Formal Conclusions (adapted from paper)</strong>:
<ol>
<li>Life comprises three primary groupings, the Domains Bacteria, Archaea, and Eucarya.</li>
<li>None of these is ancestral to the others; all descend from a common ancestor.</li>
<li>The Archaea comprise two kingdoms, Euryarchaeota and Crenarchaeota.</li>
<li>Both Bacteria and Eucarya will contain numerous kingdoms; for Eucarya, the paper anticipates preserving Plantae, Animalia, and Fungi while replacing Protista with several kingdoms.</li>
<li>&ldquo;Prokaryote&rdquo; has no phylogenetic meaning and should not be used as a formal taxon.</li>
</ol>
</li>
</ul>
<p><strong>Reception and ongoing debate</strong>: At publication, abandoning &ldquo;prokaryote&rdquo; was a controversial claim. Most microbiology and cell biology textbooks through the 2000s retained the term, and many introductory curricula continue to use it today. The three-domain framework has since been adopted in modern phylogenetics and comparative genomics, but the transition is not yet universal in pedagogy, and some researchers have proposed alternative deep-tree topologies (e.g., the eocyte hypothesis) that differ from Woese&rsquo;s original Archaea-Eucarya sisterhood.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><em>Note: As a theoretical systematics paper from 1990, &ldquo;reproducibility&rdquo; refers to the data sources and criteria used to construct the taxonomy.</em></p>
<h3 id="data">Data</h3>
<p>The taxonomy rests on comparative analysis of <strong>Ribosomal RNA (rRNA)</strong>, specifically the small subunit (16S in prokaryotes, 18S in eukaryotes).</p>
<table>
  <thead>
      <tr>
          <th>Data Type</th>
          <th>Specific Features</th>
          <th>Source Reference</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>16S rRNA</strong></td>
          <td>Region 500-545 (Hairpin Loop)</td>
          <td>Woese et al., 1983</td>
      </tr>
      <tr>
          <td><strong>16S rRNA</strong></td>
          <td>Region 180-197 &amp; 405-498</td>
          <td>Woese et al., 1983</td>
      </tr>
      <tr>
          <td><strong>Membrane Lipids</strong></td>
          <td>Diacyl esters vs. Isoprenoid ethers</td>
          <td>Used for Domain definition</td>
      </tr>
      <tr>
          <td><strong>RNA Polymerase</strong></td>
          <td>Subunit patterns and complexity</td>
          <td>Schnabel et al., 1983; Puhler et al., 1989</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper relies on rRNA sequence comparisons to generate the universal tree in <strong>Figure 1</strong>, using phylogenetic methods standard at the time.</p>
<ul>
<li><strong>Tree Inference</strong>: Branching order/lengths taken from <em>Microbiol. Rev. 51, 221-271 (1987)</em>.</li>
<li><strong>Rooting Strategy</strong>: The &ldquo;Outgroup&rdquo; method using anciently duplicated genes (paralogs) such as Elongation Factors Tu and G, which diverged prior to the Universal Ancestor.</li>
</ul>
<h3 id="models">Models</h3>
<p>The &ldquo;Model&rdquo; proposed is the <strong>Three-Domain System</strong>:</p>
<ol>
<li><strong>Domain Bacteria</strong>: Rooted independently. Includes Thermotogales, Flavobacteria, Cyanobacteria, Purple bacteria, Gram-positives, Green nonsulfur.</li>
<li><strong>Domain Archaea</strong>:
<ul>
<li><em>Kingdom Crenarchaeota</em>: &ldquo;Ancestral&rdquo; phenotype (thermophily). Includes <em>Pyrodictium</em>, <em>Thermoproteus</em>.</li>
<li><em>Kingdom Euryarchaeota</em>: &ldquo;Broad&rdquo; phenotype. Includes Methanogens, Halophiles, <em>Thermoplasma</em>, <em>Archaeoglobus</em> (sulfate-reducing), and <em>Thermococcus</em> and <em>Pyrococcus</em> (thermophilic).</li>
</ul>
</li>
<li><strong>Domain Eucarya</strong>: Includes Animals, Ciliates, Plants, Fungi, Flagellates, Microsporidia.</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<p>The authors validate the model by demonstrating <strong>Molecular Invariants</strong>: features present in all members of a domain but absent in others:</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Bacteria</th>
          <th>Archaea</th>
          <th>Eucarya</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>rRNA Loop (500-545)</strong></td>
          <td>6-nt bulge</td>
          <td>7-nt bulge</td>
          <td>7-nt bulge</td>
      </tr>
      <tr>
          <td><strong>Membrane Lipids</strong></td>
          <td>Glycerol fatty acyl diesters</td>
          <td>Isoprenoid glycerol ethers</td>
          <td>Glycerol fatty acyl diesters</td>
      </tr>
      <tr>
          <td><strong>RNA Polymerase</strong></td>
          <td>Simple subunit pattern</td>
          <td>Complex (Eucarya-like)</td>
          <td>Complex (3 separate pols)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Woese, C. R., Kandler, O., &amp; Wheelis, M. L. (1990). Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. <em>Proc. Natl. Acad. Sci. USA</em>, 87(12), 4576-4579. <a href="https://doi.org/10.1073/pnas.87.12.4576">https://doi.org/10.1073/pnas.87.12.4576</a></p>
<p><strong>Publication</strong>: Proc. Natl. Acad. Sci. USA, Volume 87, Number 12, 1990</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{woeseNaturalSystemOrganisms1990,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Towards a Natural System of Organisms: Proposal for the Domains {{Archaea}}, {{Bacteria}}, and {{Eucarya}}.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Towards a Natural System of Organisms}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Woese, C R and Kandler, O and Wheelis, M L}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1990}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Proceedings of the National Academy of Sciences of the United States of America}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{87}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4576--4579}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0027-8424}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1073/pnas.87.12.4576}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.pnas.org/doi/10.1073/pnas.74.11.5088">Woese&rsquo;s 1977 Discovery of Archaea</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi">NCBI Taxonomy Browser</a></li>
</ul>
]]></content:encoded></item><item><title>Embedded-Atom Method: Theory and Applications Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-review-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-review-1993/</guid><description>Comprehensive 1993 review of the Embedded-Atom Method (EAM), covering theory, parameterization, and applications to metallic systems.</description><content:encoded><![CDATA[<h2 id="systematizing-the-embedded-atom-method">Systematizing the Embedded-Atom Method</h2>
<p>This is a <strong>Systematization (Review)</strong> paper. It consolidates the theoretical development, semi-empirical parameterization, and broad applications of the Embedded-Atom Method (EAM) into a unified framework. The paper systematizes the field by connecting the EAM to related theories (Effective Medium Theory, Finnis-Sinclair, &ldquo;glue&rdquo; models) and organizing phenomenological results across diverse physical regimes (bulk, surfaces, interfaces).</p>
<p>The authors explicitly frame the work as a survey, stating &ldquo;We review here the history, development, and application of the EAM&rdquo; and &ldquo;This review emphasizes the physical insight that motivated the EAM.&rdquo; The paper follows a classic survey structure, organizing the literature by application domains.</p>
<h2 id="the-failure-of-pair-potentials-in-metallic-systems">The Failure of Pair Potentials in Metallic Systems</h2>
<p>The primary motivation is the failure of pair-potential models to accurately describe metallic bonding, particularly at defects and interfaces.</p>
<p><strong>Physics Gap</strong>: Pair potentials assume bond strength is independent of environment, implying cohesive energy scales linearly with coordination ($Z$), whereas in reality it scales roughly as $\sqrt{Z}$.</p>
<p><strong>Empirical Failures</strong>: Pair potentials incorrectly predict the &ldquo;Cauchy relation&rdquo; ($C_{12} = C_{44}$) and predict a vacancy formation energy equal to the cohesive energy, contradicting experimental data for fcc metals.</p>
<p><strong>Practical Need</strong>: First-principles calculations (like DFT) were computationally too expensive for low-symmetry systems like grain boundaries and fracture tips, creating a need for an efficient, semi-empirical many-body potential.</p>
<h2 id="theoretical-unification--core-innovations">Theoretical Unification &amp; Core Innovations</h2>
<p>The paper&rsquo;s core contribution is the synthesis of the EAM as a practical computational tool that captures &ldquo;coordination-dependent bond strength&rdquo; without the cost of ab initio methods.</p>
<p><strong>Theoretical Unification</strong>: It demonstrates that the EAM ansatz can be derived from Density Functional Theory (DFT) by assuming the total electron density is a superposition of atomic densities.</p>
<p><strong>Environmental Dependence</strong>: It explicitly formulates how the &ldquo;effective&rdquo; pair interaction stiffens and shortens as coordination decreases (e.g., at surfaces), a feature naturally arising from the non-linearity of the embedding function.</p>
<p><strong>Broad Validation</strong>: It provides a centralized evaluation of the method across a vast array of metallic properties, establishing it as the standard for atomistic simulations of face-centered cubic (fcc) metals.</p>
<h2 id="validating-eam-across-application-domains">Validating EAM Across Application Domains</h2>
<p>The authors review computational experiments using Energy Minimization, Molecular Dynamics (MD), and Monte Carlo (MC) simulations across several domains:</p>
<p><strong>Bulk Properties</strong>: Calculation of phonon spectra, liquid structure factors, thermal expansion coefficients, and melting points for fcc metals (Ni, Pd, Pt, Cu, Ag, Au).</p>
<p><strong>Defects</strong>: Computation of vacancy formation/migration energies and self-interstitial geometries.</p>
<p><strong>Grain Boundaries</strong>: Calculation of grain boundary structures, energies, and elastic properties for twist and tilt boundaries in Au and Al. Computed structures show good agreement with X-ray diffraction and HRTEM experiments. The many-body interactions in the EAM produce somewhat better agreement than pair potentials, which tend to overestimate boundary expansion.</p>
<p><strong>Surfaces</strong>: Analysis of surface energies, relaxations, reconstructions (e.g., Au(110) missing row), and surface phonons.</p>
<p><strong>Alloys</strong>: Investigation of heat of solution, surface segregation profiles (e.g., Ni-Cu), and order-disorder transitions.</p>
<p><strong>Mechanical Properties</strong>: Simulation of dislocation mobility, pinning by defects (He bubbles), and crack tip plasticity (ductile vs. brittle fracture modes).</p>
<h2 id="key-outcomes-and-the-limits-of-eam">Key Outcomes and the Limits of EAM</h2>
<p><strong>Many-Body Success</strong>: The EAM successfully reproduces the breakdown of the Cauchy relation and the correct ratio of vacancy formation energy to cohesive energy (~0.35) for fcc metals.</p>
<p><strong>Surface Accuracy</strong>: It correctly predicts that surface bonds are shorter and stiffer than bulk bonds due to lower coordination. It accurately predicts surface reconstructions (e.g., Au(110) $(1 \times 2)$).</p>
<p><strong>Alloy Behavior</strong>: The method naturally captures segregation phenomena, including oscillating concentration profiles in Ni-Cu, driven by the embedding energy.</p>
<p><strong>Limitations</strong>: The method is less accurate for systems with strong directional bonding (covalent materials) or significant Fermi-surface effects, as it assumes spherically averaged electron densities.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Fitting Data</strong>: The semi-empirical functions are fitted to basic bulk properties: lattice constants, cohesive energy, elastic constants ($C_{11}$, $C_{12}$, $C_{44}$), and vacancy formation energy.</p>
<p><strong>Universal Binding Curve</strong>: The cohesive energy as a function of lattice constant is constrained to follow the &ldquo;universal binding curve&rdquo; of Rose et al. to ensure accurate anharmonic behavior.</p>
<p><strong>Alloy Data</strong>: For binary alloys, dilute heats of alloying are used for fitting cross-interactions.</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Core Ansatz</strong>: The total energy is defined as:</p>
<p>$$E_{coh} = \sum_{i} G_i\left( \sum_{j \neq i} \rho_j^a(R_{ij}) \right) + \frac{1}{2} \sum_{i, j (j \neq i)} U_{ij}(R_{ij})$$</p>
<p>where $G$ is the embedding energy (function of local electron density $\rho$), and $U$ is a pair interaction.</p>
<p><strong>Simulation Techniques</strong>:</p>
<ul>
<li><strong>Molecular Dynamics (MD)</strong>: Used for liquids, phonons, and fracture simulations.</li>
<li><strong>Monte Carlo (MC)</strong>: Used for phase diagrams and segregation profiles (e.g., approximately $10^5$ iterations per atom).</li>
<li><strong>Phonons</strong>: Calculated via the dynamical matrix derived from the force-constant tensor $K_{ij}$.</li>
<li><strong>Normal-Mode Analysis</strong>: Vibrational normal modes obtained by diagonalizing the dynamical matrix, feasible for unit cells of up to about 260 atoms.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Parameterizations</strong>: The review lists several specific function sets developed by the authors (Table 2), including:</p>
<ul>
<li><strong>Daw and Baskes</strong>: For Ni, Pd, H (elemental metals and H in solution/on surfaces)</li>
<li><strong>Foiles</strong>: For Cu, Ag, Au, Ni, Pd, Pt (elemental metals)</li>
<li><strong>Foiles</strong>: For Cu, Ni (tailored for the Ni-Cu alloy system)</li>
<li><strong>Foiles, Baskes and Daw</strong>: For Cu, Ag, Au, Ni, Pd, Pt (dilute alloys)</li>
<li><strong>Daw, Baskes, Bisson and Wolfer</strong>: For Ni, H (fracture, dislocations, H embrittlement)</li>
<li><strong>Foiles and Daw</strong>: For Ni, Al (Ni-rich end of the Ni-Al alloy system)</li>
<li><strong>Daw</strong>: For Ni (calculated from first principles, not semi-empirical)</li>
<li><strong>Hoagland, Daw, Foiles and Baskes</strong>: For Al (elemental Al)</li>
</ul>
<p>Many of these historical parameterizations are directly downloadable in machine-readable formats from the NIST Interatomic Potentials Repository (linked in the resources below).</p>
<p><strong>Transferability</strong>: EAM functions are generally <em>not</em> transferable between different parameterization sets; mixing functions from different sets (e.g., Daw-Baskes Ni with Foiles Pd) is invalid.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Bulk Validation</strong>: Phonon dispersion curves for Cu show excellent agreement with experiment across the full Brillouin zone.</p>
<p><strong>Thermal Properties</strong>: Linear thermal expansion coefficients match experiment well (e.g., Cu calculated: $16.4 \times 10^{-6}/K$ vs experimental: $16.7 \times 10^{-6}/K$).</p>
<p><strong>Defect Energetics</strong>: Vacancy migration energies and divacancy binding energies (~0.1-0.2 eV) align with experimental data.</p>
<p><strong>Surface Segregation</strong>: Correctly predicts segregation species for 18 distinct dilute alloy cases (e.g., Cu segregating in Ni).</p>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute Scale</strong>: At the time of publication (1993), Molecular Dynamics simulations of up to 35,000 atoms were possible.</p>
<p><strong>Platforms</strong>: Calculations were performed on supercomputers like the <strong>CRAY-XMP</strong>, though smaller calculations were noted as feasible on high-performance workstations.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Daw, M. S., Foiles, S. M., &amp; Baskes, M. I. (1993). The embedded-atom method: a review of theory and applications. <em>Materials Science Reports</em>, 9(7-8), 251-310. <a href="https://doi.org/10.1016/0920-2307(93)90001-U">https://doi.org/10.1016/0920-2307(93)90001-U</a></p>
<p><strong>Publication</strong>: Materials Science Reports 1993</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{dawEmbeddedatomMethodReview1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The embedded-atom method: a review of theory and applications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{The Embedded-Atom Method}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Daw, Murray S. and Foiles, Stephen M. and Baskes, Michael I.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = mar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Materials Science Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{7-8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{251--310}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0920-2307}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1016/0920-2307(93)90001-U}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method/">Original EAM Paper (1984)</a></li>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-voter-1994/">EAM User Guide (1994)</a></li>
<li><a href="https://www.ctcms.nist.gov/potentials/">NIST Interatomic Potentials Repository</a></li>
</ul>
]]></content:encoded></item><item><title>Embedded-Atom Method User Guide: Voter's 1994 Chapter</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-voter-1994/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-voter-1994/</guid><description>Comprehensive user guide for the Embedded-Atom Method (EAM), covering theory, potential fitting, and applications to intermetallics.</description><content:encoded><![CDATA[<h2 id="contribution-systematizing-the-embedded-atom-method">Contribution: Systematizing the Embedded-Atom Method</h2>
<p>This is a <strong>Systematization</strong> paper (specifically a handbook chapter) with a strong secondary <strong>Method</strong> projection.</p>
<p>Its primary goal is to serve as a &ldquo;users&rsquo; guide&rdquo; to the Embedded-Atom Method (EAM). The text organizes existing knowledge:</p>
<ul>
<li>It traces the physical origins of EAM from Density Functional Theory (DFT) and Effective Medium Theory.</li>
<li>It synthesizes &ldquo;closely related methods&rdquo; (Second Moment Approximation, Glue Model), showing they are mathematically equivalent or very similar to EAM.</li>
<li>It provides a pedagogical, step-by-step methodology for fitting potentials to experimental data.</li>
</ul>
<h2 id="motivation-bridging-the-gap-between-dft-and-pair-potentials">Motivation: Bridging the Gap Between DFT and Pair Potentials</h2>
<p>The primary motivation is to bridge the gap between accurate, expensive electronic structure calculations and fast, inaccurate pair potentials.</p>
<ul>
<li><strong>Computational Efficiency</strong>: First-principles methods scale as $O(N^3)$ or worse, limiting simulations to $&lt;100$ atoms (in 1994). Pair potentials scale as $O(N)$ and fail to capture essential many-body physics of metals.</li>
<li><strong>Physical Accuracy</strong>: Simple pair potentials cannot accurately model metallic defects; they predict zero Cauchy pressure ($C_{12} - C_{44} = 0$) and equate vacancy formation energy to cohesive energy, both of which are incorrect for transition metals.</li>
<li><strong>Practical Utility</strong>: There was a need for a clear guide on how to construct and apply these potentials for large-scale simulations ($10^6+$ atoms) of fracture and defects.</li>
</ul>
<h2 id="novelty-a-unified-framework-and-robust-fitting-recipe">Novelty: A Unified Framework and Robust Fitting Recipe</h2>
<p>As a review chapter, the novelty lies in the synthesis and the specific, reproducible recipe for potential construction. Central to this synthesis is the core EAM energy functional:</p>
<p>$$E_{\text{tot}} = \sum_i \left( F(\bar{\rho}_i) + \frac{1}{2} \sum_{j \neq i} \phi(r_{ij}) \right)$$</p>
<p>where the total energy $E_{\text{tot}}$ depends on embedding an atom $i$ into a local background electron density $\bar{\rho}_i = \sum_{j \neq i} \rho(r_{ij})$, plus a repulsive pair interaction $\phi(r_{ij})$.</p>
<ul>
<li><strong>Unified Framework</strong>: It explicitly maps the &ldquo;Second Moment Approximation&rdquo; (Tight Binding) and the &ldquo;Glue Model&rdquo; onto the fundamental EAM framework above, clarifying that they differ primarily in terminology or specific functional choices (e.g., square root embedding functions).</li>
<li><strong>Cross-Potential Fitting Recipe</strong>: It details a robust method for fitting alloy potentials (specifically Ni-Al-B) by using &ldquo;transformation invariance&rdquo;, scaling the density and shifting the embedding function to fit alloy properties without disturbing pure element fits.</li>
<li><strong>Specific Parameters</strong>: It publishes optimized potential parameters for Ni, Al, and B that accurately reproduce properties like the Boron interstitial preference in $\text{Ni}_3\text{Al}$.</li>
</ul>
<h2 id="validation-computational-benchmarks-and-simulations">Validation: Computational Benchmarks and Simulations</h2>
<p>The &ldquo;experiments&rdquo; described are computational validations and simulations using the fitted Ni-Al-B potential:</p>
<ol>
<li>
<p><strong>Potential Fitting</strong>:</p>
<ul>
<li>Pure elements (Ni, Al) were fitted to elastic constants, vacancy formation energies, and diatomic data. The Ni fit achieved $\chi_{\text{rms}} = 0.75%$ and Al achieved $\chi_{\text{rms}} = 3.85%$.</li>
<li>Boron was fitted using hypothetical crystal structures (fcc, bcc) calculated via LMTO (Linear Muffin-Tin Orbital) since experimental data for fcc B does not exist.</li>
</ul>
</li>
<li>
<p><strong>Molecular Statics (Validation)</strong>:</p>
<ul>
<li><strong>Surface Relaxation</strong>: Demonstrated that EAM captures the oscillatory relaxation of atomic layers near a free surface, a many-body effect that pair potentials fail to capture.</li>
<li><strong>Defect Energetics</strong>: Calculated formation energies for Boron interstitials in $\text{Ni}_3\text{Al}$. Found the 6Ni-octahedral site is most stable ($-4.59$ eV relative to an isolated B atom and unperturbed crystal), followed by the 4Ni-2Al octahedral site ($-3.65$ eV) and the 3Ni-1Al tetrahedral site ($-2.99$ eV), consistent with channeling experiments.</li>
</ul>
</li>
<li>
<p><strong>Molecular Dynamics (Application)</strong>:</p>
<ul>
<li><strong>Grain Boundary (GB) Cleavage</strong>: Simulated the fracture of a (210) tilt grain boundary in $\text{Ni}_3\text{Al}$ at a strain rate of $5 \times 10^{10}$ s$^{-1}$.</li>
<li><strong>Comparison</strong>: Compared pure $\text{Ni}_3\text{Al}$ boundaries vs. those doped with Boron and substitutional Nickel.</li>
</ul>
</li>
</ol>
<h2 id="key-outcomes-eam-efficiency-and-boron-strengthening">Key Outcomes: EAM Efficiency and Boron Strengthening</h2>
<ul>
<li><strong>EAM Efficiency</strong>: Confirmed that EAM scales linearly with atom count ($N$), requiring only 2-5 times the computational work of pair potentials.</li>
<li><strong>Boron Strengthening Mechanism</strong>: The simulations suggested that Boron segregates to grain boundaries and, specifically when co-segregated with Ni, significantly increases cohesion.
<ul>
<li>The maximum stress for the enriched boundary was approximately 22 GPa, compared to approximately 19 GPa for the clean boundary.</li>
<li>The B-doped boundary required approximately 44% more work to cleave than the undoped boundary.</li>
<li>The fracture mode shifted from cleaving along the GB to failure in the bulk.</li>
</ul>
</li>
<li><strong>Grain Boundary Segregation</strong>: Molecular statics calculations found B interstitial energies at the GB as low as $-6.9$ eV, compared to $-4.59$ eV in the bulk, consistent with experimental observations of boron segregation to grain boundaries.</li>
<li><strong>Limitations</strong>: The author concludes that while EAM is excellent for metals, it lacks the angular dependence required for strongly covalent materials (like $\text{MoSi}_2$) or directional bonding.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The chapter provides nearly all details required to implement the described potential from scratch.</p>
<h3 id="data">Data</h3>
<ul>
<li><strong>Experimental/Reference Data</strong>: Used for fitting the cost function $\chi_{\text{rms}}$.
<ul>
<li><strong>Pure Elements</strong>: Lattice constants ($a_0$), cohesive energy ($E_{\text{coh}}$), bulk modulus ($B$), elastic constants ($C_{11}, C_{12}, C_{44}$), vacancy formation energy ($E_{\text{vac}}^f$), and diatomic bond length/strength ($R_e, D_e$).</li>
<li><strong>Alloys</strong>: Heat of solution and defect energies (APB, SISF) for $\text{Ni}_3\text{Al}$.</li>
<li><strong>Hypothetical Data</strong>: LMTO first-principles data used for unobserved phases (e.g., fcc Boron, B2 NiB) to constrain the fit.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Component Functions</strong>:
<ul>
<li><strong>Pair Potential $\phi(r)$</strong>: Morse potential form:
$$\phi(r) = D_M {1 - \exp[-\alpha_M(r - R_M)]}^2 - D_M$$</li>
<li><strong>Density Function $\rho(r)$</strong>: Modified hydrogenic 4s orbital:
$$\rho(r) = r^6(e^{-\beta r} + 2^9 e^{-2\beta r})$$</li>
<li><strong>Embedding Function $F(\bar{\rho})$</strong>: Derived numerically to force the crystal energy to match the &ldquo;Universal Energy Relation&rdquo; (Rose et al.) as a function of lattice constant.</li>
</ul>
</li>
<li><strong>Fitting Strategy</strong>:
<ul>
<li><strong>Smooth Cutoff</strong>: A polynomial smoothing function ($h_{\text{smooth}}$) applied at $r_{\text{cut}}$ to ensure continuous derivatives.</li>
<li><strong>Simplex Algorithm</strong>: Used to optimize parameters ($D_M, R_M, \alpha_M, \beta, r_{\text{cut}}$).</li>
<li><strong>Alloy Invariance</strong>: Used transformations $F&rsquo;(\rho) = F(\rho) + g\rho$ and $\rho&rsquo;(r) = s\rho(r)$ to fit cross-potentials without altering pure-element properties.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Parameters</strong>: The text provides the exact optimized parameters for the Ni-Al-B potential in <strong>Table 2</strong> (Pure elements) and <strong>Table 5</strong> (Cross-potentials).
<ul>
<li>Example Ni parameters: $D_M=1.5335$ eV, $\alpha_M=1.7728$ Å$^{-1}$, $r_{\text{cut}}=4.7895$ Å.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>1994 Context</strong>: Mentions that simulations of $10^6$ atoms were possible on the &ldquo;fastest computers available&rdquo;.</li>
<li><strong>Scaling</strong>: Explicitly notes computational work scales as $O(N)$, roughly 2-5x slower than pair potentials.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Voter, A. F. (1994). Chapter 4: The Embedded-Atom Method. In <em>Intermetallic Compounds: Vol. 1, Principles</em>, edited by J. H. Westbrook and R. L. Fleischer. John Wiley &amp; Sons Ltd.</p>
<p><strong>Publication</strong>: Intermetallic Compounds: Vol. 1, Principles (1994)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@incollection</span>{voterEmbeddedAtomMethod1994,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The Embedded-Atom Method}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Voter, Arthur F.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Intermetallic Compounds: Vol. 1, Principles}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{Westbrook, J. H. and Fleischer, R. L.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1994}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{John Wiley &amp; Sons Ltd}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{77--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">chapter</span> = <span style="color:#e6db74">{4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.ctcms.nist.gov/potentials/">NIST Interatomic Potentials Repository</a> (Modern repository often hosting EAM files)</li>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method/">Original EAM Paper (1984)</a></li>
<li><a href="/notes/chemistry/molecular-simulation/classical-methods/embedded-atom-method-review-1993/">EAM Review (1993)</a></li>
</ul>
]]></content:encoded></item><item><title>Venus Evolution Through Time: Key Questions and Missions</title><link>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/venus-evolution-through-time/</link><pubDate>Sun, 07 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/venus-evolution-through-time/</guid><description>A review of Widemann and colleagues' synthesis of key science questions, mission concepts, and international cooperation for Venus exploration 2020-2050.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Systematization</strong> paper (referencing the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">taxonomy</a>) that synthesizes the current state of Venus science and organizes future exploration strategies. It serves as a comprehensive roadmap that consolidates knowledge from prior missions, articulates open questions, and coordinates upcoming international mission concepts.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Venus serves as a natural laboratory for understanding terrestrial planet habitability and evolution. While Earth and Venus share similar mass and bulk geophysical properties, they followed radically different evolutionary paths. Venus is the only spatially resolvable, Earth-sized world that allows us to monitor geophysical envelopes (atmosphere, surface, interior) to support long-term evolutionary models. Major gaps remain regarding the stability of water reservoirs, the transition from a potentially habitable state to the current greenhouse state, and the nature of current geological activity. Understanding Venus directly informs the interpretation of Venus-like exoplanets.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The paper provides a coordinated roadmap for Venus exploration by:</p>
<ol>
<li>Synthesizing key science questions across four domains (comparative planetology, primordial history, surface processes, and interior-atmosphere coupling).</li>
<li>Detailing the instrument suites and science goals of three selected missions (VERITAS, DAVINCI, and EnVision) and demonstrating their synergies.</li>
<li>Identifying technology gaps and future mission concepts required to fully answer the habitability question.</li>
</ol>
<p>The novelty lies in the <strong>coordinated, multi-mission approach</strong> where each mission addresses complementary aspects of Venus science.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This is a review/roadmap paper, so it does not present new experimental results. Instead, it:</p>
<ol>
<li><strong>Synthesizes prior mission data</strong>: Reviews findings from Magellan, Venus Express, Akatsuki, and ground-based radar observations.</li>
<li><strong>Analyzes mission concepts</strong>: Evaluates the science objectives and instrument capabilities of VERITAS, DAVINCI, EnVision, Venera-D, and Shukrayaan-1.</li>
<li><strong>Assesses technology readiness</strong>: Identifies gaps in high-temperature electronics, long-duration surface operations, and aerial platform capabilities.</li>
</ol>
<p>The &ldquo;experiments&rdquo; are the planned observations and measurements from the coordinated fleet of missions in the 2030s.</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<p>The paper concludes that:</p>
<ol>
<li><strong>Synergistic approach is essential</strong>: No single mission can answer the habitability question. The fleet provides complementary global mapping (VERITAS), atmospheric chemistry (DAVINCI), and targeted geological analysis (EnVision).</li>
<li><strong>Key measurements identified</strong>: Noble gas isotopes (especially Xenon), D/H ratio, tesserae composition, and surface deformation are critical observables.</li>
<li><strong>Technology gaps remain</strong>: Long-lived surface landers and sample return require advances in high-temperature electronics and aerial platforms.</li>
<li><strong>Venus science informs exoplanet interpretation</strong>: Understanding the Venus Zone and the transition from habitable to runaway greenhouse states directly supports exoplanet characterization.</li>
</ol>
<p>The 2030s represent the most coordinated era of Venus exploration to date, with NASA, ESA, Roscosmos, ISRO, the Chinese Academy of Sciences, and private missions all targeting the planet within a decade.</p>
<h2 id="key-science-questions">Key Science Questions</h2>
<p>The paper organizes open questions into four primary domains.</p>
<h3 id="comparative-planetology-and-exoplanets">Comparative Planetology and Exoplanets</h3>
<p>The <strong>Venus Zone</strong> is defined as the orbital region where an Earth-sized planet is more likely to be a Venus analog than an Earth analog. Understanding Venus directly informs the interpretation of exoplanet observations.</p>
<p><strong>Magma Ocean Duration</strong>: Venus may lie at a boundary defined by magma ocean cooling times:</p>
<ul>
<li><strong>Type I</strong>: Short-lived magma ocean ($\sim 1$ Myr), allowing water condensation (Earth-like).</li>
<li><strong>Type II</strong>: Long-lived magma ocean ($\sim 100$ Myr) due to high insolation, leading to desiccation via photodissociation and escape of the steam atmosphere.</li>
</ul>
<p><strong>Rotation Rate</strong>: Slow rotation is critical for maintaining temperate conditions in the Venus Zone via cloud-albedo feedback. This has implications for habitability assessments of tidally locked exoplanets.</p>
<h3 id="accretion-and-primordial-history">Accretion and Primordial History</h3>
<p><strong>Impact History</strong>: Did Venus suffer a moon-forming giant impact? The absence of a moon challenges assumptions about early large-scale melting events.</p>
<p><strong>Differentiation</strong>: Determining the timing of silicate/metal differentiation (core formation) via Hf/W chronometry is essential to constrain the accretion phase.</p>
<p><strong>Volatile Delivery</strong>: Did volatiles arrive via solar nebula, asteroids, or comets? Xenon isotopes are key to detecting cometary contributions.</p>
<h3 id="surface-processes-and-resurfacing">Surface Processes and Resurfacing</h3>
<p>Two competing resurfacing models exist:</p>
<ul>
<li><strong>Catastrophic</strong>: A massive pulse of volcanism $\sim 1$ Ga ago followed by quiescence (suggested by random crater distribution).</li>
<li><strong>Equilibrium</strong>: Continuous resurfacing where craters are modified gradually.</li>
</ul>
<p><strong>Tesserae Terrain</strong>: Complex, highly deformed tectonic terrains that may represent the oldest surface rocks. Near-IR emissivity data suggesting low iron content indicates they may be felsic (silica-rich), potentially analogous to Earth&rsquo;s continental crust formed in the presence of water.</p>
<p><strong>Active Volcanism</strong>: Evidence includes variable $\text{SO}_2$ levels, emissivity anomalies at hotspots (Idunn Mons), and young lava flows.</p>
<h3 id="interior-and-atmosphere-coupling">Interior and Atmosphere Coupling</h3>
<p><strong>Tectonic Regime</strong>: Venus lacks plate tectonics but has deformation zones. It may be in a &ldquo;stagnant lid&rdquo; regime or a transitional state.</p>
<p><strong>Noble Gases</strong>: Abundances and isotopes (Ne, Ar, Kr, Xe) track atmospheric loss and outgassing history.</p>
<p><strong>Water Loss</strong>: The D/H ratio indicates water loss, but does not uniquely constrain <em>when</em> or <em>how fast</em> it happened.</p>
<h2 id="the-new-fleet-of-missions">The New Fleet of Missions</h2>
<p>A synergistic fleet of three selected missions (plus international partners) will address these questions in the 2030s.</p>
<h3 id="veritas-nasa-orbiter">VERITAS (NASA Orbiter)</h3>
<blockquote>
<p><strong>Status note</strong>: VERITAS was selected in 2021 but placed on indefinite hold by NASA in late 2022 due to budget pressures from the Mars Sample Return program. Its launch date and schedule remain uncertain as of 2026. The science case and instrument descriptions below reflect the mission as designed.</p></blockquote>
<p><strong>Primary Goal</strong>: Global mapping of topography, rock type, and active deformation.</p>
<p><strong>Key Instruments</strong>:</p>
<ul>
<li><strong>VISAR (X-band Radar)</strong>: Global DEM with 300m horizontal postings over 90% of the surface, with a height accuracy requirement of $\leq$10m (achieved accuracy of 5.9m for 95% of the mapped area after bundle adjustment), 30m SAR imagery globally (15m for ~27% of the surface), and interferometry (RPI) to detect cm-scale surface deformation.</li>
<li><strong>VEM (Emissivity Mapper)</strong>: 14 bands total: 6 surface bands (0.86, 0.91, 0.99, 1.02, 1.11, 1.18 $\mu$m) plus 8 atmospheric and calibration bands, mapping surface iron content (felsic vs. mafic) through atmospheric windows.</li>
</ul>
<p><strong>Science Target</strong>: Determine if Venus has &ldquo;continents&rdquo; (felsic tesserae), active volcanism, and subduction-like features. VERITAS provides the global geophysical map and target identification.</p>
<h3 id="davinci-nasa-probeflyby">DAVINCI (NASA Probe/Flyby)</h3>
<p><strong>Primary Goal</strong>: <em>In situ</em> chemical analysis of the deep atmosphere and descent imaging.</p>
<p><strong>Descent Probe Instruments</strong>:</p>
<ul>
<li><strong>VMS (Mass Spectrometer)</strong>: All noble gases (Ne, Ar, Kr, Xe isotopes), trace gases, and D/H ratio throughout descent.</li>
<li><strong>VTLS (Tunable Laser Spectrometer)</strong>: High-precision isotopes of H, S, C, O.</li>
<li><strong>VASI (Atmospheric Structure Investigation)</strong>: Temperature, pressure, winds, and turbulence characterization during the approximately one-hour descent from ~67 km to the surface.</li>
<li><strong>VenDI (Descent Imager)</strong>: Near-IR imaging of the western Alpha Regio tesserae landing ellipse (~348 $\times$ 160 km) at 2&ndash;200m imaging scales, with 5&ndash;60m topographic resolution derived via Structure-from-Motion.</li>
<li><strong>VfOx (Venus Oxygen Fugacity)</strong>: Student-built instrument to measure redox state of the near-surface atmosphere.</li>
</ul>
<p><strong>Carrier Instruments</strong> (flyby observations):</p>
<ul>
<li><strong>VISOR (4-camera UV and near-IR system)</strong>: Cloud structure and albedo mapping during two Venus flybys.</li>
<li><strong>CUVIS (Compact Ultraviolet Imaging System)</strong>: UV spectra of Venus upper cloud and haze.</li>
</ul>
<p><strong>Mission Timeline</strong>: Launch June 2029; Venus flyby 1 January 2030; Venus flyby 2 November 2030; probe descent June 2031 targeting western Alpha Regio tesserae.</p>
<p><strong>Science Target</strong>: Definitive atmospheric origin/evolution, history of water, and nature of tesserae. DAVINCI provides the chemical &ldquo;ground truth&rdquo; and high-res &ldquo;spot check&rdquo; of tesserae.</p>
<h3 id="envision-esa-orbiter">EnVision (ESA Orbiter)</h3>
<p><strong>Primary Goal</strong>: Holistic view from inner core to upper atmosphere, focusing on activity and geological history.</p>
<p><strong>Key Instruments</strong>:</p>
<ul>
<li><strong>VenSAR (S-band Radar)</strong>: Polarimetric imaging and stereo topography.</li>
<li><strong>SRS (Subsurface Radar Sounder)</strong>: Penetrates the subsurface (up to 1 km depth, 20m resolution) to map stratigraphy, buried craters, and tesserae edges.</li>
<li><strong>VenSpec Suite</strong>: Spectroscopy (IR and UV) to link surface activity to atmospheric gas variations ($\text{SO}_2$, $\text{H}_2\text{O}$).</li>
</ul>
<p><strong>Science Target</strong>: Characterize the sequence of geological events, subsurface layering, and atmospheric-interior coupling. EnVision provides targeted, multi-scale geological analysis and subsurface sounding.</p>
<h3 id="international-partners">International Partners</h3>
<p><strong>Venera-D (Russia)</strong>: Orbiter + Lander.</p>
<ul>
<li>The lander focuses on surface X-ray diffraction and fluorescence (XRD/XRF) analysis (mineralogy) and surviving 2-3 hours.</li>
<li>Includes an aerial platform (balloon) for cloud layer analysis.</li>
</ul>
<p><strong>Shukrayaan-1 (India)</strong>: Orbiter.</p>
<ul>
<li>Features a polarimetric radar (VSAR) and potentially a low-frequency subsurface sounder.</li>
</ul>
<p><strong>VOICE (China)</strong>: Venus Volcano Imaging and Climate Explorer (Dong et al. 2023), an orbiter carrying a Polarimetric Synthetic Aperture Radar (PolSAR), a Microwave Radiometric Sounder (MWRS), and a UV-Visible-Near IR Multi-Spectral Imager (UVN-MSI) on a ~350 km polar orbit, complementary to VERITAS and EnVision.</p>
<p><strong>Morning Star (Rocket Lab)</strong>: Private low-cost small entry probe mission concept (Seager et al. 2021), the Venus Life Finder mission, carrying an ultraviolet autofluorescence backscatter nephelometer to characterize cloud particles and search for biosignatures during descent through the clouds.</p>
<p><strong>CLOVE (Korea)</strong>: Earth-orbiting CubeSat concept by the Institute for Basic Science (IBS) of South Korea, designed to monitor Venus&rsquo;s long-term atmospheric variability from 320 nm to the near-infrared.</p>
<h2 id="future-concepts-and-technology-gaps">Future Concepts and Technology Gaps</h2>
<p>To fully answer the &ldquo;habitability&rdquo; question, investigations beyond the current fleet are required.</p>
<h3 id="long-lived-surface-landers">Long-Lived Surface Landers</h3>
<p><strong>Challenge</strong>: Electronics cannot survive Venus surface temperatures ($470^{\circ}\text{C}$) for long periods.</p>
<p><strong>Solution</strong>: High-temperature electronics (SiC, GaN) and battery technology.</p>
<p><strong>Science Goal</strong>: Seismology. Measuring &ldquo;Venusquakes&rdquo; is the only way to definitively resolve the core state and interior structure.</p>
<h3 id="aerial-platforms-balloons">Aerial Platforms (Balloons)</h3>
<p><strong>Environment</strong>: The cloud layer ($50\text{&ndash;}60$ km) is the &ldquo;habitable zone&rdquo; ($20^{\circ}\text{C}$, 0.5 atm).</p>
<p><strong>Science Goals</strong>:</p>
<ul>
<li>Long-term monitoring of atmospheric circulation and chemistry.</li>
<li><strong>Aerial Seismology</strong>: Detecting infrasound generated by groundquakes from the air (mechanical coupling is $60\times$ stronger on Venus than Earth).</li>
</ul>
<h3 id="sample-return">Sample Return</h3>
<p><strong>Concept</strong>: Skimming the upper atmosphere ($&lt; 120$ km) to collect noble gases and returning them to Earth for high-precision laboratory analysis.</p>
<h2 id="synergies-with-exoplanet-science">Synergies with Exoplanet Science</h2>
<p>Observations of Venus-like exoplanets (e.g., TRAPPIST-1 system) by JWST provide the statistical context for Venus&rsquo;s divergent evolution. The upcoming decade represents a coordinated campaign:</p>
<ol>
<li><strong>VERITAS</strong> provides the global geophysical map and target identification.</li>
<li><strong>DAVINCI</strong> provides the chemical &ldquo;ground truth&rdquo; and high-res &ldquo;spot check&rdquo; of tesserae.</li>
<li><strong>EnVision</strong> provides targeted, multi-scale geological analysis and subsurface sounding.</li>
</ol>
<p>Understanding Venus allows us to interpret spectra from Venus analogs around other stars, making Venus exploration directly relevant to the search for habitable worlds beyond our solar system.</p>
<h2 id="reproducibility">Reproducibility</h2>
<p>This is a review and roadmap paper, so there are no code, model, or dataset artifacts to reproduce. The paper is published open access in <em>Space Science Reviews</em> under a CC license. All referenced mission design documents and companion articles in the same volume are cited and accessible through their respective DOIs.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Widemann, T., Smrekar, S. E., Garvin, J. B., et al. (2023). Venus Evolution Through Time: Key Science Questions, Selected Mission Concepts and Future Investigations. <em>Space Science Reviews</em>, 219(7), 56. <a href="https://doi.org/10.1007/s11214-023-00992-w">https://doi.org/10.1007/s11214-023-00992-w</a></p>
<p><strong>Publication</strong>: <em>Space Science Reviews</em>, 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Widemann2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Widemann, Thomas and Smrekar, Suzanne E. and Garvin, James B. and Straume-Lindner, Anne Grete and Ocampo, Adriana C. and Schulte, Mitchell D. and Voirin, Thomas and Hensley, Scott and Dyar, M. Darby and Whitten, Jennifer L. and Nunes, Daniel C. and Getty, Stephanie A. and Arney, Giada N. and Johnson, Natasha M. and Kohler, Erika and Spohn, Tilman and O&#39;Rourke, Joseph G. and Wilson, Colin F. and Way, Michael J. and Ostberg, Colby and Westall, Frances and H{\&#34;o}ning, Dennis and Jacobson, Seth and Salvador, Arnaud and Avice, Guillaume and Breuer, Doris and Carter, Lynn and Gilmore, Martha S. and Ghail, Richard and Helbert, J{\&#34;o}rn and Byrne, Paul and Santos, Alison R. and Herrick, Robert R. and Izenberg, Noam and Marcq, Emmanuel and Rolf, Tobias and Weller, Matt and Gillmann, Cedric and Korablev, Oleg and Zelenyi, Lev and Zasova, Ludmila and Gorinov, Dmitry and Seth, Gaurav and Rao, C. V. Narasimha and Desai, Nilesh}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Venus Evolution Through Time: Key Science Questions, Selected Mission Concepts and Future Investigations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Space Science Reviews}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{219}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{56}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11214-023-00992-w}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1007/s11214-023-00992-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Life on Venus? Astrobiology and the Habitability Limits</title><link>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/life-on-venus/</link><pubDate>Fri, 05 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/life-on-venus/</guid><description>A systematic analysis of Venus's habitability limits, reviewing temperature, pressure, and acidity constraints against Cockell's predictions.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>synthetic review</strong> that evaluates Venus&rsquo;s past and present habitability by comparing physical conditions against the known limits of terrestrial extremophiles. It is a systematization of knowledge paper that rigorously analyzes environmental constraints based on existing literature.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The core question is: <em>To what degree were past habitats or are present habitats on Venus suitable for life?</em> Beyond the solar system, Cockell frames Venus as a critical <strong>template for extrasolar greenhouse planets</strong>, using it to establish baseline habitability constraints that should guide spectroscopic observations of Venus-like exoplanets. The paper systematically examines each environmental parameter (temperature, pressure, atmospheric composition, UV radiation, pH) to identify which are true biological barriers and which are surmountable based on what we know from terrestrial extremophiles.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The paper provides a rigorous, parameter-by-parameter assessment of Venus&rsquo;s habitability. The key insight is that <strong>temperature acts as the critical constraint</strong>, establishing a hierarchy for greenhouse planets where thermal limits are reached well before pressure limits. This suggests that surface pressure is rarely the primary exclusion factor for life on Venus-like exoplanets. While the surface is sterile, the cloud layers between 48-57 km altitude present a more nuanced picture where temperature and pressure fall within habitable ranges, though extreme acidity and low water activity pose the primary biological challenges.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This review paper evaluates Venus&rsquo;s environmental conditions by synthesizing data from the Venera and Pioneer missions and comparing them against the documented limits of terrestrial extremophiles (thermophiles like <em>Pyrolobus fumarii</em>, acidophiles like <em>Picrophilus</em>, and obligate barophiles from the Mariana Trench). It assesses theoretical metabolic pathways based on available chemical energy sources in the clouds.</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<p>The paper concludes that:</p>
<ul>
<li><strong>Surface</strong>: Uninhabitable due to extreme temperature ($464^\circ\text{C}$), which exceeds biochemical limits</li>
<li><strong>Cloud layers (48-57 km)</strong>: Physically compatible with life (temperature, pressure, nutrients) but extreme acidity ($81\text{&ndash;}98%\ \text{H}_2\text{SO}_4, \text{pH} \approx 0$) and low water activity present severe challenges</li>
<li><strong>Early Venus</strong>: May have had habitable oceans during a &ldquo;moist greenhouse&rdquo; period, with possible interplanetary exchange with early Earth</li>
<li><strong>Future missions</strong>: Should target cloud samples between 48-57 km altitude and look for sulfur isotope fractionation as biosignatures</li>
</ul>
<h2 id="how-cockells-1999-predictions-hold-up-today">How Cockell&rsquo;s 1999 Predictions Hold Up Today</h2>
<p>From a modern perspective (2026), Cockell&rsquo;s analysis remains the foundational baseline for Venusian astrobiology, though specific details have evolved:</p>
<ul>
<li><strong>Phosphine Detection (2020)</strong>: Cockell correctly identified the importance of searching for non-equilibrium trace gases. The <a href="https://doi.org/10.1038/s41550-020-1174-4">claimed detection of phosphine</a> ($\text{PH}_3$) in 2020 reignited interest in the cloud layer hypothesis, but subsequent re-analysis reduced the reported abundance from ~20 ppb to ~1 ppb, and multiple independent teams (Snellen et al. 2020; Villanueva et al. 2021; Thompson 2021) disputed the signal entirely as a likely instrument artifact. The current consensus leans toward a non-detection, though the question remains open pending new observations.</li>
<li><strong>Water Activity Limits (2021)</strong>: Later work (e.g., by <a href="https://doi.org/10.1038/s41550-021-01391-3">Hallsworth et al.</a>) quantified the water activity in Venus&rsquo;s clouds as ~0.004, far below the limit for terrestrial life (~0.585). This reinforces Cockell&rsquo;s concern that acidity and desiccation are the primary barriers, potentially even more severe than he estimated.</li>
<li><strong>Upcoming Missions</strong>: <strong><a href="https://en.wikipedia.org/wiki/DAVINCI">DAVINCI</a></strong> (probe descent June 2031) directly targets the deep atmosphere and cloud chemistry, fulfilling the &ldquo;Descent Probe&rdquo; requirement outlined in this 1999 paper. <strong><a href="https://en.wikipedia.org/wiki/VERITAS_(spacecraft)">VERITAS</a></strong> was selected for global surface mapping but was placed on indefinite hold by NASA in late 2022; its schedule remains uncertain.</li>
</ul>
<h2 id="physical-limits-of-the-venusian-surface">Physical Limits of the Venusian Surface</h2>
<p>The paper evaluates surface conditions against the known limits of terrestrial extremophiles.</p>
<h3 id="temperature-critical-constraint">Temperature (Critical Constraint)</h3>















<figure class="post-figure center ">
    <img src="/img/notes/planetary-science/pyrolobus-fumarii.webp"
         alt="Electron microscope image of Pyrolobus fumarii showing irregular coccoid cell structure"
         title="Electron microscope image of Pyrolobus fumarii showing irregular coccoid cell structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Electron microscope image of <em>Pyrolobus fumarii</em>, which grows optimally at 106°C and defines the upper temperature limit for known life at 113°C. (Manfred Rohde, <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>)</figcaption>
    
</figure>

<ul>
<li><strong>Condition</strong>: The surface is almost globally isothermal at <strong>$464^\circ\text{C}$</strong>.</li>
<li><strong>Biological Limit</strong>: While the known limit at the time was <strong>$113^\circ\text{C}$</strong> (<a href="https://en.wikipedia.org/wiki/Pyrolobus_fumarii"><em>Pyrolobus fumarii</em></a>), Cockell posits a <strong>generic theoretical upper limit of $150^\circ\text{C}$</strong> for his analysis.</li>
<li><strong>Biochemical Barrier</strong>: This theoretical limit sits well below <strong>$250^\circ\text{C}$</strong>, where most peptide bonds hydrolyze in less than 11 minutes (aspartate peptide bonds in less than 1 minute) and ATP decomposes in about 1 second.</li>
<li><strong>Conclusion</strong>: The surface temperature is a hard limit to life. Liquid water cannot exist because $464^\circ\text{C}$ exceeds the critical temperature of water ($374^\circ\text{C}$).</li>
</ul>
<h3 id="pressure-habitable-range">Pressure (Habitable Range)</h3>
<ul>
<li><strong>Condition</strong>: Surface pressure is <strong>9.5 MPa</strong> (~93 atm).</li>
<li><strong>Biological Context</strong>: This is equivalent to ~950 m ocean depth on Earth.</li>
<li><strong>Limit</strong>: Life exists at the Mariana Trench (~110 MPa); researchers have isolated obligate barophiles (such as <em>Shewanella</em>, <em>Moritella</em>, and <em>Colwellia</em>) that grow optimally at high pressures.</li>
<li><strong>Conclusion</strong>: Pressure levels on the surface are within the known tolerance range for piezophilic life.</li>
</ul>
<h3 id="atmospheric-composition-bio-compatible">Atmospheric Composition (Bio-Compatible)</h3>















<figure class="post-figure center ">
    <img src="/img/notes/planetary-science/cyanidium-caldarium.webp"
         alt="Microscope image of Cyanidium and Cyanidiococcus cells showing nucleus, plastid, and mitochondria"
         title="Microscope image of Cyanidium and Cyanidiococcus cells showing nucleus, plastid, and mitochondria"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Thermoacidophilic algae <em>Cyanidium</em> (left) and <em>Cyanidiococcus</em> (right), which can tolerate pure CO₂ atmospheres. (Cho et al. 2020, <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>)</figcaption>
    
</figure>

<ul>
<li><strong>Condition</strong>: $96.5%\ \text{CO}_2, 3.5%\ \text{N}_2$.</li>
<li><strong>Biological Context</strong>: Terrestrial algae like <em>Cyanidium caldarium</em> can tolerate pure $\text{CO}_2$. High $\text{CO}_2$ actually makes carbon assimilation energetically easier compared to Earth&rsquo;s 0.03%.</li>
</ul>
<h3 id="surface-acidity-indeterminate">Surface Acidity (Indeterminate)</h3>
<ul>
<li><strong>Condition</strong>: $\text{SO}_2$ and $\text{SO}_3$ in the atmosphere react with surface minerals to form sulfates. The surface lacks liquid acid, and the mineral chemistry is extremely oxidizing and sulfurous.</li>
<li><strong>Biological Context</strong>: Terrestrial thermoacidophiles (e.g., <em>Acidianus infernus</em>, which grows optimally at 88 degrees C with a pH range of 0.5-5.5) survive in hot, sulfur-rich, acidic environments. However, these organisms all require liquid water.</li>
<li><strong>Conclusion</strong>: Surface acidity is secondary to temperature as a constraint, and the surface provides no supportive chemistry for life.</li>
</ul>
<h3 id="uv-radiation-not-a-constraint">UV Radiation (Not a Constraint)</h3>
<ul>
<li><strong>Condition</strong>: The thick atmosphere ($\text{CO}_2$) scatters most harmful UVC/UVB via Rayleigh scattering, while sulfur-based absorbers in the upper clouds remove the penetrating remainder.</li>
<li><strong>Evolutionary Argument</strong>: The UV flux in the upper clouds is comparable to the surface of <strong>Archean Earth</strong> (when life evolved), despite Venus being closer to the Sun.</li>
<li><strong>Conclusion</strong>: Since life emerged on Earth under similar radiation conditions, UV flux cannot be considered a life-limiting constraint on Venus today or in its past.</li>
</ul>
<h2 id="the-cloud-habitat-a-potential-niche">The Cloud Habitat: A Potential Niche?</h2>
<p>The paper identifies a &ldquo;habitable zone&rdquo; within the lower and middle cloud layers where physical parameters relax.</p>
<h3 id="altitude-and-conditions-48-68-km">Altitude and Conditions (48-68 km)</h3>
<p>The three cloud layers span very different conditions. The <strong>lower and middle layers (48–57 km) are the most relevant for habitability</strong>: temperature and pressure fall within terrestrial extremophile tolerances there. The upper cloud layer (57–68 km) falls below the freezing point, further limiting metabolic activity. Note that H₂SO₄ concentration increases with depth, so the layers with the most favorable temperature and pressure also carry the highest acidity.</p>
<p>Cockell&rsquo;s Table 1 summarizes the key parameters:</p>
<table>
  <thead>
      <tr>
          <th>Layer</th>
          <th>Altitude</th>
          <th>Temperature</th>
          <th>Particle Sizes (modes, $\mu$m)</th>
          <th>Number/cm$^3$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Upper Cloud</td>
          <td>57–68 km</td>
          <td>$-40^\circ\text{C}\text{&ndash;}0^\circ\text{C}$</td>
          <td>0.30, 2.10</td>
          <td>200–350</td>
      </tr>
      <tr>
          <td>Middle Cloud</td>
          <td>51–57 km</td>
          <td>$0^\circ\text{C}\text{&ndash;}38^\circ\text{C}$</td>
          <td>0.30, 2.80, 6.70</td>
          <td>250–350</td>
      </tr>
      <tr>
          <td>Lower Cloud</td>
          <td>48–51 km</td>
          <td>$38^\circ\text{C}\text{&ndash;}60^\circ\text{C}$</td>
          <td>0.30, 2.80, 6.70</td>
          <td>50–150</td>
      </tr>
  </tbody>
</table>
<p>The overall $\text{H}_2\text{SO}_4$ concentration ranges from approximately 81% in the upper cloud layer to 98% in the lower layers. Pressures range from 0.1 to 1.0 MPa across the cloud deck.</p>
<ul>
<li><strong>Droplet Size</strong>: Particles range from 0.3 to 6.7 $\mu$m across three modes, sufficient in diameter to enclose bacteria (0.2–2 $\mu$m) and even bacterial assemblages.</li>
<li><strong>Residence Time</strong>: Using Stokes&rsquo; law, Cockell calculates that an assemblage of 5-10 bacteria (average size 1.1 $\mu$m) would take <strong>over 200 days</strong> to drop through the lower cloud layer. This exceeds the division time of most bacteria by three orders of magnitude or more, meaning a population could reproduce far faster than it rains out.</li>
</ul>
<h3 id="the-primary-challenge-acidity-and-water-activity">The Primary Challenge: Acidity and Water Activity</h3>
<ul>
<li><strong>Acidity</strong>: Cloud droplets are composed of concentrated sulfuric acid, ranging from <strong>$\approx 81%$</strong> in the upper clouds to <strong>$\approx 98%$</strong> in the lower layers.</li>
<li><strong>pH</strong>: The pH is effectively <strong>0</strong>.</li>
<li><strong>Biological Limit</strong>: While terrestrial acidophiles (e.g., <em>Picrophilus</em>) grow at pH 0, they require high water activity. The hygroscopic nature of concentrated $\text{H}_2\text{SO}_4$ creates extreme desiccation (osmotic) stress. Microbes typically combat this by synthesizing &ldquo;biocompatible solutes&rdquo; (like betaine, proline, or glycerol) to balance internal pressure, but the energy cost at this extreme may be prohibitive.</li>
</ul>
<h2 id="metabolism-in-the-clouds-theoretical">Metabolism in the Clouds (Theoretical)</h2>
<p>If a microbe could survive the acidity, the paper proposes a theoretical metabolism based on the sulfur cycle.</p>
<h3 id="energy-sources">Energy Sources</h3>
<ul>
<li><strong>Photosynthesis</strong>: Solar flux at the bottom of the cloud layer is ~15% of incident light (about half that on Earth&rsquo;s surface on a clear day), sufficient to drive photosynthesis.</li>
</ul>
<h3 id="chemoautotrophy">Chemoautotrophy</h3>
<ul>
<li><strong>Electron Acceptor</strong>: Sulfate ($\text{SO}_4^{2-}$) is abundant.</li>
<li><strong>Electron Donors</strong>: Hydrogen ($\text{H}_2$) exists at ~25 ppm; Carbon Monoxide ($\text{CO}$) exists at 30-50 ppm.</li>
<li><strong>Analogs</strong>: Terrestrial sulfate-reducing bacteria (e.g., <em>Desulfobacterium autotrophicum</em>) serve as biochemical templates.</li>
</ul>
<h3 id="nutrients">Nutrients</h3>
<ul>
<li><strong>Phosphorus</strong>: Present (likely as phosphoric acid).</li>
<li><strong>Nitrogen</strong>: 3.5% of atmosphere, available for fixation.</li>
</ul>
<h2 id="early-venus-and-evolutionary-implications">Early Venus and Evolutionary Implications</h2>
<h3 id="moist-greenhouse-model">Moist Greenhouse Model</h3>
<ul>
<li>Deuterium/Hydrogen ratios suggest early Venus had ~100x more water than today.</li>
<li>A &ldquo;moist greenhouse&rdquo; period may have existed with hot oceans (&lt; 100°C) for several hundred million years.</li>
</ul>
<h3 id="interplanetary-ecology">Interplanetary Ecology</h3>
<ul>
<li>High impact rates on early Earth favored thermophiles.</li>
<li>Transfer of material between Earth and Venus suggests a possible early &ldquo;interplanetary ecology&rdquo; where life could have transferred to Venusian oceans before the runaway greenhouse took over.</li>
</ul>
<h2 id="venus-as-an-exoplanet-analog">Venus as an Exoplanet Analog</h2>
<p>Cockell explicitly frames Venus as a template for understanding extrasolar greenhouse planets. By defining the <strong>sequence of habitability constraints</strong>, the paper argues that temperature becomes a limiting factor well before pressure.</p>
<ul>
<li><strong>Hierarchy of Limits</strong>: On runaway greenhouse planets, surface temperatures will exceed biochemical limits ($&gt;150^\circ\text{C}$) long before pressures exceed piezophilic limits (&gt;110 MPa).</li>
<li><strong>Spectroscopic Strategy</strong>: Consequently, exoplanet surveys should prioritize thermal characterization over pressure estimates when screening for surface habitability. High atmospheric pressure is not, in itself, a disqualifier for life.</li>
</ul>
<h2 id="future-directions-and-search-strategies">Future Directions and Search Strategies</h2>
<p>The paper concludes with specific recommendations for exobiology missions.</p>
<h3 id="planetary-protection">Planetary Protection</h3>
<p>The extreme acidity and temperature of the lower atmosphere likely sterilize incoming spacecraft, mitigating contamination risks.</p>
<h3 id="proposed-missions">Proposed Missions</h3>
<ul>
<li><strong>Descent Probe</strong>: Equipped with a sample collector arm to analyze cloud droplets between 48-57 km.</li>
<li><strong>Balloon Mission</strong>: A free-floating platform to study cloud chemistry and potentially culture organisms in situ.</li>
</ul>
<h3 id="key-biomarkers-to-search-for">Key Biomarkers to Search For</h3>
<ul>
<li><strong>Isotopic Fractionation</strong>: Biological sulfate reduction prefers $^{32}\text{S}$ over $^{34}\text{S}$; analyzing sulfur isotopes in rocks could reveal past life.</li>
<li><strong>Trace Gases</strong>: Precise measurement of non-equilibrium gases ($\text{H}_2, \text{CO}$) in the clouds.</li>
</ul>
<h3 id="earth-based-research-the-missing-venus-analog">Earth-Based Research: The Missing Venus Analog</h3>
<p>We have yet to find a terrestrial microbe that is simultaneously <strong>hyperthermophilic</strong>, <strong>acidophilic</strong>, and capable of <strong>extreme osmoregulation</strong>. Cockell identifies four potential explanations for this absence, each with different implications for whether Venusian life is possible:</p>
<ol>
<li><strong>Energetic Limitations</strong>: The variety of adaptations required (synthesis of biocompatible solutes, continuous proton pumping against low pH, synthesis of heat shock proteins and thermally stable proteins) are likely to be energetically demanding. The cumulative energy cost of multiple extreme adaptations may exceed what phototrophy or chemoautotrophy can supply. Cockell highlights this as an area needing more theoretical and laboratory experimentation.</li>
<li><strong>Biochemical Incompatibilities</strong>: Some adaptations to extreme environmental parameters may be possible individually but not simultaneously at great extremes for all parameters. Since our knowledge of many of these adaptations is still in its infancy, evaluating these interrelationships in detail for Venus is difficult.</li>
<li><strong>Habitat Limitation on Earth</strong>: Earth simply lacks stable environments that combine all Venus-like stressors. Deep-sea hydrothermal vents provide high temperature and pressure but not extreme acidity or low water activity. Hot springs can be acidic but rarely exceed 90-95 degrees C. The absence of such combined habitats means evolution has not been driven to produce polyextremophiles.</li>
<li><strong>Insufficient Exploration of the Biosphere</strong>: Studies of organisms in hot regions of the deep subsurface through deep-drilling may yield additional insights. Subsurface organisms subjected to high temperatures and low water activities would provide a useful biochemical template for understanding adaptation requirements relevant to Venus-like environments.</li>
</ol>
<h2 id="comparative-parameter-summary">Comparative Parameter Summary</h2>
<p>Cockell&rsquo;s Table 2 provides a side-by-side assessment of key environmental parameters across Venus&rsquo;s surface, lower cloud layer, early Venus, and generic extrasolar Venus-like planets:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Venus Surface</th>
          <th style="text-align: left">Lower Clouds (48–51 km)</th>
          <th style="text-align: left">Early Venus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Temperature</td>
          <td style="text-align: left">$464^\circ\text{C}$ (<strong>lethal</strong>)</td>
          <td style="text-align: left">$38\text{&ndash;}60^\circ\text{C}$ (habitable)</td>
          <td style="text-align: left">Possibly $&lt; 100^\circ\text{C}$ in oceans</td>
      </tr>
      <tr>
          <td style="text-align: left">Pressure</td>
          <td style="text-align: left">~93 bar (habitable)</td>
          <td style="text-align: left">~1 bar (habitable)</td>
          <td style="text-align: left">~93 bar at surface</td>
      </tr>
      <tr>
          <td style="text-align: left">Atmospheric gas</td>
          <td style="text-align: left">$\text{CO}_2$ (tolerable)</td>
          <td style="text-align: left">$\text{CO}_2$ (tolerable)</td>
          <td style="text-align: left">$\text{CO}_2/\text{H}_2\text{O}$</td>
      </tr>
      <tr>
          <td style="text-align: left">$\text{H}_2\text{SO}_4$</td>
          <td style="text-align: left">Minerals only</td>
          <td style="text-align: left">~98% (<strong>lethal water activity</strong>)</td>
          <td style="text-align: left">Absent (water present)</td>
      </tr>
      <tr>
          <td style="text-align: left">UV radiation</td>
          <td style="text-align: left">Absent (shielded)</td>
          <td style="text-align: left">~Archean Earth (tolerable)</td>
          <td style="text-align: left">Unknown</td>
      </tr>
      <tr>
          <td style="text-align: left">Liquid water</td>
          <td style="text-align: left">Absent</td>
          <td style="text-align: left">Absent (acid droplets only)</td>
          <td style="text-align: left">Possibly present</td>
      </tr>
      <tr>
          <td style="text-align: left">Overall verdict</td>
          <td style="text-align: left"><strong>Uninhabitable</strong></td>
          <td style="text-align: left">Physically possible, chemistry severe</td>
          <td style="text-align: left"><strong>Potentially habitable</strong></td>
      </tr>
  </tbody>
</table>
<p>The table highlights that early Venus is the most favorable scenario, while the present surface is definitively uninhabitable and the cloud layer is a physical-but-not-chemical niche.</p>
<h2 id="connecting-habitability-to-terraforming">Connecting Habitability to Terraforming</h2>
<p>Understanding the baseline habitability of Venus is the first step in conceptualizing planetary engineering. The extreme limits identified here, especially the $464^\circ\text{C}$ surface temperature and $81\text{&ndash;}98%\ \text{H}_2\text{SO}_4$ clouds, must be mitigated before complex life can take hold.</p>
<p>To explore how we might overcome these physical limits and engineer a second Earth, read my notes on:</p>
<ul>
<li><a href="/notes/interdisciplinary/planetary-science/surface-of-venus/">The Surface of Venus</a> for details on the geological constraints.</li>
<li><a href="/notes/interdisciplinary/planetary-science/venus-evolution-through-time/">Venus Evolution Through Time</a> for the history of its climate catastrophe and potential paths to recovery.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>This is a 1999 theoretical review paper with no associated code, datasets, or models. The paper synthesizes existing mission data (Venera, Pioneer) and published extremophile literature. All environmental parameters cited are drawn from publicly available planetary science databases. The paper is published in <em>Planetary and Space Science</em> (Elsevier), which is paywalled, and no open-access preprint exists (pre-arXiv era for this field).</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cockell, C. S. (1999). Life on Venus. <em>Planetary and Space Science</em>, 47(12), 1487-1501. <a href="https://doi.org/10.1016/S0032-0633(99)00036-7">https://doi.org/10.1016/S0032-0633(99)00036-7</a></p>
<p><strong>Publication</strong>: <em>Planetary and Space Science</em>, 1999</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Life_on_Venus">Wikipedia Article</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cockell1999life,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Cockell, Charles S.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Life on {Venus}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Planetary and Space Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1487--1501}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1999}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1016/S0032-0633(99)00036-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NInChI: Toward a Chemical Identifier for Nanomaterials</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/ninchi-alpha/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/ninchi-alpha/</guid><description>NInChI (Nanomaterials InChI) extends chemical identifiers to represent complex, multi-component nanomaterials.</description><content:encoded><![CDATA[<h2 id="a-new-standard-for-nanoinformatics">A New Standard for Nanoinformatics</h2>
<p>This is a <strong>Systematization paper</strong> that proposes a new standard: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative workshop organized by the H2020 research infrastructure NanoCommons and the nanoinformatics project NanoSolveIT, this work uses <strong>six detailed case studies</strong> to systematically develop a <strong>hierarchical, machine-readable notation</strong> for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.</p>
<h2 id="the-breakdown-of-traditional-chemical-identifiers">The Breakdown of Traditional Chemical Identifiers</h2>
<p>Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.</p>
<p>Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:</p>
<ul>
<li>The gold core composition and size</li>
<li>The silica shell thickness and interface</li>
<li>The surface chemistry and ligand density</li>
<li>The overall shape and morphology</li>
</ul>
<p>There&rsquo;s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:</p>
<ul>
<li><strong>Data sharing</strong> between research groups</li>
<li><strong>Regulatory assessment</strong> where precise identification matters</li>
<li><strong>Computational modeling</strong> that needs structured input</li>
<li><strong>Database development</strong> and search capabilities</li>
</ul>
<p>Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.</p>
<h2 id="the-five-tier-nanomaterial-description-hierarchy">The Five-Tier Nanomaterial Description Hierarchy</h2>
<p>The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The core insight is organizing nanomaterial description from the inside out, following the OECD&rsquo;s framework for risk assessment, with a five-tier hierarchy:</p>
<ol>
<li><strong>Tier 1: Chemical Composition</strong>: What is the core made of? This differentiates uniform compositions (Tier 1.1), randomly mixed (Tier 1.2), ordered core-shell materials (Tier 1.3), and onion-like multi-shell morphologies (Tier 1.4).</li>
<li><strong>Tier 2: Morphology</strong>: What shape, size, and dimensionality? This encodes dimension (0D-3D), size and size distribution, and shape information.</li>
<li><strong>Tier 3: Surface Properties</strong>: Physical and chemical surface parameters such as charge, roughness, and hydrophobicity. Many of these depend on external conditions (pH, solvent, temperature).</li>
<li><strong>Tier 4: Surface Functionalization</strong>: How are coatings attached to the core? This includes functionalization density, orientation, and binding type (covalent vs. non-covalent).</li>
<li><strong>Tier 5: Surface Ligands</strong>: What molecules are on the surface, their density, orientation, and distribution?</li>
</ol>
<p>This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.</p>
<h2 id="testing-the-standard-six-case-studies">Testing the Standard: Six Case Studies</h2>
<p>The authors tested their concept against six real-world case studies to identify what actually matters in practice.</p>
<p><strong>Case Study 1: Gold Nanoparticles</strong></p>
<p>Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape affects properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.</p>
<p><strong>Case Study 2: Graphene-Family NMs</strong></p>
<p>Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube&rsquo;s structure), defects and impurities that can alter properties, and number of layers (for nanotubes, single-wall vs multi-wall). This case showed that the notation needed to handle both topological complexity and chemical composition.</p>
<p><strong>Case Study 3: Complex Engineered (Doped and Multi-Metallic) NMs</strong></p>
<p>Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified. The case study assessed whether the MInChI extension could represent these solid solutions.</p>
<p><strong>Case Study 4: Database Applications</strong></p>
<p>The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.</p>
<p><strong>Case Study 5: Computational Modeling</strong></p>
<p>This explored several applications: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides structured input that computational tools need for both physics-based and data-driven nanoinformatics approaches.</p>
<p><strong>Case Study 6: Regulatory Applications</strong></p>
<p>Under frameworks like REACH, regulators need to distinguish between different &ldquo;nanoforms&rdquo;, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.</p>
<h2 id="the-ninchi-alpha-specification-in-practice">The NInChI Alpha Specification in Practice</h2>
<p>Synthesizing insights from all six case studies, the authors propose the <strong>NInChI alpha specification</strong> (version 0.00.1A), a three-layer structure. Importantly, the paper distinguishes the five-tier NM description hierarchy (Section 1.2 above) from the three-layer NInChI notation hierarchy. NM properties from the five tiers are encoded into these three notation layers:</p>
<p><strong>Layer 1 (Version Number)</strong>: Standard header indicating the NInChI version, denoted as <code>0.00.1A</code> for the alpha version. This follows the convention of all <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>-based notations.</p>
<p><strong>Layer 2 (Composition)</strong>: Each component (core, shell, ligands, impurities, dopants, linkers) gets described using standard InChI (or PInChI/MInChI) for chemical composition, with additional sublayers for morphology (prefix <code>m</code>, e.g., <code>sp</code> for sphere, <code>sh</code> for shell, <code>tu</code> for tube), size (prefix <code>s</code>, in scientific notation in meters), crystal structure (prefix <code>k</code>), and chirality (prefix <code>w</code> for carbon nanotubes). Components are separated by <code>!</code>.</p>
<p><strong>Layer 3 (Arrangement)</strong>: Specified with prefix <code>y</code>, this layer describes how the components from Layer 2 are combined, proceeding from inside out. A core-shell material is written as <code>y2&amp;1</code> where the numbers reference components in Layer 2. Covalent bonding between components is indicated with parentheses, e.g., <code>(1&amp;2&amp;3)</code> for a nano core with a covalently bound ligand coating.</p>
<p>The paper provides concrete worked examples from the case studies:</p>
<ul>
<li><strong>Silica with gold coating</strong> (20 nm silica, 2 nm gold shell):
<code>NInChI=0.00.1A/Au/msh/s2t10r1-9;12r2-9!/O2Si/c1-3-2/msp/s20d-9/k000/y2&amp;1</code></li>
<li><strong>CTAB-capped gold nanoparticle</strong> (20 nm diameter):
<code>NinChI=0.00.1A/Au/msp/s20d-9!C19H42N.BrH/c1-5-6-7.../y1&amp;2</code></li>
<li><strong>Chiral single-wall nanotube</strong> of the (3,1) type with 0.4 nm diameter:
<code>NInChI=0.00.1A/C/mtu/s4d-10/w(3,1)/y1</code></li>
</ul>
<p><strong>Property Prioritization</strong>: The case studies produced a prioritization of NM properties into four categories (Table 3 in the paper):</p>
<table>
  <thead>
      <tr>
          <th>Category 1: Must Have</th>
          <th>Category 2a: Nice to Have</th>
          <th>Category 2b: Extrinsic</th>
          <th>Category 3: Out of Scope</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemical composition</td>
          <td>Structural defects</td>
          <td>Surface charge</td>
          <td>Optical properties</td>
      </tr>
      <tr>
          <td>Size/size distribution</td>
          <td>Density</td>
          <td>Corona</td>
          <td>Magnetic properties</td>
      </tr>
      <tr>
          <td>Shape</td>
          <td>Surface composition</td>
          <td>Agglomeration state</td>
          <td>Chemical/oxidation state</td>
      </tr>
      <tr>
          <td>Crystal structure</td>
          <td></td>
          <td>Dispersion</td>
          <td></td>
      </tr>
      <tr>
          <td>Chirality</td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>Ligand and ligand binding</td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p><strong>Implementation</strong>: The authors built a prototype NInChI generation tool using the ZK framework with a Java backend, available through the <a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">Enalos Cloud Platform</a>. The tool lets users specify core composition, morphology, size, crystal structure, and chirality, then build outward by adding shells or clusters. InChIs for shell components are retrieved via the NCI/CADD chemical structure REST API.</p>
<p><strong>Limitations</strong>: The alpha version acknowledges areas for future development: nanocomposite and nanostructured materials, inverse NMs (nano holes in bulk material), and nanoporous materials are beyond current scope. Dynamic properties such as dissolution, agglomeration, and protein corona formation are excluded. The stochastic nature of NMs (e.g., broad size distributions) is not yet fully addressed. Covalent bonding between components needs further refinement.</p>
<p><strong>Impact</strong>: For researchers, NInChI enables precise structural queries for nanomaterials data sharing. For regulators, it provides systematic identification for risk assessment and nanoform classification under frameworks like REACH. For computational modelers, it enables automated descriptor generation and read-across predictions.</p>
<p><strong>Key Conclusions</strong>: The 8-month collaborative process demonstrates that creating systematic notation for nanomaterials is feasible. The hierarchical, inside-out organization provides an approach that satisfies experimentalists, modelers, database owners, and regulators. Testing against six case studies identified the essential features that must be captured. By extending InChI and reusing conventions from MInChI, RInChI, and PInChI, the work builds on existing infrastructure. The proposed NInChI alpha is intended to stimulate further analysis and refinement with the broader community and the InChI Trust.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.</li>
<li><strong>Tools &amp; Code</strong>: The authors provided a prototype NInChI generation tool available through the <a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">Enalos Cloud Platform</a>, built using the ZK framework with a Java backend. The underlying backend code was not released as an open-source library.</li>
<li><strong>Documentation</strong>: The paper serves as the first alpha specification for community discussion and refinement. No formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like <code>.cif</code>) is provided.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">NInChI Generator (Enalos Cloud)</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Prototype web tool for generating NInChI strings; backend not open-source</td>
      </tr>
      <tr>
          <td><a href="https://www.mdpi.com/2079-4991/10/12/2493">Paper (MDPI)</a></td>
          <td>Other</td>
          <td>CC BY 4.0</td>
          <td>Open-access alpha specification</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lynch, I., Afantitis, A., Exner, T., Himly, M., Lobaskin, V., Doganis, P., &hellip; &amp; Melagraki, G. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? <em>Nanomaterials</em>, <em>10</em>(12), 2493. <a href="https://doi.org/10.3390/nano10122493">https://doi.org/10.3390/nano10122493</a></p>
<p><strong>Publication</strong>: Nanomaterials (2020)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lynch2020inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nanomaterials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2493}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{MDPI}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.3390/nano10122493}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI: The Worldwide Chemical Structure Identifier Standard</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2013/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2013/</guid><description>Heller et al. (2013) explain how IUPAC's InChI became the global standard for representing chemical structures, its governance, and current limitations.</description><content:encoded><![CDATA[<h2 id="inchi-as-a-resource-and-systematization-standard">InChI as a Resource and Systematization Standard</h2>
<p>This is a <strong>Resource &amp; Systematization Paper</strong> that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.</p>
<h2 id="the-motivation-interoperability-in-chemical-databases">The Motivation: Interoperability in Chemical Databases</h2>
<p>Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. These were expensive, restricted, and relied on &ldquo;in-house&rdquo; databases.</p>
<p>The authors argue the Internet and Open Source software acted as a <strong>&ldquo;black swan&rdquo; event</strong> that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.</p>
<h2 id="technical-and-institutional-innovations-of-inchi">Technical and Institutional Innovations of InChI</h2>
<p>InChI&rsquo;s innovation is both technical and institutional:</p>
<p><strong>Technical novelty</strong>: A hierarchical &ldquo;layered&rdquo; canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that&rsquo;s a subset of the same molecule with known stereochemistry.</p>
<p><strong>Institutional novelty</strong>: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a &ldquo;pre-competitive&rdquo; necessity. This solved the political problem of maintaining an open standard in a competitive industry.</p>
<h3 id="technical-architecture-layers-and-hashing">Technical Architecture: Layers and Hashing</h3>
<h4 id="the-inchi-string">The InChI String</h4>
<p>InChI is a <strong>canonicalized structure representation</strong> derived from IUPAC conventions. It uses a hierarchical &ldquo;layered&rdquo; format where specific layers add detail. The exact technical specification includes these string segments:</p>
<ol>
<li><strong>Main Layer</strong>: Chemical Formula</li>
<li><strong>Connectivity Layer (<code>/c</code>)</strong>: Atoms and bonds (excluding bond orders)</li>
<li><strong>Hydrogen Layer (<code>/h</code>)</strong>: Tautomeric and immobile H atoms</li>
<li><strong>Charge (<code>/q</code>) &amp; Proton Balance (<code>/p</code>)</strong>: Accounting for ionization</li>
<li><strong>Stereochemistry</strong>:
<ul>
<li>Double bond (<code>/b</code>) and Tetrahedral (<code>/t</code>) parity</li>
<li>Parity inversion (<code>/m</code>)</li>
<li>Stereo type (<code>/s</code>): absolute, relative, or racemic</li>
</ul>
</li>
<li><strong>Fixed-H Layer (<code>/f</code>)</strong>: Distinguishes specific tautomers if needed</li>
</ol>
<p>This layered approach means that a molecule with unknown stereochemistry will have an InChI that&rsquo;s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.</p>
<h4 id="the-inchikey">The InChIKey</h4>
<p>Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like <code>/</code> and <code>+</code>), the InChIKey was created.</p>
<p><strong>Mechanism</strong>: A 27-character string generated via a <strong>SHA-256 hash</strong> of the InChI string. This can be represented as:</p>
<p>$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$</p>
<p><strong>Structure</strong>:</p>
<ul>
<li><strong>Block 1 (14 characters)</strong>: Encodes the molecular skeleton (connectivity)</li>
<li><strong>Block 2 (10 characters)</strong>: Eight letters encoding stereochemistry and isotopes, plus a flag indicating standard InChI (S) and an InChI version indicator (A for version 1)</li>
<li><strong>Block 3 (1 character)</strong>: Protonation flag (e.g., &lsquo;N&rsquo; for neutral)</li>
</ul>
<p>Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between <strong>InChI collisions</strong> (which are due to flaws/bugs and are very rare) and <strong>InChIKey collisions</strong> (which are mathematically inevitable due to hashing).</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This is a systematization paper documenting an existing standard. However, the authors provide:</p>
<p><strong>Validation evidence</strong>:</p>
<ul>
<li><strong>Certification Suite</strong>: A test suite that software vendors must pass to display the &ldquo;InChI Certified&rdquo; logo, preventing fragmentation</li>
<li><strong>Round-trip conversion testing</strong>: Demonstrated &gt;99% success rate converting InChI back to structure (100% with AuxInfo layer)</li>
<li><strong>Real-world adoption metrics</strong>: Documented integration across major chemical databases and publishers</li>
</ul>
<p><strong>Known limitations identified</strong>:</p>
<ul>
<li>Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)</li>
<li>Edge cases in stereochemistry representation</li>
</ul>
<h3 id="institutional-history--governance">Institutional History &amp; Governance</h3>
<p><strong>Origin</strong>: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the <strong>IUPAC Chemical Identifier Project (IChIP)</strong>.</p>
<p><strong>Development</strong>: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC <strong>CCINS</strong> committee, which later became the <strong>InChI Subcommittee</strong> of Division VIII.</p>
<p><strong>The InChI Trust</strong>: To ensure the algorithm survived beyond a volunteer organization, the <strong>InChI Trust</strong> was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.</p>
<h2 id="real-world-impact-and-future-directions">Real-World Impact and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Success through &ldquo;un-coerced adoption&rdquo;</strong>: InChI succeeded because commercial competitors viewed it as a &ldquo;pre-competitive&rdquo; necessity for the Internet age. The open governance model proved durable.</p>
<p><strong>Technical achievements</strong>:</p>
<ul>
<li>Reversible representation (&gt;99% without AuxInfo, 100% with it)</li>
<li>Hierarchical structure enables flexible matching at different levels of detail</li>
<li>InChIKey enables web search despite being a hash (with inherent collision risk)</li>
</ul>
<h3 id="limitations-acknowledged-as-of-2013">Limitations Acknowledged (as of 2013)</h3>
<ul>
<li><strong>Tautomerism Issues</strong>: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2</li>
<li><strong>Hash collision risk</strong>: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare</li>
<li><strong>Certification required</strong>: To prevent fragmentation, software must pass the InChI Certification Suite</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.</p>
<h3 id="code--software">Code &amp; Software</h3>
<ul>
<li><strong>Official Open Source Implementation</strong>: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the <a href="https://www.inchi-trust.org/downloads/">InChI Trust Downloads Page</a> and their <a href="https://github.com/IUPAC-InChI/InChI">official GitHub repository</a>.</li>
<li><strong>Canonicalization algorithm</strong>: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.</li>
</ul>
<h3 id="data--validation">Data &amp; Validation</h3>
<ul>
<li><strong>InChI Certification Suite</strong>: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.</li>
<li><strong>Version 1 specification</strong>: Complete technical documentation of the layered format.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Round-trip conversion</strong>: &gt;99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.</li>
<li><strong>Certification testing</strong>: Pass/fail validation for software claiming InChI compliance.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., &amp; Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. <em>Journal of Cheminformatics</em>, <em>5</em>(1), 7. <a href="https://doi.org/10.1186/1758-2946-5-7">https://doi.org/10.1186/1758-2946-5-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2013</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{heller2013inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{InChI} - the worldwide chemical structure identifier standard}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2013}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/1758-2946-5-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolRec: Chemical Structure Recognition at CLEF 2012</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</guid><description>MolRec achieves 95%+ accuracy on simple structures but struggles with complex diagrams, revealing rule-based OCSR limits and systematic failures.</description><content:encoded><![CDATA[<h2 id="systematization-of-rule-based-ocsr">Systematization of Rule-Based OCSR</h2>
<p>This is a <strong>Systematization</strong> paper that evaluates and analyzes MolRec&rsquo;s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.</p>
<h2 id="investigating-the-limits-of-rule-based-recognition">Investigating the Limits of Rule-Based Recognition</h2>
<p>This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.</p>
<p>The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.</p>
<h2 id="the-two-stage-molrec-architecture">The Two-Stage MolRec Architecture</h2>
<p>The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.</p>
<p><strong>MolRec Architecture Overview</strong>: The system follows a two-stage pipeline approach:</p>
<ol>
<li>
<p><strong>Vectorization Stage</strong>: The system preprocesses input images through three steps:</p>
<ul>
<li><strong>Image binarization</strong> using Otsu&rsquo;s method to convert grayscale images to black and white, followed by labelling of connected components</li>
<li><strong>OCR processing</strong> using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)</li>
<li><strong>Separation of bond elements</strong>: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds</li>
</ul>
</li>
<li>
<p><strong>Rule Engine Stage</strong>: A set of 18 chemical rules converts geometric primitives into molecular graphs:</p>
<ul>
<li><strong>Bridge bond recognition</strong> (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)</li>
<li><strong>Standard bond and atom recognition</strong> (16 rules applied in arbitrary order)</li>
<li><strong>Context-aware disambiguation</strong> resolving ambiguities using the full graph structure and character groups</li>
<li><strong>Superatom expansion</strong> looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs</li>
</ul>
</li>
</ol>
<p>The system can output results in standard formats like MOL files or SMILES strings.</p>
<h2 id="clef-2012-experimental-design">CLEF 2012 Experimental Design</h2>
<p>The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:</p>
<ol>
<li>
<p><strong>Automatic Evaluation Set (865 images)</strong>: Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.</p>
</li>
<li>
<p><strong>Manual Evaluation Set (95 images)</strong>: A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.</p>
</li>
</ol>
<p>The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.</p>
<h2 id="performance-divergence-and-critical-failure-modes">Performance Divergence and Critical Failure Modes</h2>
<p>The results reveal a stark performance gap between simple and complex molecular structures:</p>
<p><strong>Performance on Automatic Evaluation Set</strong>: On the 865-image set, MolRec achieved <strong>94.91% to 96.18% accuracy</strong> across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.</p>
<p><strong>Performance on Manual Evaluation Set</strong>: On the 95-image set, accuracy dropped to <strong>46.32% to 58.95%</strong>. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.</p>
<p><strong>Key Failure Modes Identified</strong> (with counts from the paper&rsquo;s Table 3):</p>
<ul>
<li>
<p><strong>Character Grouping</strong> (26 manual, 0 automatic): An implementation bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.</p>
</li>
<li>
<p><strong>Touching Characters</strong> (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.</p>
</li>
<li>
<p><strong>Four-Way Junction Failures</strong> (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.</p>
</li>
<li>
<p><strong>OCR Errors</strong> (5 manual, 11 automatic): Character recognition errors included &ldquo;G&rdquo; interpreted as &ldquo;O&rdquo;, &ldquo;alkyl&rdquo; being mis-recognized, and &ldquo;I&rdquo; interpreted as a vertical single bond.</p>
</li>
<li>
<p><strong>Missed Solid and Dashed Wedge Bonds</strong> (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.</p>
</li>
<li>
<p><strong>Missed Wavy Bonds</strong> (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.</p>
</li>
<li>
<p><strong>Missed Charge Signs</strong> (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.</p>
</li>
<li>
<p><strong>Other Errors</strong>: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.</p>
</li>
</ul>
<p><strong>Dataset Quality Issues</strong>: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec&rsquo;s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.</p>
<p><strong>Key Insights</strong>:</p>
<ul>
<li>
<p><strong>Performance gap between simple and complex structures</strong>: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.</p>
</li>
<li>
<p><strong>Many errors are fixable</strong>: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.</p>
</li>
<li>
<p><strong>Touching character segmentation</strong> remains a notoriously difficult open problem that the authors plan to explore further.</p>
</li>
<li>
<p><strong>Evaluation challenges</strong>: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.</p>
</li>
</ul>
<p>The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="system-architecture">System Architecture</h3>
<p><strong>Model Type</strong>: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Datasets (CLEF 2012)</strong>: 961 total test images clipped from patent documents:</p>
<ul>
<li><strong>Automatic Evaluation Set</strong>: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth</li>
<li><strong>Manual Evaluation Set</strong>: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation</li>
</ul>
<p><strong>Training Data</strong>: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Vectorization Pipeline</strong> (three steps):</p>
<ul>
<li><strong>Image Binarization</strong>: Otsu&rsquo;s method, followed by connected component labelling</li>
<li><strong>OCR</strong>: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image</li>
<li><strong>Bond Element Separation</strong>: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles</li>
</ul>
<p><strong>Rule Engine</strong>: 18 chemical structure rules converting geometric primitives to molecular graphs:</p>
<ul>
<li><strong>Bridge Bond Rules (2 rules)</strong>: Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings</li>
<li><strong>Wavy Bond Rule</strong>: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)</li>
<li><strong>Standard Recognition Rules</strong>: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)</li>
</ul>
<p><strong>Optimization</strong>: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Automated</strong>: Exact structural match via OpenBabel MOL file comparison</li>
<li><strong>Manual</strong>: Visual inspection by human experts for structures where OpenBabel fails</li>
</ul>
<p><strong>Results</strong>:</p>
<ul>
<li><strong>Automatic Evaluation Set (865 images)</strong>: 94.91% to 96.18% accuracy across four runs</li>
<li><strong>Manual Evaluation Set (95 images)</strong>: 46.32% to 58.95% accuracy across four runs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.</p>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p><strong>Closed.</strong> No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:</p>
<ul>
<li>The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)</li>
<li>Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs</li>
<li>OCR training data or character prototype specifications</li>
</ul>
<p>The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Workshop (ImageCLEF Track)</p>
]]></content:encoded></item><item><title>The Surface of Venus: Stratigraphy and Resurfacing History</title><link>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/surface-of-venus/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/planetary-science/surface-of-venus/</guid><description>A review of Venus's "stagnant lid" geology and global resurfacing history, exploring why Earth's twin diverged so dramatically from our own planet.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Systematization</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">taxonomy</a>) that organizes and unifies decades of observational data from multiple planetary missions into a coherent geological framework for Venus.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Venus and Earth are planetary twins, nearly identical in size, mass, and bulk composition. Earth developed a biosphere. Venus developed a surface temperature of $\sim 740;\text{K}$ ($\sim 467^\circ\text{C}$) and a 93-bar $\text{CO}_2$ atmosphere.</p>
<p><strong>Why did two similar planets diverge so drastically?</strong></p>
<p>Basilevsky and Head synthesize decades of data to answer this. By decoding the geological record preserved in the Venusian crust, they aim to reconstruct the planet&rsquo;s thermal evolution and understand why Venus operates under a &ldquo;stagnant lid&rdquo; regime characterized by a geological cycle of catastrophic global resurfacing.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The paper&rsquo;s contribution is a <strong>comprehensive synthesis</strong> integrating findings from Soviet Venera landers and NASA Magellan radar imaging into a unified geological history. Key novelties:</p>
<ul>
<li><strong>Global Stratigraphy</strong>: Establishes a planet-wide sequence of geological units:
<ol>
<li><strong>Tessera Terrain</strong>: The oldest, highly tectonized crust, forming &ldquo;islands&rdquo; and &ldquo;continents&rdquo; above the plains (~8% of the surface).</li>
<li><strong>Densely Fractured Plains</strong>: Widespread, heavily deformed volcanic plains showing global-scale extensional and shear fracturing.</li>
<li><strong>Ridge Belts</strong>: Linear bands of folded, compressed material (~3–5 km wide ridges), a transitional tectonic phase.</li>
<li><strong>Shield Plains</strong>: Widespread fields of small volcanic shields (5–15 km diameter), emplaced after the ridge belts.</li>
<li><strong>Wrinkle-Ridged Plains (Regional Plains)</strong>: The predominant variety of regional plains (which together with shield plains cover ~70% of the surface), consisting of vast basaltic lava flows marked by compressional ridges from gentle horizontal shortening.</li>
<li><strong>Younger Plains (Lobate/Smooth)</strong>: The most recent volcanic flows, showing little deformation and comprising ~10–15% of the surface.</li>
</ol>
</li>
<li>Surface dominated by <strong>widespread basaltic volcanism and tectonic deformation</strong>, operating under a single-plate regime with no evidence of subduction trenches, island arcs, or mid-oceanic ridges.</li>
<li><strong>The Synchronous Model</strong>: Argues geological units (like regional plains) formed <strong>synchronously</strong> planet-wide, supporting global catastrophic resurfacing events rather than geographically asynchronous activity.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This review synthesizes observational data from multiple spacecraft missions spanning four decades:</p>
<ul>
<li><strong>Radar Imaging and Altimetry (Magellan, 1990–1994)</strong>: Global high-resolution mapping (120–220 m/pixel SAR, altimetry, emissivity), revealing the full inventory of volcanoes, tectonic features, and impact craters.</li>
<li><strong>Venera 15/16 (1983)</strong>: First radar imaging of Venus from 30°N to the north pole at 1–2 km resolution, revealing tessera terrain for the first time.</li>
<li><strong>Lander Missions (Venera 9, 10, 13, 14; Vega 1, 2)</strong>: TV panoramic cameras providing the first direct surface images; gamma-ray and x-ray fluorescence (XRF) analysis confirming <strong>tholeiitic basalt</strong> composition at most sites (Venera 9, 10, 14, Vega 1, 2), with Venera 8 and 13 indicating <strong>alkaline basalt</strong> composition. The Venera 8 landing site, dominated by shield plains, showed elevated potassium, uranium, and thorium, suggesting geochemically evolved material.</li>
<li><strong>Atmospheric Probes (Pioneer Venus, Venera 4–12)</strong>: Atmospheric composition, temperature, and pressure profiles, plus high D/H ratio ($\sim 0.024$, about 150$\times$ Earth&rsquo;s oceans) indicating significant primordial water loss.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/notes/planetary-science/venus-magellan-topography.webp"
         alt="False-color radar topography map of Venus showing elevation data from the Magellan mission, with highlands in pink/white and lowlands in blue/purple"
         title="False-color radar topography map of Venus showing elevation data from the Magellan mission, with highlands in pink/white and lowlands in blue/purple"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Global topography of Venus from Magellan radar altimetry. Colors indicate planetary radius (elevation), with highlands like Ishtar Terra and Aphrodite Terra shown in pink/white. (NASA/JPL-Caltech)</figcaption>
    
</figure>

<h2 id="the-surface-environment">The Surface Environment</h2>
<p>The Venera landers provided the only direct ground-truth of Venusian surface conditions:</p>
<ul>
<li><strong>Temperature</strong>: $\sim 740;\text{K}$ at mean surface level ($\sim 6051.5$ km radius), varying by altitude. Maxwell Montes (the highest peak, $+11$ km) reaches $\sim 650;\text{K}$; deep depressions ($-2$ km) reach $\sim 755;\text{K}$.</li>
<li><strong>Pressure</strong>: 93 bar at mean surface level, 45 bar at Maxwell Montes summit.</li>
<li><strong>Winds</strong>: Very low near-surface wind speeds ($0.3\text{&ndash;}1;\text{m s}^{-1}$) at lander sites, but the <strong>zonal wind at cloud top</strong> reaches $\sim 100;\text{m s}^{-1}$, driving the planet-wide atmospheric super-rotation.</li>
<li><strong>Surface appearance</strong>: The solid surface is very dark and reddish (reflectivity only 0.03–0.1 in visible light). All four Venera landers photographed <strong>platy rocks</strong> with prominent fine layering and soil in local depressions, consistent with either lithified aeolian sediment or thin volcanic tuff. At the Venera 13 and 14 sites, rock bearing capacity was measured at only $3\text{&ndash;}10;\text{kg cm}^{-2}$, implying porous material.</li>
<li><strong>Chemical weathering</strong>: Thermodynamic calculations predict that basaltic minerals react with atmospheric gases to form magnetite, haematite, quartz, magnesite, anhydrite, and pyrite. In the highlands, above a critical altitude (which varies along the planet), iron in silicates segregates into highly conductive iron oxide or sulfide minerals, producing a radar-bright &ldquo;snow line.&rdquo;</li>
</ul>
<h2 id="geological-terrains-and-features">Geological Terrains and Features</h2>
<h3 id="volcanic-plains-80-of-surface">Volcanic Plains (80% of Surface)</h3>
<p>The vast majority of the surface consists of volcanic plains. Regional plains (including both wrinkle-ridged and shield varieties) cover ~70% of the surface. The dominant variety, <strong>plains with wrinkle ridges</strong>, consists of solidified basaltic lava flows deformed by gentle horizontal compression into networks of narrow (1–2 km wide), low (tens to hundreds of km long) ridges. Within the plains run sinuous channels (lava tubes or thermal erosion channels), including <strong>Baltis Vallis</strong>, the longest channel in the solar system at 6,800 km, about $\frac{1}{6}$ of Venus&rsquo;s circumference.</p>
<p>Younger volcanic units (10–15% of the surface) include <strong>lobate lava fields</strong> (over 200 fields each $&gt; 50{,}000;\text{km}^2$) and <strong>smooth plains</strong>, representing the most recent volcanism. The highest volcano, <strong>Maat Mons</strong>, stands 9 km above MPR and its lava flows extend 800 km across.</p>
<h3 id="coronae-a-uniquely-venusian-feature">Coronae: A Uniquely Venusian Feature</h3>
<p>Several hundred <strong>coronae</strong> are among the most distinctive structures on Venus. These oval-to-circular volcanic-tectonic features are typically 100–300 km in diameter (a few exceeding 1,000 km) and are unique to Venus in the solar system. A corona typically consists of:</p>
<ul>
<li>A <strong>tectonically deformed annulus</strong> (circular rim of compressed/fractured terrain) standing a few hundred metres above the surrounding plains.</li>
<li>A <strong>depressed interior</strong> flooded with plains-forming volcanics.</li>
<li><strong>Aprons of young lobate lava flows</strong> radiating outward from the annulus.</li>
</ul>
<p>Coronae form from <strong>rising hot mantle diapirs</strong>: the diapir pushes up the overlying lithosphere and crust, producing magmatic melts that reach the surface as lava flows. When the diapir cools, the uplifted surface collapses, creating the annular structure. Their long-lived circular geometry (rather than being deformed into elongated shapes) is strong evidence that plate tectonics did not operate during their formation.</p>
<h3 id="deformed-terrains-20-of-surface">Deformed Terrains (20% of Surface)</h3>
<p>About 20% of the surface is occupied by rough, tectonically deformed terrains:</p>
<ul>
<li><strong>Ridge belts</strong>: Fragments of globally widespread compressed plains-forming material, now partly flooded by regional plains lavas. Their folded ridges (3–5 km wide) indicate past regional-to-global horizontal compression.</li>
<li><strong>Densely fractured plains</strong>: &ldquo;Islands&rdquo; of 100–200 km extent of heavily fractured plains-forming material, elevated slightly above regional plains. Fracture patterns are often subparallel within a given island, implying global-scale deformation events.</li>
<li><strong>Tessera terrain</strong>: The most highly deformed and probably oldest surface unit, forming elevated &ldquo;islands&rdquo; and &ldquo;continents&rdquo; (e.g. Ishtar Terra at 60–70°N, Fortuna, Ovda, Tellus). The surface is dissected by criss-crossing ridges and grooves a few km wide and tens of km long. Composition is unknown; may be basaltic or more feldspathic (resembling lunar anorthosites or terrestrial granites).</li>
<li><strong>Rifts (Chasmata)</strong>: A global system of extensional troughs up to 40,000 km long, with floors a few km below surrounding terrain. Associated with young post-regional-plain volcanism and coronae chains.</li>
<li><strong>Mountain ranges</strong>: The highest topographic features (Maxwell Montes, 11 km above MPR), formed by intense horizontal compression. Lateral merger with tessera suggests mountain range formation may be the initial stage of tessera formation.</li>
</ul>
<h3 id="impact-craters">Impact Craters</h3>
<p>More than <strong>960 impact craters</strong> from 1.5 to 270 km in diameter have been identified on Venus. Their distribution is <strong>indistinguishable from random</strong>, confirming no plate tectonics (which would preferentially destroy craters). Key characteristics:</p>
<ul>
<li><strong>Atmospheric screening</strong>: The dense atmosphere breaks up relatively small projectiles, so craters smaller than $\sim 10\text{&ndash;}20$ km in diameter have irregular floors caused by impacting swarms of fragments rather than single bodies. The observed size-frequency distribution is well-matched by models of the present-day atmosphere.</li>
<li><strong>Impact melt outflows</strong>: Many craters show <strong>flow-like outflow features</strong> (impact melt) extending tens to hundreds of km from the rim, more abundant than on other planets due to the high surface temperature increasing melt production.</li>
<li><strong>Dark parabola halos</strong>: The youngest craters are associated with radar-dark parabolic haloes formed by fine crater ejecta carried westward by the strong zonal upper atmosphere winds and settled in a parabolic pattern. Parabolas degrade to non-parabolic halos with age, providing a crater aging tool.</li>
</ul>
<h3 id="aeolian-features">Aeolian Features</h3>
<p>In the absence of liquid water, <strong>aeolian (wind) processes</strong> dominate exogenic resurfacing:</p>
<ul>
<li><strong>Wind streaks</strong>: The most abundant aeolian features. Elongated radar-dark or bright features a few to tens of km long, originating from topographic obstacles; represent erosional and depositional products of wind turbulence.</li>
<li><strong>Dark mantles</strong>: Fine-grained debris from impact ejecta, deposited atmospherically and redistributed by wind. Common around impact craters as halos.</li>
<li><strong>Dunes</strong>: Only <strong>two dune fields</strong> identified (one near Fortuna–Meshkenet, one in Lavinia Planitia), each associated with large impact craters that provided debris. The scarcity of dunes implies a general <strong>deficit of sand-sized particles</strong> on Venus.</li>
<li><strong>Candidate yardangs</strong>: Wind-erosional grooves observed near the Mead crater (the largest impact crater on Venus).</li>
</ul>
<h2 id="the-age-debate-synchronous-vs-non-synchronous-resurfacing">The Age Debate: Synchronous vs. Non-Synchronous Resurfacing</h2>
<p>A central unresolved question is whether similar geological units across the planet share the same <strong>absolute age</strong> or different ages:</p>
<ul>
<li><strong>Synchronous model</strong> (favored by authors): Similar units are globally contemporaneous. Evidence: mapping of more than half the planet shows consistent age-sequence relationships across province boundaries; a complete latitude band mapped at 30°N shows the unit sequence is laterally traceable around the planet.</li>
<li><strong>Non-synchronous model</strong> (alternative): Unit L in one province may be younger than unit L in a neighboring province (the same stratigraphic position reflects local, not global, timing). Distinction requires fossil biostratigraphy or isotopic dating, neither of which is possible on Venus without sample return.</li>
</ul>
<p>The synchronous model, if correct, implies a brief period of <strong>intense global volcanism and tectonism</strong> (resurfacing 80–85% of the surface), followed by a dramatic drop to the low-activity stagnant lid regime that persists today.</p>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>















<figure class="post-figure center ">
    <img src="/img/notes/planetary-science/venus-magellan-radar.webp"
         alt="Magellan radar mosaic of Venus showing the northern hemisphere with volcanic plains, tesserae, and lava flows in orange-brown tones"
         title="Magellan radar mosaic of Venus showing the northern hemisphere with volcanic plains, tesserae, and lava flows in orange-brown tones"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Magellan synthetic aperture radar mosaic of Venus&rsquo;s northern hemisphere, centered on the North Pole. The bright, highly deformed tessera terrain is visible at center, surrounded by darker volcanic plains. (NASA/JPL-Caltech)</figcaption>
    
</figure>

<p>The authors conclude that Venus operates under a <strong>&ldquo;stagnant lid&rdquo; regime</strong>, fundamentally different from Earth&rsquo;s plate tectonics.</p>
<ul>
<li><strong>Mean surface age</strong>: $\sim 500\text{&ndash;}800$ Myr (range $\sim 300;\text{Myr}$ to $\sim 1;\text{Gyr}$ with uncertainties), inferred from crater density calibrated against lunar chronology. The pre-regional-plains units and regional plains (occupying 80–85% of the surface together) formed in a compressed time window during the early era.</li>
<li><strong>Two-era history</strong>:
<ol>
<li><strong>Global/Early Era</strong>: Intense, planet-wide volcanic and tectonic activity resurfaced 80–85% of the surface. Mean volcanic rate was comparable to Earth&rsquo;s current mid-oceanic ridge volcanism. Tectonic deformation was most intense (tessera-forming) at the start, waning through ridge belts and wrinkle ridges.</li>
<li><strong>Localized/Late Era</strong>: Beginning $\sim 500\text{&ndash;}1000$ Myr ago and continuing to the present, activity dropped to rates <strong>lower than terrestrial intraplate volcanism</strong> and more comparable to lunar mare volcanism. Concentrated in rift zones ($\sim 4%$ of the surface); lobate and smooth plains occupy only 10–15% of the surface.</li>
</ol>
</li>
<li><strong>The stagnant lid transition</strong>: Earth releases internal heat gradually through plate tectonics. On Venus, when the lithosphere thickened sufficiently, its yield strength exceeded the tectonic driving stresses (Solomatov &amp; Moresi 1996), locking the planet into a single immobile plate. This caused mantle heating, suppressed melting that feeds surface volcanism, and halted the overturning cycle. Volcanic and tectonic activity may still occur at low rates.</li>
<li><strong>No magnetic field</strong>: Cessation of rapid core cooling (which plate tectonics drives on Earth) likely halted geodynamo action, explaining the absence of an intrinsic magnetic field despite Venus having an Earth-like iron core.</li>
<li><strong>Interior structure</strong>: The crust–mantle boundary sits at $\sim 70;\text{km}$; the mantle–core boundary at $\sim 2840;\text{km}$. Gravity correlates strongly with topography (unlike Earth), suggesting Venus lacks an <strong>asthenosphere</strong> (mechanically soft upper mantle layer), possibly because the high surface temperature precludes the stability of chlorite and serpentine, the &ldquo;slippery&rdquo; minerals that weaken Earth&rsquo;s lithosphere.</li>
<li><strong>Surface-atmosphere coupling</strong>: Extensive early-era volcanism may have added $\text{H}_2\text{O}$ and $\text{SO}_2$ to the atmosphere, amplifying the greenhouse effect and causing $\pm 100;\text{K}$, 100–200 Myr surface temperature excursions that could in turn have driven tectonic stress and partial crustal melting.</li>
</ul>
<h2 id="connecting-to-venus-as-a-system">Connecting to Venus as a System</h2>
<p>The geological record decoded here provides essential context for understanding the full story of Venus. The stagnant lid and catastrophic resurfacing events explain how the planet lost its early surface water and why the atmosphere evolved into its current extreme state.</p>
<p>To explore what the surface conditions mean for life and planetary engineering, see:</p>
<ul>
<li><a href="/notes/interdisciplinary/planetary-science/life-on-venus/">Life on Venus</a> for how these surface conditions define the hard limits for any biology.</li>
<li><a href="/notes/interdisciplinary/planetary-science/venus-evolution-through-time/">Venus Evolution Through Time</a> for the coordinated mission strategy that will answer whether Venus was ever habitable.</li>
<li><a href="/notes/interdisciplinary/planetary-science/cloud-continents/">Terraforming Venus: The Cloud Continent Proposal</a> for a speculative look at how humanity might one day engineer around these geological constraints.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>This is a review paper synthesizing publicly available mission data. The primary datasets are accessible through NASA&rsquo;s Planetary Data System (PDS):</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pds-geosciences.wustl.edu/missions/magellan/">Magellan SAR and Altimetry (PDS)</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Global radar imagery and topography</td>
      </tr>
      <tr>
          <td><a href="https://www.nasa.gov/nssdc/">Venera/Vega Lander Data (NSSDC)</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Surface composition and imagery</td>
      </tr>
  </tbody>
</table>
<p>No custom software or models are associated with this paper. Reproducing the geological interpretations requires access to the Magellan radar mosaics and familiarity with planetary geological mapping techniques.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Basilevsky, A. T., &amp; Head, J. W., III. (2003). The surface of Venus. <em>Reports on Progress in Physics</em>, 66(10), 1699-1734. <a href="https://doi.org/10.1088/0034-4885/66/10/R04">https://doi.org/10.1088/0034-4885/66/10/R04</a></p>
<p><strong>Publication</strong>: Reports on Progress in Physics, 2003</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{basilevsky2003surface,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{The surface of Venus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Basilevsky, Alexander T and Head, James W, III}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Reports on Progress in Physics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{66}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1699--1734}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2003}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/0034-4885/66/10/R04}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Dark Side of Forces: Non-Conservative ML Force Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/dark-side-of-forces/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/dark-side-of-forces/</guid><description>Bigi et al. critique non-conservative force models in ML potentials, showing their simulation failures and proposing hybrid solutions.</description><content:encoded><![CDATA[<h2 id="contribution-systematic-assessment-of-non-conservative-ml-force-models">Contribution: Systematic Assessment of Non-Conservative ML Force Models</h2>
<p>This is a <strong>Systematization</strong> paper. It systematically catalogs the exact failure modes of existing non-conservative force approaches, quantifies them with a new diagnostic metric, and proposes a hybrid Multiple Time-Stepping solution combining the speed benefits of direct force prediction with the physical correctness of conservative models.</p>
<h2 id="motivation-the-speed-accuracy-trade-off-in-ml-force-fields">Motivation: The Speed-Accuracy Trade-off in ML Force Fields</h2>
<p>Many recent machine learning interatomic potential (MLIP) architectures predict forces directly ($F_\theta(r)$). This &ldquo;non-conservative&rdquo; approach avoids the computational overhead of automatic differentiation, yielding faster inference (typically 2-3x speedup) and faster training (up to 3x). However, it sacrifices energy conservation and rotational constraints, potentially destabilizing molecular dynamics simulations. The field lacks rigorous quantification of when this trade-off breaks down and how to mitigate the failures.</p>
<h2 id="novelty-jacobian-asymmetry-and-hybrid-architectures">Novelty: Jacobian Asymmetry and Hybrid Architectures</h2>
<p>Four key contributions:</p>
<ol>
<li>
<p><strong>Jacobian Asymmetry Metric ($\lambda$):</strong> A quantitative diagnostic for non-conservation. Since conservative forces derive from a scalar field, their Jacobian (the Hessian of energy) must be symmetric. The normalized norm of the antisymmetric part quantifies the degree of violation:
$$ \lambda = \frac{|| \mathbf{J}_{\text{anti}} ||_F}{|| \mathbf{J} ||_F} $$
where $\mathbf{J}_{\text{anti}} = (\mathbf{J} - \mathbf{J}^\top)/2$. Measured values range from $\lambda \approx 0.004$ (PET-NC) to $\lambda \approx 0.032$ (SOAP-BPNN-NC), with ORB at 0.015 and EquiformerV2 at 0.017. Notably, the pairwise $\lambda_{ij}$ approaches 1 at large interatomic distances, meaning non-conservative artifacts disproportionately affect long-range and collective interactions.</p>
</li>
<li>
<p><strong>Systematic Failure Mode Catalog:</strong> First comprehensive demonstration that non-conservative models cause runaway heating in NVE ensembles (temperature drifts of ~7,000 billion K/s for PET-NC and ~10x larger for ORB) and equipartition violations in NVT ensembles where different atom types equilibrate to different temperatures, a physical impossibility.</p>
</li>
<li>
<p><strong>Theoretical Analysis of Force vs. Energy Training:</strong> Force-only training overemphasizes high-frequency vibrational modes because force labels carry per-atom gradients that are dominated by stiff, short-range interactions. Energy labels provide a more balanced representation across the frequency spectrum. Additionally, conservative models benefit from backpropagation extending the effective receptive field to approximately 2x the interaction cutoff, while direct-force models are limited to the nominal cutoff radius.</p>
</li>
<li>
<p><strong>Hybrid Training and Inference Protocol:</strong> A practical workflow that combines fast direct-force prediction with conservative corrections:</p>
<ul>
<li><strong>Training:</strong> Pre-train on direct forces, then fine-tune on energy gradients (2-4x faster than training conservative models from scratch)</li>
<li><strong>Inference:</strong> Multiple Time-Stepping (MTS) where fast non-conservative forces are periodically corrected by slower conservative forces</li>
</ul>
</li>
</ol>
<h2 id="methodology-systematic-failure-mode-analysis">Methodology: Systematic Failure Mode Analysis</h2>
<p>The evaluation systematically tests multiple state-of-the-art models across diverse simulation scenarios:</p>
<p><strong>Models tested:</strong></p>
<ul>
<li><strong>PET-C/PET-NC</strong> (Point Edge Transformer, conservative and non-conservative variants)</li>
<li><strong>PET-M</strong> (hybrid variant jointly predicting both conservative and non-conservative forces)</li>
<li><strong>ORB-v2</strong> (non-conservative, trained on Alexandria/MPtrj)</li>
<li><strong>EquiformerV2</strong> (non-conservative equivariant Transformer)</li>
<li><strong>MACE-MP-0</strong> (conservative message-passing)</li>
<li><strong>SevenNet</strong> (conservative message-passing)</li>
<li><strong>SOAP-BPNN-C/SOAP-BPNN-NC</strong> (descriptor-based baseline, both conservative and non-conservative variants)</li>
</ul>
<p><strong>Test scenarios:</strong></p>
<ol>
<li><strong>NVE stability tests</strong> on bulk liquid water, graphene, amorphous carbon, and FCC aluminum</li>
<li><strong>Thermostat artifact analysis</strong> with Langevin and GLE thermostats</li>
<li><strong>Geometry optimization</strong> on water snapshots and <a href="/notes/chemistry/datasets/qm9/">QM9</a> molecules using FIRE and L-BFGS</li>
<li><strong>MTS validation</strong> on OC20 catalysis dataset</li>
<li><strong>Species-resolved temperature measurements</strong> for equipartition testing</li>
</ol>
<p><strong>Key metrics:</strong></p>
<ul>
<li>Jacobian asymmetry ($\lambda$)</li>
<li>Kinetic temperature drift in NVE</li>
<li>Velocity-velocity correlations</li>
<li>Radial distribution functions</li>
<li>Species-resolved temperatures</li>
<li>Inference speed benchmarks</li>
</ul>
<h2 id="results-simulation-instability-and-hybrid-solutions">Results: Simulation Instability and Hybrid Solutions</h2>
<p>Purely non-conservative models are <strong>unsuitable for production simulations</strong> due to uncontrollable unphysical artifacts that no thermostat can correct. Key findings:</p>
<p><strong>Performance failures:</strong></p>
<ul>
<li>Non-conservative models exhibited catastrophic temperature drift in NVE simulations: ~7,000 billion K/s for PET-NC and ~70,000 billion K/s for ORB, with EquiformerV2 comparable to PET-NC</li>
<li>Strong Langevin thermostats ($\tau=10$ fs) damped diffusion by ~5x, negating the speed benefits of non-conservative models</li>
<li>Advanced GLE thermostats also failed to control non-conservative drift (ORB reached 1181 K vs. 300 K target)</li>
<li>Equipartition violations: under stochastic velocity rescaling, O and H atoms equilibrated at different temperatures. For ORB, H atoms reached 336 K and O atoms 230 K against a 300 K target. For PET-NC, deviations were smaller but still significant (H at 296 K, O at 310 K).</li>
<li>Geometry optimization was more fragile with non-conservative forces: inaccurate NC models (SOAP-BPNN-NC) failed catastrophically, while more accurate ones (PET-NC) could converge with FIRE but showed large force fluctuations with L-BFGS. Non-conservative models consistently had lower success rates across water and QM9 benchmarks.</li>
</ul>
<p><strong>Hybrid solution success:</strong></p>
<ul>
<li>MTS with non-conservative forces corrected every 8 steps ($M=8$) achieved conservative stability with only ~20% overhead compared to a purely non-conservative trajectory. Results were essentially indistinguishable from fully conservative simulations. Higher stride values ($M=16$) became unstable due to resonances between fast degrees of freedom and integration errors.</li>
<li>Conservative fine-tuning achieved the accuracy of from-scratch training in about 1/3 the total training time (2-4x resource reduction)</li>
<li>Validated on OC20 catalysis benchmark</li>
</ul>
<p><strong>Scaling caveat:</strong> The authors note that as training datasets grow and models become more expressive, non-conservative artifacts should diminish because accurate models naturally exhibit less non-conservative behavior. However, they argue the best path forward is hybrid approaches rather than waiting for scale to solve the problem.</p>
<p><strong>Recommendation:</strong> The optimal production path is hybrid architectures using direct forces for acceleration (via MTS and pre-training) while anchoring models in conservative energy surfaces. This captures computational benefits without sacrificing physical reliability.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Primary training/evaluation:</strong></p>
<ul>
<li><strong>Bulk Liquid Water</strong> (Cheng et al., 2019): revPBE0-D3 calculations with over 250,000 force/energy targets, chosen for rigorous thermodynamic testing</li>
</ul>
<p><strong>Generalization tests:</strong></p>
<ul>
<li>Graphene, amorphous carbon, FCC aluminum (tested with general-purpose foundation models)</li>
</ul>
<p><strong>Benchmarks:</strong></p>
<ul>
<li><strong>QM9</strong>: Geometry optimization tests</li>
<li><strong>OC20</strong> (Open Catalyst): Oxygen on alloy surfaces for MTS validation</li>
</ul>
<p>All datasets publicly available through cited sources.</p>
<h3 id="models">Models</h3>
<p><strong>Point Edge Transformer (PET)</strong> variants:</p>
<ul>
<li><strong>PET-C (Conservative)</strong>: Forces via energy backpropagation</li>
<li><strong>PET-NC (Non-Conservative)</strong>: Direct force prediction head, slightly higher parameter count</li>
<li><strong>PET-M (Hybrid)</strong>: Jointly predicts both conservative and non-conservative forces, accuracy within ~10% of the best single-task models</li>
</ul>
<p><strong>Baseline comparisons:</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Training Data</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ORB-v2</td>
          <td>Non-conservative</td>
          <td>Alexandria/MPtrj</td>
          <td>Rotationally unconstrained</td>
      </tr>
      <tr>
          <td>EquiformerV2</td>
          <td>Non-conservative</td>
          <td>Alexandria/MPtrj</td>
          <td>Equivariant Transformer</td>
      </tr>
      <tr>
          <td>MACE-MP-0</td>
          <td>Conservative</td>
          <td>MPtrj</td>
          <td>Equivariant message-passing</td>
      </tr>
      <tr>
          <td>SevenNet</td>
          <td>Conservative</td>
          <td>MPtrj</td>
          <td>Equivariant message-passing</td>
      </tr>
      <tr>
          <td>SOAP-BPNN-C</td>
          <td>Conservative</td>
          <td>Bulk water</td>
          <td>Descriptor-based baseline</td>
      </tr>
      <tr>
          <td>SOAP-BPNN-NC</td>
          <td>Non-conservative</td>
          <td>Bulk water</td>
          <td>Descriptor-based baseline</td>
      </tr>
  </tbody>
</table>
<p><strong>Training details:</strong></p>
<ul>
<li><strong>Loss functions</strong>: PET-C uses joint Energy + Force $L^2$ loss; PET-NC uses Force-only $L^2$ loss</li>
<li><strong>Fine-tuning protocol</strong>: PET-NC converted to conservative via energy head fine-tuning</li>
<li><strong>MTS configuration</strong>: Non-conservative forces with conservative corrections every 8 steps ($M=8$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics &amp; Software:</strong>
Molecular dynamics evaluations were performed using <strong>i-PI</strong>, while geometry optimizations used <strong>ASE (Atomic Simulation Environment)</strong>. Note that primary code reproducibility is provided via an archived Zenodo snapshot; the authors did not link a live, public GitHub repository.</p>
<ol>
<li><strong>Jacobian asymmetry</strong> ($\lambda$): Quantifies non-conservation via antisymmetric component</li>
<li><strong>Temperature drift</strong>: NVE ensemble stability</li>
<li><strong>Velocity-velocity correlation</strong> ($\hat{c}_{vv}(\omega)$): Thermostat artifact detection</li>
<li><strong>Radial distribution functions</strong> ($g(r)$): Structural accuracy</li>
<li><strong>Species-resolved temperature</strong>: Equipartition testing</li>
<li><strong>Inference speed</strong>: Wall-clock time per MD step</li>
</ol>
<p><strong>Key results:</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Speed (ms/step)</th>
          <th>NVE Stability</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PET-NC</td>
          <td>8.58</td>
          <td>Failed</td>
          <td>~7,000 billion K/s drift</td>
      </tr>
      <tr>
          <td>PET-C</td>
          <td>19.4</td>
          <td>Stable</td>
          <td>2.3x slower than PET-NC</td>
      </tr>
      <tr>
          <td>SevenNet</td>
          <td>52.8</td>
          <td>Stable</td>
          <td>Conservative baseline</td>
      </tr>
      <tr>
          <td><strong>PET Hybrid (MTS)</strong></td>
          <td><strong>~10.3</strong></td>
          <td><strong>Stable</strong></td>
          <td><strong>~20% overhead vs. pure NC</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Thermostat artifacts:</strong></p>
<ul>
<li>Langevin ($\tau=10$ fs) dampened diffusion by ~5x (weaker coupling at $\tau=100$ fs reduced diffusion by ~1.5x)</li>
<li>GLE thermostats also failed to control non-conservative drift</li>
<li>Equipartition violations under SVR: ORB showed H at 336 K and O at 230 K (target 300 K); PET-NC showed smaller but significant species-resolved deviations</li>
</ul>
<p><strong>Optimization failures:</strong></p>
<ul>
<li>Non-conservative models showed lower geometry optimization success rates across water and QM9 benchmarks, with inaccurate NC models failing catastrophically</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute resources:</strong></p>
<ul>
<li><strong>Training</strong>: From-scratch baseline models were trained using 4x Nvidia H100 GPUs (over a duration of around two days).</li>
<li><strong>Fine-Tuning</strong>: Conservative fine-tuning was performed using a single (1x) Nvidia H100 GPU for a duration of one day.</li>
<li>This hybrid fine-tuning approach achieved a 2-4x reduction in computational resources compared to training conservative models from scratch.</li>
</ul>
<p><strong>Reproduction resources:</strong></p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://zenodo.org/records/14778891">Zenodo repository</a></td>
          <td>Code/Data</td>
          <td>Unknown</td>
          <td>Code and data to reproduce all results</td>
      </tr>
      <tr>
          <td><a href="https://atomistic-cookbook.org/examples/pet-mad-nc/pet-mad-nc.html">MTS inference tutorial</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Multiple time-stepping dynamics tutorial</td>
      </tr>
      <tr>
          <td><a href="https://atomistic-cookbook.org/examples/pet-finetuning/pet-ft-nc.html">Conservative fine-tuning tutorial</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Fine-tuning workflow tutorial</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bigi, F., Langer, M. F., &amp; Ceriotti, M. (2025). The dark side of the forces: assessing non-conservative force models for atomistic machine learning. <em>Proceedings of the 42nd International Conference on Machine Learning</em>, PMLR 267.</p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{bigi2025dark,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{The dark side of the forces: assessing non-conservative force models for atomistic machine learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bigi, Filippo and Langer, Marcel F and Ceriotti, Michele}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span>=<span style="color:#e6db74">{Vancouver, Canada}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://icml.cc/virtual/2025/poster/45458">ICML 2025 poster page</a></li>
<li><a href="https://openreview.net/pdf?id=OEl3L8osas">PDF on OpenReview</a></li>
<li><a href="https://zenodo.org/records/14778891">Zenodo repository</a></li>
<li><a href="https://atomistic-cookbook.org/examples/pet-mad-nc/pet-mad-nc.html">MTS Inference Tutorial</a></li>
<li><a href="https://atomistic-cookbook.org/examples/pet-finetuning/pet-ft-nc.html">Conservative Fine-Tuning Tutorial</a></li>
</ul>
]]></content:encoded></item></channel></rss>