<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Property Prediction on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/</link><description>Recent content in Property Prediction on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Thu, 09 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/index.xml" rel="self" type="application/rss+xml"/><item><title>MTL-BERT: Multitask BERT for Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</guid><description>MTL-BERT combines BERT pretraining, multitask learning, and SMILES enumeration for molecular property prediction across 60 drug discovery datasets.</description><content:encoded><![CDATA[<h2 id="a-multitask-bert-framework-for-molecular-property-prediction">A Multitask BERT Framework for Molecular Property Prediction</h2>
<p>MTL-BERT is a <strong>Method</strong> paper that introduces a multitask learning framework built on BERT for predicting molecular properties from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a>. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a> approaches.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p>Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.</p>
<p>The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., <a href="https://en.wikipedia.org/wiki/Lipophilicity">lipophilicity</a> relates to many ADMET endpoints), (2) using only canonical SMILES limits the model&rsquo;s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.</p>
<h2 id="three-strategies-combined-pretraining-multitask-learning-and-smiles-enumeration">Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration</h2>
<p>The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.</p>
<h3 id="masked-smiles-pretraining">Masked SMILES Pretraining</h3>
<p>Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).</p>
<p>SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.</p>
<h3 id="transformer-architecture">Transformer Architecture</h3>
<p>The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:</p>
<p>$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$</p>
<p>where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.</p>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
          <th>Fine-tuning Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MTL-BERT_SMALL</td>
          <td>4</td>
          <td>4</td>
          <td>128</td>
          <td>512</td>
          <td>0.931</td>
          <td>0.826</td>
      </tr>
      <tr>
          <td>MTL-BERT_MEDIUM</td>
          <td>8</td>
          <td>8</td>
          <td>256</td>
          <td>1,024</td>
          <td>0.962</td>
          <td>0.852</td>
      </tr>
      <tr>
          <td>MTL-BERT_LARGE</td>
          <td>12</td>
          <td>12</td>
          <td>576</td>
          <td>2,304</td>
          <td>0.974</td>
          <td>0.848</td>
      </tr>
  </tbody>
</table>
<p>The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.</p>
<h3 id="multitask-fine-tuning-with-task-tokens">Multitask Fine-tuning with Task Tokens</h3>
<p>During fine-tuning, task tokens ([T0], [T1], &hellip;) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.</p>
<p>Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.</p>
<h3 id="smiles-enumeration-as-data-augmentation">SMILES Enumeration as Data Augmentation</h3>
<p>A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:</p>
<ol>
<li><strong>Pretraining</strong>: Enumerated SMILES increase diversity of the self-supervised training data.</li>
<li><strong>Fine-tuning</strong>: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.</li>
<li><strong>Inference</strong>: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.</li>
</ol>
<p>The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.</p>
<h2 id="experimental-evaluation-across-60-datasets">Experimental Evaluation Across 60 Datasets</h2>
<h3 id="setup">Setup</h3>
<p>MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.</p>
<p>Classification tasks were evaluated with <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> and accuracy; regression tasks with $R^2$ and RMSE.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ul>
<li><strong>ECFP4-XGBoost</strong>: Extended-connectivity fingerprints (diameter 4) with gradient boosting</li>
<li><strong>Graph Attention Network (GAT)</strong></li>
<li><strong>Graph Convolutional Network (GCN)</strong></li>
<li><strong>AttentiveFP</strong>: A GNN with attention for molecular property prediction</li>
<li><strong>CDDD</strong>: Continuous and data-driven descriptors from a pretrained RNN auto-encoder</li>
</ul>
<h3 id="ablation-study">Ablation Study</h3>
<p>Three model variants were compared to isolate contributions:</p>
<ul>
<li><strong>MTL-BERT</strong>: Full model (pretraining + multitask + SMILES enumeration)</li>
<li><strong>STL-BERT</strong>: Single-task fine-tuning with SMILES enumeration (no multitask)</li>
<li><strong>Cano-BERT</strong>: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)</li>
</ul>
<p>Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.</p>
<h3 id="results-vs-baselines">Results vs. Baselines</h3>
<p>MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:</p>
<ul>
<li>ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.</li>
<li>GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.</li>
<li>MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).</li>
<li>On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.</li>
<li>Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).</li>
</ul>
<h3 id="representation-analysis">Representation Analysis</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:</p>
<ul>
<li>Tokens of the same type cluster together (capturing atomic type information).</li>
<li>Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).</li>
<li>Nearby embeddings share similar molecular neighborhood environments.</li>
</ul>
<h3 id="attention-based-interpretability">Attention-based Interpretability</h3>
<p>The model&rsquo;s attention weights provide interpretability for predictions:</p>
<ul>
<li>For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.</li>
<li>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> (mutagenicity), attention focused on <a href="https://en.wikipedia.org/wiki/Azide">azide</a>, nitrosamide, <a href="https://en.wikipedia.org/wiki/Acyl_chloride">acylchloride</a>, and nitrite groups, which are known mutagenic structural alerts.</li>
</ul>
<h2 id="performance-gains-from-combined-strategies-with-interpretable-attention">Performance Gains from Combined Strategies with Interpretable Attention</h2>
<p>MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.</p>
<p>Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.</p>
<p>Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.</p>
<p>The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>1.7M molecules</td>
          <td>Unlabeled SMILES; 10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning/Evaluation</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>60 datasets (44 classification, 16 regression)</td>
          <td>8:1:1 train/val/test split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.</li>
<li><strong>Fine-tuning</strong>: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.</li>
<li><strong>SMILES enumeration</strong>: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.</li>
<li><strong>Inference fusion</strong>: Predictions from multiple enumerated SMILES are averaged.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size</li>
<li>Pretraining recovery accuracy: 0.962</li>
<li>1,000 task tokens pre-allocated for future tasks</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Classification</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Regression</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Secondary metric</td>
      </tr>
  </tbody>
</table>
<p>All experiments repeated 10 times with random splits; mean and standard deviation reported.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/MTL-BERT">MTL-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Pretraining data source</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Fine-tuning benchmark</td>
      </tr>
      <tr>
          <td><a href="https://admetmesh.scbdd.com/">ADMETlab</a></td>
          <td>Dataset</td>
          <td>Free for academic use</td>
          <td>ADMET property datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., &amp; Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. <em>Research</em>, 2022, Article 0004. <a href="https://doi.org/10.34133/research.0004">https://doi.org/10.34133/research.0004</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2022mtlbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{Article 0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.34133/research.0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science (AAAS)}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Maxsmi: SMILES Augmentation for Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/</guid><description>Maxsmi systematically evaluates five SMILES augmentation strategies with CNN and RNN models across solubility, lipophilicity, and bioactivity tasks.</description><content:encoded><![CDATA[<h2 id="systematic-benchmarking-of-smiles-data-augmentation">Systematic Benchmarking of SMILES Data Augmentation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the &ldquo;Maxsmi&rdquo; models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.</p>
<h2 id="the-data-scarcity-problem-in-qsar-modeling">The Data Scarcity Problem in QSAR Modeling</h2>
<p>Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES representation</a> of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.</p>
<h2 id="five-augmentation-strategies-and-test-time-ensemble-learning">Five Augmentation Strategies and Test-Time Ensemble Learning</h2>
<p>The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:</p>
<ol>
<li><strong>No augmentation</strong>: use only the canonical SMILES (baseline)</li>
<li><strong>Augmentation with duplication</strong>: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$</li>
<li><strong>Augmentation without duplication</strong>: generate $m$ random SMILES and discard exact duplicates</li>
<li><strong>Augmentation with reduced duplication</strong>: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above</li>
<li><strong>Augmentation with estimated maximum</strong>: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space</li>
</ol>
<p>Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:</p>
<p>$$
\hat{y}_i(C) = M_{\Theta}(S_i(C))
$$</p>
<p>The compound-level prediction is an aggregation (mean) over these:</p>
<p>$$
\hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big)
$$</p>
<p>The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.</p>
<h2 id="experimental-design-three-architectures-four-datasets">Experimental Design: Three Architectures, Four Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size (after preprocessing)</th>
          <th>Train / Test</th>
          <th>Task</th>
          <th>Provenance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>902 / 226</td>
          <td>Water solubility</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></td>
      </tr>
      <tr>
          <td>ESOL_small</td>
          <td>1,068</td>
          <td>854 / 214</td>
          <td>Solubility (max 25 heavy atoms)</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>513 / 129</td>
          <td>Hydration free energy</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,199</td>
          <td>3,359 / 840</td>
          <td>Octanol/water distribution</td>
          <td><a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a></td>
      </tr>
      <tr>
          <td>Affinity (EGFR)</td>
          <td>5,849</td>
          <td>4,679 / 1,170</td>
          <td><a href="https://en.wikipedia.org/wiki/IC50">pIC50</a> against <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> kinase</td>
          <td>Kinodata</td>
      </tr>
  </tbody>
</table>
<h3 id="architectures">Architectures</h3>
<p>Three shallow neural networks are compared:</p>
<ul>
<li><strong>CONV1D</strong>: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers</li>
<li><strong>CONV2D</strong>: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers</li>
<li><strong>RNN</strong>: LSTM layer followed by two fully connected layers (128 and 64 units)</li>
</ul>
<p>All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.</p>
<h3 id="augmentation-sweep">Augmentation sweep</h3>
<p>The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.</p>
<h2 id="key-findings-augmentation-consistently-improves-rmse">Key Findings: Augmentation Consistently Improves RMSE</h2>
<h3 id="augmentation-always-helps">Augmentation always helps</h3>
<p>Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.</p>
<h3 id="best-models-maxsmi">Best models (Maxsmi)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Model</th>
          <th>Augmentation Number</th>
          <th>Strategy</th>
          <th>Test RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>Reduced duplication</td>
          <td>0.569</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>With duplication</td>
          <td>1.032</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>CONV1D</td>
          <td>80</td>
          <td>Without duplication</td>
          <td>0.593</td>
      </tr>
  </tbody>
</table>
<p>The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.</p>
<h3 id="no-single-best-augmentation-strategy">No single best augmentation strategy</h3>
<p>The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.</p>
<h3 id="canonical-smiles-outperform-single-random-smiles">Canonical SMILES outperform single random SMILES</h3>
<p>When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).</p>
<h3 id="comparison-to-prior-work">Comparison to prior work</h3>
<table>
  <thead>
      <tr>
          <th>Study</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Maxsmi</td>
          <td>0.569</td>
          <td>1.032</td>
          <td>0.593</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td>MoleculeNet</td>
          <td>0.58 +/- 0.03</td>
          <td>1.15 +/- 0.12</td>
          <td>0.655 +/- 0.036</td>
          <td>GNN</td>
      </tr>
      <tr>
          <td>CNF</td>
          <td>0.62</td>
          <td>1.11</td>
          <td>0.67</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a></td>
          <td>N/A</td>
          <td>1.197 +/- 0.127</td>
          <td>0.565 +/- 0.037</td>
          <td>RNN</td>
      </tr>
  </tbody>
</table>
<p>Maxsmi outperforms or matches MoleculeNet&rsquo;s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.</p>
<h3 id="confidence-estimation">Confidence estimation</h3>
<p>The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.</p>
<h3 id="egfr-affinity-test-case">EGFR affinity test case</h3>
<p>Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.</li>
<li>The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.</li>
<li>The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.</li>
<li>The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>MoleculeNet, water solubility</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet, hydration free energy</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,199</td>
          <td>ChEMBL, logD</td>
      </tr>
      <tr>
          <td>Test case</td>
          <td>EGFR Affinity</td>
          <td>5,849</td>
          <td>Kinodata (ChEMBL v28), pIC50</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES generation via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>&rsquo;s random SMILES enumeration</li>
<li>One-hot encoding of SMILES characters with padding to max length</li>
<li>Five augmentation strategies applied to both training and test sets</li>
<li>Mean aggregation for compound-level predictions</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CONV1D</td>
          <td>1D conv (kernel 10, stride 1) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>CONV2D</td>
          <td>2D conv (single channel) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RNN</td>
          <td>LSTM + FC(128) + FC(64)</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RF Baseline</td>
          <td>Random Forest (default sklearn)</td>
          <td>Morgan FP, radius 2, length 1024</td>
      </tr>
  </tbody>
</table>
<p>Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE (ESOL)</td>
          <td>0.569</td>
          <td>1.102 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
      <tr>
          <td>RMSE (FreeSolv)</td>
          <td>1.032</td>
          <td>2.563 (RF)</td>
          <td>CONV1D, 70x with dup</td>
      </tr>
      <tr>
          <td>RMSE (Lipophilicity)</td>
          <td>0.593</td>
          <td>0.860 (RF)</td>
          <td>CONV1D, 80x without dup</td>
      </tr>
      <tr>
          <td>RMSE (EGFR)</td>
          <td>0.777</td>
          <td>0.758 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/maxsmi">volkamerlab/maxsmi</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full source code, trained models, CLI for prediction</td>
      </tr>
      <tr>
          <td><a href="https://maxsmi.readthedocs.io/en/latest/">Documentation</a></td>
          <td>Docs</td>
          <td>N/A</td>
          <td>Read the Docs documentation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/openkinome/kinodata">Kinodata</a></td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Curated kinase bioactivity data from ChEMBL v28</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kimber, T. B., Gagnebin, M., &amp; Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. <em>Artificial Intelligence in the Life Sciences</em>, 1, 100014. <a href="https://doi.org/10.1016/j.ailsci.2021.100014">https://doi.org/10.1016/j.ailsci.2021.100014</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimber2021maxsmi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Artificial Intelligence in the Life Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{100014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ailsci.2021.100014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformers for Molecular Property Prediction Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</guid><description>A systematic review of 16 transformer models for molecular property prediction, analyzing architecture, data, tokenization, and benchmarking gaps.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformers-for-molecular-property-prediction">A Systematization of Transformers for Molecular Property Prediction</h2>
<p>This is a <strong>Systematization</strong> paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper&rsquo;s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.</p>
<h2 id="the-problem-inconsistent-evaluation-hinders-progress">The Problem: Inconsistent Evaluation Hinders Progress</h2>
<p>Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. However, the field faces several challenges:</p>
<ol>
<li><strong>Small labeled datasets</strong>: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.</li>
<li><strong>No standardized evaluation protocol</strong>: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.</li>
<li><strong>Unclear design choices</strong>: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.</li>
</ol>
<p>The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.</p>
<h2 id="seven-design-questions-for-molecular-transformers">Seven Design Questions for Molecular Transformers</h2>
<p>The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.</p>
<h3 id="reviewed-models">Reviewed Models</h3>
<p>The paper catalogs 16 models organized by architecture:</p>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Base Model</th>
          <th>Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Encoder-Decoder</td>
          <td>Transformer, BART</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">ST</a>, Transformer-CNN, <a href="/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/">X-Mol</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>BERT</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, MAT, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, Mol-BERT, Chen et al., K-BERT, FP-BERT, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>RoBERTa</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
      </tr>
      <tr>
          <td>Decoder-Only</td>
          <td>XLNet</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a> (RT)</td>
      </tr>
  </tbody>
</table>
<p>The core attention mechanism shared by all these models is the scaled dot-product attention:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
$$</p>
<p>where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.</p>
<h3 id="question-1-which-database-and-how-many-molecules">Question 1: Which Database and How Many Molecules?</h3>
<p>Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Database</th>
          <th>Size</th>
          <th>Language</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>ChEMBL</td>
          <td>900K</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>ChEMBL (<a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>)</td>
          <td>1.6M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>PubChem</td>
          <td>100K-10M</td>
          <td>SMILES, SELFIES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>PubChem</td>
          <td>5M-77M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>ZINC</td>
          <td>2M</td>
          <td>List of atoms</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>ZINC + PubChem</td>
          <td>1.1B</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>Chen et al.</td>
          <td>C, CP, CPZ</td>
          <td>2M-775M</td>
          <td>SMILES</td>
      </tr>
  </tbody>
</table>
<p>A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.</p>
<h3 id="question-2-which-chemical-language">Question 2: Which Chemical Language?</h3>
<p>Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.</p>
<p>Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.</p>
<h3 id="question-3-how-to-tokenize">Question 3: How to Tokenize?</h3>
<p>Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.</p>
<h3 id="question-4-how-to-add-positional-embeddings">Question 4: How to Add Positional Embeddings?</h3>
<p>Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.</p>
<p>MolFormer&rsquo;s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.</p>
<p>The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.</p>
<h3 id="question-5-how-many-parameters">Question 5: How Many Parameters?</h3>
<p>Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Dimensions</th>
          <th>Heads</th>
          <th>Layers</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>256</td>
          <td>4</td>
          <td>4</td>
          <td>7M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>768</td>
          <td>12</td>
          <td>12</td>
          <td>85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>768</td>
          <td>12</td>
          <td>6, 12</td>
          <td>43M, 85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
          <td>768</td>
          <td>12, 4</td>
          <td>8, 12</td>
          <td>57M, 85M</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>1024</td>
          <td>16</td>
          <td>8</td>
          <td>101M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>768</td>
          <td>12</td>
          <td>6</td>
          <td>43M</td>
      </tr>
  </tbody>
</table>
<p>SELFormer and MolFormer both tested different model sizes. SELFormer&rsquo;s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer&rsquo;s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.</p>
<h3 id="question-6-which-pre-training-objectives">Question 6: Which Pre-training Objectives?</h3>
<p>Pre-training objectives fall into domain-agnostic and domain-specific categories:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Pre-training Objective</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>MLM</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a></td>
          <td>MLM</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>MLM, PhysChemPred, SMILES-EQ</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td>K-BERT</td>
          <td>Atom feature, MACCS prediction, CL</td>
          <td>Update last layer</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>MLM, MTR</td>
          <td>Update</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>MLM, 2D Adjacency, 3D Distance</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
          <td>Denoising Span MLM, Augmentation</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">RT</a></td>
          <td>PLM (Permutation Language Modeling)</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT&rsquo;s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT&rsquo;s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).</p>
<p>ChemBERTa-2&rsquo;s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.</p>
<h3 id="question-7-how-to-fine-tune">Question 7: How to Fine-tune?</h3>
<p>Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.</p>
<h2 id="benchmarking-challenges-and-performance-comparison">Benchmarking Challenges and Performance Comparison</h2>
<h3 id="downstream-datasets">Downstream Datasets</h3>
<p>The review focuses on nine benchmark datasets across three categories from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Molecules</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Application</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>LogD at pH 7.4</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,050</td>
          <td>1 classification</td>
          <td>Physiology</td>
          <td>Blood-brain barrier</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>1,484</td>
          <td>2 classification</td>
          <td>Physiology</td>
          <td>Clinical trial toxicity</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>1,427</td>
          <td>27 classification</td>
          <td>Physiology</td>
          <td>Drug side effects</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>7,831</td>
          <td>12 classification</td>
          <td>Physiology</td>
          <td>Nuclear receptor/stress pathways</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>1,513</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Beta-secretase 1 binding</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Anti-HIV activity</td>
      </tr>
  </tbody>
</table>
<h3 id="inconsistencies-in-evaluation">Inconsistencies in Evaluation</h3>
<p>The authors document substantial inconsistencies that prevent fair model comparison:</p>
<ol>
<li><strong>Data splitting</strong>: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.</li>
<li><strong>Different test sets</strong>: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.</li>
<li><strong>Varying repetitions</strong>: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.</li>
<li><strong>Metric inconsistency</strong>: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.</li>
</ol>
<h3 id="performance-findings">Performance Findings</h3>
<p>When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.</p>
<p>For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.</p>
<h2 id="key-takeaways-and-future-directions">Key Takeaways and Future Directions</h2>
<p>The review concludes with six main takeaways:</p>
<ol>
<li><strong>Performance</strong>: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.</li>
<li><strong>Scaling</strong>: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.</li>
<li><strong>Pre-training data</strong>: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.</li>
<li><strong>Chemical language</strong>: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.</li>
<li><strong>Domain knowledge</strong>: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.</li>
<li><strong>Benchmarking</strong>: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.</li>
</ol>
<p>The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.</p>
<h3 id="models">Models</h3>
<p>Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/Transformers4MPP_review">Transformers4MPP_review</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Figure generation code and compiled data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sultan, A., Sieg, J., Mathea, M., &amp; Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. <em>Journal of Chemical Information and Modeling</em>, 64(16), 6259-6280. <a href="https://doi.org/10.1021/acs.jcim.4c00747">https://doi.org/10.1021/acs.jcim.4c00747</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sultan2024transformers,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6259--6280}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00747}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer-CNN: SMILES Embeddings for QSAR Modeling</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</guid><description>Transformer-CNN uses SMILES embeddings from a canonicalization Transformer with a CNN head for interpretable QSAR property prediction.</description><content:encoded><![CDATA[<h2 id="transformer-based-smiles-embeddings-for-property-prediction">Transformer-Based SMILES Embeddings for Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces Transformer-CNN, a two-stage architecture for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder&rsquo;s internal representations are then used as &ldquo;dynamic SMILES embeddings&rdquo; for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.</p>
<h2 id="from-descriptors-to-learned-embeddings-in-qsar">From Descriptors to Learned Embeddings in QSAR</h2>
<p>Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.</p>
<p>The authors identify two specific gaps. First, existing SMILES-based autoencoders such as <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.</p>
<h2 id="dynamic-smiles-embeddings-via-canonicalization-pre-training">Dynamic SMILES Embeddings via Canonicalization Pre-training</h2>
<p>The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.</p>
<h3 id="pre-training-on-smiles-canonicalization">Pre-training on SMILES Canonicalization</h3>
<p>The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.</p>
<p>The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:</p>
<p>$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$</p>
<p>where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.</p>
<p>On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).</p>
<h3 id="from-encoder-states-to-qsar-predictions">From Encoder States to QSAR Predictions</h3>
<p>After pre-training, the encoder&rsquo;s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these &ldquo;dynamic embeddings&rdquo; preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.</p>
<p>To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).</p>
<p>The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.</p>
<h3 id="interpretability-via-layer-wise-relevance-propagation">Interpretability via Layer-wise Relevance Propagation</h3>
<p>The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:</p>
<p>$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$</p>
<p>In practice, biases absorb some relevance, so the total propagated to the input is less than the output:</p>
<p>$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$</p>
<p>For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.</p>
<h2 id="benchmarks-across-18-regression-and-classification-datasets">Benchmarks Across 18 Regression and Classification Datasets</h2>
<p>The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.</p>
<h3 id="regression-results-r2">Regression Results ($r^2$)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MP (19,104)</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center"><strong>0.86</strong></td>
          <td style="text-align: center">0.85</td>
      </tr>
      <tr>
          <td>BP (11,893)</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.97</td>
          <td style="text-align: center"><strong>0.98</strong></td>
          <td style="text-align: center">0.98</td>
      </tr>
      <tr>
          <td>BCF (378)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center"><strong>0.85</strong></td>
          <td style="text-align: center">0.81</td>
      </tr>
      <tr>
          <td>FreeSolv (642)</td>
          <td style="text-align: center"><strong>0.94</strong></td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
      </tr>
      <tr>
          <td>LogS (1,311)</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.91</td>
      </tr>
      <tr>
          <td>Lipo (4,200)</td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.60</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center"><strong>0.74</strong></td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.66</td>
          <td style="text-align: center"><strong>0.76</strong></td>
          <td style="text-align: center">0.75</td>
      </tr>
      <tr>
          <td>DHFR (739)</td>
          <td style="text-align: center">0.62</td>
          <td style="text-align: center">0.63</td>
          <td style="text-align: center">0.46</td>
          <td style="text-align: center"><strong>0.67</strong></td>
          <td style="text-align: center">0.61</td>
      </tr>
      <tr>
          <td>LEL (483)</td>
          <td style="text-align: center">0.19</td>
          <td style="text-align: center">0.25</td>
          <td style="text-align: center">0.20</td>
          <td style="text-align: center"><strong>0.27</strong></td>
          <td style="text-align: center">0.23</td>
      </tr>
  </tbody>
</table>
<h3 id="classification-results-auc">Classification Results (AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (41,127)</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.74</td>
      </tr>
      <tr>
          <td>AMES (6,542)</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center"><strong>0.89</strong></td>
          <td style="text-align: center">0.86</td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center"><strong>0.91</strong></td>
          <td style="text-align: center">0.90</td>
      </tr>
      <tr>
          <td>ClinTox (1,478)</td>
          <td style="text-align: center"><strong>0.77</strong></td>
          <td style="text-align: center">0.76</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center">0.77</td>
          <td style="text-align: center">0.73</td>
      </tr>
      <tr>
          <td>Tox21 (7,831)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.82</td>
      </tr>
      <tr>
          <td>BBBP (2,039)</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.89</td>
      </tr>
      <tr>
          <td>JAK3 (886)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.80</strong></td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.76</td>
      </tr>
      <tr>
          <td>BioDeg (1,737)</td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center"><strong>0.93</strong></td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.92</td>
      </tr>
      <tr>
          <td>RP AR (930)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center"><strong>0.87</strong></td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.87</td>
          <td style="text-align: center">0.86</td>
      </tr>
  </tbody>
</table>
<h3 id="key-comparisons">Key Comparisons</h3>
<p>Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.</p>
<p>Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method&rsquo;s effectiveness.</p>
<p>A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.</p>
<h3 id="interpretability-case-studies">Interpretability Case Studies</h3>
<p>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of <a href="https://en.wikipedia.org/wiki/Haloperidol">haloperidol</a>, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.</p>
<h2 id="effective-transfer-learning-for-small-qsar-datasets">Effective Transfer Learning for Small QSAR Datasets</h2>
<p>Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.</p>
<p>The authors acknowledge several limitations and future directions:</p>
<ul>
<li>Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties</li>
<li>The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)</li>
<li>The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work</li>
<li>Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (SMILES &lt;= 110 chars)</td>
          <td>17.7M pairs</td>
          <td>10x augmentation + 1 identity pair per molecule</td>
      </tr>
      <tr>
          <td>Validation (canon.)</td>
          <td>Generated ChEMBL-like SMILES</td>
          <td>500,000</td>
          <td>From a molecular generator</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>9 regression + 9 classification</td>
          <td>378-41,127</td>
          <td>Available on OCHEM (<a href="https://ochem.eu">https://ochem.eu</a>)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)</li>
<li>TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)</li>
<li>Augmentation: n=10 non-canonical SMILES per molecule during training and inference</li>
<li>LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)</li>
<li>QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping</li>
<li>Pre-trained embeddings and standalone prediction models available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$</li>
<li>Classification: Area Under the ROC Curve (AUC)</li>
<li>Five-fold cross-validation with bootstrap standard errors</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)</li>
<li>TensorFlow v1.12.0, RDKit v2018.09.2</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bigchem/transformer-cnn">transformer-cnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Source code, pre-trained embeddings, standalone prediction models</td>
      </tr>
      <tr>
          <td><a href="https://ochem.eu">OCHEM</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online platform hosting the method, training datasets, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Karpov, P., Godin, G., &amp; Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. <em>Journal of Cheminformatics</em>, 12, 17. <a href="https://doi.org/10.1186/s13321-020-00423-w">https://doi.org/10.1186/s13321-020-00423-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{karpov2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00423-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES2Vec: Interpretable Chemical Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/smiles2vec-interpretable-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/smiles2vec-interpretable-property-prediction/</guid><description>SMILES2Vec uses a Bayesian-optimized CNN-GRU architecture to predict chemical properties directly from SMILES strings with an interpretable explanation mask.</description><content:encoded><![CDATA[<h2 id="a-general-purpose-rnn-for-chemical-property-prediction-from-smiles">A General-Purpose RNN for Chemical Property Prediction from SMILES</h2>
<p>SMILES2Vec is a <strong>Method</strong> paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> text representations. The primary contributions are: (1) a Bayesian-optimized CNN-<a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, <a href="https://en.wikipedia.org/wiki/Solvation">solvation</a> energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network&rsquo;s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.</p>
<h2 id="motivation-beyond-engineered-features-in-chemical-modeling">Motivation: Beyond Engineered Features in Chemical Modeling</h2>
<p>At the time of writing (2017), deep learning models in chemistry relied heavily on engineered <a href="https://en.wikipedia.org/wiki/Molecular_descriptor">molecular descriptors</a> and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a>/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:</p>
<ol>
<li><strong>Restricted search space</strong>: Engineered features limit the neural network&rsquo;s ability to discover potentially useful representations that domain experts have not anticipated.</li>
<li><strong>Incomplete domain knowledge</strong>: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.</li>
</ol>
<p>In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.</p>
<p>A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.</p>
<h2 id="core-innovation-cnn-gru-architecture-with-explanation-masks">Core Innovation: CNN-GRU Architecture with Explanation Masks</h2>
<h3 id="architecture-design-via-bayesian-optimization">Architecture Design via <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a></h3>
<p>SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database) through three stages:</p>
<ol>
<li><strong>Embedding layer</strong>: Maps one-hot character vectors to a learned embedding space (size 50)</li>
<li><strong>1D convolutional layer</strong>: 192 filters with kernel size 3, stride 1</li>
<li><strong>Bidirectional GRU layers</strong>: Two layers with 224 and 384 units respectively</li>
</ol>
<p>The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Parameter</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Embedding</td>
          <td>Size</td>
          <td>50</td>
      </tr>
      <tr>
          <td>Conv1D</td>
          <td>Filters</td>
          <td>192</td>
      </tr>
      <tr>
          <td>BiGRU Layer 1</td>
          <td>Units</td>
          <td>224</td>
      </tr>
      <tr>
          <td>BiGRU Layer 2</td>
          <td>Units</td>
          <td>384</td>
      </tr>
  </tbody>
</table>
<h3 id="explanation-mask-for-interpretability">Explanation Mask for Interpretability</h3>
<p>The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model&rsquo;s output while masking as much input as possible. The loss function for a single sample is:</p>
<p>$$
\text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i)
$$</p>
<p>where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.</p>
<p>The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="datasets">Datasets</h3>
<p>The model was evaluated on four datasets from the <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark and the ESOL solubility dataset:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Property</th>
          <th>Task</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>Toxicity</td>
          <td>Multi-task classification</td>
          <td>8,014</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Activity</td>
          <td>Single-task classification</td>
          <td>41,193</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Solvation energy</td>
          <td>Single-task regression</td>
          <td>643</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>Solubility</td>
          <td>Single-task regression</td>
          <td>1,128</td>
      </tr>
  </tbody>
</table>
<p>SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.</p>
<h3 id="training-protocol">Training Protocol</h3>
<ul>
<li><strong>Optimizer</strong>: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$</li>
<li><strong>Batch size</strong>: 32</li>
<li><strong>Epochs</strong>: 250 with early stopping (patience of 25 epochs based on validation loss)</li>
<li><strong>Classification loss</strong>: Binary cross-entropy</li>
<li><strong>Regression loss</strong>: Mean absolute error</li>
<li><strong>Metrics</strong>: AUC for classification, RMSE for regression</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>SMILES2Vec was compared against:</p>
<ul>
<li><strong>MLP with engineered features</strong>: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)</li>
<li><strong>Molecular graph convolutions</strong>: Graph-based neural network from MoleculeNet</li>
<li><strong>Chemception</strong>: CNN operating on 2D chemical images</li>
</ul>
<h3 id="bayesian-optimization-protocol">Bayesian Optimization Protocol</h3>
<p>Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.</p>
<h2 id="results-competitive-accuracy-with-interpretable-predictions">Results: Competitive Accuracy with Interpretable Predictions</h2>
<h3 id="property-prediction-performance">Property Prediction Performance</h3>
<p>SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SMILES2Vec</th>
          <th>SMILES2Vec + Pre-training</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>AUC</td>
          <td>0.80</td>
          <td>0.81</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>AUC</td>
          <td>0.78</td>
          <td>0.80</td>
          <td>0.80</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE (kcal/mol)</td>
          <td>1.4</td>
          <td>1.2</td>
          <td>1.3</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.63</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.</p>
<p>Key findings:</p>
<ul>
<li>SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.</li>
<li>Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).</li>
<li>SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.</li>
</ul>
<h3 id="interpretability-evaluation">Interpretability Evaluation</h3>
<p>On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (&gt; 1.0) and insoluble (&lt; -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.</p>
<p>Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.</li>
<li>The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.</li>
<li>SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.</li>
<li>The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Architecture optimization</td>
          <td>Tox21 (nr-ahr task)</td>
          <td>8,014</td>
          <td>Single toxicity task for Bayesian optimization</td>
      </tr>
      <tr>
          <td>Architecture optimization</td>
          <td>FreeSolv</td>
          <td>643</td>
          <td>Solvation free energy regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21 (full, 12 tasks)</td>
          <td>8,014</td>
          <td>Multi-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,193</td>
          <td>Single-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Solubility regression, also used for interpretability</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)</li>
<li>RMSprop optimizer with standard settings</li>
<li>Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Final architecture: Embedding(50) -&gt; Conv1D(192, kernel=3, stride=1) -&gt; BiGRU(224) -&gt; BiGRU(384)</li>
<li>Explanation network: 20-layer residual network with SELU activations</li>
<li>No pre-trained weights or code were released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC</td>
          <td>Tox21</td>
          <td>0.81</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>HIV</td>
          <td>0.80</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>FreeSolv</td>
          <td>1.2 kcal/mol</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>ESOL</td>
          <td>0.63</td>
          <td>Base model</td>
      </tr>
      <tr>
          <td>Top-3 accuracy</td>
          <td>ESOL interpretability</td>
          <td>88%</td>
          <td>Explanation mask</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Goh, G. B., Hodas, N. O., Siegel, C., &amp; Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. <em>arXiv preprint arXiv:1712.02034</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{goh2017smiles2vec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.02034}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.1712.02034}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolPMoFiT: Inductive Transfer Learning for QSAR</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/</guid><description>MolPMoFiT adapts ULMFiT for QSAR by pre-training an LSTM language model on 1M ChEMBL SMILES and fine-tuning on small molecular property datasets.</description><content:encoded><![CDATA[<h2 id="transfer-learning-meets-molecular-property-prediction">Transfer Learning Meets Molecular Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSPR/QSAR</a> modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a> as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.</p>
<h2 id="the-small-data-problem-in-qsar-modeling">The Small Data Problem in QSAR Modeling</h2>
<p>Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like <a href="https://en.wikipedia.org/wiki/Allosteric_regulation">allosteric inhibition</a>, renal clearance, and inhibitor residence times.</p>
<p>Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), <a href="/notes/chemistry/molecular-representations/encoders/mol2vec-unsupervised-chemical-intuition/">Mol2vec</a> (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.</p>
<h2 id="core-innovation-ulmfit-adapted-for-smiles">Core Innovation: ULMFiT Adapted for SMILES</h2>
<p>MolPMoFiT adapts ULMFiT&rsquo;s three-stage transfer learning pipeline to molecular property prediction:</p>
<p><strong>Stage 1: General-Domain MSPM Pre-training.</strong> A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.</p>
<p><strong>Stage 2: Task-Specific MSPM Fine-tuning (Optional).</strong> The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:</p>
<p>$$\eta^{layer-1} = \eta^{layer} / 2.6$$</p>
<p>where higher layers (containing more task-specific features) receive higher learning rates.</p>
<p><strong>Stage 3: QSAR/QSPR Model Fine-tuning.</strong> The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:</p>
<ul>
<li><strong>Discriminative fine-tuning</strong>: Different learning rates per layer group</li>
<li><strong>Gradual unfreezing</strong>: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)</li>
<li><strong>One cycle policy</strong>: Learning rate scheduling following Smith&rsquo;s approach</li>
</ul>
<p>The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.</p>
<p><strong>SMILES Augmentation.</strong> Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.</p>
<h2 id="benchmarks-across-four-qsar-datasets">Benchmarks Across Four QSAR Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,200</td>
          <td>Regression (logD)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Regression (<a href="https://en.wikipedia.org/wiki/Solvation">solvation energy</a>)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>Classification (replication inhibition)</td>
          <td>AUROC</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,039</td>
          <td>Classification (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>)</td>
          <td>AUROC</td>
      </tr>
  </tbody>
</table>
<p>All datasets use the same 10 random 80:10:10 splits from <a href="/notes/chemistry/molecular-design/property-prediction/systematic-study-molecular-property-prediction/">Yang et al. (2019)</a> for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.</p>
<h3 id="baselines">Baselines</h3>
<p>Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> descriptors.</p>
<h3 id="hyperparameters">Hyperparameters</h3>
<p>The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):</p>
<table>
  <thead>
      <tr>
          <th>Layer Group</th>
          <th>Base Learning Rate</th>
          <th>Epochs</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Linear head only</td>
          <td>3e-2</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final LSTM layer</td>
          <td>5e-3</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final two LSTM layers</td>
          <td>5e-4</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Full model</td>
          <td>5e-5</td>
          <td>6</td>
      </tr>
  </tbody>
</table>
<p>Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="benchmark-results">Benchmark Results</h3>
<p><strong>Lipophilicity (random split):</strong> MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.</p>
<p><strong>FreeSolv (random split):</strong> RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.</p>
<p><strong>BBBP (random split):</strong> AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.</p>
<p><strong>HIV (random split):</strong> General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.</p>
<p>Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.</p>
<h3 id="transfer-learning-impact">Transfer Learning Impact</h3>
<p>Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.</p>
<h3 id="smiles-augmentation-analysis">SMILES Augmentation Analysis</h3>
<p>Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (curated)</td>
          <td>1M molecules</td>
          <td>Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td>MoleculeNet benchmark</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers</li>
<li>ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy</li>
<li>SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens</li>
<li>SMILES enumeration for data augmentation with optional Gaussian label noise for regression</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)</li>
<li>Task-specific MSPMs fine-tuned per dataset (optional stage)</li>
<li>QSAR models fine-tuned with transferred embeddings and encoder</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Split</th>
          <th>Metric</th>
          <th>MolPMoFiT (TTA)</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lipophilicity</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$0.565 \pm 0.037$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$0.635 \pm 0.031$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$1.197 \pm 0.127$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$2.082 \pm 0.460$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.950 \pm 0.020$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.931 \pm 0.025$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.828 \pm 0.029$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.816 \pm 0.022$</td>
          <td>D-MPNN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P4000 GPU (single GPU)</li>
<li>General-domain MSPM pre-training: approximately 1 day</li>
<li>Pre-training needs to be done only once; fine-tuning is fast per task</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/MolPMoFiT">MolPMoFiT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>PyTorch + fastai v1 implementation with curated datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, X., &amp; Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. <em>Journal of Cheminformatics</em>, 12, 27. <a href="https://doi.org/10.1186/s13321-020-00430-x">https://doi.org/10.1186/s13321-020-00430-x</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2020molpmofit,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Xinhao and Fourches, Denis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00430-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLM-Prop: Predicting Crystal Properties from Text</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</guid><description>LLM-Prop fine-tunes the T5 encoder on crystal text descriptions to predict band gap, volume, and other properties, outperforming GNN baselines.</description><content:encoded><![CDATA[<h2 id="text-based-crystal-property-prediction-with-llms">Text-Based Crystal Property Prediction with LLMs</h2>
<p>LLM-Prop is a <strong>Method</strong> paper that proposes using the encoder portion of <a href="https://en.wikipedia.org/wiki/T5_(language_model)">T5</a> (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for <a href="/notes/chemistry/molecular-design/property-prediction/">property prediction</a>, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.</p>
<h2 id="why-text-instead-of-crystal-graphs">Why Text Instead of Crystal Graphs?</h2>
<p>Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:</p>
<ol>
<li><strong>Periodicity encoding</strong>: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.</li>
<li><strong>Information incorporation</strong>: Critical structural information like bond angles, <a href="https://en.wikipedia.org/wiki/Space_group">space group</a> symmetry, and <a href="https://en.wikipedia.org/wiki/Wyckoff_positions">Wyckoff sites</a> is difficult to incorporate into graph representations.</li>
<li><strong>Expressiveness</strong>: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.</li>
</ol>
<p>Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.</p>
<h2 id="core-innovation-t5-encoder-with-careful-fine-tuning">Core Innovation: T5 Encoder with Careful Fine-Tuning</h2>
<p>The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (<a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:</p>
<ul>
<li>Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences</li>
<li>Longer sequences mean more crystal information can be included</li>
<li>The encoder-only approach avoids T5&rsquo;s known weakness at regression in text-to-text format</li>
</ul>
<p>The framework applies several preprocessing strategies to the crystal text descriptions:</p>
<ol>
<li><strong>Stopword removal</strong>: Standard English stopwords are removed, except digits and symbols carrying chemical information</li>
<li><strong>Numerical token replacement</strong>: Bond distances are replaced with a <code>[NUM]</code> token and bond angles with <code>[ANG]</code>, reducing sequence length while preserving structural cues</li>
<li><strong>[CLS] token prepending</strong>: A classification token is added at the start, and its learned embedding is used as input to the prediction layer</li>
<li><strong>Label scaling</strong>: For regression tasks, targets are normalized using z-score, min-max, or log normalization</li>
</ol>
<p>The normalization schemes are defined as:</p>
<p>$$
\hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma}
$$</p>
<p>$$
\hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}}
$$</p>
<p>$$
\hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1)
$$</p>
<p>The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens <code>[NUM]</code>, <code>[ANG]</code>, and <code>[CLS]</code> are added to the vocabulary.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="dataset-textedge">Dataset: TextEdge</h3>
<p>The authors collected data from the <a href="https://en.wikipedia.org/wiki/Materials_Project">Materials Project</a> database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Type</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Band gap (eV)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Unit cell volume (A^3/cell)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Formation energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy above hull (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Is-gap-direct</td>
          <td>Classification</td>
          <td>AUC (higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<p>Seven baselines were compared:</p>
<ul>
<li><strong>GNN-based</strong>: CGCNN, MEGNet, ALIGNN, DeeperGATGNN</li>
<li><strong>Classic ML</strong>: XGBoost, Random Forest (on Robocrystallographer features)</li>
<li><strong>Text-based</strong>: MatBERT (domain-specific pre-trained BERT, ~110M parameters)</li>
</ul>
<p>All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.</p>
<h3 id="main-results-llm-prop-vs-gnn-baselines">Main Results: LLM-Prop vs. GNN Baselines</h3>
<p>When using crystal text descriptions as input, LLM-Prop achieved:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>0.293</td>
          <td>188.834</td>
          <td>0.046</td>
          <td>0.082</td>
          <td>0.040</td>
          <td>0.830</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>0.304</td>
          <td>297.948</td>
          <td>0.077</td>
          <td>0.056</td>
          <td>0.051</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>0.250</td>
          <td>129.580</td>
          <td>0.027</td>
          <td>0.059</td>
          <td>0.028</td>
          <td>0.678</td>
      </tr>
      <tr>
          <td>DeeperGATGNN</td>
          <td>0.291</td>
          <td>111.857</td>
          <td>0.081</td>
          <td>0.116</td>
          <td>0.045</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>LLM-Prop (Descr.)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.252</strong></td>
          <td>0.056</td>
          <td>0.067</td>
          <td>0.047</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on <a href="https://en.wikipedia.org/wiki/Band_gap">band gap</a> prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.</p>
<h3 id="llm-prop-vs-matbert">LLM-Prop vs. MatBERT</h3>
<p>LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&amp;[ANG]):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MatBERT (best)</td>
          <td>0.258</td>
          <td>54.969</td>
          <td>0.071</td>
          <td>0.098</td>
          <td>0.050</td>
          <td>0.722</td>
      </tr>
      <tr>
          <td>LLM-Prop (best)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.138</strong></td>
          <td><strong>0.056</strong></td>
          <td><strong>0.067</strong></td>
          <td><strong>0.047</strong></td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: LLM-Prop&rsquo;s best band gap (0.231) comes from the &ldquo;w/o Numbers&rdquo; configuration, while the best volume (39.138) comes from &ldquo;w/ Numbers&rdquo;. The best Is-gap-direct AUC (0.857) uses the &ldquo;[NUM]&amp;[ANG]&rdquo; configuration.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The contribution of each preprocessing strategy was evaluated:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Band gap</th>
          <th>Volume</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLM-Prop (baseline)</td>
          <td>0.256</td>
          <td>69.352</td>
          <td>0.796</td>
      </tr>
      <tr>
          <td>+ modified tokenizer</td>
          <td>0.247</td>
          <td>78.632</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>+ label scaling</td>
          <td>0.242</td>
          <td>44.515</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>+ [CLS] token</td>
          <td>0.231</td>
          <td>39.520</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>+ [NUM] token</td>
          <td>0.251</td>
          <td>86.090</td>
          <td>0.793</td>
      </tr>
      <tr>
          <td>+ [ANG] token</td>
          <td>0.242</td>
          <td>64.965</td>
          <td>0.810</td>
      </tr>
      <tr>
          <td>- stopwords</td>
          <td>0.252</td>
          <td>56.593</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>LLM-Prop+all (no space group)</td>
          <td>0.235</td>
          <td>97.457</td>
          <td>0.705</td>
      </tr>
      <tr>
          <td>LLM-Prop+all</td>
          <td><strong>0.229</strong></td>
          <td>42.259</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.</p>
<h3 id="data-efficiency-and-transfer-learning">Data Efficiency and Transfer Learning</h3>
<p>LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.</p>
<p>Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Volume-to-Band gap (Test)</th>
          <th>Band gap-to-Volume (Test)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN-transfer</td>
          <td>0.295</td>
          <td>182.997</td>
      </tr>
      <tr>
          <td>ALIGNN-transfer</td>
          <td>0.322</td>
          <td>136.164</td>
      </tr>
      <tr>
          <td>MatBERT-transfer</td>
          <td>0.266</td>
          <td>54.289</td>
      </tr>
      <tr>
          <td>LLM-Prop-transfer</td>
          <td><strong>0.244</strong></td>
          <td><strong>50.753</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text</li>
<li>A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary</li>
<li>Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning</li>
<li>Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The origin of LLM-Prop&rsquo;s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself</li>
<li>LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data</li>
<li>The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency</li>
<li>Current LLMs&rsquo; inability to reason about numerical values limits the use of quantitative information in descriptions</li>
</ul>
<p><strong>Future directions</strong> suggested by the authors include investigating techniques to use <a href="/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">CIF files</a> directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>TextEdge</td>
          <td>144,931 crystals</td>
          <td>From Materials Project (Nov 2022), text generated by Robocrystallographer</td>
      </tr>
      <tr>
          <td>Training split</td>
          <td>TextEdge</td>
          <td>125,098</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Validation split</td>
          <td>TextEdge</td>
          <td>9,945</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Test split</td>
          <td>TextEdge</td>
          <td>9,888</td>
          <td>Random split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam with one-cycle learning rate scheduler</li>
<li><strong>Learning rate</strong>: 1e-3 for LLM-Prop, 5e-5 for MatBERT</li>
<li><strong>Dropout</strong>: 0.2 for LLM-Prop, 0.5 for MatBERT</li>
<li><strong>Batch size</strong>: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop</li>
<li><strong>Epochs</strong>: 200-300 depending on task</li>
<li><strong>Loss</strong>: MAE for regression, BCE for classification</li>
<li><strong>Evaluation</strong>: MAE for regression, AUC for classification</li>
<li><strong>Each model run 5 times on test set</strong>, averaged MAE reported</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base model</strong>: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)</li>
<li><strong>Vocabulary size</strong>: 32k (retrained tokenizer)</li>
<li><strong>Max input tokens</strong>: 888 (default) or 2000</li>
<li><strong>Special tokens</strong>: [CLS], [NUM], [ANG]</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/vertaix/LLM-Prop">LLM-Prop</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://drive.google.com/drive/folders/1YCDBzwjwNRIc1FRkB662G3Y5AOWaokUG">TextEdge + Checkpoints</a></td>
          <td>Dataset + Model</td>
          <td>Not specified</td>
          <td>Benchmark dataset and trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: NVIDIA RTX A6000</li>
<li><strong>Training time</strong>: ~40 minutes per epoch for LLM-Prop</li>
<li><strong>Inference</strong>: ~1 minute for 10,000 materials on one GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rubungo, A. N., Arnold, C. B., Rand, B. P., &amp; Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. <em>npj Computational Materials</em>, 11, 186. <a href="https://doi.org/10.1038/s41524-025-01536-2">https://doi.org/10.1038/s41524-025-01536-2</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rubungo2025llmprop,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LLM-Prop: predicting the properties of crystalline materials using large language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{npj Computational Materials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{186}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41524-025-01536-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Perplexity for Molecule Ranking and CLM Bias Detection</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/perplexity-molecule-ranking-bias-clms/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/perplexity-molecule-ranking-bias-clms/</guid><description>Perplexity scoring enables intrinsic molecule ranking and pretraining bias detection in chemical language models for de novo drug design.</description><content:encoded><![CDATA[<h2 id="a-method-for-intrinsic-scoring-and-bias-detection-in-chemical-language-models">A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces two contributions to the chemical language model (CLM) pipeline for <a href="/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/">de novo molecular design</a>. First, the authors propose using perplexity as a model-intrinsic score to rank generated <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a &ldquo;delta score&rdquo; that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.</p>
<h2 id="the-ranking-and-bias-problem-in-clm-based-molecule-generation">The Ranking and Bias Problem in CLM-Based Molecule Generation</h2>
<p>Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">transfer learning</a> (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce &ldquo;pretraining bias,&rdquo; where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.</p>
<p>Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.</p>
<h2 id="perplexity-scoring-and-the-delta-score-for-bias-estimation">Perplexity Scoring and the Delta Score for Bias Estimation</h2>
<p>The core innovation is the application of <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a>, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:</p>
<p>$$
\text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})}
$$</p>
<p>Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.</p>
<p>To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):</p>
<p>$$
\text{delta} = \text{rank}_{ft} - \text{rank}_{pt}
$$</p>
<p>A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.</p>
<p>The multinomial sampling probability for each character is computed via the softmax function:</p>
<p>$$
p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}}
$$</p>
<p>where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).</p>
<h2 id="experimental-setup-10-protein-targets-across-four-data-regimes">Experimental Setup: 10 Protein Targets Across Four Data Regimes</h2>
<p>The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).</p>
<p><strong>Model architecture</strong>: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.</p>
<p><strong>Pretraining</strong>: The model was pretrained on 1,683,181 molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.</p>
<p><strong>Fine-tuning</strong>: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL &gt; 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).</p>
<table>
  <thead>
      <tr>
          <th>CHEMBL ID</th>
          <th>Target</th>
          <th>Protein Classification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CHEMBL1836</td>
          <td>Prostanoid EP4 receptor</td>
          <td><a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">G protein-coupled receptor</a></td>
      </tr>
      <tr>
          <td>CHEMBL1945</td>
          <td>Melatonin receptor 1A</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL1983</td>
          <td>Serotonin 1D (5-HT1D) receptor</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL202</td>
          <td><a href="https://en.wikipedia.org/wiki/Dihydrofolate_reductase">Dihydrofolate reductase</a></td>
          <td>Oxidoreductase</td>
      </tr>
      <tr>
          <td>CHEMBL3522</td>
          <td><a href="https://en.wikipedia.org/wiki/Cytochrome_P450">Cytochrome P450</a> 17A1</td>
          <td>Cytochrome P450</td>
      </tr>
      <tr>
          <td>CHEMBL4029</td>
          <td>Interleukin-8 receptor A</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL5073</td>
          <td>CaM kinase I delta</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5137</td>
          <td>Metabotropic glutamate receptor 2</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL5408</td>
          <td>Serine/threonine-protein kinase TBK1</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5608</td>
          <td>NT-3 growth factor receptor</td>
          <td>Kinase</td>
      </tr>
  </tbody>
</table>
<p><strong>Sampling comparison</strong>: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.</p>
<p><strong>Molecular similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> was computed using Morgan fingerprints (radius 2, length 1024) and 2D <a href="https://en.wikipedia.org/wiki/Pharmacophore">pharmacophore</a> fingerprints via RDKit (2019.03.2).</p>
<h2 id="key-findings-multinomial-sampling-outperforms-beam-search">Key Findings: Multinomial Sampling Outperforms Beam Search</h2>
<p><strong>Perplexity correlates with molecular similarity.</strong> The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.</p>
<p><strong>Multinomial sampling produces better-ranked molecules than beam search.</strong> With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.</p>
<p><strong>Perplexity scoring narrows the quality distribution.</strong> The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.</p>
<p><strong>Pretraining bias is substantial.</strong> The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect &ldquo;generic&rdquo; pretraining rather than task-focused fine-tuning.</p>
<p><strong>Perplexity alone partially mitigates bias.</strong> Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.</p>
<p><strong>SMILES validity remained high.</strong> Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or <a href="/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/">SMILES augmentation</a> remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v28</td>
          <td>1,683,181 molecules</td>
          <td>Canonical SMILES, 20-90 characters, salts and duplicates removed</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>ChEMBL v28 (split)</td>
          <td>84,160 molecules</td>
          <td>Random split from pretraining set</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ChEMBL v28 (per target)</td>
          <td>5, 10, 20, or 40 molecules</td>
          <td>pChEMBL &gt; 6, 10 targets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>LSTM-based CLM with character-level SMILES prediction</li>
<li>Multinomial sampling at $T = 1$</li>
<li>Beam search at $k = 10$ and $k = 50$</li>
<li>Perplexity computed per Equation 1; delta score per Equation 2</li>
<li>Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization</li>
<li>5,820,515 parameters total</li>
<li>One-hot encoded SMILES input</li>
<li>Pretrained weights available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perplexity</td>
          <td>Model confidence in generated SMILES</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Delta score</td>
          <td>Rank difference between fine-tuned and pretrained models</td>
          <td>Positive indicates task-relevant generation</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Morgan and pharmacophore fingerprints</td>
          <td>Compared to fine-tuning set</td>
      </tr>
      <tr>
          <td>Pearson correlation</td>
          <td>Perplexity vs. Tanimoto distance</td>
          <td>Stabilizes at ~0.5</td>
      </tr>
      <tr>
          <td>SMILES validity</td>
          <td>Fraction of valid SMILES strings</td>
          <td>Consistently &gt; 90%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ETHmodlab/CLM_perplexity">CLM_perplexity</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Framework, pretrained weights, and training data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ETHmodlab/molecular_design_with_beam_search">Beam search implementation</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Referenced beam search implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Moret, M., Grisoni, F., Katzberger, P., &amp; Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. <em>Journal of Chemical Information and Modeling</em>, 62(5), 1199-1206. <a href="https://doi.org/10.1021/acs.jcim.2c00079">https://doi.org/10.1021/acs.jcim.2c00079</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ETHmodlab/CLM_perplexity">GitHub: CLM_perplexity (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{moret2022perplexity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1199--1206}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c00079}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoleculeNet: Benchmarking Molecular Machine Learning</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/</guid><description>MoleculeNet curates 17 datasets across quantum mechanics, physical chemistry, biophysics, and physiology with standardized splits and metrics for molecular ML.</description><content:encoded><![CDATA[<h2 id="a-resource-paper-for-molecular-machine-learning-benchmarking">A Resource Paper for Molecular Machine Learning Benchmarking</h2>
<p>This is a <strong>Resource</strong> paper. MoleculeNet provides a standardized benchmark suite for evaluating molecular machine learning methods. Its primary contribution is the curation of 17 public datasets spanning four categories of molecular properties, together with standardized evaluation metrics, multiple dataset splitting strategies, and open-source implementations of featurization and learning algorithms via the DeepChem library.</p>
<h2 id="why-molecular-ml-needed-a-unified-benchmark">Why Molecular ML Needed a Unified Benchmark</h2>
<p>Prior to MoleculeNet, algorithmic progress in molecular machine learning was difficult to measure. Individual papers benchmarked proposed methods on different datasets with different metrics, making cross-method comparison unreliable. Several factors make molecular ML particularly challenging:</p>
<ol>
<li><strong>Data scarcity</strong>: Molecular datasets are much smaller than those available for computer vision or NLP, since obtaining accurate chemical property measurements requires specialized instruments and expert supervision.</li>
<li><strong>Heterogeneous outputs</strong>: Properties of interest range from quantum mechanical characteristics to macroscopic physiological effects on the human body.</li>
<li><strong>Variable input structures</strong>: Molecules have arbitrary size, variable connectivity, and many possible 3D conformers, all of which must be encoded into fixed-length representations for conventional ML algorithms.</li>
<li><strong>No standard evaluation protocol</strong>: Without prescribed metrics, splits, or data subsets, two methods using the same underlying database (e.g., PubChem) could be entirely incomparable.</li>
</ol>
<p>Existing databases like PubChem, ChEMBL, and the Quantum Machine collections provided raw data but did not define evaluation protocols suitable for machine learning development. MoleculeNet bridges this gap, following the precedent set by ImageNet in computer vision and WordNet in NLP.</p>
<h2 id="core-design-datasets-splits-metrics-and-featurizations">Core Design: Datasets, Splits, Metrics, and Featurizations</h2>
<p>MoleculeNet is organized around four components: curated datasets, splitting methods, evaluation metrics, and molecular featurizations.</p>
<h3 id="datasets-across-four-property-categories">Datasets Across Four Property Categories</h3>
<p>The benchmark includes 17 datasets covering over 700,000 compounds and more than 800 tasks. These are organized into four categories reflecting different levels of molecular properties:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Compounds</th>
          <th>Task Type</th>
          <th>Rec. Split</th>
          <th>Rec. Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Quantum Mechanics</td>
          <td>QM7</td>
          <td>1</td>
          <td>7,165</td>
          <td>Regression</td>
          <td>Stratified</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM7b</td>
          <td>14</td>
          <td>7,211</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM8</td>
          <td>12</td>
          <td>21,786</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM9</td>
          <td>12</td>
          <td>133,885</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>1,128</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>643</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>Lipophilicity</td>
          <td>1</td>
          <td>4,200</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA</td>
          <td>128</td>
          <td>439,863</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>MUV</td>
          <td>17</td>
          <td>93,127</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>HIV</td>
          <td>1</td>
          <td>41,913</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>PDBbind</td>
          <td>1</td>
          <td>11,908</td>
          <td>Regression</td>
          <td>Time</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>BACE</td>
          <td>1</td>
          <td>1,522</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>2,053</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>Tox21</td>
          <td>12</td>
          <td>8,014</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ToxCast</td>
          <td>617</td>
          <td>8,615</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>SIDER</td>
          <td>27</td>
          <td>1,427</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ClinTox</td>
          <td>2</td>
          <td>1,491</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p><strong>Quantum mechanics</strong> datasets (QM7, QM7b, QM8, QM9) contain DFT-computed electronic properties for subsets of the <a href="/notes/chemistry/datasets/gdb-17/">GDB</a> database. <strong>Physical chemistry</strong> datasets cover solubility (ESOL), hydration free energy (FreeSolv), and lipophilicity. <strong>Biophysics</strong> datasets include high-throughput screening results (PCBA, MUV), HIV inhibition activity, protein-ligand binding affinity (PDBbind), and BACE-1 inhibition. <strong>Physiology</strong> datasets cover blood-brain barrier penetration (BBBP), toxicity (Tox21, ToxCast), side effects (SIDER), and clinical trial toxicity (ClinTox).</p>
<h3 id="data-splitting-strategies">Data Splitting Strategies</h3>
<p>MoleculeNet implements four splitting methods, all using an 80/10/10 train/validation/test ratio:</p>
<ul>
<li><strong>Random splitting</strong>: Standard random assignment to subsets.</li>
<li><strong>Scaffold splitting</strong>: Separates molecules by their 2D structural frameworks (Bemis-Murcko scaffolds), providing a harder generalization test since structurally different molecules appear in different subsets.</li>
<li><strong>Stratified splitting</strong>: Ensures each subset contains the full range of label values (used for QM7).</li>
<li><strong>Time splitting</strong>: Trains on older data and tests on newer data to mimic real-world development (used for PDBbind).</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>Regression tasks use MAE or RMSE depending on the dataset. Classification tasks use either ROC-AUC or PRC-AUC. The choice between ROC-AUC and PRC-AUC depends on class imbalance: PRC-AUC is recommended for datasets with positive rates below 2% (PCBA, MUV), since precision-recall curves better capture performance under extreme imbalance.</p>
<p>The false positive rate and precision are defined as:</p>
<p>$$
\text{FPR} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}}
$$</p>
<p>$$
\text{precision} = \frac{\text{true positive}}{\text{false positive} + \text{true positive}}
$$</p>
<p>When positive samples form a small fraction of the data, false positives influence precision much more than FPR, making PRC-AUC more informative than ROC-AUC.</p>
<h3 id="featurization-methods">Featurization Methods</h3>
<p>MoleculeNet implements six molecular featurization approaches:</p>
<ol>
<li><strong>ECFP (Extended-Connectivity Fingerprints)</strong>: Fixed-length binary fingerprints capturing topological substructures via hashing.</li>
<li><strong><a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb Matrix</a></strong>: Encodes nuclear charges and 3D coordinates through atomic self-energies and Coulomb repulsion:</li>
</ol>
<p>$$
M_{IJ} = \begin{cases} 0.5 Z_{I}^{2.4} &amp; \text{for } I = J \\ \frac{Z_{I} Z_{J}}{|\mathbf{R}_{I} - \mathbf{R}_{J}|} &amp; \text{for } I \neq J \end{cases}
$$</p>
<ol start="3">
<li><strong>Grid Featurizer</strong>: Designed for PDBbind, incorporating both ligand and protein structural information including salt bridges, hydrogen bonds, and SPLIF fingerprints.</li>
<li><strong>Symmetry Functions</strong>: Preserve rotational and permutation symmetry through radial and angular functions between atom pairs and triplets.</li>
<li><strong>Graph Convolutions</strong>: Compute initial atom feature vectors and neighbor lists from molecular graphs.</li>
<li><strong>Weave</strong>: Similar to graph convolutions but also computes pairwise atom features encoding bond properties, graph distance, and ring information.</li>
</ol>
<h2 id="benchmarked-models-and-experimental-setup">Benchmarked Models and Experimental Setup</h2>
<p>MoleculeNet benchmarks 12 learning algorithms divided into conventional methods and graph-based methods.</p>
<h3 id="conventional-methods">Conventional Methods</h3>
<ul>
<li><strong>Logistic Regression</strong> (classification only)</li>
<li><strong>Kernel SVM</strong> with radial basis function kernel</li>
<li><strong>Kernel Ridge Regression (KRR)</strong></li>
<li><strong>Random Forests</strong></li>
<li><strong>Gradient Boosting</strong> (XGBoost)</li>
<li><strong>Singletask/Multitask Networks</strong>: Fully connected networks with shared layers across tasks</li>
<li><strong>Bypass Networks</strong>: Multitask networks augmented with per-task &ldquo;bypass&rdquo; layers that directly connect inputs to outputs</li>
<li><strong>Influence Relevance Voting (IRV)</strong>: Refined K-nearest neighbor classifiers using Jaccard-Tanimoto similarity:</li>
</ul>
<p>$$
S(\vec{A}, \vec{B}) = \frac{A \cap B}{A \cup B}
$$</p>
<h3 id="graph-based-methods">Graph-Based Methods</h3>
<ul>
<li><strong>Graph Convolutional Models (GC)</strong>: Extend circular fingerprints with learnable convolutions over molecular graphs.</li>
<li><strong>Weave Models</strong>: Update atom features using information from all other atoms and their pairwise features.</li>
<li><strong>Directed Acyclic Graph (DAG) Models</strong>: Define directed bonds toward a central atom and propagate features through the directed graph.</li>
<li><strong>Deep Tensor Neural Networks (DTNN)</strong>: Use nuclear charges and distance matrices directly, updating atom embeddings based on pairwise physical distances.</li>
<li><strong>ANI-1</strong>: Learns transferable potentials using symmetry function features with atom-type-specific neural networks.</li>
<li><strong>Message Passing Neural Networks (MPNN)</strong>: Generalized framework with edge-dependent message functions and set2set readout.</li>
</ul>
<h3 id="experimental-protocol">Experimental Protocol</h3>
<p>Gaussian process hyperparameter optimization was applied to each dataset-model combination, followed by three independent runs with different random seeds. All results are reported as means with standard deviations. Variable training-size experiments were conducted on Tox21, FreeSolv, and QM7 to study data efficiency.</p>
<h2 id="key-findings-across-property-categories">Key Findings Across Property Categories</h2>
<h3 id="biophysics-and-physiology">Biophysics and Physiology</h3>
<p>Graph convolutional and weave models showed strong performance on larger datasets with less overfitting than conventional methods. Graph-based models outperformed multitask networks at 30% training data compared to 90% on Tox21. However, for smaller single-task datasets (under 3,000 samples), kernel SVM and ensemble tree methods were more robust. On highly imbalanced datasets like MUV (0.20% positive rate), graph-based models struggled to control false positives.</p>
<p>Multitask training had a regularizing effect, reducing the gap between train and test scores compared to single-task models. Bypass networks consistently matched or exceeded vanilla multitask networks, confirming that per-task layers add explanatory power.</p>
<h3 id="physical-chemistry">Physical Chemistry</h3>
<p>Graph-based methods (GC, DAG, MPNN, Weave) provided significant improvements over single-task networks for predicting solubility, solvation energy, and lipophilicity. The best models achieved accuracy comparable to ab initio predictions (within 0.5 RMSE for ESOL, within 1.5 kcal/mol for FreeSolv). On FreeSolv, a weave model trained on approximately 200 samples matched the accuracy of alchemical free energy calculations.</p>
<h3 id="quantum-mechanics">Quantum Mechanics</h3>
<p>Models incorporating 3D distance information (DTNN, MPNN, KRR with Coulomb matrix) substantially outperformed models using only topological features. DTNN and MPNN covered the best-performing models on 28 of 39 tasks across QM datasets. The choice of physics-aware featurization proved more important than the choice of learning algorithm for these tasks.</p>
<h3 id="summary-of-best-performances">Summary of Best Performances</h3>
<p>Graph-based models outperformed conventional methods on 11 of 17 datasets. Key results on the test set:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>Best Conventional</th>
          <th>Best Graph-Based</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM7</td>
          <td>MAE</td>
          <td>KRR (CM): 10.22</td>
          <td>DTNN: 8.75</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>MAE</td>
          <td>Multitask (CM): 4.35</td>
          <td>DTNN: 2.35</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>XGBoost: 0.99</td>
          <td>MPNN: 0.58</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>XGBoost: 1.74</td>
          <td>MPNN: 1.15</td>
      </tr>
      <tr>
          <td>PCBA</td>
          <td>PRC-AUC</td>
          <td>Logreg: 0.129</td>
          <td>GC: 0.136</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.822</td>
          <td>GC: 0.829</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.792</td>
          <td>GC: 0.763</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>ROC-AUC</td>
          <td>RF: 0.867</td>
          <td>Weave: 0.806</td>
      </tr>
  </tbody>
</table>
<p>Conventional methods (KernelSVM, RF) still won on several smaller or scaffold-split datasets (HIV, BACE, MUV, PDBbind, BBBP, SIDER), highlighting that graph-based models are not universally superior, particularly under data scarcity or challenging splits.</p>
<h2 id="conclusions-and-limitations">Conclusions and Limitations</h2>
<p>MoleculeNet demonstrated that learnable representations broadly offer the best performance for molecular machine learning. However, the authors identify several important caveats:</p>
<ol>
<li><strong>Data scarcity</strong>: Graph-based methods are not robust enough on complex tasks with limited training data.</li>
<li><strong>Class imbalance</strong>: On heavily imbalanced classification datasets, conventional methods such as kernel SVM outperform learnable featurizations with respect to recall of positives.</li>
<li><strong>Task-specific featurizations</strong>: For quantum mechanical and biophysical datasets, incorporating physics-aware features (<a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrix</a>, 3D coordinates) is more important than the choice of learning algorithm.</li>
<li><strong>Data-driven physical chemistry</strong>: On FreeSolv, data-driven methods outperformed ab initio calculations with moderate data, suggesting data-driven approaches will become increasingly important as methods and datasets mature.</li>
</ol>
<p>The authors express hope that MoleculeNet will stimulate algorithmic development similar to how ImageNet catalyzed breakthroughs in computer vision. Future directions include extending coverage to 3D protein structure prediction, DNA topological modeling, and other areas of molecular science.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All 17 datasets are publicly available and integrated into the DeepChem Python package. Users can load any dataset with a single library call.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM benchmark</td>
          <td>QM7/QM7b/QM8/QM9</td>
          <td>7K-134K compounds</td>
          <td>DFT-computed properties from GDB subsets</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL/FreeSolv/Lipophilicity</td>
          <td>643-4,200 compounds</td>
          <td>Experimental measurements</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA/MUV/HIV/PDBbind/BACE</td>
          <td>1.5K-440K compounds</td>
          <td>Bioassay and binding data</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP/Tox21/ToxCast/SIDER/ClinTox</td>
          <td>1.4K-8.6K compounds</td>
          <td>Toxicity and drug safety data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>All splitting methods (random, scaffold, stratified, time) and featurizations (ECFP, Coulomb matrix, grid, symmetry functions, graph convolutions, weave) are implemented in DeepChem. Hyperparameters were tuned via Gaussian process optimization. Three random seeds were used per experiment.</p>
<h3 id="models">Models</h3>
<p>All 12 models are implemented in DeepChem, built on Scikit-Learn and TensorFlow. No pretrained weights are provided; models are trained from scratch on each dataset.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics include MAE, RMSE, ROC-AUC, and PRC-AUC as specified per dataset. Multi-task datasets report mean metric values across all tasks.</p>
<h3 id="hardware">Hardware</h3>
<p>The authors used Stanford&rsquo;s Sherlock and Xstream GPU nodes. Specific GPU types and training times per model are provided in Table S1 of the supplementary material.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source library with all datasets, featurizations, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., &amp; Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. <em>Chemical Science</em>, 9(2), 513-530. <a href="https://doi.org/10.1039/c7sc02664a">https://doi.org/10.1039/c7sc02664a</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2018moleculenet,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MoleculeNet: a benchmark for molecular machine learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{513--530}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/c7sc02664a}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Benchmarking Molecular Property Prediction at Scale</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/systematic-study-molecular-property-prediction/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/systematic-study-molecular-property-prediction/</guid><description>A study training 62,820 models finds fixed molecular representations often outperform learned representations for property prediction.</description><content:encoded><![CDATA[<h2 id="a-large-scale-empirical-study-of-molecular-property-prediction">A Large-Scale Empirical Study of Molecular Property Prediction</h2>
<p>This is an <strong>Empirical</strong> paper that systematically benchmarks molecular property prediction across multiple dimensions: molecular representations, model architectures, evaluation metrics, data splitting strategies, and chemical space generalization. The primary contribution is a rigorous, large-scale comparison (62,820 trained models) showing that traditional machine learning models on fixed molecular representations frequently outperform recent deep representation learning approaches, and that several overlooked evaluation factors (statistical testing, metric choice, activity cliffs, dataset size) significantly influence conclusions about model performance.</p>
<h2 id="motivation-overlooked-evaluation-pitfalls-in-molecular-property-prediction">Motivation: Overlooked Evaluation Pitfalls in Molecular Property Prediction</h2>
<p>Molecular property prediction is a core task in AI-driven drug discovery, and recent years have seen a proliferation of representation learning methods (transformers on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, GNNs on molecular graphs) claiming improved performance on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet benchmark datasets</a>. However, the authors identify several systemic problems in how these methods are evaluated:</p>
<ol>
<li><strong>Heavy reliance on MoleculeNet benchmarks</strong>, which may not reflect real-world drug discovery challenges. Some benchmark tasks (e.g., SIDER, ClinTox) are arguably unreasonable because they try to predict outcomes from chemical structure alone when other factors (food-drug interactions, patient-level variables) dominate.</li>
<li><strong>Lack of statistical rigor.</strong> Most papers report mean metrics over 3 or 10 splits without statistical tests. Without rigorous analysis, improved metrics could be statistical noise.</li>
<li><strong>Inconsistent data splits.</strong> Across studies, the actual splits vary because seeds and splitting implementations differ, making cross-paper comparisons unreliable.</li>
<li><strong>Inappropriate metrics.</strong> AUROC, the default for classification, can overestimate performance, especially on imbalanced datasets. Precision-oriented metrics (PPV, NPV) may be more relevant for virtual screening.</li>
<li><strong>Neglect of activity cliffs.</strong> Most studies only evaluate inter-scaffold generalization via scaffold splits, ignoring intra-scaffold generalization where structurally similar molecules exhibit drastically different activities (<a href="/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/">activity cliffs</a>).</li>
</ol>
<h2 id="core-contribution-fixed-representations-often-outperform-learned-representations">Core Contribution: Fixed Representations Often Outperform Learned Representations</h2>
<p>The central finding is that traditional ML models (RF, SVM, XGBoost) operating on fixed molecular representations (RDKit2D descriptors, Morgan fingerprints, MACCS keys, AtomPairs) frequently outperform recent self-supervised pretrained models (<a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, GROVER) across diverse datasets. The authors frame the paper around a central thesis:</p>
<blockquote>
<p>&ldquo;A model cannot save an unqualified dataset which cannot remedy an improper evaluation for an ambiguous chemical space generalization claim.&rdquo;</p></blockquote>
<p>Key findings on representations and models:</p>
<ul>
<li><strong>RF on RDKit2D descriptors</strong> achieves the best performance on BACE, BBBP, ESOL, and Lipop under scaffold split. MolBERT only matches RF in HIV.</li>
<li><strong>Concatenating RDKit2D descriptors to GROVER&rsquo;s learned embeddings (GROVER_RDKit)</strong> significantly improves performance, suggesting the learned representations alone are insufficient and that fixed descriptors carry substantial predictive signal.</li>
<li><strong>For binding activity datasets</strong> (<a href="https://en.wikipedia.org/wiki/Opioid_receptor">opioid receptors</a> MOR, DOR, KOR), MorganBits fingerprints outperform other representations, consistent with the structural nature of binding.</li>
<li><strong>PhysChem descriptors</strong> excel on datasets where properties correlate strongly with simple molecular features (e.g., ESOL has a near-linear relationship between MolLogP and solubility), but perform poorly on binding activity datasets where the relationship is more complex.</li>
</ul>
<h2 id="experimental-setup-62820-models-across-diverse-datasets">Experimental Setup: 62,820 Models Across Diverse Datasets</h2>
<h3 id="models-evaluated">Models evaluated</h3>
<p>The study evaluates nine models across three categories:</p>
<ul>
<li><strong>Traditional ML</strong>: Random Forest (RF), Support Vector Machine (SVM), XGBoost</li>
<li><strong>Regular neural networks</strong>: RNN (GRU variant), GCN, GIN</li>
<li><strong>Pretrained models</strong>: MolBERT (SMILES-based, ~85M parameters, pretrained on 1.6M molecules), GROVER (graph-based, ~48M parameters, pretrained on ~10M molecules), and GROVER_RDKit (GROVER with concatenated RDKit2D descriptors)</li>
</ul>
<h3 id="molecular-representations">Molecular representations</h3>
<p>Six fixed representations are evaluated: RDKit2D descriptors (200 features), PhysChem descriptors (11 features), MACCS keys, MorganBits fingerprints, MorganCounts fingerprints, and AtomPairs fingerprints. Morgan fingerprints use radius 2 and 2048 bits after testing showed little difference between common parameter choices.</p>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Datasets</th>
          <th>Task Type</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoleculeNet benchmarks</td>
          <td>BACE, BBBP, HIV</td>
          <td>Classification</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>MoleculeNet benchmarks</td>
          <td>ESOL, FreeSolv, Lipop</td>
          <td>Regression</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>Opioids-related</td>
          <td>MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR</td>
          <td>Classification + Regression</td>
          <td>ChEMBL</td>
      </tr>
      <tr>
          <td>Activity datasets</td>
          <td>24 targets</td>
          <td>Regression</td>
          <td>Cortes-Ciriano et al.</td>
      </tr>
      <tr>
          <td>Activity datasets</td>
          <td>30 targets (MoleculeACE)</td>
          <td>Regression</td>
          <td>Tilborg et al.</td>
      </tr>
      <tr>
          <td>Descriptor datasets</td>
          <td>MolWt, NumAtoms (16 sizes each)</td>
          <td>Regression</td>
          <td>ZINC250k</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-protocol">Evaluation protocol</h3>
<ul>
<li>Both scaffold and random splits (80:10:10 ratio)</li>
<li><strong>30 different random seeds</strong> per experiment for statistical rigor</li>
<li><a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U test</a> for pairwise significance ($p &lt; 0.05$, two-sided)</li>
<li>Multiple metrics per task: AUROC, AUPRC, PPV, NPV for classification; RMSE, MAE, $R^2$, Pearson $R$ for regression</li>
</ul>
<h3 id="key-metrics">Key metrics</h3>
<p>Classification:</p>
<p>$$
\text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$</p>
<p>$$
\text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}}
$$</p>
<p>Regression:</p>
<p>$$
\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}
$$</p>
<p>$$
\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|
$$</p>
<p>$$
\text{Pearson}_R = \frac{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})(\hat{y}_i - \bar{y}_{pred})}{\sqrt{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2 \sum_{i=1}^{N} (\hat{y}_i - \bar{y}_{pred})^2}}
$$</p>
<p>$$
R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2}
$$</p>
<h2 id="key-findings-metrics-activity-cliffs-and-dataset-size">Key Findings: Metrics, Activity Cliffs, and Dataset Size</h2>
<h3 id="statistical-testing-is-essential">Statistical testing is essential</h3>
<p>Without statistical tests, there is a real risk of drawing incorrect conclusions. Analysis of individual splits shows that in certain splits, MolBERT or GROVER can appear to outperform RF, even though on aggregate with proper statistical testing, RF is significantly better. For example, in BBBP, RF dominates in 20 of 30 splits, but the remaining 10 could mislead a researcher using only a single split.</p>
<h3 id="metric-choice-changes-conclusions">Metric choice changes conclusions</h3>
<p>Different evaluation metrics can lead to contradictory conclusions about the same models:</p>
<ul>
<li>In BBBP under scaffold split, RF significantly outperforms other models by AUROC, but shows similar performance when evaluated by PPV or NPV.</li>
<li>In FreeSolv, GROVER outperforms RF by Pearson $R$ ($p &lt; 0.05$) but shows similar performance by $R^2$.</li>
<li>Pearson $R$ can overestimate $R^2$: even when $R^2$ drops to zero or negative, Pearson $R$ can remain around 0.5.</li>
<li>AUROC can be over-optimistic, especially on imbalanced datasets like CYP2D6 and CYP3A4.</li>
</ul>
<p>The authors argue that PPV and NPV are more practically relevant for <a href="/notes/chemistry/molecular-design/generation/evaluation/molscore-scoring-benchmarking-framework/">virtual screening</a> than AUROC or AUPRC, since the goal is to identify true hits among predicted positives (or true non-binders among predicted negatives).</p>
<h3 id="activity-cliffs-pose-a-major-challenge">Activity cliffs pose a major challenge</h3>
<p>Activity cliffs, defined as <a href="https://en.wikipedia.org/wiki/IC50">IC50</a> values spanning at least two orders of magnitude within one scaffold, are prevalent in the opioid-related datasets. Although AC scaffolds represent only about 10% of scaffolds, they encompass 25-46% of all molecules:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>AC scaffolds (%)</th>
          <th>AC molecules (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MDR1</td>
          <td>62 (10.2%)</td>
          <td>594 (41.3%)</td>
      </tr>
      <tr>
          <td>CYP2D6</td>
          <td>124 (9.3%)</td>
          <td>710 (31.0%)</td>
      </tr>
      <tr>
          <td>CYP3A4</td>
          <td>146 (7.2%)</td>
          <td>926 (25.2%)</td>
      </tr>
      <tr>
          <td>MOR</td>
          <td>213 (13.1%)</td>
          <td>1627 (46.1%)</td>
      </tr>
      <tr>
          <td>DOR</td>
          <td>178 (11.6%)</td>
          <td>1342 (41.6%)</td>
      </tr>
      <tr>
          <td>KOR</td>
          <td>218 (13.1%)</td>
          <td>1502 (45.2%)</td>
      </tr>
  </tbody>
</table>
<p>Prediction performance is consistently worse for AC molecules, indicating limited intra-scaffold generalization. Removing edge-case molecules (those sharing scaffolds with pIC50 spanning 5 to 7) from test sets generally improves classification performance, confirming that activity cliffs are a key source of prediction error.</p>
<h3 id="dataset-size-is-critical-for-representation-learning">Dataset size is critical for representation learning</h3>
<p>Experiments on descriptor datasets (predicting MolWt and NumAtoms) reveal clear patterns:</p>
<ul>
<li>With fewer than 1K data points, traditional ML on fixed representations outperforms all neural network models except pretrained GROVER, which shows competitive performance in the low-data regime.</li>
<li>MolBERT shows severely limited performance (RMSE &gt; 200 for MolWt) with fewer than 10K data points.</li>
<li>RNN achieves the best performance when dataset size exceeds 10K, demonstrating the promise of representation learning in the &ldquo;big-data&rdquo; regime.</li>
<li>SVM achieves near-perfect RMSE (close to zero) on datasets larger than 10K when paired with AtomPairs fingerprints.</li>
<li>GROVER&rsquo;s performance does not substantially improve with increasing dataset size, while MolBERT improves at 100K but is slow to benefit from more data.</li>
</ul>
<h3 id="representation-learning-models-show-higher-metric-variability">Representation learning models show higher metric variability</h3>
<p>Representation learning models, particularly GROVER, exhibit higher variability in performance metrics across splits. This variability correlates negatively with mean performance: models with higher variability tend to perform worse on average. The authors emphasize the importance of reporting metric variability alongside means.</p>
<h3 id="scaffold-split-versus-random-split">Scaffold split versus random split</h3>
<p>Prediction performance under scaffold split is consistently worse than under random split, confirming the inter-scaffold generalization challenge. Notably, random split alleviates the intra-scaffold generalization challenge because some AC scaffolds are seen during training.</p>
<h3 id="descriptors-correlate-with-specific-properties">Descriptors correlate with specific properties</h3>
<p>PhysChem descriptors excel on datasets where molecular properties correlate with simple descriptors (e.g., MolLogP has near $-1$ correlation with ESOL labels). For binding activity datasets, correlation coefficients mostly fall within $[-0.5, 0.5]$, explaining why PhysChem descriptors show limited performance on those tasks, while structural fingerprints are more useful.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>The authors acknowledge several limitations:</p>
<ol>
<li><strong>Uncertainty from model training</strong> (random initialization, mini-batch shuffling) was not fully addressed. Ensembling was not evaluated due to computational cost.</li>
<li><strong>Experimental uncertainty in labels</strong> (noise, measurement error in pIC50 values) was not modeled, though it can be <a href="https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity">heteroscedastic</a> and impact performance.</li>
<li><strong>Model explainability</strong> was not covered, although it is important for building trust in AI tools for drug discovery.</li>
<li>The study focused on GROVERbase only (not GROVERlarge) due to computational constraints.</li>
</ol>
<p>Future directions include: exploring better ways to use fixed representations alongside learned ones, developing techniques for chemical space generalization (both inter- and intra-scaffold), incorporating experimental uncertainty into model training and evaluation, and generating larger high-quality datasets to fully harness representation learning models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Benchmark</td>
          <td>MoleculeNet (BACE, BBBP, HIV, ESOL, FreeSolv, Lipop)</td>
          <td>642-41,127 molecules</td>
          <td>Downloaded from MolMapNet; max length &lt; 400</td>
      </tr>
      <tr>
          <td>Activity</td>
          <td>Opioids-related (MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR)</td>
          <td>Varies</td>
          <td>Collected from ChEMBL27; pIC50 values</td>
      </tr>
      <tr>
          <td>Activity</td>
          <td>Cortes-Ciriano et al. 24 targets</td>
          <td>Varies</td>
          <td>Activity data for drug targets</td>
      </tr>
      <tr>
          <td>Activity</td>
          <td>MoleculeACE 30 targets</td>
          <td>Varies</td>
          <td>Activity cliffs emphasis</td>
      </tr>
      <tr>
          <td>Descriptor</td>
          <td>MolWt, NumAtoms from <a href="/notes/chemistry/datasets/zinc-22/">ZINC250k</a></td>
          <td>0.1K to 100K</td>
          <td>16 dataset sizes per descriptor</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>RF: 500 trees (following Chemprop)</li>
<li>SVM: linear kernel</li>
<li>XGBoost: gradient boosting regressor/classifier with default hyperparameters</li>
<li>RNN: GRU variant, hidden size 512, 3 fully connected layers</li>
<li>GCN/GIN: embedding dimension 300, 5 convolutional layers, hidden size 512</li>
<li>MolBERT: BERTBase architecture, 768 embedding, 12 layers, 12 heads, ~85M parameters (769 fine-tuned)</li>
<li>GROVER: GROVERbase, ~48M parameters (~5.2M fine-tuned)</li>
<li>All splits repeated 30 times with seeds 0-29</li>
</ul>
<h3 id="models">Models</h3>
<p>All model configurations, splits, and raw predictions are available in the <a href="https://github.com/dengjianyuan/Respite_MPP">GitHub repository</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics: AUROC, AUPRC, PPV, NPV (classification); RMSE, MAE, $R^2$, Pearson $R$ (regression). Statistical testing via Mann-Whitney U test ($p &lt; 0.05$, two-sided). <a href="https://en.wikipedia.org/wiki/Youden%27s_J_statistic">Youden&rsquo;s $J$ statistic</a> used to determine classification threshold for PPV/NPV.</p>
<h3 id="hardware">Hardware</h3>
<p>All neural network experiments run on a single NVIDIA V100 GPU for 100 epochs. Batch size 32 for most experiments; 256 for GROVER on HIV due to compute time (MolBERT takes ~3 hours per split on HIV at batch size 32; GROVER takes ~5 hours at batch size 256). The study is partially funded by Stony Brook University OVPR Seed Grant, using the AI Institute at Stony Brook for computational resources.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dengjianyuan/Respite_MPP">Respite_MPP</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Code, data, and raw predictions</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.1038/s41467-023-41948-6">Nature Communications article</a></td>
          <td>Paper</td>
          <td>CC-BY-4.0</td>
          <td>Open access</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D., &amp; Wang, F. (2023). A systematic study of key elements underlying molecular property prediction. <em>Nature Communications</em>, 14, 6395. <a href="https://doi.org/10.1038/s41467-023-41948-6">https://doi.org/10.1038/s41467-023-41948-6</a></p>
<p><strong>Publication</strong>: Nature Communications 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/dengjianyuan/Respite_MPP">Respite_MPP GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{deng2023systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic study of key elements underlying molecular property prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Deng, Jianyuan and Yang, Zhibo and Wang, Hehe and Ojima, Iwao and Samaras, Dimitris and Wang, Fusheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6395}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-023-41948-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ROGI-XD: Roughness of Pretrained Molecular Representations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/rogi-xd-roughness-pretrained-representations/</link><pubDate>Tue, 24 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/rogi-xd-roughness-pretrained-representations/</guid><description>ROGI-XD enables cross-representation roughness comparison, showing pretrained chemical models produce no smoother QSPR surfaces than fingerprints.</description><content:encoded><![CDATA[<h2 id="evaluating-chemical-foundation-models-through-surface-roughness">Evaluating Chemical Foundation Models Through Surface Roughness</h2>
<p>This is a <strong>Systematization</strong> paper that introduces a metric reformulation (ROGI-XD) and uses it to evaluate whether pretrained chemical models (PCMs) learn representations that produce smoother <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">quantitative structure-property relationship</a> (QSPR) surfaces than simple baselines. The key finding is negative: pretrained representations are no smoother than molecular fingerprints or descriptors, offering a principled explanation for their inconsistent performance on property prediction benchmarks.</p>
<h2 id="the-smoothness-gap-in-chemical-foundation-models">The Smoothness Gap in Chemical Foundation Models</h2>
<p>Chemical foundation models like ChemBERTa, ChemGPT, and graph-based pretrained networks promise to learn meaningful molecular representations from large unlabeled datasets via self-supervised learning. However, empirical benchmarks consistently show mixed results: these learned representations sometimes match and sometimes underperform simple baselines like Morgan fingerprints or RDKit descriptors.</p>
<p>Prior work by Deng et al. demonstrated that a random forest trained on 2048-bit Morgan fingerprints was competitive with, or superior to, pretrained models like <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a> and GROVER on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and opioid bioactivity tasks. The authors sought to explain this pattern through the lens of QSPR surface roughness: if pretrained representations do not produce smoother mappings from molecular structure to property, they cannot consistently outperform baselines.</p>
<h2 id="rogi-xd-a-dimensionality-independent-roughness-metric">ROGI-XD: A Dimensionality-Independent Roughness Metric</h2>
<p>The original ROuGhness Index (ROGI) captures global surface roughness by measuring the loss in property dispersion as a dataset is progressively coarse-grained through <a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">hierarchical clustering</a>. However, ROGI values are not comparable across representations of different dimensionalities because distances between randomly sampled points increase with dimension, artificially deflating ROGI for high-dimensional representations.</p>
<p>ROGI-XD addresses this by changing the integration variable. Instead of integrating over normalized distance threshold $t$, ROGI-XD integrates over $1 - \log N_{\text{clusters}} / \log N$, where $N_{\text{clusters}}$ is the number of clusters at a given dendrogram step and $N$ is the dataset size. This variable captures the degree of coarse-graining independent of representation dimensionality, producing comparable roughness values across representations ranging from 14 dimensions (descriptors) to 2048 dimensions (ChemGPT).</p>
<p>The procedure follows five steps: (1) cluster molecules using <a href="https://en.wikipedia.org/wiki/Complete-linkage_clustering">complete linkage</a> at distance threshold $t$, (2) coarse-grain by replacing each property label $y_i$ with its cluster mean $\bar{y}_j$, (3) compute the standard deviation $\sigma_t$ of the coarse-grained dataset, (4) repeat for all dendrogram steps, and (5) compute the area under the curve of $2(\sigma_0 - \sigma_t)$ versus the new integration variable.</p>
<h2 id="representations-and-tasks-evaluated">Representations and Tasks Evaluated</h2>
<p>The study compares seven molecular representations:</p>
<table>
  <thead>
      <tr>
          <th>Representation</th>
          <th>Type</th>
          <th>Dimensionality</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>Fixed</td>
          <td>14</td>
          <td>RDKit (14 properties)</td>
      </tr>
      <tr>
          <td>Morgan FP</td>
          <td>Fixed</td>
          <td>512</td>
          <td>Radius 2, 512-bit</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>Pretrained</td>
          <td>128</td>
          <td>Character-based <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> VAE, <a href="/notes/chemistry/datasets/zinc-22/">ZINC 250k</a></td>
      </tr>
      <tr>
          <td>GIN</td>
          <td>Pretrained</td>
          <td>300</td>
          <td>Node attribute masking, ZINC 250k</td>
      </tr>
      <tr>
          <td>ChemBERTa</td>
          <td>Pretrained</td>
          <td>384</td>
          <td>77M molecules, masked LM</td>
      </tr>
      <tr>
          <td>ChemGPT</td>
          <td>Pretrained</td>
          <td>2048</td>
          <td>PubChem 10M, causal LM</td>
      </tr>
      <tr>
          <td>Random</td>
          <td>Baseline</td>
          <td>128</td>
          <td>Uniform $[0,1]^{128}$</td>
      </tr>
  </tbody>
</table>
<p>These are evaluated on 17 regression tasks drawn from two sources: ADMET datasets from the Therapeutics Data Commons (TDC) and toy datasets generated using <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> oracle functions. Five ML models are used for cross-validation: KNN, MLP, <a href="https://en.wikipedia.org/wiki/Partial_least_squares_regression">PLS</a>, random forest, and SVR.</p>
<h2 id="pretrained-representations-are-not-smoother">Pretrained Representations Are Not Smoother</h2>
<p>ROGI-XD correlates strongly with cross-validated RMSE across representations (median Pearson $r = 0.72$-$0.88$ depending on model), compared to the original ROGI which produces weak cross-representation correlations (median $r \in [-0.32, 0.28]$). When correlating over both representations and tasks simultaneously, ROGI-XD achieves $r = 0.91$-$0.99$ versus $r = 0.68$-$0.84$ for the original ROGI.</p>
<p>Using this validated metric, the authors find that pretrained representations do not produce smoother QSPR surfaces than fingerprints or descriptors. In more than 50% of tasks, both descriptors and fingerprints generate smoother surfaces. The median relative ROGI-XD increase for pretrained representations is 9.1-21.3% compared to descriptors and 2.3-10.1% compared to fingerprints, indicating rougher surfaces.</p>
<p>As a practical tool, ROGI-XD can guide representation selection without exhaustive benchmarking. Selecting the representation with the lowest ROGI-XD for each task and then optimizing over model architecture results in only a 6.8% average relative increase in best-case model error across the 17 tasks. In 8 of 17 tasks, the lowest ROGI-XD correctly identifies the optimal representation.</p>
<p>Fine-tuning can improve smoothness. On the Lipophilicity task ($N_{\text{tot}} = 4200$), fine-tuning the VAE with a contrastive loss reduces ROGI-XD from 0.254 to 0.107 ($\pm 0.02$), well below the descriptor baseline of 0.227. On the smaller CACO2 task ($N_{\text{tot}} = 910$), fine-tuning yields ROGI-XD of 0.143 ($\pm 0.05$), comparable to descriptors at 0.132. The impact of fine-tuning is sensitive to both the task and the amount of labeled data.</p>
<h2 id="implications-for-chemical-foundation-model-development">Implications for Chemical Foundation Model Development</h2>
<p>The lack of smoothness in pretrained QSPR surfaces explains the inconsistent empirical performance of chemical foundation models. The authors note that ROGI-XD is thematically similar to a contrastive loss, as both scale proportionally with the frequency and severity of activity cliffs. This connection suggests that imposing stronger smoothness assumptions during pretraining, for example through weak supervision on calculable molecular properties, could help produce representations that generalize better to downstream property prediction. ROGI-XD provides a practical tool for evaluating new pretraining strategies without exhaustive benchmark testing: a representation with lower ROGI-XD on a given task is likely to yield lower model error.</p>
<p>A limitation is that the study treats pretrained representations as static (frozen features). Fine-tuning introduces many additional design choices and can substantially improve representation quality, but this evaluation is left for future work. Additionally, the survey of pretrained models is not exhaustive and focuses on four representative architectures.</p>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/coleygroup/rogi-xd">coleygroup/rogi-xd</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained models and notebooks; results reproducible via <code>make all</code></td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining (VAE, GIN)</td>
          <td>ZINC 250k</td>
          <td>250,000</td>
          <td>80/20 train/val split</td>
      </tr>
      <tr>
          <td>Pretraining (ChemBERTa)</td>
          <td>PubChem</td>
          <td>77M</td>
          <td>Masked language modeling</td>
      </tr>
      <tr>
          <td>Pretraining (ChemGPT)</td>
          <td>PubChem 10M</td>
          <td>10M</td>
          <td>Causal language modeling</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TDC ADMET</td>
          <td>~900-10,000 per task</td>
          <td>12 regression tasks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>GuacaMol oracles</td>
          <td>10,000 per task</td>
          <td>5 synthetic tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>ROGI-XD</strong>: Hierarchical clustering (complete linkage) with integration over $1 - \log N_{\text{clusters}} / \log N$</li>
<li><strong>Cross-validation</strong>: 5-fold CV with KNN, MLP, PLS, RF (n_estimators=50), SVR from scikit-learn</li>
<li><strong>Fine-tuning loss</strong>: $\mathscr{L} = \mathscr{L}_{\text{CE}} + \beta \cdot \mathscr{L}_{\text{KL}} + \gamma \cdot \mathscr{L}_{\text{cont}}$ with $\beta = 0.1$, $\gamma = 50$; contrastive term uses cosine distance in latent space and absolute value in target space</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Two AMD Ryzen Threadripper PRO 3995WX CPUs, four NVIDIA A5000 GPUs, 512 GB RAM, Ubuntu 20.04 LTS.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Graff, D. E., Pyzer-Knapp, E. O., Jordan, K. E., Shakhnovich, E. I., &amp; Coley, C. W. (2023). Evaluating the roughness of structure-property relationships using pretrained molecular representations. <em>Digital Discovery</em>, 2(5), 1452-1460. <a href="https://doi.org/10.1039/d3dd00088e">https://doi.org/10.1039/d3dd00088e</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/coleygroup/rogi-xd">ROGI-XD Code Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{graff2023roughness,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evaluating the roughness of structure--property relationships using pretrained molecular representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Graff, David E. and Pyzer-Knapp, Edward O. and Jordan, Kirk E. and Shakhnovich, Eugene I. and Coley, Connor W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1452--1460}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d3dd00088e}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Regression Transformer: Prediction Meets Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</guid><description>The Regression Transformer unifies property prediction and conditional generation in one multitask model by casting regression as sequence modelling.</description><content:encoded><![CDATA[<h2 id="a-multitask-model-that-unifies-regression-and-generation">A Multitask Model That Unifies Regression and Generation</h2>
<p>The Regression Transformer (RT) is a <strong>Method</strong> paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.</p>
<h2 id="closing-the-gap-between-predictors-and-generators">Closing the Gap Between Predictors and Generators</h2>
<p>Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.</p>
<p>The RT addresses three specific gaps:</p>
<ol>
<li><strong>No true multitask entanglement</strong>: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.</li>
<li><strong>No inductive bias for continuous properties</strong>: Molecular generative models lack mechanisms to condition generation on floating-point property values.</li>
<li><strong>Disconnected workflows</strong>: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.</li>
</ol>
<h2 id="core-innovation-regression-as-conditional-sequence-modelling">Core Innovation: Regression as Conditional Sequence Modelling</h2>
<p>The RT&rsquo;s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:</p>
<h3 id="numerical-tokenization">Numerical Tokenization</h3>
<p>Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence <code>[1_1, 2_0, 3_-1]</code>.</p>
<h3 id="numerical-encodings">Numerical Encodings</h3>
<p>To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:</p>
<p>$$
\text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1}
$$</p>
<p>These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.</p>
<h3 id="alternating-training-with-self-consistency">Alternating Training with Self-Consistency</h3>
<p>The RT uses an <a href="https://en.wikipedia.org/wiki/XLNet">XLNet</a> backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:</p>
<ul>
<li><strong>Mask numerical tokens</strong>: the model performs property prediction (regression)</li>
<li><strong>Mask textual tokens</strong>: the model performs conditional sequence generation</li>
</ul>
<p>The base PLM objective is:</p>
<p>$$
\mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{&lt; i}}) \right]
$$</p>
<p>This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.</p>
<p>The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:</p>
<p>$$
\mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}})
$$</p>
<p>This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT&rsquo;s dual capability as both predictor and generator.</p>
<h2 id="experiments-across-molecules-proteins-and-reactions">Experiments Across Molecules, Proteins, and Reactions</h2>
<h3 id="drug-likeness-qed">Drug Likeness (QED)</h3>
<p>Initial validation on a synthetic QED dataset (~1.4M molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE &lt; 0.06) and generate novel molecules conditioned on desired QED values (Spearman&rsquo;s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> representations proved comparable to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).</p>
<h3 id="moleculenet-regression-benchmarks">MoleculeNet Regression Benchmarks</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.</p>
<p>Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT&rsquo;s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).</p>
<h3 id="constrained-property-optimization">Constrained Property Optimization</h3>
<p>On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Improvement ($\delta$=0.4)</th>
          <th>Success</th>
          <th>Property Prediction</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.84</td>
          <td>83.6%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>2.49</td>
          <td>100%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>4.71</td>
          <td>85.7%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td><strong>RT</strong></td>
          <td><strong>3.16</strong></td>
          <td><strong>97.1%</strong></td>
          <td><strong>PCC = 0.92</strong></td>
      </tr>
  </tbody>
</table>
<p>The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.</p>
<h3 id="protein-language-modelling">Protein Language Modelling</h3>
<p>On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.</p>
<h3 id="chemical-reaction-modelling">Chemical Reaction Modelling</h3>
<p>The RT was applied to reaction yield prediction on <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig amination</a> and <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki coupling</a> datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.</li>
<li>The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.</li>
<li>A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.</li>
<li>The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ol>
<li><strong>No large-scale pre-training</strong>: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a> or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.</li>
<li><strong>Fine-grained regression precision</strong>: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).</li>
<li><strong>Single-property focus</strong>: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.</li>
<li><strong>SELFIES validity caveats</strong>: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed&rsquo;s atoms).</li>
<li><strong>XLNet backbone limitations</strong>: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/regression-transformer">Regression Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/GT4SD/gt4sd-core">GT4SD Integration</a></td>
          <td>Code + Models</td>
          <td>MIT</td>
          <td>Pre-trained model inference pipelines</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></td>
          <td>Demo</td>
          <td>-</td>
          <td>Interactive inference webapp</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug likeness</td>
          <td>ChEMBL (QED)</td>
          <td>~1.4M molecules</td>
          <td>Synthetic QED labels computed with RDKit</td>
      </tr>
      <tr>
          <td>Regression benchmark</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipo)</td>
          <td>642-4,200 compounds</td>
          <td>16x SMILES augmentation, 3 random splits</td>
      </tr>
      <tr>
          <td>Property optimization</td>
          <td>ZINC (plogP)</td>
          <td>215,381 train / 799 test</td>
          <td>Fixed split from Jin et al. (2018)</td>
      </tr>
      <tr>
          <td>Protein pre-training</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a> (Boman)</td>
          <td>2,648,205 peptides</td>
          <td>15-45 amino acid peptides</td>
      </tr>
      <tr>
          <td>Protein benchmarks</td>
          <td>TAPE (Fluorescence, Stability)</td>
          <td>21,446-53,416 samples</td>
          <td>Fixed splits</td>
      </tr>
      <tr>
          <td>Reaction pre-training</td>
          <td>USPTO</td>
          <td>2,830,616 reactions</td>
          <td>Molecular weight as numerical property</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig / Suzuki</td>
          <td>3,955 / 5,760 reactions</td>
          <td>Ten 70/30 random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)</li>
<li>Parameters: ~27 million</li>
<li>Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)</li>
<li>Decoding: Greedy for property prediction, beam search for sequence generation</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>RT Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED prediction</td>
          <td>RMSE</td>
          <td>0.037</td>
          <td>Best config (NE + SC)</td>
      </tr>
      <tr>
          <td>QED generation</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.517</td>
          <td>Between primers and generated QED</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>Comparable to XLNet</td>
          <td>Within s.d. of regression-loss XLNet</td>
      </tr>
      <tr>
          <td>plogP optimization ($\delta$=0.4)</td>
          <td>Improvement</td>
          <td>3.16</td>
          <td>Outperforms JT-VAE, GCPN</td>
      </tr>
      <tr>
          <td>Protein fluorescence</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.72</td>
          <td>Outperforms TAPE baselines</td>
      </tr>
      <tr>
          <td>BH yield prediction</td>
          <td>$R^2$</td>
          <td>0.939</td>
          <td>Near Yield-BERT (0.951)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained on single GPUs (NVIDIA A100 or V100)</li>
<li>Training time: ~4 days for pre-training, ~1 day for fine-tuning</li>
<li>Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Born, J. &amp; Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. <em>Nature Machine Intelligence</em>, 5(4), 432-444. <a href="https://doi.org/10.1038/s42256-023-00639-z">https://doi.org/10.1038/s42256-023-00639-z</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence, April 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/regression-transformer">Regression Transformer GitHub Repository</a></li>
<li><a href="https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer">GT4SD Integration</a></li>
<li><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{born2023regression,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Regression Transformer enables concurrent sequence regression and generation for molecular language modelling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Born, Jannis and Manica, Matteo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{432--444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Language Models Learn Complex Molecular Distributions</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/</guid><description>RNN language models trained on SMILES and SELFIES outperform graph models at learning complex, multi-modal, and large-scale molecular distributions.</description><content:encoded><![CDATA[<h2 id="rnn-language-models-as-flexible-molecular-generators">RNN Language Models as Flexible Molecular Generators</h2>
<p>This is an <strong>Empirical</strong> paper that investigates the capacity of simple recurrent neural network (RNN) language models to learn complex molecular distributions. The core finding is that LSTM-based models trained on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> (SM-RNN) or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (SF-RNN) string representations consistently outperform popular graph generative models (JTVAE, CGVAE) across three increasingly challenging generative modeling tasks. The paper positions language models as flexible, scalable alternatives to graph-based approaches for molecular generation.</p>
<h2 id="scaling-beyond-standard-benchmarks">Scaling Beyond Standard Benchmarks</h2>
<p>Most molecular generative models are evaluated on relatively small, drug-like molecules from datasets like <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a> or <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>. These standard benchmarks do not test whether models can handle larger, more structurally diverse molecules or distributions with complex shapes (multi-modal, heavy-tailed). This gap matters because there is increasing interest in larger, more complex molecules for therapeutics, including peptides and natural products.</p>
<p>Graph generative models like JTVAE and CGVAE impose structural constraints (tree decompositions, valency restrictions) that help with validity but limit their ability to scale. Language models, by contrast, only need to generate a single character sequence, making them inherently more flexible.</p>
<h2 id="three-challenging-generative-modeling-tasks">Three Challenging Generative Modeling Tasks</h2>
<p>The paper introduces three benchmark tasks designed to stress-test generative models:</p>
<h3 id="task-1-penalized-logp-distribution">Task 1: Penalized LogP Distribution</h3>
<p>A dataset of approximately 160K molecules from ZINC15 with penalized <a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a> scores exceeding 4.0. The training distribution is sharply peaked around 4.0 to 4.5 with a subtle tail extending above 6.0. Molecules in the tail tend to have long carbon chains and fewer rings. The challenge is learning this skewed distribution rather than just finding individual high-scoring molecules.</p>
<h3 id="task-2-multi-modal-distribution">Task 2: Multi-Modal Distribution</h3>
<p>A composite dataset of approximately 200K molecules drawn from four sources with distinct molecular weight ranges:</p>
<ul>
<li><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> (MW $\leq$ 185)</li>
<li>ZINC (185 $\leq$ MW $\leq$ 425)</li>
<li>Harvard Clean Energy Project (460 $\leq$ MW $\leq$ 600)</li>
<li>POLYMERS (MW $&gt;$ 600)</li>
</ul>
<p>Models must learn to generate from all four modes simultaneously, each with very different molecular structures.</p>
<h3 id="task-3-large-scale-molecules">Task 3: Large-Scale Molecules</h3>
<p>The largest molecules in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> with more than 100 heavy atoms, yielding approximately 300K molecules with molecular weights ranging from 1,250 to 5,000. These include small biomolecules, photovoltaics, peptides, and cyclic peptides. This task is particularly challenging because the SMILES/SELFIES strings are very long.</p>
<h2 id="evaluation-by-distributional-fidelity">Evaluation by Distributional Fidelity</h2>
<p>The evaluation framework focuses on how well a model learns the full training distribution rather than generating individual good molecules. The primary quantitative metric is the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> (earth mover&rsquo;s distance) between molecular property distributions of generated and training molecules:</p>
<p>$$W(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \int | x - y | , d\gamma(x, y)$$</p>
<p>Properties evaluated include LogP, synthetic accessibility (SA), quantitative estimate of drug-likeness (QED), molecular weight (MW), Bertz complexity (BCT), and natural product likeness (NP). An oracle baseline is computed by measuring the Wasserstein distance between different random samples of the training data itself.</p>
<p>Standard metrics (validity, uniqueness, novelty) are also reported but are secondary to distributional fidelity.</p>
<h2 id="architecture-lstm-language-models">Architecture: LSTM Language Models</h2>
<p>The language models use standard LSTM architectures trained autoregressively on molecular strings. Two variants are compared:</p>
<ul>
<li><strong>SM-RNN</strong>: Trained on canonical SMILES</li>
<li><strong>SF-RNN</strong>: Trained on SELFIES representations</li>
</ul>
<p>Hyperparameters are tuned via random search over learning rate ($\in [0.0001, 0.001]$), hidden units ($\in [100, 1000]$), layers (1 to 5), and dropout ($\in [0.0, 0.5]$). Model selection uses a combination of standard metrics and Wasserstein distance rankings.</p>
<p>The graph model baselines include JTVAE (junction tree VAE) and CGVAE (constrained graph VAE), along with several additional baselines (MolGAN, GraphNVP, and others).</p>
<h2 id="results-language-models-outperform-graph-models-across-all-tasks">Results: Language Models Outperform Graph Models Across All Tasks</h2>
<h3 id="penalized-logp">Penalized LogP</h3>
<p>Both RNN models learn the sharp training distribution far better than graph models. The SM-RNN achieves the lowest Wasserstein distances across most properties. The graph models produce substantial out-of-distribution mass around penalized LogP scores of 1.75 to 2.25, failing to capture the peaked nature of the training distribution.</p>
<p>Critically, the RNNs also learn the subtle tail above penalized LogP of 6.0, generating molecules with long carbon chains and fewer rings that match the structural characteristics of high-scoring training molecules. CGVAE and JTVAE almost entirely miss this tail.</p>
<h3 id="multi-modal-distribution">Multi-Modal Distribution</h3>
<p>Both RNN models capture all four modes of the training distribution. JTVAE entirely misses the GDB13 mode and poorly learns the ZINC and CEP modes. CGVAE learns GDB13 but misses the CEP mode. The SM-RNN again achieves the best Wasserstein metrics.</p>
<h3 id="large-scale-molecules">Large-Scale Molecules</h3>
<p>This is the most discriminating task. Both JTVAE and CGVAE completely fail to train on these large molecules. JTVAE&rsquo;s tree decomposition produces a vocabulary of approximately 11,000 substructures, making training intractable. Only the RNN models succeed, with the SF-RNN achieving slightly better distributional match due to SELFIES guaranteeing 100% validity even for very long strings.</p>
<p>Both RNN models also learn the bimodal LogP structure within the large-molecule distribution and can generate molecules with substructures resembling peptides, including backbone chains and standard amino acid side chains.</p>
<h3 id="summary-of-wasserstein-distance-results">Summary of Wasserstein Distance Results</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Model</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>MW</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP</td>
          <td>SM-RNN</td>
          <td>0.095</td>
          <td>0.031</td>
          <td>0.007</td>
          <td>3.3</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>SF-RNN</td>
          <td>0.177</td>
          <td>0.290</td>
          <td>0.010</td>
          <td>6.3</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>JTVAE</td>
          <td>0.536</td>
          <td>0.289</td>
          <td>0.081</td>
          <td>35.9</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>CGVAE</td>
          <td>1.000</td>
          <td>2.120</td>
          <td>0.115</td>
          <td>69.3</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>SM-RNN</td>
          <td>0.081</td>
          <td>0.025</td>
          <td>0.006</td>
          <td>5.5</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>SF-RNN</td>
          <td>0.286</td>
          <td>0.179</td>
          <td>0.023</td>
          <td>11.4</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>JTVAE</td>
          <td>0.495</td>
          <td>0.274</td>
          <td>0.034</td>
          <td>27.7</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>CGVAE</td>
          <td>1.617</td>
          <td>1.802</td>
          <td>0.076</td>
          <td>30.3</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>SM-RNN</td>
          <td>1.367</td>
          <td>0.213</td>
          <td>0.003</td>
          <td>124.5</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>SF-RNN</td>
          <td>1.095</td>
          <td>0.342</td>
          <td>0.010</td>
          <td>67.3</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>JTVAE</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>CGVAE</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<h3 id="smiles-vs-selfies-trade-off">SMILES vs. SELFIES Trade-off</h3>
<p>An interesting finding is that SMILES and SELFIES RNNs each have complementary strengths. The SF-RNN consistently achieves better standard metrics (validity, uniqueness, novelty) across all tasks, while the SM-RNN achieves better Wasserstein distance metrics. The authors suggest that the SELFIES grammar may reduce memorization of the training data, improving novelty but slightly hurting distributional fidelity.</p>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations. Language models cannot account for molecular geometry or 3D information, which is important for many applications. The study evaluates distributional fidelity but does not test downstream utility for specific molecular design tasks (e.g., optimizing for a particular biological target). Additionally, while the graph models (JTVAE, CGVAE) are more interpretable, the language models operate as black boxes over string representations. The comparison is also limited to two specific graph model architectures, and more recent or specialized graph models may close the performance gap. Finally, trained model weights are only available upon request rather than being publicly released.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/danielflamshep/genmoltasks">danielflamshep/genmoltasks</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>Processed training data and generated samples</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Three custom datasets constructed from ZINC15, <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>, Harvard Clean Energy Project, POLYMERS, and PubChem. Processed data available at the GitHub repository.</p>
<p><strong>Code</strong>: LSTM networks implemented in PyTorch using the char-rnn code from the <a href="https://github.com/molecularsets/moses">MOSES repository</a>. Baselines use the official <a href="https://github.com/wengong-jin/icml18-jtnn">JTVAE</a> and <a href="https://github.com/microsoft/constrained-graph-variational-autoencoder">CGVAE</a> implementations. No unified training script is provided in the repository.</p>
<p><strong>Evaluation</strong>: Wasserstein distances computed using SciPy. Molecular properties computed using RDKit. 10K molecules generated from each model for evaluation.</p>
<p><strong>Hyperparameters</strong>: Task-specific configurations reported. For example, the LogP task SM-RNN uses 2 hidden layers with 400 units, dropout of 0.2, and learning rate of 0.0001.</p>
<p><strong>Hardware</strong>: Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types and training times are not reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D., Zhu, K., &amp; Aspuru-Guzik, A. (2022). Language models can learn complex molecular distributions. <em>Nature Communications</em>, 13, 3293. <a href="https://doi.org/10.1038/s41467-022-30839-x">https://doi.org/10.1038/s41467-022-30839-x</a></p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/danielflamshep/genmoltasks">GitHub: danielflamshep/genmoltasks</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2022language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can learn complex molecular distributions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Zhu, Kevin and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3293}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-022-30839-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Exposing Limitations of Molecular ML with Activity Cliffs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/activity-cliffs-benchmark/</guid><description>A benchmark of 24 ML methods on activity cliff compounds across 30 drug targets, showing descriptor-based models outperform deep learning.</description><content:encoded><![CDATA[<h2 id="a-benchmark-for-activity-cliff-prediction">A Benchmark for Activity Cliff Prediction</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for <a href="/notes/chemistry/molecular-design/property-prediction/">molecular property prediction</a> in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.</p>
<h2 id="activity-cliffs-as-a-blind-spot-in-molecular-ml">Activity Cliffs as a Blind Spot in Molecular ML</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Chemical_similarity">similarity principle</a> underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).</p>
<p>Despite their importance for <a href="https://en.wikipedia.org/wiki/Hit_to_lead">hit-to-lead optimization</a>, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.</p>
<p>The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.</p>
<h2 id="defining-and-detecting-activity-cliffs">Defining and Detecting Activity Cliffs</h2>
<p>The authors use three complementary similarity metrics to identify activity cliffs:</p>
<ol>
<li><strong>Substructure similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto coefficient</a> on extended connectivity fingerprints (ECFPs), capturing shared radial substructures</li>
<li><strong>Scaffold similarity</strong>: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences</li>
<li><strong>SMILES similarity</strong>: <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> on canonical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings, capturing character-level insertions, deletions, and translocations</li>
</ol>
<p>Pairs with $\geq 90%$ similarity on <strong>any one</strong> of the three metrics and $&gt; 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.</p>
<h2 id="24-methods-across-30-drug-targets">24 Methods Across 30 Drug Targets</h2>
<p>The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> v29 datasets (48,707 total molecules).</p>
<p><strong>Traditional ML algorithms</strong>: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.</p>
<p><strong>Deep learning methods</strong>: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> (SMILES-based), and an MLP on ECFPs.</p>
<p>Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:</p>
<p>$$
\text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}}
$$</p>
<p>Key results:</p>
<ul>
<li><strong>Molecular descriptors matter more than algorithms</strong>: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p &lt; 0.05$, <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Wilcoxon rank-sum test</a> with <a href="https://en.wikipedia.org/wiki/False_discovery_rate">Benjamini-Hochberg correction</a>).</li>
<li><strong>SVM + ECFPs wins on average</strong>: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.</li>
<li><strong>Deep learning underperforms</strong>: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.</li>
<li><strong>Large case-by-case variation</strong>: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.</li>
</ul>
<h2 id="simple-descriptors-beat-complex-architectures-on-cliffs">Simple Descriptors Beat Complex Architectures on Cliffs</h2>
<p>The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.</p>
<p>Key observations:</p>
<ul>
<li><strong>RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average)</strong>, so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.</li>
<li><strong>Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation</strong>: Datasets with $&gt; 1000$ training molecules show $r &gt; 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.</li>
<li><strong>No relationship between % cliff compounds and model performance</strong>, and no target-family-specific effects were found.</li>
<li><strong>Transfer learning helped SMILES models (LSTM) but not graph models</strong>: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.</li>
</ul>
<p>The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Source</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Benchmarking</td>
          <td>ChEMBL v29</td>
          <td>48,707 molecules (35,632 unique) across 30 targets</td>
          <td>Curated for duplicates, salts, outliers</td>
      </tr>
      <tr>
          <td>Smallest dataset</td>
          <td>JAK1</td>
          <td>615 molecules</td>
          <td>7% activity cliffs</td>
      </tr>
      <tr>
          <td>Largest dataset</td>
          <td>DRD3</td>
          <td>3,657 molecules</td>
          <td>39% activity cliffs</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Activity cliff detection</strong>: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $&gt; 10\times$ potency difference</li>
<li><strong>Splitting</strong>: <a href="https://en.wikipedia.org/wiki/Spectral_clustering">Spectral clustering</a> on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion</li>
<li><strong>Hyperparameter optimization</strong>: <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian optimization</a> with Gaussian process, max 50 combinations, 5-fold cross-validation</li>
<li><strong>SMILES augmentation</strong>: 10-fold for all SMILES-based methods</li>
<li><strong>Transfer learning</strong>: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> compounds</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Traditional ML</strong>: KNN, RF, GBM, SVM (scikit-learn v1.0.2)</li>
<li><strong>Descriptors</strong>: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)</li>
<li><strong>GNNs</strong>: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling</li>
<li><strong>SMILES models</strong>: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer</li>
<li><strong>Total models trained</strong>: 720 (24 methods $\times$ 30 targets)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE</td>
          <td>All test molecules</td>
          <td>Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$</td>
      </tr>
      <tr>
          <td>$\text{RMSE}_{\text{cliff}}$</td>
          <td>Activity cliff compounds only</td>
          <td>RMSE restricted to cliff molecules in test set</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE">MoleculeACE</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Benchmark platform with all 30 curated datasets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE/tree/main/MoleculeACE/Data/benchmark_data">Curated datasets</a></td>
          <td>Data</td>
          <td>MIT</td>
          <td>Processed ChEMBL bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: van Tilborg, D., Alenicheva, A., &amp; Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. <em>Journal of Chemical Information and Modeling</em>, 62(23), 5938-5951. <a href="https://doi.org/10.1021/acs.jcim.2c01073">https://doi.org/10.1021/acs.jcim.2c01073</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molML/MoleculeACE">MoleculeACE GitHub Repository</a></li>
<li><a href="https://chemrxiv.org/engage/chemrxiv/article-details/630cc44058843b8403a19810">ChemRxiv Preprint</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{vantilborg2022activity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exposing the Limitations of Molecular Machine Learning with Activity Cliffs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5938--5951}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01073}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>